Why Your MCP Server Might Be Eating Your Context Window (and How to Fix It)
MCP was devised as a protocol to give AI agents a consistent way to interact with external systems and the structured context that sits outside the model itself. Adoption has been broad since late 2025: most major coding agents, IDEs, and platform vendors now ship MCP support.
As more servers come online and more agents wire into them, a new problem has surfaced: context bloat. Many MCP servers in production today fill agents' context windows in two ways - they front-load full tool definitions the moment a connection opens, and they return unfiltered responses on every call. Context windows fill before the agent has done any useful work.
Connect three MCP servers to your agent stack - GitHub, Slack, Sentry - and 55,000 tokens of tool definitions load before the agent reads its first user message. That's Apideck's own illustrative test , published March 2026. A different setup they documented came in at 143,000 tokens, which was 72% of Claude's context window when those benchmarks ran.
The obvious counter is that we have 1M context windows now. Claude Opus 4.6 and Sonnet 4.6 run 1M tokens at standard pricing , generally available since mid March 2026. Opus 4.7, which launched in April, supports the same window, and ditto for Opus 4.8 which launched recently. A 143,000-token tool dump is 14% of 1M, not 72% of 200K. For a simple three-server agent stack, that headroom probably covers it.
For an MCP server sitting on top of a context graph, the maths work differently.
Why more headroom doesn't change the economics
MCP and CLI reach the same services. What differs is what the trip costs in tokens.
Scalekit ran 75 benchmark runs comparing CLI and MCP on identical tasks. MCP cost between 4 and 32 times more tokens per operation than CLI. Repo language detection: 1,365 tokens via CLI, 44,026 via MCP. That ratio holds at any window size. At 10,000 operations a month, you're paying $3.20 via CLI or $55.20 via MCP. A bigger ceiling doesn't change the multiplier. None of this is a reason to drop back to CLI. MCP earns its overhead by handing back structured, related data instead of raw output you'd have to parse and stitch together yourself - the goal is to make it pay only for what the task actually touches.
The same benchmark recorded a 28% failure rate on calls to GitHub's Copilot MCP server - TCP timeouts, not protocol errors. You pay the per-operation token cost on calls that don't always complete.
These two problems are real on any MCP implementation. On a graph-backed server, the second one tends to compound faster than you'd expect. Many general-purpose servers load a few dozen tool definitions and return bounded responses. Graph-backed servers tend to be a different shape - they can return tags, annotations, relationship graphs, and deployment history on every entity query, if you let them. Roadie's Context Graph serves 200-300K agent API calls per day via MCP, against graphs built to hold millions of rows. If each of those calls returned full records by default, the responses wouldn't fit in 1M tokens any more than they fit in 200K - we'd saturate the window before the agent did any useful work. Bigger windows give you more margin, but they don't change that cost shape, which is why returning everything by default was never an option for us.
One pattern that works
Progressive disclosure is one of the approaches the industry is converging on, and it's the one we've leant into: each query returns what the agent needs for its current decision, not everything it might conceivably need across all possible tasks. A discovery query asks what's available and gets a compact summary - entity kinds, counts, namespaces. A scoped query, "services owned by the payments team", returns names, owners, and current status. A detail query, against a specific service the agent has already identified, returns the fields the workflow actually needs: ownership chain, recent deployments, open incidents, attached runbooks. Full records are available when asked for. They're just not the default.
Port of Context tested the same idea at the interaction-model level with a 12-task Stripe benchmark. They ran identical workflows across CLI, raw MCP, and Code Mode - where Code Mode lets the agent write a short TypeScript program to orchestrate calls internally rather than looping back through the model. That collapses 12-turn workflows into 4. Token totals across all 12 tasks: 711,555 via CLI, 506,970 via raw MCP, 294,924 via Code Mode. Same protocol, same Stripe server. The 42% reduction against raw MCP comes from two directions at once: scoped tool definitions, and fewer model round trips because batching moves into code rather than into the model loop. You keep what makes MCP useful for graph work - structured data, relationship graphs, consistent query semantics - and stop paying for what the agent didn't ask for.
How Roadie approaches this
Our MCP server sits on top of a Context Graph. Integrations pull structural data from systems of record into the graph; Relations link items across those systems (a GitHub user matched to an AWS IAM identity, say); Context Groups collapse the linked items into single concepts an agent can reason about - an Employee that resolves to all three sources at once. Different queries return different slices. The guiding question is "give me what's relevant to this incident on this service," not "give me the catalog."
Capabilities go a layer further. A Capability is a documented procedure an agent can follow for a known process - incident investigation, employee onboarding, credential rotation. The graph supplies the entities; the Capability supplies the steps to execute against them. And the same Integrations that feed the graph are exposed directly to the agent through MCP, so live state - current open PRs, current alert status - gets fetched on demand rather than preloaded. The agent doesn't pay for what changes by the minute, and doesn't pay for what it didn't ask for.
Alongside MCP, we also serve pre-built briefings - context packages that drop directly into an agent's system prompt or working directory. The broader pattern is already familiar from tools like Claude Code's CLAUDE.md and Cursor's rules ; briefings apply the same idea specifically to organisational context, for cases where you know upfront what an agent needs. The integration shape varies across these surfaces, but the disclosure discipline behind them is the same.
A 20,000-entity graph doesn't impose a 20,000-entity context cost per query. The agent pays for the entities the task touches, and graph scale becomes an asset - more complete information available when a workflow asks for it - rather than a liability that fills the window before the agent can act.
Three questions worth asking
Whether you're evaluating a graph-backed MCP server or building one yourself, three questions tend to matter most for how it behaves under real agent load. The first is what an agent sees on initial connection - a compact summary of entity kinds and scale works better than eager-loaded full schema definitions, which drive the worst initial-token numbers and don't help the agent before its first real query. The second is what a scope query returns - names, owners, and current status are usually enough for an agent to identify its targets, where full entity records load context the agent hasn't asked for. The third is what goes in a detail response - ownership, recent operational state, and attached documentation cover most workflows, while historical annotations and full schema definitions sit better behind a deliberate deeper query.
In each case, the discipline is the same: return what was requested, at the granularity it was requested, and trust the agent to come back for more. That holds at 200K and at 1M.
If you want to see Context Groups and Capabilities in practice, request a demo.
