I Deleted All My MCP Servers and Everything Got Faster

# I Deleted All My MCP Servers and Everything Got Faster In my last post I described a PKM system held together by Ansible, git, and a small fleet of MCP servers. Eight of them, to be precise. ArXiv, Semantic Scholar, Google Workspace, Obsidian, Thoughtbox, QMD, Markitdown, Mermaid. Each one a stdio process that Claude Code spawns, connects to, and keeps alive for the duration of the session. If you're running MCP servers with Claude Code, you already know there's token overhead. Tool definitions aren't free. But there's a detail about *when* that cost kicks in that changed how I think about the whole architecture. ## The carry problem Claude Code doesn't load every MCP tool definition at session start. It's smarter than that. Tools get loaded lazily — the first time you actually use a server, its definitions enter the context. So far, so reasonable. Here's the part that bit me: once loaded, those definitions stay in context for *every subsequent turn*. They're carried forward, request after request, until compaction finally clears them out. Use your Semantic Scholar MCP once on turn 3 to look up a paper? Its 33 tool definitions — names, descriptions, full JSON parameter schemas — ride along on turns 4, 5, 6, 7... all the way until the context gets compacted. It's not a startup cost. It's a *carry* cost. And it compounds. I got curious about the actual weight. I cracked open GoodLemur's API logs (raw request/response JSONL I save for research) and measured what each server adds to every request once loaded: | Server | Tools | Tokens carried per turn | | --- | --- | --- | | google-workspace | 142 | ~37,392 | | semantic-scholar | 33 | ~10,776 | | thoughtbox | 3 | ~1,555 | | arxiv-mcp-server | 4 | ~1,185 | | markitdown | 1 | ~76 | Google Workspace was the worst offender. 142 tools. ~37,392 tokens. Check your calendar once on turn 5, and those definitions occupy context on every turn until compaction. Now here's the thing: prompt caching means you're not paying full input price for those repeated definitions. The API caches them server-side after the first send. So the *dollar* cost is manageable. But caching doesn't shrink the context window. Those ~37,392 tokens still *occupy space*. They count against your 200k limit just like your actual conversation does. Across all my servers once loaded: ~51,000 tokens per turn sitting in context. That's 25% of a 200k window that isn't available for conversation history, code, or reasoning. Compaction triggers sooner. You lose conversational context faster. Your effective session depth shrinks — not because you ran out of things to say, but because tool definitions you used once are squatting on the space. There's a second angle that makes this worse. The API has a [context editing](https://platform.claude.com/docs/en/build-with-claude/context-editing) strategy called `clear_tool_uses` that clears old tool *results* from past turns — file contents you've already read, search results you've already processed. It's designed to free up exactly this kind of accumulated weight. And it works great for CLI tools. When Claude calls `s2-search` via Bash, that Bash output is a tool result. On later turns, `clear_tool_uses` can sweep it away. But MCP tool *definitions* aren't tool results. They're structural. They live in the `tools` parameter, sent with every API request, immune to context editing. No clearing strategy can touch them. So MCP tools get the worst of both: their definitions persist forever (until compaction), and they don't benefit from the system designed to manage exactly this problem. ## The idea What if the tools were just CLI commands? An MCP server is a process that speaks JSON-RPC over stdio. Claude Code launches it, negotiates capabilities, and loads each tool definition into context when you first use it — where it stays until compaction. But if Claude already knows a CLI exists — because you told it in CLAUDE.md — it can just call it with Bash. No process to manage. No tool definitions accumulating in context. No carry cost. **Before (MCP):** Use Semantic Scholar once → 33 tool definitions load → carried every turn until compaction → immune to `clear_tool_uses`. **After (CLI):** CLAUDE.md says `s2-search` exists → Claude calls it via Bash → result clearable on later turns → no definitions persist. Same capability. The difference is what happens on the turns where you're *not* using that tool. ## The migration I ended up with three approaches depending on the server: ### 1. Write thin CLI wrappers (arxiv-search, s2-search) For ArXiv and Semantic Scholar, I wrote Python scripts that hit the APIs directly. No libraries for S2 — the `semanticscholar` Python package turned out to be unusably slow without an API key. Its retry logic would block for 30+ seconds. A raw `urllib.request` call returns in milliseconds. ```python # s2-search — direct API, no library def api(path, params=None, base=BASE_GRAPH, method="GET", body=None): url = f"{base}{path}" if params: url += "?" + urlencode({k:v for k,v in params.items() if v is not None}) req = Request(url, method=method, data=json.dumps(body).encode() if body else None, headers=headers()) with urlopen(req, timeout=30) as r: return json.loads(r.read()) ``` 13 subcommands: `search`, `bulk`, `match`, `snippets`, `get`, `refs`, `cites`, `batch`, `recommend`, `recommend-multi`, `author`, `author-search`, `autocomplete`. All from ~250 lines of Python. The `arxiv-search` wrapper was even simpler — just `arxiv.py` with a CLI face. One gotcha: argparse subcommands collide with positional search queries (your query gets interpreted as a subcommand name). I rewrote it to parse `sys.argv` manually. Ugly but bulletproof. ### 2. Compile MCP servers to static binaries (thoughtbox, obsidian, google-workspace) This is the weird one. [clihub](https://github.com/thellimist/clihub) takes a running MCP server — stdio or HTTP — and generates a standalone Go binary with a subcommand for every tool. No Node runtime needed at execution time. The compiled binary just works. ```bash # Point clihub at a running MCP server, get a CLI clihub generate --name thoughtbox \ --transport stdio \ --command "npx -y @kastalien-research/thoughtbox" # Result: static binary with 3 subcommands thoughtbox mental-models --operation list_models thoughtbox thoughtbox --raw '{"thought":"...", "thoughtNumber":1, "totalThoughts":5}' ``` For Google Workspace, this turned 142 MCP tools into 142 CLI subcommands. One binary. No Node, no Python, no uvx at runtime. ```bash # Before: MCP server with 142 tool definitions eating ~37,392 tokens/turn # After: one line in CLAUDE.md gw search-gmail-messages --user_google_email [email protected] \ --query "from:someone subject:thing" ``` I hit two problems with clihub along the way: **TLS certificates.** The Obsidian MCP server runs on localhost with a self-signed cert. Go's TLS stack doesn't honor `NODE_TLS_REJECT_UNAUTHORIZED`. Solution: use the HTTP endpoint on port 3001 instead of HTTPS on 3443. **Codegen collision.** When an MCP tool has a parameter named `raw`, it collides with clihub's own `--raw` flag. The generated Go code has two variables both trying to be `flagRaw`. I patched it with a sed script and [filed an issue](https://github.com/thellimist/clihub/issues/8). For servers that need credentials, a wrapper script handles it: ```bash #!/bin/sh # /opt/homebrew/bin/gw (wrapper) export GOOGLE_OAUTH_CLIENT_ID="..." export GOOGLE_OAUTH_CLIENT_SECRET="..." export USER_GOOGLE_EMAIL="[email protected]" exec /opt/homebrew/bin/gw-bin "$@" ``` ### 3. Already had a CLI (markitdown, qmd, mmdr) Some tools already existed as CLIs. `markitdown` is a pip package. `qmd` has both MCP and CLI interfaces (CLI is faster). `mmdr` is a Rust binary for Mermaid rendering. For these, I just killed the MCP server and added a one-liner to CLAUDE.md. ## Teaching Claude about the tools This is the part that surprised me with how simple it was. CLAUDE.md is a file loaded into every session's context. Adding a CLI tool means adding a few lines: ```markdown ## CLI Tools - **s2-search**: Semantic Scholar CLI. Hits the API directly. - `s2-search "query"` — search papers - `s2-search get <id>` — paper details - `s2-search refs <id>` / `s2-search cites <id>` — references and citations - `--json` for structured output. `-n` controls result count. ``` That's it. Claude reads the description, knows the command exists, and uses Bash to call it. The CLAUDE.md entry for all eight tools is maybe 80 lines total. Compare that to ~51,000 tokens of MCP tool definitions occupying context every turn. I also updated two skills — `/recall` switched from `mcp__qmd__search` to `qmd search`, and `/think` switched from MCP thoughtbox tools to `thoughtbox mental-models`. Same behavior, lighter context. ## Cleaning up With the CLIs in place, I ripped out every MCP server declaration: ```python # One script to clear ~/.claude.json import json with open(os.path.expanduser('~/.claude.json'), 'r+') as f: d = json.load(f) d['mcpServers'] = {} f.seek(0); json.dump(d, f, indent=2); f.truncate() ``` Then updated Ansible so the next playbook run doesn't re-add them: ```yaml # group_vars/mac.yml — before mcp_servers: arxiv-mcp-server: command: uv args: ["tool", "run", "arxiv-mcp-server"] semantic-scholar: command: uvx args: ["semantic-scholar-mcp"] google-workspace: command: uvx args: ["workspace-mcp", "--single-user"] # ... 5 more # group_vars/mac.yml — after mcp_servers: {} ``` One `ansible-playbook site.yml --limit vps` applied the same change to the VPS. The Python CLI scripts (`arxiv-search`, `s2-search`) got `scp`'d over and pointed at a venv. I also found a BasicMemory MCP ghost — a `UserPromptSubmit` hook echoing MCP tool instructions into every session even though the server was long gone. Removed that too. Archaeology. ## The numbers | What | Before | After | | --- | --- | --- | | MCP servers | 8 | 0 | | Tokens occupying context per turn (all loaded) | ~51,000 | 0 | | Context window lost to tool definitions | ~25% | 0% | | Compaction triggered | Sooner | Later | | CLAUDE.md lines for equivalent capability | 0 | ~80 | The capability is identical. Same searches. Same graph traversal. Same Gmail queries. The difference is how much of the context window is actually available for the work. ## What I learned **MCP is great for discovery, bad for production.** When you're experimenting with a new tool, MCP is perfect. Install it, see the tools appear, try them out. But once you know what you need, the protocol overhead isn't paying for itself anymore. The tool definitions are documentation Claude doesn't need if you've already told it what exists. **CLAUDE.md is underrated.** A few lines of natural language in a config file replaced thousands of tokens of JSON schema. Claude doesn't need a formal tool definition to use `s2-search get <id>`. It just needs to know the command exists and what the flags are. **clihub is a cheat code.** Compiling a 142-tool MCP server into a static Go binary felt like it shouldn't work. It did. The generated code is readable, the binaries are fast, and the only bug I hit (the `raw` flag collision) was minor and patchable. For anyone running MCP servers in production, this tool is worth knowing about. **Measure before optimizing.** I wouldn't have done any of this if I hadn't looked at the API logs. The carry cost was invisible until I counted it. If you're running multiple MCP servers, check your actual per-turn payload size after a few tool uses. You might be surprised at what's riding along. **The best abstraction is no abstraction.** An MCP server is an abstraction over a CLI command (or an API call). Sometimes the abstraction adds value — multiplexing, capability negotiation, sampling. For simple tool use, it's just overhead. `curl` has been calling APIs since 1996. We don't need a protocol for it. ## The current stack Eight fewer processes. ~51,000 tokens freed from every turn. Compaction happens later. Sessions go deeper. Same capabilities. The CLI tools are version-controlled in `~/.venv/bin/` and `/opt/homebrew/bin/`, documented in CLAUDE.md, and deployed to both machines via Ansible. Is it less elegant than MCP? Maybe. But `s2-search get 1706.03762` returning in 200ms without eating a quarter of my context window is hard to argue with. --- Built with Claude Code, clihub, urllib.request, and the realization that JSON-RPC is not always the answer.