Skip to main content
A short, honest playbook for cutting Claude Code costs without switching to a worse tool.

Claude Code on a Token Diet - Trim the Bill, Not the Brain

A short, honest playbook for cutting Claude Code costs without switching to a worse tool.

TL;DR

Claude Code’s bill snowballs because every turn replays the entire session as input tokens, and output tokens cost roughly 5× more than input. Three habits cut my usage by ~60%: a lean CLAUDE.md, a strict .claudeignore, and disciplined use of /clear, /compact, and per-task model routing. Everything else is gravy.

The two facts nobody says out loud

Claude Code is not a chatbot - it’s an agent. Every message it sends carries the whole session as input: every file it read, every command output, every previous turn.

Two consequences:

  1. The snowball. Message 1 is cheap. Message 50 is expensive even when you type five words - because those five words ride on top of 50 turns of accumulated context.
  2. Output is 5× input. Verbose answers aren’t just noisy - they’re literally more expensive than the question that produced them.

Once those click, the rest is mechanics.

The three files that actually move the bill

1. CLAUDE.md - the silent tax

CLAUDE.md loads before every single message. All of it. Every turn.

3,000 tokens of project lore × 100 messages = 300,000 tokens of overhead before you’ve typed real work.

Keep it short and structural - not narrative.

Belongs:

  • How to run tests (pytest -v, npm test, etc.)
  • Which package manager
  • 1-2 genuinely non-obvious conventions
  • Folders Claude must avoid

Doesn’t:

  • Pasted Slack threads or chat transcripts kept “just in case”
  • Historical “why we did it this way” essays
  • Anything Claude already knows how to do

Target: under 150 lines, bullets only. Mine went from ~800 lines to 85 - and Claude behaved better, not worse. Less noise, clearer signal.

2. .claudeignore - set once, save forever

By default Claude can read everything in the repo. That includes virtualenvs, build artifacts, .pytest_cache/, generated code - useless context, all billed.

Drop a .claudeignore in the root, .gitignore syntax. For a typical Python project:

__pycache__/
*.pyc
*.pyo
.venv/
venv/
env/
*.egg-info/
build/
dist/
.pytest_cache/
.mypy_cache/
.ruff_cache/
.tox/
htmlcov/
.coverage
coverage.xml
*.log
poetry.lock
uv.lock
*.generated.*

For monorepos or multi-package layouts, ignore at both workspace and package level. One-time setup. Every session afterward is cheaper by default.

Bonus: also ignore IDE and editor folders. They’re full of settings files, workspace caches, and machine-specific noise Claude never needs:

.vscode/
.idea/
*.iml
.vs/
.nova/
*.swp
*.swo
.DS_Store
Thumbs.db

3. Use skills, not a fatter CLAUDE.md

Skills are CLAUDE.md-like files that load only when Claude pattern-matches the task to the skill’s description. Ten skills cost ~100 tokens each as metadata on a normal message - the heavy content (~2k tokens) loads only when relevant.

The mechanism: only the name + description sit in Claude’s context every turn (~100 tokens per skill). The body of SKILL.md (often 1-3k tokens) is fetched only when Claude reads your message and decides the description matches the task. Ten skills idle in the background for ~1k tokens total; only the one or two that match actually load.

That makes the description the single most important line in the whole file. Vague descriptions never trigger - the body never gets loaded - the skill is dead weight:

  • Bad: description: Helps with code (matches everything, so it matches nothing reliably)
  • Good: description: Reviews Python code for security, performance, and team standards. Use when asked to review code, audit a PR, or check a diff for risks.

Specific verbs and concrete triggers (“review”, “audit a PR”, “check a diff”) give Claude clear signals to match against. Write the description like you’re writing a search query the user would type.

Mid-session: /clear, /compact, and don’t switch models

/clear drops the session history. Treat it as the default move whenever the topic changes. The auth bug you just fixed adds nothing to your next React tweak - but it rides along on every prompt until you clear it. Dead context is weight you’re paying to carry.

/compact summarizes history and replaces it with the summary. Use mid-task when the window fills up. Trick: tell it what to preserve before you run it (“keep the decision to use optimistic locking”), or you’ll lose conclusions that took twenty messages to reach.

Don’t switch models mid-session. Prompt caches are per-model, so when you switch from Sonnet to Opus, the new model has no cache for your history and reads the full transcript as fresh, uncached input - you effectively pay for it twice. If you need Opus for one subtask, spawn a subagent instead. It runs in its own context window with its own cache; only the summary returns, and your Sonnet cache stays warm.

Two environment variables worth setting

# Cap extended thinking - output tokens cost 5x
export MAX_THINKING_TOKENS=10000

# Disable background suggestions/tips
export DISABLE_NON_ESSENTIAL_MODEL_CALLS=1

The first is the big lever. Default thinking budgets can blow tens of thousands of output tokens on Opus. 10k is plenty for normal coding. Use /effort low|medium for routine work and save /effort xhigh for the actually-hard problems.

Model routing in one line

  • Haiku - triage, classify, syntax lookups.
  • Sonnet - 80-85% of real coding work.
  • Opus - genuine architecture decisions, hard multi-file refactors, bugs Sonnet already failed on.

Sonnet on a clean session beats Opus on a bloated one. Every time.

The cheatsheet: a session loop that works

  1. One task per session.
  2. CLAUDE.md short, bullets only.
  3. .claudeignore covers build artifacts and lockfiles.
  4. MAX_THINKING_TOKENS=10000 in shell profile.
  5. /compact when a single task runs long.
  6. git commit, then /clear on task switch.
  7. End of session: ask Claude to summarize decisions and next steps. Save the note. Load it tomorrow.

That last step is the underrated one. A 200-token handoff note tomorrow is infinitely cheaper than a 50,000-token zombie context.

The clean way to automate it is a dedicated skill. Drop the file below at .claude/skills/end-of-session/SKILL.md and it’ll fire whenever you wrap up:

---
name: end-of-session
description: Summarizes the current session's decisions, open threads, and next steps into a short handoff note saved to disk. Use when the user is wrapping up a working session, ending a coding day, or asking for a "handoff" or "carry-over" note.
---

# End-of-session handoff

When triggered, produce a short handoff note for the next session.

## What to capture

1. **Decisions made** - one bullet per concrete choice (chose X over Y because Z).
2. **Open threads** - questions or sub-tasks that did NOT get resolved.
3. **Next steps** - 2-5 prioritized items, each one action verb + concrete target.
4. **Touched files** - paths only, no diffs.

## Output format

Write the note to `.claude/handoffs/YYYY-MM-DD-HHmm.md` (UTC). Use the structure:

```markdown
# Handoff - <ISO date>

## Decisions
- ...

## Open threads
- ...

## Next steps
1. ...

## Touched files
- path/to/file
```

## Rules

- Be short and precise. Don't pad, don't restate the obvious - the next session pays input tokens for every line.
- Write for a human reader. The note should be scannable in 30 seconds. No JSON dumps, no copy-pasted logs, no diagnostic noise no human would actually read. Laser focus.
- Keep the whole note under 200 lines.
- No prose recap of the conversation - bullets only.
- If a decision is uncertain, mark it `(tentative)` rather than burying it.
- Do not paste code. Reference files and line numbers instead.

Why this works: the description gives Claude clean verbal triggers (“wrap up”, “end of session”, “handoff”) so it activates exactly when you want it. The body carries the actual prompt - what to capture, where to save, in what format - so you shape the handoff once and never re-explain it. Tomorrow, opening the latest handoff file as your first message loads ~200 tokens instead of replaying a 50,000-token session.

And that’s it - you are good to go!