A context window is a budget. Model vendors publish the maximum number of tokens the window can hold. Most of the interesting engineering is about what you put in it. The mistake I kept making in the first few months of running an agentic development environment was treating the context window as a container you fill, rather than as an artifact you manufacture. The fix, once I finally built it, is a skill called /deep-context that pre-assembles a task-specific file before the task starts. The file is the context for the task. The task is spawned as a sub-session that reads only that file, plus the brief.
This is a worked description of the pipeline, the incident that forced it, the benchmark that gates its release, and the places where it has already caught things a default context load would have missed.
The incident that forced it
The school governors’ application was the forcing function. The corpus is roughly 918 megabytes of school documents, compressed to around 150,000 tokens after extraction. The application loads that corpus into the context window for every query. For the first few weeks, this seemed to work. Then I asked it a question about a specific set of meeting minutes. The answer it gave was plausible, fluent, and wrong. The correct minutes were in the corpus. The model had not found them.
I went looking for why. Liu et al.’s “Lost in the Middle” paper, published 2023, quantified the failure mode: relevant information buried in the middle of a long context is found around 25% of the time, compared to around 42% at the end of the context. The drop is sharper for information that is not semantically obvious from surrounding text. School minutes are exactly the kind of document where the relevance-marker (a specific date, a specific agenda item) is a single token surrounded by several thousand tokens of procedural boilerplate.
The fix for the governors’ application was query-type routing. Date-specific queries start with keyword search against a sqlite Full-Text Search index (FTS5) and inject only the matching minutes into context. Open-ended policy questions start with semantic search against a ChromaDB index and inject the top matches. Full-corpus loading is the fallback, not the default. Accuracy on the class of question that had been failing moved from “usually wrong” to “usually right”. That experience convinced me context management is a correctness requirement, not an optimisation.
The question that followed was: if this is true for the governors’ app, where is it also true for me in my own coding sessions?
The answer was “everywhere”. Every high-stakes task I do as a coder is a task where the relevant context is scattered across topic files, previous conversations, code, and the raw session logs that underlie the search indices. The default context load is “whatever is auto-loaded by the harness plus whatever Claude decides to read once the task starts”. That default is fine for small tasks. For architectural changes, migrations, and incidents, it misses things. The pattern the governors’ app exposed was general.
What /deep-context does
/deep-context <brief> is a skill that runs before a high-stakes task and produces a file called context.md. The file is under 50,000 tokens (tunable), structured, and deduplicated. The task is spawned as a sub-session with context.md as its only input. The sub-session does not need to search. The search has already happened.
The skill has four stages. Each stage exists to do one job.
Stage 1: pre-filter
The corpus of things to consider is three separate stores. The topic files (curated current truth, hand-maintained). Compressed session summaries (one per closed session, generated by a separate compression pass, stored at memory/sessions/compressed/YYYY/). Raw session transcripts (the JavaScript Object Notation (JSON) Lines files Claude Code writes, about 800 of them, totalling roughly 600 megabytes).
The pre-filter runs five queries against these stores and takes the union of the matches. The queries are: time window (last 90 days by default, extended when the brief contains keywords implying history), topic overlap with the brief, file-path overlap with the brief, keyword match via FTS5, and semantic match via ChromaDB. The union typically returns 20 to 80 candidate sessions plus all relevant topic files. This is the pool the next stage works with.
Stage 2: fan-out
Three agents run in parallel. Agent A reads the topic-file index, then the topic files it thinks are relevant. Agent B reads the candidate compressed sessions from stage 1. Agent C walks the codebase via Glob and Grep. None of them reads the full raw session transcripts; they read the compressed summaries only.
Each agent returns two outputs. A set of relevant excerpts (bounded, deduplicated, with sources tagged). A list of session identifiers flagged for deeper reading. The flagged-session list is the way the fan-out says “the compressed summary is suggestive but not enough; someone should read the raw transcript before answering this part of the brief”.
Stage 3: aggregation
The aggregator reads the flagged session transcripts (pre-stripped to remove tool definitions and truncated tool outputs) and assembles the final context.md. The structure is fixed:
# Context for: <brief>
## Recent state
(from topics. What IS true now.)
## Relevant history
(from compressed + raw re-reads. How we got here, what we tried, what failed.)
## Unresolved threads
(flagged but not acted on; touches this brief)
## Files likely to touch
(from code fan-out. Targeted list, not dump.)
## Citations
(every claim tagged: [topic:path] | [session:id] | [raw:id] | [code:path:line])
Every claim in the context file is tagged by source. The tagging is mechanical, done at aggregation time, not self-reported by the model. This matters because the model’s own account of where information came from is untrustworthy; the tags are the record the system itself kept.
Stage 4: sub-session
The original task is spawned as a fresh Claude session with the brief and context.md, and nothing else. The sub-session has the full context it needs, pre-arranged, pre-deduplicated, with provenance. It does not need to search for anything, so it does not waste context window on search outputs. The whole context budget is pre-committed to substantive information.
The single point of failure
The aggregator is the most dangerous component. If it fabricates, the whole pipeline fabricates. If it drops something load-bearing, the whole pipeline misses it. This is the piece I stress-tested before shipping.
The test was a golden-context benchmark. I picked three past tasks where I knew the right context ex post: a specific iOS SwiftUI state-capture bug, a LaunchAgent mass-failure root-cause analysis, and an architecture discussion about whether to decompose a monolith. For each task I hand-curated the ideal context.md from memory and session logs: every claim that should be there, every citation that should be there, every piece of failed-attempt history that should be there. Then I ran the aggregator against the same briefs and scored the generated output against the golden answer.
Ship threshold was 80% claim coverage. Below that, iterate. Above that, ship. The first version of the aggregator scored 61% on the average of the three benchmarks. The second version scored 74%. The version that shipped scored 87%. The scoring is partially automatic (citation-tag matching) and partially manual (I read the outputs and ticked off the claims myself on a spreadsheet). I would not build this pipeline again without the benchmark.
The benchmark is not a one-time gate. It is a regression test. Any future change to the aggregator prompt, the pre-filter, or the fan-out has to be rerun against the golden set before being accepted. This is the discipline that keeps the pipeline from drifting.
Precedence, or why topics beat compressed beats raw
The pipeline reads from three layers that can disagree. The governing rule is in CLAUDE.md:
## Memory precedence
When information conflicts across layers:
1. Topics. CURATED CURRENT TRUTH. Wins.
2. Compressed sessions. RECALL/ROUTING. Not authoritative on facts.
3. Raw JSONL. ARBITRATION when summary and topic disagree.
4. Chroma/FTS5. DERIVED, regenerate from source.
Compressed entries are APPEND-ONLY.
/dream may promote insights FROM compressed INTO topics, never the reverse.
Topics are the hand-maintained truth. Compressed sessions are summaries generated from raw sessions; they are routing, not facts. Raw sessions are the final arbiter when compressed summaries and topic files disagree. The vector and keyword indices are derived and can be regenerated at any time; they are not a source of truth.
This precedence matters because without it, a contamination event (a bad compressed summary gets written because the compression prompt had a bug that week) can bias retrieval for months. With the precedence rule, raw arbitrates. If a claim in context.md is sourced only from compressed sessions and conflicts with a topic, the topic wins. If the raw transcript shows something different from both, the raw transcript wins.
When /deep-context runs and when it does not
Explicit invocation only. I do not auto-trigger it from plan mode, because the overhead (running the pipeline takes a few minutes and some compute) is only worth it for tasks that warrant the preparation.
The invocation heuristic: if a task warrants plan mode (architectural changes, migrations, multi-file refactors in core systems, anything with blast radius), it warrants /deep-context. If it is a small edit, a quick question, or iterating on a user interface, it does not.
I run /deep-context maybe twice a week. The rest of the time the default context load is sufficient and the overhead is not justified.
What it has caught
Three classes of problem, each of which I can name with specifics.
Forgotten failed attempts. In several cases the pipeline surfaced a previous attempt at the same problem that had been abandoned. The previous attempt had reasons for being abandoned; those reasons were captured in a compressed summary of the old session. Without /deep-context, I would have repeated the attempt and hit the same failure. With it, the failure history was in context.md and I skipped it.
Un-migrated consumers. Architectural changes that touch multiple files frequently miss downstream consumers. The code fan-out stage explicitly searches for references to the identifiers being changed and lists them. Several times this list has been longer than I expected; the ones I had not thought of were the ones that would have broken.
Policy reversals. In a few cases the pipeline surfaced a decision from months earlier that I had forgotten and was about to reverse without justification. The compressed summary captured the why behind the original decision. Reading it changed my mind; in other cases it confirmed the reversal was correct and I proceeded with the reasoning intact.
None of these is exotic. They are the normal consequence of being the only person maintaining a system that has accumulated several hundred decisions and several hundred thousand lines of transcript. They are the things a good human colleague with a long memory would catch, and which I do not have. /deep-context is my substitute for that colleague.
The limits
I want to be specific about where the pipeline does not help.
It does not help when the relevant information is not in any of the three stores. If a decision was made verbally and never written down, the pipeline cannot know about it. The pipeline’s coverage is the coverage of the written record. I maintain the written record carefully for exactly this reason.
It does not help when the task is small. The five minutes of overhead to run the pipeline is not justified for an edit that would take three minutes anyway. The rule of thumb is “if the task costs more than twenty minutes of my attention, the pipeline is worth it”.
It does not help when the task has no history. New kinds of work, where the relevant context is external documentation or a single specification, do not benefit from a pipeline that searches internal history. For those, I read the specification directly.
It does not replace judgement. The output of /deep-context is a context file, not a decision. The task that runs against it is a sub-session doing the work; the sub-session is still capable of getting the work wrong. The pipeline gives it the best possible starting point, not a guaranteed ending point.
Building it
The code lives under ~/code/deep_context/ on the Mac Mini. The skill definition is at ~/.claude/skills/deep-context/SKILL.md. The compressed-session schema is a small Yet Another Markup Language (YAML) frontmatter plus markdown body (target 500 to 1500 tokens per entry, hard cap 2000). The full pipeline, with pre-strip, complex-session routing (Sonnet or Opus depending on session complexity), unified index writes to ChromaDB and FTS5, and the aggregator prompt, is around 1,500 lines of Python. It will be published in the control plane repository along with the rest of the system.
If you want to build a similar pipeline for your own setup, the three components I would strongly recommend inheriting are: the precedence rule (topics over compressed over raw), the golden-context benchmark (three tasks, 80% threshold), and the explicit-invocation policy (no auto-trigger). Everything else is tunable; those three are load-bearing.
The underlying shift
The broader change in how I work is that I have stopped treating context as a free resource. Before this pipeline, my assumption was that the model would read the relevant things when it needed to and the context window was “big enough”. Both assumptions were wrong for the kinds of tasks I care about most. The fix, once I made it, is not subtle: manufacture the context before the task starts, benchmark the manufacturing, enforce a precedence rule between layers. The pipeline is not elegant; it is a scaffold. But it is the scaffold that makes the important tasks reliable, and reliability is the binding constraint on how ambitious a task I am willing to start.
Context windows have grown dramatically. Context quality has not grown at the same rate. The gap is where the engineering happens.
The pipeline is working end-to-end as of 2026-04-23. The backfill to compress the full historical session corpus is paced against the subscription quota and will run over the coming days. Golden-context benchmark scores above are from the production aggregator; regressions require a new benchmark run before any aggregator change is accepted.