Feb 03, 2026· 7 min read· Alignment· Position

Why Context Engineering — Not Scale — Is the Real Bottleneck

The most expensive thing in modern AI is not parameters or compute. It is the silent, underexamined choice of what the model gets to see at inference time.

“Context engineering” is the unglamorous cousin of prompt engineering, and it is doing most of the actual work behind every deployed system worth using. Retrieval, routing, summarization, memory, scratch buffers, tool schemas, role priming — these are not implementation details. They are the system. The model is a fixed asset; the context window is where design happens.

Three claims

One: for any model above a certain capability threshold, performance on a deployed task is governed by the quality of the context construction, not the size of the underlying model. Doubling parameters helps less than halving the irrelevant tokens in the prompt.

Two: the failure modes of long-running agents are almost entirely context failures — stale memory, conflicting evidence, role bleed, tool-output overflow. Models do not “forget” in any meaningful sense; their context curators forget for them.

Three: we are evaluating context engineering with the wrong instruments. Single-turn benchmarks reward models that ignore context cleverly. Multi-turn benchmarks that hold context construction fixed reward models that compensate for bad context. Neither is what we want.

Figure 1. Stylized comparison: returns from doubling parameters vs. doubling investment in context curation. Numbers illustrative.

What this looks like in practice

Concretely, the wins I keep seeing in deployed systems come from the same handful of moves: retrieval that rewrites the query before searching, summaries that are structured rather than free-text, hard limits on tool-output token budgets with explicit truncation policies, and an explicit memory schema the agent is allowed to update rather than a free-form scratchpad.

None of this is research-paper exciting. All of it works.

Why this is unfashionable

Context engineering does not scale a chart line. It does not produce a leaderboard number. It is system-design work, and system design is hard to publish. So we publish the model and underplay the context, and then wonder why open replications underperform on the same task.

I don't have a tidy conclusion. I think we should be writing more papers about prompts, retrievers, and memory, and we should be embarrassed that we treat them as engineering rather than science.