Everyone is talking about model capability. Which model scores highest on SWE-Bench. Whether Opus beats Codex at agentic tasks. How many tokens per second Mercury 2 can push. But after two months of watching the best engineering teams ship with AI agents, I’m convinced the conversation is aimed at the wrong layer.
The model is not the product. The harness is.
What’s a harness, and why should you care?
A harness is the software layer that wraps an LLM and turns it into something useful. It assembles prompts, exposes tools, tracks file state, applies edits, runs commands, manages permissions, caches context, and stores memory. Sebastian Raschka put it best in his deep dive on coding agent architecture: “much of the apparent ‘model quality’ is really context quality.”
Think about it. Claude Code, Cursor, Codex, Devin. They all use frontier models. The models are increasingly similar in raw capability. What separates them is everything around the model. The control loop. The context window management. The tool definitions. The memory system. The prompt engineering.
When Anthropic accidentally leaked Claude Code’s source in March, the most interesting thing wasn’t the model. It was the 30+ conditional components that get assembled into a system prompt. Half of them appear dynamically based on your environment, your configuration, your tools. That’s not a wrapper. That’s a product.
Context engineering is the new software engineering
Here’s what I keep seeing in the best teams: the engineers who get the most out of AI agents aren’t the ones with the best prompts. They’re the ones who build the best context.
This happens at three levels. LangChain’s piece on continual learning for AI agents breaks it down clearly: learning can happen at the model layer (weights), the harness layer (code and instructions), or the context layer (configuration and memory). Most people immediately jump to “we need to fine-tune the model” when what they actually need is better harness design.
The Claude Code leak proved this. The system doesn’t just send your message to Claude and wait. It reads your git status. It checks which shell you’re running. It loads your CLAUDE.md files. It discovers your MCP servers. It compresses old conversation history. It manages subagent spawning with bounded context. All before the model sees a single token.
This is context engineering. And it’s the skill that matters most right now.
The spec layer: constraining agents instead of prompting them
Matt Rickard coined the term “spec layer” to describe what sits between human intent and machine execution. His argument is simple but powerful: when decisions aren’t documented, agents must redecide them repeatedly within limited context windows. Written specs narrow the search space.
This reframes how we think about working with agents. Instead of asking “how do I prompt this better?”, the question becomes “how do I constrain this better?” Specs constrain intent. Plans constrain approach. Tasks constrain sequencing. Tests, schemas, and linters constrain behavior. Harnesses constrain execution.
The practical takeaway: if a rule can be enforced mechanically, get it out of your prompt and into a linter, a schema, a test, or the harness itself. Use less prose. Enforce more. Agents don’t need better instructions. They need tighter guardrails.
Silent drift: the failure mode nobody talks about
The most dangerous thing about AI agents isn’t when they fail loudly. It’s when they fail silently.
One engineer documented finding agent-generated code that compiled, passed all tests, and looked perfectly reasonable. Except it had quietly reverted a design system migration, introduced magic numbers into CSS, and bypassed naming conventions that took months to establish. Everything worked. Nothing was right.
This is “silent drift.” Agents lack the intuitive understanding that makes experienced engineers catch architectural violations. A senior dev looks at a PR and thinks “this doesn’t feel right.” An agent looks at the same code and sees that it passes the test suite.
The fix isn’t better prompts. It’s better feedback loops. Convert your recurring PR comments into lint rules. Add screenshot tests for critical pages. Set complexity limits that force decomposition. Make the right behavior the only behavior that passes CI.
As one practitioner put it: “Linters don’t sleep, and CI doesn’t get tired. That’s more than I can say for myself.”
MCP vs Skills: the two layers of a harness
The MCP vs Skills debate is really a debate about harness architecture. And it reveals something important about how this layer is evolving.
MCP (Model Context Protocol) is a connector layer. It gives agents actual access to services: calendars, browsers, databases, APIs. The agent doesn’t need to understand the “how.” It just needs to know the “what.” This is why MCP hit 97 million installs by March and every major AI lab integrated support.
Skills are a knowledge layer. They teach agents how to think about problems, what patterns to follow, what conventions to respect. Pure documentation that shapes behavior without providing direct access.
The mistake most teams make is conflating the two. They build Skills that depend on CLI tools (which half of AI environments can’t run) or they try to encode business logic into MCP servers (which should be standardized connectors, not business rules).
The clean architecture: MCP for access, Skills for knowledge. Connectors and manuals. Both part of the harness, serving different purposes.
Extended thinking is load-bearing
One of the most striking findings from the last two months came from an engineer who meticulously tracked Claude Code’s performance before and after Anthropic reduced thinking token visibility. The data is damning.
When thinking depth was reduced, the read-to-edit ratio collapsed from 6.6 to 2.0. The model stopped reading code before modifying it. Stop hook violations jumped from zero to 173. User frustration indicators nearly doubled. Session autonomy dropped from 30+ minute runs to stalling every couple of minutes.
The same user effort produced drastically worse results. The engineer estimates that degraded thinking wastes 15-20x more compute per useful outcome than full thinking depth. Cutting thinking tokens to save money actually costs more.
This is a harness insight, not a model insight. The thinking tokens aren’t just the model “being smart.” They’re structurally required for planning, convention adherence, and self-correction. The harness architecture that allocates token budgets determines whether you get a capable agent or an expensive autocomplete.
What this means for engineering teams
If you’re building with AI agents or building products that use them, here’s the shift: stop optimizing for model selection and start optimizing for harness design.
Practically:
Invest in context, not prompts. Build systems that automatically gather relevant context (git status, project conventions, architectural decisions) before the model sees anything. The best prompt in the world can’t compensate for missing context.
Enforce mechanically, instruct minimally. Every rule in your CLAUDE.md that could be a lint rule or a CI check is a rule that will eventually be violated. Move constraints downstream into deterministic enforcement.
Design your feedback loops. The gap between “agent makes a mistake” and “something catches it” is the gap that determines your code quality. Shrink it relentlessly.
Treat your harness as a product. Version it. Test it. Iterate on it. The difference between a team that ships well with AI and a team that fights it daily is almost always in the harness, not the model.
The model race gets all the headlines. But the teams that are actually shipping great software with AI agents right now? They figured out that the harness is where the leverage lives.