⚙️ Harness Engineering

Introduction

You would have heard “AI Agent” frequently and many times

Most of the time it would be related to “SOTA Models”, “Bigger Context Window”, “Managing Tools” etc.,

But one thing that will make Agent work in Production, it’s not the Model alone but also everything around it

That Everything Around It has name now……

Harness

💡 Inception

💭 Ah, Story? Please don’t worry we won’t see boring one and let us keep short

On February 5, 2026, Mitchell Hashimoto published a short Personal Blog Post describing a habit he had Developed, like every time one of his AI agents made a mistake, he didn’t just patch the Prompt but he Engineered a permanent fix into the Agent’s Environment so that specific mistake became structurally impossible to repeat

He called the practice “Engineering the Harness”

The core principle, in his own framing :

“Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.”

Few days later, OpenAI, Anthropic and everyothers took the direction

💭 Then? Then, And we get the Vocabulary 😉

⚙️ Harness Engineering

We could roughly and generically say,

Agent = Model + Harness

That’s the whole insight

🧠 The Model

The Model, the LLM, is the Brain

It takes Input, Reasons, and Generates Completions

As we all know, That is all it does

It cannot remember yesterday’s conversation, It cannot Execute Code on its own(without Tool), it cannot retry when something fails

When an LLM says “I want to search the web,” it isn’t searching anything, it is producing a String of Text i.e. Completions, that describes the intent to search

🐴 The Harness

The Harness is everything wrapped around that Brain which is here LLM, the System that converts Raw Reasoning into Reliable Action :

  • Tool Execution : Actually Running the Search, Executing Code, the Bash Command, the API Call, etc.,
  • Memory and State Management : Remembering what happened across Steps and Sessions
  • Context Management : Deciding what the Model sees and when to Compact it
  • Guardrails and Safety Constraints : Limits like “max five Tool Calls” that prevent runaway Loops
  • Error Handling and Retries : Recovering when something inevitably breaks
  • The Agent Loop : The Orchestration Logic that Drives Iteration
  • Verification Steps : Deterministic checks (like, Linters, Tests, Validators) that confirm the work was actually done

🤝 The Agent

The Agent is what emerges when the Brain and the Harness work together

Simple Analogy

This is just for understanding

The Horse is Raw Power

The Harness : reins, saddle, road boundaries which is structure and control

The Horse without a Harness is impressive but unsteerable. A Harness without a Horse is just leather

And putting them together, and you have a working Transport System, that’s an Agent

Why a Harness, and Why Now? 🤔

The one word answer is Reliability

Most builders are paying for Inference, we work against Rate Limits, and we usually treat the Model itself as a Black Box

But we can control is the Environment around the Model and that Environment Determines whether the Agent succeeds 30% of the time or 95% of the time on the same Task with the same Model

Here is what Numbers show :

  • On SWE-bench, swapping the Harness can shift scores by more than 20 points and swapping the Model typically shifts them by 1 ([2] Morph, 2026)

So, the Harness is to an Agent what a climbing Harness is to a Mountaineer, an anchor to reality

Its job is to keep the Agent from “going off the rails” regardless of which black box Model would happen to be Running underneath

🏗️ The Anatomy of a Production Harness

A real world Harnessed Agent Runtime usually contains five layers working together

  1. Tool Registry : The set of Tools the Agent can call (File I/O, Shell, Search, Internal APIs, etc.,), each with a clear Schema
  2. Context Primitives : Utilities that automatically Compact and manage the Context Window so the Agent stays operational across long Runs
  3. Guardrails : Hard limits (max steps, max Tokens, forbidden actions, allowed domains) that can stop a Run before it does damage
  4. The Agent Loop : The Orchestration Code that Iterates like receive Output from LLM -> Parse -> Execute Tools -> Feed Results back -> Repeat
  5. The Verify Step : A deterministic check at the end. Did the Linter pass? Did the Test suite go green? Is the Output Schema Valid? If not, the work isn’t done, no matter what the Model claims(Some times in Ralph Loops)

This is also where the discipline reveals itself as deeply cumulative

Some Call it the Ratchet Principle : the Harness only tightens, never loosens

Every guide you add traces back to a real failure. Every sensor traces back to a real Bug

Over time, the Harness accumulates institutional knowledge about what goes wrong, and what is been permanently fixed

All these are now easily available in LangChain, LangGraph, Claude Agent SDK, etc.,

Examples Systems for Harness Engineering

💭 What Example 1, 2, …..? Don’t worry, Only 2 examples with very brief explanation, please read through

💻 Example 1 : Claude Code or Cursor or Any Agentic Engineering (Coding) Tools :

This might be the widely Deployed Harness in production today

  • A Developer working with Claude Code drops a CLAUDE.md file in the project root that tells the Agent : “TypeScript strict mode, no any, Functional Components only, tests live next to source files.” That’s a guide

  • Then they add a .claude/settings.json with explicit deny rules : Bash(rm -rf:*), etc., so the Agent literally cannot run a destructive Command, no matter how confused it gets

  • Finally, they wire up a PostToolUse Hook that runs npx tsc --noEmit after every File Edit. If the type check fails, the Agent sees the Error immediately and fixes it before moving on

  • That is a Sensor (feedback)

LangChain reported a 13.7 point jump on Terminal Bench 2.0 just by changing the Harness, the same Model, the Model didn’t get Smarter, but the Environment did ([1] Trivedy, 2026)

🌐 Example 2 : A browser-use Agent Booking Flights

An Agent is asked to book a flight

Without a Harness, it might happily click “Confirm Purchase” on a $4,000 mistake

With a Harness :

  • The Tool registry doesn’t expose a generic click(coordinates) for any element matching confirm|purchase|pay — those require explicit human-in-the-loop confirmation

  • A Sensor takes a screenshot after every Click and runs a Fast vision check: “Are we on a payment page? Is the total above the user’s stated budget?” If yes, halt

  • Credentials are injected by the Harness from a vault, never visible to the Model

The LLM’s “judgment” never has to be right about money, the Harness is the thing standing between it and your Credit Card

Additional Information (You could skip if not needed)

The Word “Harness” Didn’t Come From Nowhere

Before AI Engineering borrowed it, “Harness” already had a long history in Software

A Test Harness :

A concept dating back decades in Classical Software Engineering which is a collection of Stubs, Drivers, Scripts, and Test Data that surrounds a Piece of Code so it can be exercised in a controlled, repeatable way

Sounds familiar? It should

  • A Test Harness wraps Code with the Environment it needs to behave reliably under Test
  • An AI Harness wraps a Model with the Environment it needs to behave reliably in Production

Same instinct, new layer

A few related cousins from the same family of ideas:

Evaluation harnesses (often called Eval Harnesses)

Frameworks like EleutherAI’s lm-evaluation-harness that standardize how LLMs are benchmarked across thousands of Tasks

Benchmark Harnesses in Systems Engineering

Controlled environment for Measuring Performance, Throughput, or Latency under Load

Wiring Harnesses in Hardware

So, the concept through all of them is the same : a Harness is the Structured Environment that turns an unpredictable component into a dependable one

Harness Engineering for AI Agents is the natural extension of an Idea

Conclusion

The Model Thinks

The Harness acts

The Agent is what you get when both work together

When your Agent fails in Production, generally it is to blame the Model, swap it out or Fine Tune it or Prompt it Harder

Resist that reflex, The Model is usually(not all the case) not the problem

Fix the Harness

That is the discipline or way of 2026

And once you see it, you can’t unsee it

References

[1] : Trivedy, V. (2026, February 17). Improving Deep Agents with harness engineering. LangChain Blog. https://www.langchain.com/blog/improving-deep-agents-with-harness-engineering

[2] : Morph. (2026, March 17). Best AI for Coding (2026): Every Model Ranked by Real Benchmarks. https://www.morphllm.com/best-ai-model-for-coding

Happy Building! 🎉

Thanks for Reading!