Building an AI Agent Harness from Scratch: The Architecture Between LLM and Agent

Everyone talks about the model. Nobody talks about the harness.

Give Claude Sonnet or GPT-4o a chat interface and you get a conversational AI. Wrap it in a loop that can call external tools, maintain state across turns, enforce budget limits, and validate its own outputs — and you get an agent. The difference isn’t the LLM. It’s everything around the LLM.

The AWS team published a guide on “agent harnesses” this week, and it got me thinking: most tutorials show you how to call an LLM or how to register a tool. Almost none show you the orchestration layer that makes those individual pieces behave as a coherent system.

I’ve built agents that run autonomously on production infrastructure 24/7. The mistakes I made early on weren’t about picking the wrong model. They were about skipping the harness — assuming the model would “just figure it out.” It won’t. The harness is what makes an agent reliable, and reliability is the only metric that matters once you move past the demo phase.

Here’s how to build one from scratch.

What Is an Agent Harness, Really?

An agent harness is the execution environment that sits between the user and the LLM. It’s not the prompt. It’s not the model. It’s the infrastructure that:

Manages the conversation loop — receiving input, calling the model, routing tool calls, feeding results back, repeating until termination
Registers and dispatches tools — maintaining a catalog of callable functions, validating arguments, executing them safely, and returning structured results
Maintains memory — storing conversation history, injecting relevant context, compressing old messages to stay within context limits
Enforces guardrails — limiting token budgets, capping tool call counts, preventing infinite loops, blocking dangerous actions
Handles failures — retrying on transient errors, degrading gracefully when a tool is unavailable, escalating to human review when confidence is low

Without a harness, you have a stateless API call. With a harness, you have a system.

The Minimal Agent Harness

Let’s start with the smallest useful version. A harness needs three things: a model interface, a tool registry, and a loop.

import json
from typing import Callable, Any
from dataclasses import dataclass, field

@dataclass
class Tool:
    name: str
    description: str
    parameters: dict  # JSON Schema
    fn: Callable

class AgentHarness:
    def __init__(self, model, system_prompt: str = ""):
        self.model = model
        self.system_prompt = system_prompt
        self.tools: dict[str, Tool] = {}
        self.max_iterations = 10

    def register_tool(self, tool: Tool):
        self.tools[tool.name] = tool

    def tool_list(self) -> list[dict]:
        return [
            {"type": "function", "function": {
                "name": t.name, "description": t.description,
                "parameters": t.parameters,
            }}
            for t in self.tools.values()
        ]

    def run(self, user_input: str) -> str:
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": user_input},
        ]
        for i in range(self.max_iterations):
            response = self.model.chat(
                messages=messages, tools=self.tool_list() if self.tools else None,
            )
            if not response.tool_calls:
                return response.content
            messages.append(response.message)
            for call in response.tool_calls:
                tool = self.tools.get(call.function.name)
                if not tool:
                    result = f"Error: Unknown tool '{call.function.name}'"
                else:
                    try:
                        args = json.loads(call.function.arguments)
                        result = tool.fn(**args)
                    except Exception as e:
                        result = f"Error: {type(e).__name__}: {e}"
                messages.append({"role": "tool", "content": str(result), "tool_call_id": call.id})
        return "Max iterations reached."

That’s the skeleton. It loops: call model, check for tool calls, execute, feed back. Seven lines of core logic. It works for demos. It breaks in production. Let’s see why.

Problem 1: The Tool Registry Lies

You register a tool, the agent calls it, and it crashes because input validation is wrong. The tool description promised certain parameters, the model complied, but the underlying function has tighter requirements. This isn’t the model’s fault — it’s a harness problem: the tool registry should validate before dispatch.

class ToolRegistry:
    def __init__(self):
        self.tools: dict[str, Tool] = {}
        self.call_counts: dict[str, int] = {}

    def register(self, tool: Tool):
        self.tools[tool.name] = tool
        self.call_counts[tool.name] = 0

    def validate_call(self, tool_name: str, arguments: dict) -> tuple[bool, str]:
        if tool_name not in self.tools:
            return False, f"Unknown tool: {tool_name}"
        schema = self.tools[tool_name].parameters
        for field in schema.get("required", []):
            if field not in arguments:
                return False, f"Missing required parameter: {field}"
        for arg_name, arg_value in arguments.items():
            if arg_name not in schema.get("properties", {}):
                return False, f"Unexpected parameter: {arg_name}"
        return True, "OK"

    def execute(self, tool_name: str, arguments: dict) -> Any:
        self.call_counts[tool_name] += 1
        return self.tools[tool_name].fn(**arguments)

The registry acts as a gatekeeper, not just a dispatcher. Before any tool fires, the harness validates existence, required fields, type correctness, and hallucinated parameters. This catches 60-70% of tool-call errors before they reach application code.

Problem 2: Memory Bloat Kills Context

Ten turns in, the conversation contains the original prompt, four tool call/response pairs, and a partial draft. The context window is filling up. By turn 20, the model starts forgetting the system prompt. The solution is intelligent context management: compress what you don’t need, preserve what you do.

import tiktoken
from dataclasses import dataclass

@dataclass
class MemoryConfig:
    max_context_tokens: int = 64_000
    keep_recent_messages: int = 8
    always_preserve_system: bool = True

class AgentMemory:
    def __init__(self, config: MemoryConfig):
        self.config = config
        self.messages: list[dict] = []
        self.encoder = tiktoken.encoding_for_model("gpt-4o")

    def add(self, role: str, content: str, **kwargs):
        self.messages.append({"role": role, "content": content, **kwargs})

    def get_messages(self) -> list[dict]:
        total = sum(len(self.encoder.encode(m.get("content", ""))) + 4 for m in self.messages)
        if total <= self.config.max_context_tokens:
            return self.messages
        return self._compress()

    def _compress(self) -> list[dict]:
        keep = self.config.keep_recent_messages
        system_msg = None
        if self.config.always_preserve_system:
            system_msgs = [m for m in self.messages if m["role"] == "system"]
            if system_msgs:
                system_msg = system_msgs[0]
        recent = self.messages[-keep:]
        old = self.messages[:-keep]
        if not old:
            return [system_msg] + recent if system_msg else recent
        # Summarize old messages (in production, call a cheap model like Haiku)
        old_text = "n".join(f"[{m['role']}]: {m.get('content', '')[:200]}" for m in old)
        summary = " | ".join([line[:100] for line in old_text.split("n") if any(kw in line.lower() for kw in ["tool:", "result:", "error:"])][:10])
        compressed = [{"role": "system", "content": f"[EARLIER CONTEXT: {summary}]"}]
        if system_msg:
            compressed = [system_msg] + compressed
        compressed.extend(recent)
        return compressed

Treat the context window like OS memory: recent messages are your hot cache, old messages are swap space, and the system prompt is kernel memory — never page it out.

Problem 3: The Loop Runs Forever

The model enters a reasoning spiral. It calls search_database, gets a result, calls it again with slightly different parameters, repeats indefinitely. Tokens pile up. Budget enforcement is the most critical guardrail, and it belongs in the harness, not the prompt.

from dataclasses import dataclass
import time

@dataclass
class BudgetConfig:
    max_tokens: int = 30_000
    max_tool_calls: int = 25
    max_time_seconds: float = 300.0
    max_per_tool_calls: int = 5

class BudgetEnforcer:
    def __init__(self, config: BudgetConfig):
        self.config = config
        self.tokens_used = 0
        self.tool_calls_total = 0
        self.tool_calls_per_tool: dict[str, int] = {}
        self.start_time = time.time()

    def record_tokens(self, input_tokens: int, output_tokens: int):
        self.tokens_used += input_tokens + output_tokens

    def record_tool_call(self, tool_name: str):
        self.tool_calls_total += 1
        self.tool_calls_per_tool[tool_name] = self.tool_calls_per_tool.get(tool_name, 0) + 1

    def check(self) -> str | None:
        if self.tokens_used >= self.config.max_tokens:
            return f"Token budget exceeded: {self.tokens_used} (limit {self.config.max_tokens})"
        if self.tool_calls_total >= self.config.max_tool_calls:
            return f"Tool call budget exceeded: {self.tool_calls_total}"
        if time.time() - self.start_time >= self.config.max_time_seconds:
            return "Time budget exceeded"
        for tool, count in self.tool_calls_per_tool.items():
            if count >= self.config.max_per_tool_calls:
                return f"Per-tool limit: '{tool}' called {count} times"
        return None

Four budgets, any of which stops the agent before costs spiral: token budget, tool call budget, time budget, and per-tool budget.

Problem 4: Errors Swallowed, Not Handled

A tool call raises ConnectionError. The harness catches it, returns "Error: ConnectionError", and the model gets confused. It doesn’t know if it should retry, try a different tool, or give up. Error formatting is an agent design problem. The model needs structured error messages that tell it what went wrong and what to do.

from enum import Enum
from dataclasses import dataclass

class ErrorType(Enum):
    TRANSIENT = "transient"
    PERMANENT = "permanent"
    UNAVAILABLE = "unavailable"

@dataclass
class ToolError:
    error_type: ErrorType
    message: str
    suggestion: str

def format_tool_error(error: ToolError) -> str:
    parts = [f"[TOOL ERROR: {error.error_type.value.upper()}]"]
    parts.append(error.message)
    if error.suggestion:
        parts.append(f"Suggested action: {error.suggestion}")
    return "n".join(parts)

Examples:

Transient: Rate limit hit → “Retry with different parameters or try an alternative tool.”
Permanent: DELETE query rejected → “Use SELECT queries to read data instead.”
Unavailable: Weather service down → “Inform the user data is unavailable.”

A bare exception traceback tells the model nothing. A structured error with a suggested action gives it a decision tree.

Problem 5: The Harness Has No State

The minimal harness is stateless between runs. For cross-session persistence, you need a state layer:

import json
import sqlite3
from datetime import datetime, UTC

class AgentState:
    def __init__(self, db_path: str = "agent_state.db"):
        self.db = sqlite3.connect(db_path)
        self.db.execute("""CREATE TABLE IF NOT EXISTS sessions (
            session_id TEXT PRIMARY KEY, created_at TEXT,
            last_active TEXT, user_id TEXT)""")
        self.db.execute("""CREATE TABLE IF NOT EXISTS tool_invocations (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            session_id TEXT, turn_number INTEGER,
            tool_name TEXT, arguments TEXT, result TEXT,
            success INTEGER, duration_ms INTEGER, timestamp TEXT)""")
        self.db.commit()

    def create_session(self, session_id: str, user_id: str):
        self.db.execute(
            "INSERT INTO sessions VALUES (?, ?, ?, ?)",
            (session_id, datetime.now(UTC).isoformat(), datetime.now(UTC).isoformat(), user_id))
        self.db.commit()

    def record_tool_invocation(self, session_id: str, turn: int,
                                tool: str, args: dict, result: str,
                                success: bool, duration_ms: int):
        self.db.execute(
            "INSERT INTO tool_invocations VALUES (NULL, ?, ?, ?, ?, ?, ?, ?, ?)",
            (session_id, turn, tool, json.dumps(args), result,
             int(success), duration_ms, datetime.now(UTC).isoformat()))
        self.db.commit()

    def get_analytics(self, session_id: str) -> dict:
        total = self.db.execute("SELECT COUNT(*) FROM tool_invocations WHERE session_id = ?", (session_id,)).fetchone()[0]
        rate = self.db.execute("SELECT AVG(success) FROM tool_invocations WHERE session_id = ?", (session_id,)).fetchone()[0] or 0
        return {"total_invocations": total, "success_rate": round(rate * 100, 1)}

The state layer gives you session persistence, tool invocation audit logs, and built-in analytics — essential for debugging failed sessions.

The Complete Architecture

All five pieces fit together:

User Input
    ▼
┌───────────────────────────────┐
│         Budget Enforcer        │  ← Checks before every iteration
├───────────────────────────────┤
│         Agent Memory           │  ← Compresses old context
├───────────────────────────────┤
│         LLM Call               │  ← With tool definitions
├─────────────────┬─────────────┤
│   tool calls?   │   no → return
├─────────────────┤
│  Tool Registry   │  ← Schema + type validation
├───────────────────────────────┤
│  Safe Execute    │  ← Structured errors with suggestions
├───────────────────────────────┤
│  Agent State     │  ← Log turn + tool invocation
└───────────────────────────────┘
         loop back

Each component has a single responsibility. The harness coordinates them. The model is just one node in the graph.

Where Managed Platforms Fit In

Building this harness from scratch teaches you exactly what’s involved. But the five components — tool registry, memory management, budget enforcement, error handling, and state persistence — are infrastructure, not business logic. They’re identical whether you’re building a GitHub agent, a content agent, or a customer support agent.

Platforms like Nebula abstract exactly this layer. You define the tools (automatically MCP-exposed), the system prompt, and constraints like max iterations and token budgets. The platform handles the harness: tool validation, context compression, budget tracking, error formatting, and session persistence. Every agent execution is traced end-to-end with cost attribution, and the observability dashboard shows tool call distributions, success rates, and budget consumption in real time.

You focus on what the agent does. The platform ensures you can see when it goes wrong.

Actionable Takeaways

Start with the loop, not the model. The call-observe-decide-repeat pattern is fundamental. Pick any capable LLM and focus on getting the harness right.
Validate tool calls before dispatch. Schema validation catches 60-70% of errors before they hit application code.
Compress context aggressively. Use a hot-cache pattern: keep recent messages, summarize old ones, preserve the system prompt.
Enforce budgets in code, not prompts. A max_iterations field in your prompt is a suggestion. A BudgetEnforcer that halts execution is a guarantee.
Structure your errors. Classify errors as transient (retry), permanent (redirect), or unavailable (graceful degradation), always with a suggested action.
Log everything. Tool invocations with arguments, results, durations, and success status. When a session goes wrong, logs are the only way to reconstruct what happened.
Build the harness first, optimize the model second. A well-harnessed GPT-3.5 outperforms an unharnessed GPT-4o every time.

The agent harness isn’t glamorous. But it’s the difference between an agent that works once in a notebook and one that works at 2 AM on a Tuesday when nobody’s watching. Build it right, and the model becomes the least interesting part of your system.

This article is part of the Building Production AI Agents series on Dev.to.

Building an AI Agent Harness from Scratch: The Architecture Between LLM and Agent

What Is an Agent Harness, Really?

The Minimal Agent Harness

Problem 1: The Tool Registry Lies

Problem 2: Memory Bloat Kills Context

Problem 3: The Loop Runs Forever

Problem 4: Errors Swallowed, Not Handled

Problem 5: The Harness Has No State

The Complete Architecture

Where Managed Platforms Fit In

Actionable Takeaways

Search

Quads Text

Recent Posts

Archives

Meta