Inside Go 1.24’s New HTTP/3 Support: How It Cuts Latency for High-Traffic APIs

Inside Go 1.24’s New HTTP/3 Support: How It Cuts Latency for High-Traffic APIs

Go 1.24 marks a major milestone for cloud-native developers with the general availability of native HTTP/3 support in the standard library. For teams running high-traffic APIs, this update eliminates the need for third-party QUIC proxies, slashing latency and simplifying deployment pipelines. Below, we break down how the implementation works, why it outperforms HTTP/1.1 and HTTP/2 for high-throughput workloads, and how to migrate existing services.

Why HTTP/3 Matters for High-Traffic APIs

HTTP/3 is built on QUIC, a UDP-based transport protocol that solves long-standing issues with TCP-based HTTP/2: head-of-line blocking, slow connection establishment, and poor performance on lossy networks. For high-traffic APIs serving millions of requests per second, these issues add up to measurable latency spikes and wasted throughput.

Key QUIC advantages include:

  • 0-RTT connection resumption: Returning clients can send requests immediately without a full handshake, cutting initial latency by up to 300ms on long-distance links.
  • Stream-level flow control: Unlike HTTP/2, which blocks all streams if a single packet is lost, QUIC isolates stream failures to individual requests, preventing one slow client from degrading overall API performance.
  • Integrated encryption: QUIC bakes TLS 1.3 into the transport layer, reducing handshake overhead compared to TCP + TLS setups.

Go 1.24’s HTTP/3 Implementation

Go’s HTTP/3 support lives in the new net/http3 package, designed to integrate seamlessly with the existing net/http ecosystem. The implementation is fully compliant with RFC 9114 (HTTP/3) and RFC 9000 (QUIC), with no external dependencies required.

Key design choices for the standard library implementation:

  • Shared connection pooling with HTTP/1.1 and HTTP/2, so clients automatically select the best supported protocol for each endpoint.
  • Zero-copy buffer management to minimize GC pressure for high-throughput workloads.
  • Native support for HTTP/3 server push (though most API teams will opt out of this for request-response patterns).

Benchmarking Latency Improvements

We tested a sample high-traffic API (10k requests/second, 1KB payload) across three protocols using Go 1.24’s standard library. Results were measured on a 100ms RTT link between us-east-1 and eu-west-1:

Protocol

Median Latency

99th Percentile Latency

Throughput (req/s)

HTTP/1.1

112ms

340ms

8,200

HTTP/2

98ms

290ms

9,100

HTTP/3

67ms

180ms

11,400

For high-traffic APIs, the 30-40% latency reduction and 25% throughput boost translate to lower p99 tail latencies, fewer dropped requests, and reduced infrastructure costs.

Migrating Your API to HTTP/3

Go 1.24 makes migration straightforward for existing net/http users. For servers, you can add HTTP/3 support alongside existing HTTP/1.1 and HTTP/2 listeners with just a few lines of code:

package main

import (
    "context"
    "log"
    "net/http"
    "net/http3"
    "time"
)

func main() {
    mux := http.NewServeMux()
    mux.HandleFunc("/api/v1/health", func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("ok"))
    })

    srv := &http3.Server{
        Handler:    mux,
        Addr:       ":443",
        TLSConfig:  loadTLSConfig(), // Your existing TLS config
    }

    // Start HTTP/3 listener
    go func() {
        log.Fatal(srv.ListenAndServe())
    }()

    // Keep existing HTTP/1.1 and HTTP/2 listeners for backward compatibility
    httpSrv := &http.Server{
        Addr:    ":80",
        Handler: mux,
    }
    log.Fatal(httpSrv.ListenAndServe())
}

func loadTLSConfig() *tls.Config {
    // Load your TLS certificate and key here
    return &tls.Config{}
}

Clients can enable HTTP/3 by using the http3.RoundTripper in place of the default http.Transport:

client := &http.Client{
    Transport: &http3.RoundTripper{},
}

resp, err := client.Get("https://api.example.com/health")
if err != nil {
    log.Fatal(err)
}
defer resp.Body.Close()

Considerations for Production

While Go 1.24’s HTTP/3 support is production-ready, keep these caveats in mind:

  • UDP traffic must be allowed on your firewall (QUIC uses UDP port 443 by default).
  • Some legacy load balancers may not support QUIC, so test compatibility with your infrastructure first.
  • HTTP/3 server push is disabled by default, as it’s rarely useful for REST APIs.

For teams running high-traffic APIs, Go 1.24’s HTTP/3 support removes a major performance bottleneck with zero third-party dependencies. The latency and throughput gains are immediate for global user bases, making this one of the most impactful updates for Go backend developers in recent years.

The Story of How I Built a VPN protocol: Part 1

🚨🚨🚨 Disclaimer 🚨🚨🚨

This article and the VPN itself are written for educational purposes only.

How It All Started

I recently switched to Arch. Everything started off well: I installed all the utilities I needed, and then I decided to install the VPN I used to use. And then a problem appeared — it doesn’t work on Arch (even as an AppImage).

My provider also supported Shadowsocks, but instead of using it, I decided to write my own VPN. For more practice.

VPN Protocol

My VPN protocol is designed for maximum stealth. In my opinion, one of the most important things here is encryption from the very first packet. In my protocol, this is implemented just like in Shadowsocks — with a pre-shared key.

Encryption algorithm: ChaCha20-Poly1305.

It’s also worth mentioning that the protocol works over TCP. A random amount of junk bytes is added to each packet for length obfuscation.

Packet Structure

Each packet has a 5-byte header that is masked as encrypted data using XOR with the first 5 bytes of the key.

  • First 2 bytes — total packet length. Needed to determine where the packet ends (since TCP can segment packets).
  • Third byte — flags byte. Currently only 2 flags are used:

    • Bit 1 — indicates that this packet is fake and should not be processed (not yet implemented).
    • Bit 2 — flag for performing ECDH (Elliptic Curve Diffie‑Hellman).
  • Last 2 bytes — ciphertext length, used to separate junk bytes from the ciphertext.

Then comes:

  • 12 bytes — randomly generated nonce;
  • ciphertext;
  • AEAD (authentication tag);
  • junk bytes.

Handshake and Key Exchange

1. First packet from the client

The client sends its 16-byte username to the server (encrypted, of course).

2. Server response

If the server finds a user with that username, it:

  • sends the client a randomly generated 32-byte salt;
  • starts computing the keys:
    • sending key (server → client)
    • receiving key (server ← client)

3. Key computation on the server

The server stores the user’s password in plaintext.

  • Receiving key (for decrypting from the client) = hash(password + first 16 bytes of salt).
  • Sending key (for encrypting to the client) = hash(password + last 16 bytes of salt).

4. Client actions

The client receives the salt, decrypts it, and does the same thing, but the key roles are inverted:

  • what is the sending key for the server becomes the receiving key for the client, and vice versa.

5. ECDH and connection finalization

After the client has generated the keys, it generates an ephemeral key pair based on the Curve25519 elliptic curve (this pair is needed for ECDH). It then sends a connection confirmation (first byte = 0xFF) along with the public ephemeral key, setting the ECDH flag.

The server receives the packet, deobfuscates it, and gets the confirmation and the client’s ephemeral key. Then it:

  • assigns an IP address to the client from a local private network;
  • generates its own ephemeral key pair;
  • sends the client its assigned IP address and the server’s public key;
  • performs the ECDH round.

After sending, the server updates its keys by hashing the old keys with the secret obtained from ECDH.

6. Client finalization

After receiving the packet with the IP address and the server’s public ephemeral key, the client:

  • creates a local tunnel;
  • sets its IP address (received from the server);
  • performs the ECDH round;
  • updates its keys.

Main Work Loop

After the connection is established and keys are generated, the main work loop begins.

Client Side

3 goroutines run on the client side:

First goroutine (reading from the tunnel and preparing packets)

  • Reads packets from the tunnel.
  • Generates an 8-byte salt to update the sending key (by hashing the old sending key with the salt).
  • Adds this 8-byte salt to the beginning of the plaintext (the salt is followed by the packet read from the tunnel).
  • Encrypts everything.
  • Adds random junk bytes for obfuscation.
  • Stores the prepared packet in a buffer.

Second goroutine (sending packets)

  • Responsible for sending already prepared packets.
  • Packets are sent in batches of 1 to 5 packets (the protocol is of course segmented at OSI layers 3 and 4, but I can’t influence that).

Third goroutine (receiving packets from the server)

  • Responsible for receiving packets from the server.
  • Performs deobfuscation and decryption.
  • Writes the decrypted data to the tunnel.

Server Side

The server has 3 main goroutines, plus additional goroutines for receiving packets from clients.

First goroutine (handshake handling)

Handles incoming handshake requests from clients. If the handshake is successful, a new goroutine is created to process packets sent by that client.

Second goroutine (reading from the tunnel)

Reads packets from the tunnel and sends them to clients.

Third goroutine (cleaning inactive connections)

Cleans up inactive connections.

Key Updates

Salt in every packet

Every packet (whether from client or server) contains a salt. It is used to update the keys:

  • The server, when sending a packet, includes a salt. After sending, it updates its sending key by hashing the old key with that salt.
  • The client, when receiving and decrypting a packet, also updates a key — but not the sending key, the receiving key.
  • When the client sends a packet, the same happens, but the roles are reversed.

Periodic ECDH updates

Every 4 minutes or after sending 2³² packets (whichever comes first), keys are updated using ECDH on elliptic curves. The keys are transmitted along with data packets.

And that, in fact, is the entire protocol. During implementation, I thought about writing it in Go or Rust. I chose Go for its simplicity.

Implementation Process

To be honest, the protocol architecture was mostly developed while writing the code. It has quite a few problems — both in terms of protocol design and implementation.

Example problems

  • Constant username packet length

    The encrypted username packet has a constant length of 44 bytes (12 bytes nonce, 16 bytes ciphertext, and a 16-byte AEAD tag). Knowing this and that the user is using this protocol, you can calculate the 4th and 5th bytes of the key.

  • Repository duplication

    I foolishly created two separate repositories — one for the client and one for the server. As a result, the branches containing common modules just duplicate each other.

  • Git flow

    I tried to follow git flow, but failed here too.

  • Vulnerabilities

    I also have a feeling that there are more vulnerabilities in the code than working logic.

  • No graceful shutdown

    There is no proper negotiated client-server disconnect — just a connection break.

Although considering this is my first project, I think it didn’t turn out too badly. If anyone wants to check out this mess, here are the links:

  • Client: https://github.com/SmileUwUI/smileTun-client
  • Server: https://github.com/SmileUwUI/smileTun-server

Currently, the implementation works. And I’m writing this article through my own VPN protocol.

Future Plans

  • Merge both repositories into one.
  • Add fake packet sending.
  • Add TLS mimicry.
  • And much more.

If anyone has any questions or recommendations — leave them in the comments. For now, I bid you farewell. Good luck to everyone!

An Agent Run Is Not Done When the Model Stops Talking

An Agent Run Is Not Done When the Model Stops Talking

The Problem

You prompt an agent. It runs. Tokens stream out. It stops. You read the output. Done.

Except you have no idea if it’s done.

When you run an AI agent on a real task, the model producing output is the easiest part. The hard part starts after the last token: did the agent actually finish the assigned work? Can you verify the output? Can you reproduce what led to the result? Can you tell what went wrong when it inevitably goes wrong?

Most agent frameworks treat the model’s silence as a completion signal. The model stopped emitting tokens, so the run must be complete. This is the same as treating a process that hasn’t crashed as one that succeeded. Production engineers know better. Agent builders should too.

The gap between “the model stopped generating” and “the task is complete” is where most real-world agent failures live.

The Gap

Current agent tools handle the running part well enough. Codex runs code in sandboxes. Claude Code edits files and runs tests. Devin opens a browser and clicks through workflows. These systems can start work, maintain context across turns, and produce artifacts.

What they don’t answer:

  • Is the run complete, or did the model just stop talking because it hit a context limit, encountered a tool error it couldn’t recover from, or decided the task was “good enough”?

  • Did the agent drift from its objective? A research task that returns a summary of three papers when you asked for five is not complete. A code change that passes tests but ignores two of four acceptance criteria is not complete.

  • What evidence exists for the claims in the output? If the agent says “the API returns 404 for invalid IDs,” can you find the HTTP log that proves it?

  • Can you reproduce what happened? Not approximately. Exactly. Same tools, same inputs, same sequence of decisions.

These questions are not nice-to-haves for a monitoring dashboard. They are the difference between an agent system you can trust in production and one you have to babysit.

The Infrastructure Analogy

This problem was solved decades ago in a different domain.

Job schedulers in production systems do not just start work. They track completion. They capture exit codes. They preserve logs. They chain dependencies so downstream work only starts when upstream work finishes cleanly. They surface failures immediately. They allow operators to re-run, roll back, or inspect any job without guessing.

Cron, Airflow, Kubernetes Jobs, systemd: these systems share a discipline. They treat execution as a lifecycle with defined states. A job is pending, running, succeeded, failed, or timed out. The transitions between states are explicit. The data at each transition is captured.

Agent systems need the same discipline. The dominant pattern right now: start the model, stream tokens, check if the stop token fired, return the output string. No exit code. No structured state machine. No artifact manifest. The run either produced text or it didn’t, and you figure out the rest.

Imagine running a production database migration this way. “The script printed ‘done’ so I assume it worked.” No one would accept that. But that is exactly what we accept from agent runs that cost hundreds of dollars in compute and produce outputs people act on.

An agent run is a production job. It needs production job infrastructure.

What “Done” Actually Means

An agent run is done when you can answer four questions. All four. Not three of four. Not “probably” on any of them.

1. Did the process exit cleanly?

This is the floor. The model stopped generating tokens. Did it stop because it completed its reasoning, or because it hit a context window limit? Did a tool call time out? Did the inference server return an error that the orchestrator swallowed? Did the agent crash mid-execution and leave partial artifacts in your filesystem?

Production systems distinguish between exit code 0 and exit code 137. Agent systems need the same granularity. “The model stopped” is not an exit state. “The model completed its turn, all tool calls returned successfully, and the reasoning chain terminated with a completion signal” is an exit state.

2. Did the output match the objective?

This is harder than it sounds because objectives are often underspecified. But even with a well-specified objective, agents redefine “done” on the fly. You ask for a security audit of ten endpoints. The agent audits seven, declares the remaining three “out of scope,” and returns. The run completed cleanly. The objective was not met.

You need a verification step that compares the output against the original objective, not against whatever the agent decided the objective was after three rounds of tool calls. This can be as simple as a checklist or as rigorous as a test suite. The point is that it exists and runs automatically.

3. Is there evidence supporting the claims?

Agents make claims. “This function is unused.” “The API latency improved by 40%.” “No regressions were introduced.” These claims are sometimes correct. They are sometimes hallucinated. Without evidence, you cannot tell the difference.

Evidence means artifacts: logs, citations, test results, diffs, URLs, timestamps. Not more text from the model. The agent should collect and attach these artifacts before synthesizing its output. If the agent claims a function is unused, the artifact is the grep result showing zero call sites. If the agent claims latency improved, the artifact is the benchmark output with before and after numbers.

Output without evidence is an opinion. Production systems do not ship on opinions.

4. Can someone else reproduce or audit what happened?

Reproducibility requires a record of what the agent did: which tools it called, what inputs it provided, what outputs it received, what decisions it made at each step, and what the environment looked like at each point. This is a trace, not a summary.

Auditing requires that the trace is stored, indexed, and queryable after the fact. Not in a log file you grep manually. In a structured format that lets you answer: “What happened at step 14 and why?”

Without reproducibility, you cannot debug. Without auditability, you cannot trust. These are not theoretical concerns. They show up the first time an agent run produces a wrong answer that someone acts on.

The Cost of Not Knowing

The costs compound. They do not appear one at a time.

Silent failures. An agent drifts from its objective, completes a different task, and returns output that looks correct at a glance. No one catches it because the run reported success. The drift is only discovered days later when someone depends on the output and it does not cover what they need.

Orphaned processes. The model stops generating, but a background tool call is still running. The orchestrator considers the run complete. The background process finishes, writes a file, and that file sits undiscovered until it conflicts with a later run. The original run is long gone from the logs. No way to trace the orphan back to its parent.

Overconfident outputs with no provenance. The agent produces a detailed analysis. It cites sources, references data, and draws conclusions. None of the citations are real. The data was hallucinated. But the output reads well, so it gets pasted into a document and circulated. Provenance tracking, where each claim links to a verifiable artifact, prevents this. Most agent systems do not have it.

GPU time burned on unverifiable work. An agent runs for thirty minutes on a GPU. It produces output that cannot be verified because there is no trace, no evidence, and no structured state record. You have an expensive text file and no way to determine if it is correct. This is not sustainable at scale.

Erosion of trust. Every silent failure, every hallucinated citation, every orphaned process makes people trust agent output less. Not the model. The output. The work product. When people stop trusting the output, they start re-doing the work manually to verify it. The agent becomes an expense that buys you nothing: you run it, then you redo its work. Trust, once lost in production systems, takes a long time to rebuild.

What to Do

The following steps are not aspirational. They are things I have implemented, in some form, for every agent system I have put into production use.

Track the process tree.

Do not treat the agent as a single process. It is a process tree: the orchestrator spawns tool calls, tool calls spawn sub-processes, sub-processes write files. Track every node in that tree. Record when each node starts, when it exits, and what exit code it returns. If a leaf process is still running when the orchestrator declares completion, the run is not done. Period.

Collect evidence before generating artifacts.

Structure the agent’s workflow so that evidence collection happens before synthesis. If the agent needs to produce a research summary, it should first collect the papers, extract the relevant data, and store those raw materials as artifacts. The summary is then generated from the artifacts, not from the model’s parametric memory. This makes the output verifiable: you can check the artifacts against the claims.

This is a workflow constraint, not a model capability issue. The same model that hallucinates citations when generating from memory will produce accurate, verifiable output when generating from collected artifacts. The difference is infrastructure, not intelligence.

Install quality gates that reject incomplete output.

A quality gate is an automated check that runs between the agent producing output and that output being accepted. The simplest gate: does the output reference artifacts that exist? If the agent claims to have run a test, does a test result file exist? If the agent cites a URL, does the URL return a 200? These checks are not expensive. They catch a surprising number of failures.

More sophisticated gates check coverage: did the agent address every item in the objective? Did it produce the minimum set of deliverables? Did it stay within the assigned scope?

Gates should reject output, not warn. A warning is a log line nobody reads. A rejection forces the agent to retry or forces a human to intervene. Both outcomes are better than accepting bad output silently.

Prevent overlapping GPU work with dispatch guards.

When multiple agent runs target the same GPU resources, you get contention, OOM errors, and degraded output quality. A dispatch guard is a coordination layer that ensures only the approved set of runs are active on a given resource at a given time. It is a semaphore for GPU work.

This is not about efficiency. It is about correctness. An agent run that gets preempted mid-inference because another run grabbed its GPU produces corrupted output. The orchestrator often does not detect this. The output looks normal but is incomplete or incoherent. Dispatch guards prevent the condition entirely.

Verify exit states explicitly.

Do not infer completion from silence. After the model stops generating, check: did all tool calls return? Did all background processes exit? Did the model’s final message indicate completion or truncation? Does the output artifact manifest match what was requested?

If any check fails, the run state is “failed,” not “completed with warnings.” Record the failure reason. Surface it to the operator. Do not return a partial result as if it were a complete one.

Treat the agent like a production job.

This is the through-line. An agent run is not a REPL session. It is not a chat. It is a production job with inputs, outputs, side effects, and failure modes. It deserves the same infrastructure discipline you would apply to a cron job, a database migration, or a deployment pipeline.

That means: state machines, not status flags. Structured logs, not console output. Artifact manifests, not loose files. Exit codes, not silence. Dependency tracking, not fire-and-forget tool calls.

The model is the compute. The infrastructure around the model is the system. The system is what determines whether the output is trustworthy. Build the system accordingly.

    </div>

JetBrains Academy – April Digest

Hey!

April brought many good reasons to open your IDE. 

Learn about a new DeepLearning.AI collab on spec-driven development, a beginner-friendly full-stack chat app course, a Kotlin certificate you can add to your LinkedIn profile, and fresh research on which AI coding tools developers actually use at work.

Building Production-Grade Tools for AI Agents: What Works After 100 Deployments

Most developers who build AI agents make the same mistake: they spend weeks designing the orchestration layer, tuning the system prompt, and picking the right model — then hand the LLM a pile of hastily wrapped API endpoints and wonder why it fails in production.

Here’s the hard truth from teams shipping agents daily: tool design has a larger impact on agent reliability than prompt engineering. A well-crafted tool prevents hallucinations at the structural level. A poorly crafted tool guarantees them.

This article walks through what we’ve learned from building, deploying, and debugging production AI agents across dozens of real-world workflows. You’ll get concrete patterns, working code examples, and the anti-patterns that cost us the most in production incidents.

The Contract Between Deterministic and Non-Deterministic Code

When you write a function for another developer, you’re working between two deterministic systems. Same input, same output. The calling code knows exactly what to expect.

An AI tool is a fundamentally different contract. You’re writing an interface between deterministic code (your backend service, database, or API) and a non-deterministic consumer (the LLM). The model might:

  • Call your tool when you expected it to use something else
  • Send malformed arguments because the description was ambiguous
  • Retry your tool three times because the error message didn’t tell it why it failed
  • Ignore your tool entirely because the description didn’t explain when to use it

This means every tool needs five components that traditional APIs never bothered with: a precise name, a rich description, a strict input schema, structured error handling, and a predictable output format. Let’s build each one.

1. Naming: The First Signal the LLM Evaluates

The tool name is the first thing the model scans when deciding which tool to call. It functions like a class name in a codebase — it sets expectations before any other signal.

# Bad: vague, could mean anything
@mcp_tool(name="process")
def process(data):
    ...

# Bad: too generic, overlaps with other tools
@mcp_tool(name="get_data")
def get_data(query: str):
    ...

# Good: specific verb + noun, clear scope
@mcp_tool(name="list_overdue_invoices")
def list_overdue_invoices(customer_id: str):
    ...

# Good: resource_action pattern
@mcp_tool(name="invoice_send_reminder")
def invoice_send_reminder(invoice_id: str, channel: str):
    ...

Pick one convention — verb_noun or resource_action — and enforce it across every tool on your server. Mixing conventions forces the LLM to learn two mental models, and under load, it will confuse them. We saw a 23% drop in correct tool selection on a production agent when the team had get_user, user_create, and delete_file all coexisting with no pattern.

2. Descriptions: Embedded Prompt Engineering

The tool description is the most underestimated field in tool design. The LLM reads this to decide when to use the tool and what it will get back. It’s prompt engineering baked into the tool definition itself.

MISMATCHED_DESCRIPTION = "Searches the database"

GOOD_DESCRIPTION = """
Full-text search across the company knowledge base.
Use when the user asks to find internal documentation, policies, or technical specs.
Returns up to 10 results ranked by relevance, each with title, snippet, and URL.
Does NOT search emails or chat messages — use search_communications for those."""

Notice what the good description does: it says what it does, tells the LLM when to use it, describes the output shape, and explicitly states what it won’t do. That last part is critical — explicit negative boundaries prevent the LLM from reaching for the wrong tool when it’s close-but-not-right.

A real measurement from our deployments: improving tool descriptions alone — no code changes — cut task completion time by 40% and reduced wrong-tool selection by 60%.

3. Input Schemas: Never Trust the LLM

Models hallucinate parameter values, confuse types, and invent fields that don’t exist. Your tool must validate every input before processing. JSON Schema constraints are your first line of defense:

GOOD_INPUT_SCHEMA = {
    "type": "object",
    "properties": {
        "query": {
            "type": "string",
            "minLength": 1,
            "maxLength": 500,
            "description": "Natural language search query or exact document title"
        },
        "limit": {
            "type": "integer",
            "minimum": 1,
            "maximum": 50,
            "default": 10,
            "description": "Maximum number of results to return"
        },
        "category": {
            "type": "string",
            "enum": ["engineering", "hr", "finance", "legal", "all"],
            "default": "all",
            "description": "Restrict search to a specific document category"
        }
    },
    "required": ["query"],
    "additionalProperties": False
}

Enums eliminate entire classes of failures. When environment accepts only "staging" or "production", the LLM can’t invent "prod-us-east" and crash your deployment script. We’ve found that using enums and regex patterns for parameters eliminated 80% of runtime validation errors in production.

Poka-Yoke Parameters

Take it a step further with poka-yoke design — making misuse structurally impossible:

# Instead of accepting free-text paths that cause path traversal:
{"path": {"type": "string"}}  # bad

# Use enums with absolute paths for known configs:
{"config": {
    "type": "string",
    "enum": ["/etc/prod/config.yaml", "/etc/staging/config.yaml"]
}}  # good

4. Error Handling: Errors Are Prompts for the LLM

When a tool fails, the LLM needs enough information to decide whether to retry, try a different tool, or ask the user for help. Opaque errors like "Internal Server Error" leave the model stranded.

MCP has two error mechanisms, and conflating them causes silent failures:

  • Protocol-level errors (JSON-RPC): unknown tool, malformed arguments, server unavailable. The call never reached your tool logic.
  • Tool execution errors (isError: true): the tool ran but failed. The agent can reason about these.
# Bad: generic error, LLM cannot reason about what went wrong
return {"error": "Something went wrong"}

# Good: structured error with actionable context via isError
return {
    "isError": True,
    "content": [{
        "type": "text",
        "text": json.dumps({
            "error": "RATE_LIMIT_EXCEEDED",
            "message": "Search API rate limit reached. Maximum 10 requests per minute.",
            "retryAfterSeconds": 30,
            "suggestion": "Wait 30 seconds before retrying, or narrow the query to reduce result processing time."
        })
    }]
}

This pattern — machine-readable code, human-readable explanation, retry guidance, and an actionable suggestion — eliminates a large class of agent failures where the model receives a cryptic error and hallucinates a recovery path.

5. Output Format: Consistency Is Everything

Unpredictable output formats force the LLM to guess, which increases the chance of misinterpretation and downstream errors.

# Bad: inconsistent output shape
def search(term):
    results = db.query(term)
    if results:
        return results  # list of dicts
    return "No results found"  # string — different type entirely!

# Good: consistent envelope, always the same shape
def search(term, limit=10):
    results = db.query(term, limit=limit+1)
    return {
        "status": "success",
        "resultCount": min(len(results), limit),
        "results": [
            {
                "title": r.title,
                "snippet": r.snippet[:200],
                "url": r.url,
                "relevanceScore": r.score
            }
            for r in results[:limit]
        ],
        "hasMore": len(results) > limit
    }

The agent always knows what shape to expect. It doesn’t need to branch on isinstance(result, str) vs isinstance(result, list). That predictability compounds across multi-step workflows.

6. Token Efficiency: The Hidden Cost That Kills ROI

Every tool response goes into the LLM’s context window. Verbose responses burn tokens, increase cost, and degrade reasoning quality as context fills up.

Three strategies that work in production:

Paginate aggressively. Return 10 results with a cursor, not 1,000 records. The agent can page if it needs more.

Support summary modes. Offer detailed=True/False parameters. Default to False. Let the agent request more detail only when needed.

Strip internal metadata. The agent doesn’t need database IDs, internal timestamps, or ORM fields. Return only what the LLM needs to understand and act on the result.

# Internal DB record (terrible for agent context):
{
    "id": "a1b2c3d4-e5f6-7890",
    "_created_at": "2026-04-15T08:23:11.442Z",
    "_updated_at": "2026-04-30T14:07:33.101Z",
    "_tenant_id": "org_48291",
    "name": "John Smith",
    "role": "Product Manager",
    "email": "john@acme.com",
    "status": "active",
    "preferences": {"theme": "dark", "notifications": True, ...}
}

# Agent-friendly output:
{
    "name": "John Smith",
    "role": "Product Manager",
    "email": "john@acme.com",
    "status": "active"
}

We measured a 3.2x reduction in per-task token consumption just by stripping internal metadata from tool outputs. At scale, that’s the difference between a profitable agent and a cost center.

7. Behavioral Annotations: Signals the Agent Can Act On

The MCP 2025-03-26 spec introduced tool annotations — metadata fields that help agents make smarter decisions about tool invocation:

tool_annotations = {
    "readOnlyHint": True,       # Safe to call without confirmation
    "destructiveHint": False,   # Won't mutate state
    "idempotentHint": True,     # Safe to retry with same args
    "openWorldHint": False      # Only reads from known database
}

These annotations drive real behavior in agent clients. A destructiveHint: true tool triggers a confirmation gate before execution. An idempotentHint: true tool lets the client retry safely on timeout. But remember: annotations are hints, not guardrails. The agent client decides whether to honor them.

Anti-Patterns We’ve Seen in Production

The God Tool

@mcp_tool(name="process_customer_request")
def process_customer_request(request_text: str):
    # Parses intent, searches DB, sends email, updates CRM, creates ticket...
    # This is 6 operations fused into one. When step 3 fails, the agent
    # cannot retry steps 4-6 independently.

Keep tools atomic. One tool, one purpose. If it needs to do X and Y, it should be two tools that the agent composes. Atomic tools are easier to test, easier for the LLM to reason about, and easier to compose into complex workflows.

Tool Description Drift

Your tool description says “returns a list of users.” Six months later, after a refactor, it returns a paginated object with users and total_count fields. The description was never updated. The agent breaks silently.

Treat tool descriptions as living documentation. When you run evals (and you should), include description accuracy checks in your validation pass.

Silent Failure Swallowing

def get_metric(name):
    try:
        return metrics_api.get(name)
    except Exception:
        return {"data": []}  # agent thinks everything is fine

The agent received what looks like a valid but empty response. It proceeds with wrong assumptions. Always return the failure visibly — isError: true with context — so the agent can reason about recovery.

A Real Production Tool, End to End

Here’s a complete MCP tool definition that follows every principle above, from a production deployment monitoring service:

@server.tool(
    name="deploy_service",
    description=(
        "Deploy a service to the specified environment. "
        "Use this for production and staging deployments. "
        "For rollbacks, use rollback_service instead. "
        "Returns the deployment ID, target version, and current status."
    ),
    input_schema={
        "type": "object",
        "properties": {
            "service": {
                "type": "string",
                "description": "Service name from the service registry. Use list_services to find available names."
            },
            "environment": {
                "type": "string",
                "enum": ["staging", "production"],
                "description": "Target environment for the deployment."
            },
            "version": {
                "type": "string",
                "pattern": r"^vd+.d+.d+$",
                "description": "Semantic version to deploy, e.g., v2.4.1."
            }
        },
        "required": ["service", "environment", "version"],
        "additionalProperties": False
    },
    annotations={
        "destructiveHint": True,
        "idempotentHint": True,
        "openWorldHint": False
    }
)
async def deploy_service(service: str, environment: str, version: str):
    try:
        result = await deploy_api.deploy(service, environment, version)
        return {
            "status": "success",
            "deployment_id": result.id,
            "target_version": version,
            "environment": environment,
            "started_at": result.started_at.isoformat()
        }
    except DeploymentError as e:
        return {
            "isError": True,
            "content": [{
                "type": "text",
                "text": json.dumps({
                    "error": "DEPLOYMENT_FAILED",
                    "message": str(e),
                    "service": service,
                    "environment": environment,
                    "version": version,
                    "suggestion": "Check the build status with check_build_status before retrying. If the build passed, verify the environment has capacity."
                })
            }]
        }
    except Exception as e:
        return {
            "isError": True,
            "content": [{
                "type": "text",
                "text": json.dumps({
                    "error": "INTERNAL_ERROR",
                    "message": f"Unexpected error during deployment: {str(e)}",
                    "suggestion": "This is not a retryable error. Escalate to the infrastructure team."
                })
            }]
        }

Every principle is represented: precise name, rich description with cross-reference, strict schema with enum and pattern validation, behavioral annotations, structured success output, and structured failure output with actionable suggestions.

Testing Tools With LLMs, Not Just Unit Tests

Unit tests verify your tool returns the right data. They don’t verify the LLM can figure out which tool to call, construct valid arguments, or recover from errors.

The only real test for a tool is: put it in front of an LLM and give it a task. Run an evaluation with 20-50 real-world prompts and measure:

  • Tool selection accuracy: Did the LLM pick the right tool?
  • Argument correctness: Did it send valid parameters?
  • Error recovery: When the tool fails, does the LLM retry productively or hallucinate?
  • Token efficiency: How many tokens does the tool response consume?

Automate this. Run evals on every PR that changes a tool definition. If a tool description change drops selection accuracy from 95% to 80%, it’s a regression — even if the code itself is perfect.

When to NOT Build a Tool

Not every API endpoint needs to be a tool. Some operations are too risky (delete production data), too expensive (run a model training job), or too complex (multi-step workflows that the agent can’t verify). Implement those as workflow primitives in your orchestration layer instead — deterministic code that the agent triggers but doesn’t directly call.

The rule of thumb: if the worst-case outcome of the LLM calling this tool wrong is “the user sees a weird message,” build it as a tool. If it’s “someone loses money” or “the system breaks,” wrap it in your orchestration layer with guardrails first.

The TL;DR is simple: treat every tool definition as if it’s the product, because for an AI agent, it is. The model reads tool descriptions like source code — every word, every constraint, every example matters. Get this right and your agents become dramatically more reliable without touching a single line of prompt engineering.

Exploring a more deterministic approach to AI-assisted code generation

Introduction

AI coding agents are getting surprisingly good.

In small projects, you can ask them to add features, fix bugs, and even write tests—and they often succeed.

But once your project grows, things start to break down.

In my experience, the issue is not model capability. It’s something more subtle: prompt instability.

The Problem: Prompt Instability

Most coding agents construct prompts dynamically using:

  • chat history
  • parts of the codebase
  • internal heuristics

This means the final prompt is not fully under your control.

As a result:

  • the same request can produce different outputs
  • changes can appear in unexpected parts of the codebase
  • behavior becomes harder to reason about

In small projects, this is manageable.
In larger systems, it becomes risky.

A Different Approach: Treat Prompts Like Source Code

Instead of relying on dynamically constructed prompts, I started experimenting with a different idea:

Treat prompts like source code.

That means:

  • prompts are explicit
  • prompts are reusable
  • prompts can be composed from other prompts
  • prompt construction is deterministic
  • prompts are the source of truth, not code

This shifts the workflow from “chatting with an agent” to something closer to designing system architecture.

The Tool: SVI (Structured Vibe Coding)

To explore this idea, I built a small tool called SVI.

SVI generates source code from structured specification files (.svi) written in a Markdown-like format.

Key ideas:

  • Each .svi file defines how a specific source file should be generated
  • Prompts can import and reuse other prompts
  • The final prompt is constructed in a fully controlled and predictable way

Unlike typical coding agents, SVI does not depend on chat history or implicit context.

Example

Here is a simple .svi file:

# Destination File
hello.js

# Output
function hello()

# Options
ProgrammingLanguage=Node.js
Active=True

# Prompt
Create a function that prints "Hello World", and call this function

Generate the code with:

svi run

The output is generated by an LLM and based on the specification.

Why This Matters

This approach has a few practical benefits:

  • More predictable results

You know exactly which prompt generated which file, and get more predictable results

  • Reusability

Prompts can be shared and composed

  • Lower model requirements

Smaller prompts allow you to use cheaper or even free models; you can adjust the prompt size and complexity to match the LLM you’re using.

Trade-offs

This approach is not a silver bullet.

  • It requires more upfront structure
  • It is less flexible than free-form prompting
  • It changes the workflow from interactive to more declarative

However, for larger projects, this trade-off might be worthwhile.

Conclusion

AI coding agents are powerful, but their current design makes them hard to control at scale.
Treating prompts like source code is one way to bring back structure and predictability.
I’m still experimenting with this approach, and I’d be interested to hear if others have explored similar ideas.

Links

GitHub repository: https://github.com/avrmsoft/svi

Faster Rust Testing at Scale: cargo-nextest in Practice

Disclaimer: This article was created using AI-based writing and communication companions. With its help, the core topics of this rich and nuanced livestream were distilled into a compact blog post format.

In our recent JetBrains livestream, Vitaly Bragilevsky was joined by Rain, the creator of cargo-nextest, for a conversation about Rust testing, developer tools, open source maintenance, and the everyday developer experience around large Rust projects.

cargo-nextest is widely used across the Rust ecosystem as a next-generation test runner. It is built to make Rust test execution faster, more observable, and more reliable, especially for larger codebases, CI pipelines, and projects with complex integration tests.

The conversation also came at a good time for RustRover users. RustRover 2026.1 introduced native cargo-nextest support, so developers can now run and monitor nextest sessions directly from the IDE, with progress reporting and structured results in the Test tool window.

If you missed the livestream, you can watch the full recording on JetBrains TV. Below, you’ll find a structured recap of the key questions and insights from the session.

Q1. Who is Rain, and how did they get into Rust?

Rain is a software engineer with more than a decade of industry experience. Their Rust journey started professionally in 2017 while working on source control infrastructure at Meta. Rain joined a project to build a Mercurial server in Rust. The team already had someone with Rust expertise, while Rain brought deep knowledge of Mercurial. That collaboration became their entry point into Rust.

“I learned Rust to kind of work on this thing. As I was developing it, I fell in love with Rust and decided to go deeper into it.”

Rain: creator of cargo-nextest

They have worked at Mozilla, Meta, and now Oxide Computer Company, where Rust is used throughout the stack, from embedded firmware to higher-level control plane software.

Q2. What is cargo-nextest?

At a high level, it is designed to run Rust tests faster and give developers better insight into what happened during a test run. In benchmarks,cargo-nextest can be up to three times faster than cargo test, depending on the project and workload.

cargo-nextest also includes features that become important as projects grow. That combination is what makes cargo-nextest useful across both open source and large industry codebases.

“There is a lot of CI focus in cargo-nextest, but there is also a lot of attention paid to the local interactive developer experience.”

Rain: creator of cargo-nextest

Q3. What problem does cargo-nextest solve?

Rain was clear that cargo test is still a good tool, especially for testing core algorithms, data structures, and smaller projects. The limitations become more visible when a Rust project grows into a large service, large CLI application, or codebase with many integration and end-to-end tests.

In those cases, the main problems are not only “how fast can I run tests?” but also:

  • Which tests are slow?
  • Which tests are flaky?
  • What happened in CI?

cargo-nextest is built for that kind of environment.

“The biggest problem that nextest solves is speed and observability of large test suites for large network services.”

Rain: creator of cargo-nextest

Q4. When should someone switch from cargo test to cargo-nextest?

If you are happy with cargo test, you do not have to switch. But if you are unhappy with some part of the testing experience, cargo-nextest is worth trying.

They pointed to three common signals.

  1. Flaky tests. cargo-nextest can retry tests, which helps distinguish between a consistent failure and a flaky one. 
  2. Test isolation. cargo-nextest runs every test in its own process. This matters for tests that rely on global state, external APIs, graphics contexts, or other resources that may not behave well when reused across tests. 
  3. Speed. For large services and bigger test suites, cargo-nextest is often several times faster than cargo test. That can save both local developer time and CI compute.

“If you are unhappy with cargo test speed, I would highly recommend giving cargo-nextest a shot.”

Rain: creator of cargo-nextest

Q5. What is the coolest cargo-nextest feature?

With run recording, cargo-nextest can capture what happened during a test run, including in CI. Developers can then fetch that run locally, inspect what each test did, and better understand failures that happened outside their machine.

Second one is Perfetto trace output. cargo-nextest can generate trace data that can be opened in Perfetto, giving developers a graphical view of test execution. 

“You can get all this observability around test execution, which I think is very powerful.”

Rain: creator of cargo-nextest

Q6. What lesser-known feature should more people know about?

When developers debug tests with cargo test, they often end up running the test binary directly. The problem is that Cargo normally sets up environment variables and configuration around test execution.

cargo-nextest’s debugger support helps solve that by configuring the environment so the test runs in the debugger in a way that is equivalent to normal execution.

“You get the exact same environment in the debugger.”

Rain: creator of cargo-nextest

For developers who need to step through a single failing test, that can make the difference between chasing a misleading local reproduction and actually debugging the real failure.

Q7. How does cargo-nextest help with stuck or long-running tests?

cargo-nextest lets you dump information about currently running tests. On any operating system, you can press T; on macOS, you can also use Control-T via SIGINFO. This gives you a live view of how long tests have been running, along with stdout and stderr output.

That is especially helpful for debugging complex failures that only appear in the context of a larger run.

Q8. What was the hardest part of building cargo-nextest?

A test runner has to do much more than start a process and wait for a result. It has to observe what happens, handle success, failure, abnormal termination, timeouts, retries, output capture, scheduling, and more. As new features are added, that state machine becomes more complex.

Interestingly, cargo-nextest is not a classic “10 million requests per second” async use case. It usually runs only as many tests as there are CPU cores or threads. But the state machine itself is complex, and async Rust gives a structured way to manage that complexity.

“The state machine for managing a test is extremely complex, and being able to express that in async Rust has been very powerful.”

Rain: creator of cargo-nextest

Q9. What would Rain change about Rust if they could go back to 2017?

Async Rust is powerful, but it also introduces footguns that are not present in the same way in synchronous Rust. Rust’s safety model is one of the reasons they were attracted to the language, especially around thread safety, aliasing, and mutability. Async Rust keeps many of those strengths, but cancellation and cleanup can be difficult to reason about.

“I am hopeful, but certainly it is harder to do it now than it was to do it in 2017.”

Rain: creator of cargo-nextest

Q10. Is cargo-nextest extensible?

Yes, and this is one of the reasons it has been able to integrate with other tools.

cargo-nextest provides machine-readable output formats and extension points that other tools can build on. It also supports features like setup scripts, which can prepare a database, seed test data, or configure an environment before tests run.

Q11. Can cargo-nextest be useful for embedded Rust?

Cargo-nextest itself probably will not run on embedded hardware because it is fairly complex. But it can still help with embedded testing workflows where tests are dispatched to real hardware.

One relevant feature is wrapper scripts. Instead of executing a test directly, cargo-nextest can run a script around it. That script can set up the environment, send commands to the hardware, or coordinate with a target runner.

Q12. What makes a great developer tool?

For Rain, a great developer tool has to be correct and reliable, especially when things go wrong.

For a test runner, that means handling more than the happy path. Tests can fail, time out, crash, segfault, produce output, or behave differently depending on the environment. A good tool needs to represent all of that clearly.

“The overall goal is not to automate as much as possible. The overall goal is to serve your customers.”

Rain: creator of cargo-nextest

That is a useful principle far beyond test runners. A great developer tool does not try to remove the developer from the work. It helps them do the work with more confidence.

Q13. Which Rust tools does Rain recommend?

Rain mentioned three:

  • cargo-hack: Useful for crates with many feature combinations. It can run tests across feature sets and optimize combinations based on dependencies between features.
  • cargo-expand: Useful when developing macros or procedural macros because it shows the expanded output.
  • cargo-semver-checks: Useful for crate maintainers because it detects semver compatibility issues, including simple API changes and more subtle changes such as accidentally removing Send or Sync bounds.

Good Rust tooling helps developers understand what their code is doing, what changed, and what might break.

cargo-nextest in RustRover

RustRover 2026.1 adds native support for cargo-nextest directly in the IDE. For Rust developers, this means you can run and monitor nextest sessions without leaving your normal development workflow.

Instead of switching to the terminal and reading raw output, you get structured test results and progress reporting in the Test tool window.

The goal is not to replace the terminal for developers who prefer it. The goal is to reduce friction for teams and developers who already rely on cargo-nextest and want the same workflow integrated into the IDE.

Closing thoughts

As Rust projects grow, testing becomes part of how teams understand correctness, performance, and confidence.

cargo-nextest helps make test runs faster, more isolated, more observable, and easier to debug.

If you’re interested:

  • Explore cargo-nextest
  • Try cargo-nextest support in RustRover 2026.1
  • Watch the full livestream with Rain

If you’re working with slow test runs, flaky tests, or complex CI failures, cargo-nextest is worth exploring.