From the pit wall to open collaboration: Welcoming Formula 1 strategist Ruth Buscombe to OCX 2026

From the pit wall to open collaboration: Welcoming Formula 1 strategist Ruth Buscombe to OCX 2026

Open Community Experience (OCX) brings together developers, industry leaders, researchers, and open source communities to explore how open technologies are shaping the future of software and digital infrastructure. Taking place 21–23 April 2026 in Brussels, OCX provides a space for collaboration, knowledge sharing, and discussion across domains such as AI, developer tooling, automotive software, regulatory compliance, and open source governance.

This year, we’re delighted to welcome a keynote speaker whose experience comes from one of the most demanding and data-driven environments in the world: Formula 1.

Welcoming Ruth Buscombe to OCX

Ruth Buscombe is an expert in Formula 1 race strategy who has spent her career turning data into winning results in environments where performance is measured in milliseconds.

Ruth began her career with Ferrari, shaping race-winning strategies for drivers including Kimi Räikkönen, Fernando Alonso, and Sebastian Vettel. She later brought her expertise to the Haas F1 Team and spent eight seasons at Sauber as Head of Race Strategy, working with Charles Leclerc.

In 2024, Ruth also consulted on the Formula 1 movie starring Brad Pitt, co-produced by Lewis Hamilton, helping ensure that the strategy and racing dynamics reflected the reality of the sport.

Today, Ruth shares her knowledge as a Formula 1 analyst, strategist, and broadcaster, while also championing diversity initiatives for women in STEM and engineering.

 

The winning formula: Lessons from Formula 1

In her keynote session, The Winning Formula: What F1 teaches us about marginal gains, teamwork and data-driven decision making,” Ruth will take the OCX audience behind the scenes of Formula 1 strategy.

While spectators often focus on the drivers, Formula 1 is fundamentally a sport of data, collaboration, and rapid decision-making. Teams must interpret vast amounts of information in real time and coordinate dozens of specialists under extreme pressure, where even the smallest improvement can determine the outcome of a race.

Drawing on her experience working with elite teams and drivers, Ruth will explore how these high-performance environments approach:

  • Marginal gains and continuous improvement
  • Collaboration across highly specialised teams
  • Data-driven decision-making under pressure
  • Communication when there is no margin for error

These lessons resonate far beyond the racetrack, particularly for open source and distributed engineering teams, where collaboration, trust, and effective decision-making are critical to building complex systems.

 

Join her at OCX 2026

OCX 2026 will bring together hundreds of developers, maintainers, researchers, and technology leaders to exchange ideas, share experiences, and explore the future of open technologies.

Ruth Buscombe’s keynote promises to offer a fascinating perspective on how high-performance teams operate and what open communities can learn from environments where every decision matters. Register for OCX.

 

Watch Ruth’s keynote introduction

Before joining us in Brussels, watch this video to learn what Formula 1 strategy can teach open source teams about collaboration and performance.

 

Image
OCX

Natalia Loungou


The Parent Developer’s Guide to Building Games With AI

My son and I recently built a delivery truck maze. You drive delivery trucks through a maze of city streets to the right destination (bread truck to the bakery, flowers to the flower shop). There’s sparkles and audio feedback when you complete a delivery, points awarded, and increasing maze difficulty with each level.

I’ve been a developer for over a decade — web apps, APIs, infrastructure — but game dev always felt like a different discipline. Engines, physics libraries, sprite sheets. Then my three-year-old said “make me a game where a red car goes fast” and I opened Claude instead of Unity.

We made it. It was playable. He loved it. We’ve built dozens since.

Here’s everything I’ve learned about the process.

What You Actually Need

An AI chatbot. Claude, ChatGPT, Gemini — any of the major ones. The technique is the same across all of them.

A web browser. That’s it. We build simple HTML/CSS/JavaScript games that run in a browser tab. Completely sufficient for young kids. No installs. No build tools. No dependencies.

A kid with opinions. (This is the easy part.)

You do NOT need game development experience, knowledge of any game framework, art skills, sound design skills, or a CS degree (though it helps for debugging).

The Basic Flow

Here’s how a typical session goes in our house:

Step 1: The kid has an idea. “I want a game where a delivery truck drives through a maze.”

Step 2: You help translate it into a prompt. “Create a simple HTML game where the player drives a delivery truck through a maze using arrow keys. There are houses along the route — when the truck reaches a house, it delivers a package and the house lights up. Add a counter for deliveries completed. Keep it simple and colorful, suitable for a 3-year-old.”

Step 3: The AI generates code. You get back a complete HTML file with embedded CSS and JavaScript. Save it as .html, open it in your browser.

Step 4: The kid reacts. “Make the truck yellow.” “Add more houses.” “I want a warehouse where you pick up the packages first.”

Step 5: You iterate. Feed the feedback back to the AI. “Change the truck color to yellow. Add a warehouse at the start where the truck loads packages before delivering. Add more houses to the route.”

Step 6: Repeat steps 4-5 until the kid is satisfied or hungry.

That’s the entire game development cycle. Your kid’s imagination, an AI that writes code faster than you can explain what you want, and a browser.

Prompting Tips (The Practical Stuff)

After building a lot of these, here’s what works:

Start way simpler than you think.

Your first prompt should describe a game that a first-year CS student could build. One mechanic. One interaction. One thing on screen. You can always add complexity later, but starting complex usually produces buggy, tangled code that’s hard to iterate on.

Good first prompt: “Make an HTML game where a red circle follows the mouse cursor and collects yellow stars that appear randomly.”

Too ambitious first prompt: “Make a 2D platformer with multiple levels, power-ups, enemies with AI pathing, and a save system.”

Specify the audience.

Always mention that this is for a young child. It changes the AI’s output in useful ways: bigger click targets, brighter colors, simpler controls, more forgiving collision detection.

Ask for everything in one file.

“Put all HTML, CSS, and JavaScript in a single HTML file.” This makes it trivial to save and run. No build step, no dependencies, no module imports that break.

Request mobile/touch support.

“Make it work with both mouse/keyboard and touch.” Toddlers are surprisingly good with touchscreens, and this means the game works on your phone or tablet too.

When things break (and they will):

Copy the error from the browser console and paste it directly to the AI. “When I click the truck, I get this error in the console: [error]. Fix it.” Or if there’s no error, describe it: “When I press the down arrow, the page moves instead of the truck.” AI is excellent at debugging its own code.

What Your Kid Actually Learns

Here’s the part that surprised me: building games this way is sneakily educational, even though it feels like pure play.

Decision-making. Every feature request is a design decision. “Should the truck be fast or slow?” “What happens when you crash?” Your kid is learning to think about cause and effect in a system.

Iteration. The game is never right on the first try. Your kid learns that making things is a process of attempt → evaluate → adjust. That’s the most important meta-skill in all of technology.

Systems thinking. “When I added the dinosaur, the truck stopped working.” Things are connected. Changes have consequences. Welcome to software.

Creative expression. Your kid’s game is their game. Not a game they downloaded. Not a game someone else designed. It has their ideas, their aesthetics, their dinosaur-on-a-garbage-truck vision. That ownership compounds interest.

Common Pitfalls

Don’t optimize too early. Your kid doesn’t care about code quality. They care about whether the dinosaur is big enough. Ship the feature, clean up later (or don’t — these are throwaway games).

Don’t take over. It’s tempting to start adding your own ideas. “What if we add a leaderboard? What about particle effects?” Let your kid drive. Your job is to translate their ideas into prompts, not to impose your own.

Don’t expect polish. AI-generated games look like AI-generated games. They’re functional and fun, but they won’t win any design awards. That’s fine. Your kid genuinely does not care that the truck is a rectangle with two circles for wheels.

Don’t make it a lesson. The moment you say “this is teaching you about algorithms,” the magic dies. Just build. The learning happens in the background.

One More Thing

The games we build might be objectively terrible. The collision detection is approximate. The graphics are basic shapes. The physics are suggestions at best. My son’s current favorite is a car wash game – you literally wash a car. Click on water, click the car. Click on soap, click the car. Some elements overlap others and there’s a bug when you click on the sponge too early – but it’s fine. It works well enough and he’s played it about a hundred times already.

He made it. That’s why.

You don’t need game dev experience to give your kid that feeling. You need an AI, a browser, and the willingness to build a really, really bad game about garbage trucks.

It just might be the best thing you ship all year.

This post was originally published on Raising Pixels. Computational thinking for little kids, from a dev mom who builds with her toddler. Subscribe at raisingpixels.dev.

Your MCP Server Is Eating Your Context Window. There’s a Simpler Way

TL;DR: MCP tool definitions can burn 55,000+ tokens before an agent processes a single user message. We built the Apideck CLI as an AI-agent interface instead:an ~80-token agent prompt replaces tens of thousands of tokens of schema, with progressive disclosure via --help and structural safety baked into the binary. Any agent that can run shell commands can use it. No protocol support required.

The problem nobody talks about at demo scale

Here’s a scenario that’ll feel familiar if you’ve wired up MCP servers for anything beyond a demo.

You connect GitHub, Slack, and Sentry. Three services, maybe 40 tools total. Before your agent has read a single user message, 55,000 tokens of tool definitions are sitting in the context window. That’s over a quarter of Claude’s 200k limit. Gone.

It gets worse. Each MCP tool costs 550–1,400 tokens for its name, description, JSON schema, field descriptions, enums, and system instructions. Connect a real API surface, say a SaaS platform with 50+ endpoints, and you’re looking at 50,000+ tokens just to describe what the agent could do, with almost nothing left for what it should do.

One team reported three MCP servers consuming 143,000 of 200,000 tokens. That’s 72% of the context window burned on tool definitions. The agent had 57,000 tokens left for the actual conversation, retrieved documents, reasoning, and response. Good luck building anything useful in that space.

This isn’t a theoretical concern. David Zhang (@dzhng), building Duet, described ripping out their MCP integrations entirely, even after getting OAuth and dynamic client registration working. The tradeoff was impossible:

  • Load everything up front → lose working memory for reasoning and history
  • Limit integrations → agent can only talk to a few services
  • Build dynamic tool loading → add latency and middleware complexity

He called it a “trilemma.” That feels about right.

And the numbers hold up under controlled testing. A recent benchmark by Scalekit ran 75 head-to-head comparisons (same model, Claude Sonnet 4, same tasks, same prompts) and found MCP costing 4 to 32× more tokens than CLI for identical operations. Their simplest task, checking a repo’s language, consumed 1,365 tokens via CLI and 44,026 via MCP. The overhead is almost entirely schema: 43 tool definitions injected into every conversation, of which the agent uses one or two.

Three approaches to the same problem

The industry is converging on three responses to context bloat. Each has a sweet spot.

MCP with compression tricks

The first response is to keep MCP but fight the bloat. Teams compress schemas, use tool search to load definitions on demand, or build middleware that slices OpenAPI specs into smaller chunks.

This works for small, well-defined interactions like looking up an issue, creating a ticket, or fetching a document. MCP’s structured tool calls and typed schemas are genuinely useful when you have a tight set of operations that agents use frequently.

But it adds infrastructure. You need a tool registry, search logic, caching, and routing. You’re building a service to manage your services. And you’re still paying per-tool token costs every time the agent decides it needs a new capability.

Code execution (the Duet approach)

Duet’s answer was to treat the agent like a developer with a persistent workspace. When the agent needs a new integration, it reads the API docs, writes code against the SDK, runs it, and saves the script for reuse.

This is powerful for long-lived workspace agents that maintain state across sessions and need complex workflows (loops, conditionals, polling, batch operations). Things that are awkward to express as individual tool calls become natural in code.

The downside: your agent is now writing and executing arbitrary code against production APIs. The safety surface is enormous. You need sandboxing, review mechanisms, and a lot of trust in your agent’s judgment.

CLI as the agent interface

The third approach is the one we took. Instead of loading schemas into the context window or letting the agent write integration code, you give it a CLI.

A well-designed CLI is a progressive disclosure system by nature. When a human developer needs to use a tool they haven’t touched before, they don’t read the entire API reference. They run tool --help, find the subcommand they need, run tool subcommand --help, and get the specific flags for that operation. They pay attention costs proportional to what they actually need.

Agents can do exactly the same thing. And the token economics are dramatically different.

Why CLIs are the pragmatic sweet spot

Progressive disclosure saves tokens

Here’s what the Apideck CLI agent prompt looks like. This is the entire thing an AI agent needs in its system prompt:

Use `apideck` to interact with the Apideck Unified API.
Available APIs: `apideck --list`
List resources: `apideck <api> --list`
Operation help: `apideck <api> <resource> <verb> --help`
APIs: accounting, ats, crm, ecommerce, hris, ...
Auth is pre-configured. GET auto-approved. POST/PUT/PATCH prompt (use --yes). DELETE blocked (use --force).
Use --service-id <connector> to target a specific integration.
For clean output: -q -o json

That’s ~80 tokens. Compare that to the alternatives:

Approach Tokens consumed When
Full OpenAPI spec in context 30,000–100,000+ Before first message
MCP tools (~3,600 per API) 10,000–50,000+ Before first message
CLI agent prompt ~80 Before first message
CLI --help call ~50–200 Only when needed

The agent starts with 80 tokens of guidance and discovers capabilities on demand:

# Level 1: What APIs are available? (~20 tokens output)
$ apideck --list
accounting ats connector crm ecommerce hris ...

# Level 2: What can I do with accounting? (~200 tokens output)
$ apideck accounting --list
Resources in accounting API:

  invoices
    list       GET  /accounting/invoices
    get        GET  /accounting/invoices/{id}
    create     POST /accounting/invoices
    delete     DELETE /accounting/invoices/{id}

  customers
    list       GET  /accounting/customers
    ...

# Level 3: How do I create an invoice? (~150 tokens output)
$ apideck accounting invoices create --help
Usage: apideck accounting invoices create [flags]

Flags:
  --data string        JSON request body (or @file.json)
  --service-id string  Target a specific connector
  --yes                Skip write confirmation
  -o, --output string  Output format (json|table|yaml|csv)
  ...

Each step costs 50–200 tokens, loaded only when the agent decides it needs that information. An agent handling an accounting query might consume 400 tokens total across three --help calls. The same surface through MCP would cost 10,000+ tokens loaded upfront whether the agent uses them or not.

This mirrors how Claude Agent Skills work. Metadata first, full details only when selected, reference material only when needed. The CLI is doing the same thing through a different mechanism.

Scalekit’s benchmark independently validated this pattern. They found that even a minimal ~800-token “skills file” (a document of CLI tips and common workflows) reduced tool calls by a third and latency by a third compared to a bare CLI. Our approach takes it further: the ~80-token agent prompt provides the same progressive discovery at a tenth of the cost. The principle is the same. A small, upfront hint about how to navigate the tool is worth more than thousands of tokens of exhaustive schema.

Reliability: local beats remote

There’s a dimension of the MCP problem that doesn’t get enough attention: availability.

Scalekit’s benchmark recorded a 28% failure rate on MCP calls to GitHub’s Copilot server. Out of 25 runs, 7 failed with TCP-level connection timeouts. The remote server simply didn’t respond in time. Not a protocol error, not a bad tool call. The connection never completed.

CLI agents don’t have this failure mode. The binary runs locally. There’s no remote server to time out, no connection pool to exhaust, no intermediary to go down. When your agent runs apideck accounting invoices list, it makes a direct HTTPS call to the Apideck API. One hop, not two.

This matters at scale. At 10,000 operations per month, a 28% failure rate means roughly 2,800 retries, each burning additional tokens and latency. Scalekit estimated the monthly cost difference at $3.20 for CLI versus $55.20 for direct MCP, a 17× cost multiplier, with the reliability tax on top.

Remote MCP servers will improve. Connection pooling, better infrastructure, and gateway layers will close the gap. But “the binary is on your machine” is a reliability guarantee that no amount of infrastructure engineering on the server side can match.

Structural safety beats prompt-based safety

Telling an agent “never delete production data” in a system prompt is like putting a sticky note on the nuclear launch button. It might work. Probably. Until a creative prompt injection peels the note off.

Security research on AI agents in CI/CD has shown how prompt injection can manipulate agents with high-privilege tokens into leaking secrets or modifying infrastructure. The pattern is always the same: untrusted input gets injected into a prompt, the agent has broad tool access, and bad things happen.

The Apideck CLI takes a structural approach. Permission classification is baked into the binary based on HTTP method:

// From internal/permission/engine.go
switch op.Permission {
case spec.PermissionRead:
    return ActionAllow      // GET → auto-approved
case spec.PermissionWrite:
    return ActionPrompt     // POST/PUT/PATCH → confirmation required
case spec.PermissionDangerous:
    return ActionBlock       // DELETE → blocked by default
}

No prompt can override this. A DELETE operation is blocked unless the caller explicitly passes --force. A POST requires --yes or interactive confirmation. GET operations run freely because they can’t modify state.

The agent frameworks reinforce this. Claude Code, Cursor, and GitHub Copilot all have permission systems that gate shell command execution. So you get two layers of structural safety: the agent framework asks “should I run this command?” and the CLI itself enforces “is this operation allowed?”

You can also customize the policy per operation:

# ~/.apideck-cli/permissions.yaml
defaults:
  read: allow
  write: prompt
  dangerous: block

overrides:
  accounting.payments.create: block    # payments are sensitive
  crm.contacts.delete: prompt          # contacts can be soft-deleted

This is the same principle behind Duda blocking destructive MCP actions, but enforced structurally in the binary, not through prompt instructions that compete with everything else in the context window.

Universal compatibility, zero protocol overhead

Every serious agent framework ships with “run shell command” as a primitive. Claude Code has Bash. Cursor has terminal access. GitHub Copilot SDK exposes shell execution. Gemini CLI runs commands natively.

MCP requires dedicated client support, connection plumbing, and server lifecycle management. A CLI requires a binary on the PATH.

This matters more than it sounds. When you’re building an agent that needs to interact with APIs, the integration path for a CLI is:

  1. Install the binary
  2. Set environment variables for auth
  3. Add ~80 tokens to the system prompt
  4. Done

The integration path for MCP is:

  1. Implement or configure an MCP client
  2. Set up server connections (transport, auth, lifecycle)
  3. Handle tool registration and schema loading
  4. Manage connection state and reconnection
  5. Deal with the token budget for tool definitions

The CLI approach also means your agent integration isn’t locked to any specific framework. The same apideck binary works from Claude Code, Cursor, a custom Python agent, a bash script, or a CI/CD pipeline.

How we built it

The Apideck CLI is a single static binary that parses our OpenAPI spec at startup and generates its entire command tree dynamically.

OpenAPI-native, no code generation. The binary embeds the latest Apideck Unified API spec. On startup, it parses the spec with libopenapi and builds commands for every API group, resource, and operation. When the API adds new endpoints, apideck sync pulls the latest spec. No SDK regeneration, no version bumps.

Smart output defaults. When running in a terminal, output defaults to a formatted table with colors. When piped or called from a non-TTY (which is how agents call it), output defaults to JSON. Agents get machine-parseable output without needing to remember --output json.

# Agent calls this (non-TTY) → gets JSON automatically
$ apideck accounting invoices list -q
[{"id": "inv_12345", "number": "INV-001", "total": 1500.00, ...}]

# Human runs the same command in terminal → gets a table
$ apideck accounting invoices list
┌──────────┬─────────┬──────────┐
│ ID       │ Number  │ Total    │
├──────────┼─────────┼──────────┤
│ inv_12345│ INV-001 │ 1,500.00 │
└──────────┴─────────┴──────────┘

Auth is invisible. Credentials are resolved from environment variables (APIDECK_API_KEY, APIDECK_APP_ID, APIDECK_CONSUMER_ID) or a config file, and injected into every request automatically. The agent never handles tokens, never sees auth headers, never needs to manage sessions.

Connector targeting. The --service-id flag lets agents target specific integrations. apideck accounting invoices list --service-id quickbooks hits QuickBooks. Swap to --service-id xero and the same command hits Xero. Same interface, different backend. That’s the unified API doing its job.

When CLI isn’t the answer

CLIs aren’t universally better. Here’s where the other approaches still win.

MCP is better for tightly scoped, high-frequency tools. If your agent calls the same 5–10 tools hundreds of times per session, the upfront schema cost amortizes well. A customer support agent that only ever looks up tickets, updates status, and sends replies doesn’t need progressive disclosure. It needs those tools ready immediately.

Code execution is better for complex, stateful workflows. If your agent needs to poll an API every 30 seconds, aggregate results across paginated endpoints, or orchestrate multi-step transactions with rollback logic, writing code is more natural than chaining CLI calls. Duet’s approach makes sense for agents that are essentially autonomous developers.

MCP is better when your agent acts on behalf of other people’s users. This is the dimension most CLI-vs-MCP comparisons gloss over, and it’s worth being direct about. When your agent automates your own workflow, ambient credentials are fine. You are the user, and the only person at risk is you. But if you’re building a B2B product where agents act on behalf of your customers’ employees, across organizations those customers control, the identity problem becomes three-layered: which agent is calling, which user authorized it, and which tenant’s data boundary applies. Per-user OAuth with scoped access, consent flows, and structured audit trails are real requirements at that boundary, and they’re requirements that raw CLI auth (gh auth login, environment variables) wasn’t designed to solve. MCP’s authorization model, whatever its efficiency cost, addresses this natively.

That said, the gap is narrower than it looks for unified API architectures. Apideck already centralizes auth through Vault: credentials are managed per-consumer, per-connection, and scoped by service. The --service-id flag targets a specific integration within a specific consumer’s vault. The structural permission system enforces read/write/delete boundaries in the binary. What’s missing is the per-user OAuth consent flow and tenant-scoped audit trail, real gaps, but ones that sit at the platform layer, not the agent interface layer. A CLI can be the interface while a backend handles delegated authorization. These aren’t mutually exclusive.

It’s also worth noting that MCP’s auth story is less settled than it appears. As Speakeasy’s MCP OAuth guide makes clear, user-facing OAuth exchange is not actually required by the MCP spec. Passing access tokens or API keys directly is completely valid. The real complexity kicks in when MCP clients need to handle OAuth flows dynamically, which requires Dynamic Client Registration (DCR), a capability most API providers don’t support today. Companies like Stripe and Asana have started adding DCR to accommodate MCP, but it remains a high-friction integration. The auth advantage MCP has over CLI is real in theory, but in practice, the ecosystem is still catching up to the spec.

CLIs are weaker at streaming and bi-directional communication. A CLI call is request-response. If you need server-sent events, WebSocket streams, or long-lived connections, you’ll want an SDK or MCP server that can hold a connection open.

Distribution has friction. MCP servers can theoretically live behind a URL. CLIs need a binary per platform, updates, and PATH management. For the Apideck CLI, we ship a static Go binary that runs everywhere without dependencies, but it’s still a binary you need to install.

The honest framing: MCP, code execution, and CLIs are complementary tools. The mistake is treating MCP as the universal answer when, for many integration patterns, a CLI does the job with two orders of magnitude less context overhead.

What this means for API providers

If you’re building developer tools in 2026, AI agents are becoming a primary consumer of your API surface. Not the only consumer (human developers still matter), but a rapidly growing one.

A few things are worth considering:

Your OpenAPI spec is too big for a context window. If you have 50+ endpoints, converting your spec to MCP tools will burn the budget of most agent interactions. Think about what a minimal entry point looks like.

Progressive disclosure isn’t just a UX pattern anymore. It’s a token optimization strategy. Give agents a way to discover capabilities incrementally instead of dumping everything upfront.

Structural safety is non-negotiable. Prompt-based guardrails are the security equivalent of honor system parking. Build permission models into your tools, not your prompts. Classify operations by risk level and enforce that classification in code.

Ship machine-friendly output formats. JSON by default in non-interactive contexts. Stable exit codes. Deterministic output. These are documented principles for agentic CLI design, and they matter because your next power user might not have hands.

Further reading

  • MCP vs API – How MCP and REST APIs relate (Apideck blog)
  • API Design Principles for the Agentic Era – Designing APIs with AI agents as first-class consumers
  • Understanding the Security Landscape of MCP – MCP security considerations in depth
  • The MCP Context Tax – Detailed analysis of MCP token overhead
  • Agentic CLI Design: 7 Principles – Design principles for CLIs as agent interfaces
  • MCP vs CLI Benchmark – Scalekit’s head-to-head benchmark data (75 runs, Claude Sonnet 4)

Why One AI Agent Isn’t Enough: Subagent Delegation and Context Drift

TL;DR

  • One AI agent handling everything can become a single point of failure.
  • Context drift leads to inaccuracies as tasks extend.
  • Delegate tasks to subagents for better focus and reliability.
  • Isolation of tasks helps to manage complexity in workflows.

There comes a moment in your AI journey when the initial magic starts to fade. You begin with a gleeful experience: your AI agent can research, summarize, and throw together documentation like it’s nothing. But then, as you progress through longer, more complex tasks, things start to get derailed.

You watch as the agent references constraints it pulled from nowhere, redoes work it has already completed, or presents contradicting outputs. You’re left scratching your head, questioning where everything went wrong because the agent should know better. Spoiler: context drift is likely the culprit, and no, a better prompt won’t fix it.

Understanding Context Drift

A language model doesn’t “remember” like a database does. Instead, it relies on a context window – the conversation or task’s full transcript that it has to keep in mind while generating responses.

Early on, that window is manageable; it’s all clear. But as you go on, it gets crowded. Each interaction adds layer upon layer of noise. Before long, something you established early on is buried under a heap of information accumulated during the task.

Researchers have dubbed this “lost-in-the-middle” behavior. As the context drags on, the model increasingly forgets crucial early decisions, leading to subtle yet significant misalignments in understanding.

The Old Approach: One Agent to Rule Them All

Traditionally, developers run a single agent session for extended tasks, expecting coherent behavior through the task’s progression. But as complexity increases—with tasks like refactoring the authentication layer across multiple files—the agent begins to lose the plot. Early decisions become fuzzy, leading to inconsistent rewrites and inaccurate outputs.

This approach works well for a few short tasks, but as the workload grows, coherence deteriorates.

The New Strategy: Delegating to Subagents

Enter the subagent model. Instead of letting one agent accumulate context for hours, you use a main agent to orchestrate tasks, delegating specific pieces to isolated subagents. Each of these subagents gets a clean context, specific tasks, and all relevant inputs.

Here’s how it works:

  1. The main agent defines the work.
  2. It hands off clear directives to subagents.
  3. Each subagent carries out its task without the distraction of accumulated noise.

Think about how effective teamwork operates. A project lead delegates tasks instead of bogging down in every minuscule detail, allowing for efficient progress and updates.

Implementing This in Real Life

Let’s look at how this plays out in our content production pipeline. For instance, creating a weekly blog post involves web research, topic selection, draft generation, and much more—all tasks that can bog down a single-session agent if handled together.

Instead of letting everything stack up, we spawn a subagent for the main task. The main agent fires off a clear job description, and the subagent deals with all tools and outputs in its own session. This way, if it hits 80% context utilization mid-task, that’s on the subagent alone.

When the task is done, the main agent gets a clean summary of the actions taken—the context in the main session remains limited and easily manageable.

Even our PR-waiting phases are handled by isolated cron jobs. Each runs one action and terminates, avoiding any accumulated state.

The minor overhead for setting up these subagents turns into a significant reliability improvement over time.

Key Benefits from Delegation

  1. Maintain Task Integrity: Long tasks won’t degrade mid-execution. Each subagent operates with its own focused context.
  2. Clearer Error Management: When something goes wrong, it’s easier to identify the fault within subagents than untangle errors in a sprawling context.
  3. Boost Efficiency: Multiple subagents can operate simultaneously on non-dependent tasks, significantly shortening overall completion times.

A Word of Caution

Of course, there’s no free lunch. Implementing subagent delegation does add complexity. Figuring out what each subagent needs can be tricky. Too much detail can overload their context, and too little can lead to assumptions they shouldn’t be making.

However, the investment in clarity pays off. Ensuring structured handoffs and defining success criteria means that even if there’s a slight hiccup during delegation, the isolated nature of the subagents saves the day.

Conclusion

Running everything through one agent is a risky approach. As tasks grow lengthy and convoluted, the context balloon becomes an Achilles’ heel. Subagents by contrast offer tailored attention to each task. What’s more, they create a more resilient system that keeps your main agent focused.

If you’re looking to alleviate context drift in your workflows, consider shifting to a subagent architecture. It may well be the change your automation strategy needs.

Have you tried creating multi-agent workflows? What’s worked best for you in managing handoff details? Share your thoughts in the comments!

This article was originally published on OctoClaw. Read the full breakdown on the OctoClaw blog.

This article was originally published on OctoClaw. OctoClaw provides turnkey cloud-hosted OpenClaw instances — up and running in minutes, no self-hosting pain.

5 RAG Architecture Mistakes That Kill Production Accuracy (And How to Fix Them)

I’ve built RAG systems that hit 96.8% retrieval accuracy in production. I’ve also built ones that started at 40% and needed emergency rewrites. The difference wasn’t the LLM — it was the architecture decisions made before any model was chosen.

Here are the five mistakes I see most often when teams take RAG from prototype to production.

1. Treating Chunking as an Afterthought

Most tutorials show you how to split documents into 512-token chunks with 50-token overlap and move on. This works for demos. It fails catastrophically on real business documents.

The problem: A contract clause that spans three paragraphs gets split across two chunks. Neither chunk contains the complete clause. The LLM gets partial context and hallucinates the rest.

What actually works:

Use semantic chunking that respects document structure. For structured documents (contracts, legal filings, compliance reports), chunk by logical section — not by token count. A 2,000-token chunk that contains a complete clause is far more useful than four 500-token chunks that fragment it.

# Bad: fixed-size chunking
chunks = text_splitter.split_text(document, chunk_size=512)

# Better: structure-aware chunking
chunks = split_by_sections(document, 
    section_markers=["Article", "Section", "Clause"],
    max_chunk_size=2048,
    preserve_hierarchy=True
)

In production I use a tiered approach: heading-aware splitting for structured documents, semantic similarity-based splitting for unstructured text, and table-preserving extraction for documents with embedded data.

2. Using Only Vector Search

Pure vector search is great at finding semantically similar content. It’s terrible at exact matches.

Ask a vector database “What is the termination clause in contract #2847?” and it might return clauses from contracts #2845 and #2849 because they’re semantically similar. The user asked for a specific document. Semantic similarity isn’t what they need.

The fix: hybrid search.

Combine vector search (semantic understanding) with keyword search (exact matching). Weight them based on query type:

def hybrid_search(query, documents, vector_weight=0.6, keyword_weight=0.4):
    vector_results = vector_store.similarity_search(query, k=20)
    keyword_results = bm25_search(query, documents, k=20)

    combined = reciprocal_rank_fusion(vector_results, keyword_results)
    return rerank(combined, query, top_k=5)

In my production systems I use pgvector for vector search and pg_trgm for fuzzy keyword matching — both in the same PostgreSQL database. No external services, no sync issues, and the retrieval accuracy jump from pure vector to hybrid was 23 percentage points.

The reranking step matters too. After fusion, run the top candidates through a cross-encoder reranker. This catches the cases where both retrieval methods ranked a mediocre result highly.

3. No Source Attribution (The Hallucination Trap)

If your RAG system returns an answer without showing where it came from, you have a hallucination machine with extra steps.

Users need to verify. Especially in high-stakes domains — legal, financial, compliance, healthcare. If the AI says “the penalty clause states a 5% charge” and there’s no link to the actual clause, nobody can trust it and nobody will use it.

In production, every answer needs:

  • The exact source document and section
  • A relevance confidence score
  • A clear signal when the system doesn’t have enough context to answer
interface RAGResponse {
  answer: string;
  sources: {
    document: string;
    section: string;
    pageNumber: number;
    relevanceScore: number;
    extractedText: string;  // The actual text the answer was derived from
  }[];
  confidence: 'high' | 'medium' | 'low';
  caveat?: string;  // "Based on documents uploaded before March 2026"
}

The confidence field is critical. When retrieval scores are below your threshold, the system should say “I don’t have enough information to answer this reliably” rather than guessing. In production, a confident “I don’t know” is worth more than a plausible-sounding hallucination.

4. Ignoring Temporal Context

Documents have dates. Policies get updated. Contracts expire. Regulations change.

If your RAG system treats a 2023 compliance policy and a 2026 compliance policy as equally relevant, you’re serving stale information. Worse — in regulated industries, this creates legal liability.

Build temporal awareness into your retrieval pipeline:

  • Store document dates as metadata and use them in filtering
  • When multiple versions of a document exist, default to the most recent unless the user asks for a specific version
  • Add temporal signals to the system prompt: “The following context is from documents dated between X and Y”
def temporal_search(query, date_context=None):
    results = hybrid_search(query)

    if date_context:
        results = filter_by_date_range(results, date_context)
    else:
        # Prefer recent documents, but don't exclude older ones entirely
        results = boost_recent(results, decay_factor=0.95, half_life_days=180)

    return results

This sounds obvious. In practice, I’ve seen production RAG systems serving answers from superseded documents because nobody thought to add date filtering. The fix takes an afternoon. The risk of not doing it is significant.

5. Building a Monolith Instead of a Pipeline

The prototype RAG loop is simple: embed query → search → stuff context → generate answer. Teams ship this and then discover they can’t debug it, can’t improve one component without breaking another, and can’t add features without rewriting everything.

Production RAG is a pipeline, not a function.

Each stage should be independently testable, measurable, and replaceable:

Query Analysis → Retrieval → Reranking → Context Assembly → Generation → Validation

Each stage has its own metrics:

Stage Key Metric
Query Analysis Query classification accuracy
Retrieval Recall@20
Reranking Precision@5
Context Assembly Context relevance score
Generation Answer faithfulness
Validation Hallucination detection rate

When accuracy drops, you can pinpoint which stage is failing. Is retrieval missing relevant documents? Is reranking promoting the wrong ones? Is the LLM hallucinating despite good context? Each answer leads to a different fix.

In a 12-component RAG system I built for enterprise document intelligence, this pipeline approach let us iterate individual components without regression — and we could A/B test retrieval strategies independently from generation strategies.

The Uncomfortable Truth About RAG in Production

RAG is not a product. It’s an architecture pattern. The difference between a demo that impresses stakeholders and a system that handles 10,000 queries a day with consistent accuracy is entirely in the engineering.

The LLM is maybe 20% of the work. The other 80% is chunking strategy, retrieval pipeline, temporal handling, source attribution, monitoring, and the unglamorous work of testing edge cases until your accuracy numbers are something you’d stake your reputation on.

If you’re building RAG for production and hitting accuracy walls, the fix is almost always in the retrieval pipeline — not in switching to a bigger model.

Nic Chin is an AI Architect and Fractional CTO specialising in multi-agent systems, RAG architecture, and enterprise AI automation. He provides AI consulting to businesses across the UK, US, Europe, Malaysia, and Singapore. Portfolio and case studies at nicchin.com.

Trust, Two Truths, and the Coming Agent Swarm

Picture a typical workday.

You’re in a meeting. Someone asks the typical question: “So, how was revenue last month?”

You pull up your dashboard and respond, “Looks like we’re up 5%.”

The CFO then opens his laptop, checks his numbers, and says, “Well, from what I’m seeing, we’re down 2%.”

And from that moment on, the meeting stops being about the organization’s next decisions and turns into a comparison game.

Are we using the same date range? Order date or payment date? Gross or net revenue? UTC or local time? How exactly are “real users” defined? Are we looking at the finance mart or raw invoices?

This happens all the time in healthy, well-run companies. Two analysts can produce two clean reports and still disagree. The same metric gets defined differently across teams, and that knowledge never gets consolidated in a single place.

This is often the real problem in analytics. We’re not dealing with “bad data” in some abstract sense, but with a lack of shared meaning. This hidden cost is what you could call the trust tax: the price you pay on every important decision just to prove the numbers are real.

Now here’s the twist. In 2026, we’re moving from dashboards to AI agents and AI-driven analytics. And if two humans can generate two versions of the truth, an agent can generate twenty – fast, confidently, and on demand.

AI won’t fix the inconsistent definitions. It will scale them. And without a stronger foundation in place, you may get answers more quickly, but you’ll spend even more time arguing about which one to trust.

Self-service BI didn’t fail. We just skipped the boring part.

For years, all data professionals talked about was self-service BI and data-driven decision-making.

An endless stack of tools, including Tableau, Power BI, and Looker, was introduced to help explore data and move faster. But even with this technology, the industry kept running into the same problem.

Tool vendors gave everyone access to the library, but the books weren’t organized. Access was democratized, but meaning wasn’t.

So people did what they’ve always done: They created local definitions based on what they needed.

For marketing, “active user” meant someone who visited the website.
For product, it meant someone who completed an action.
For finance, it meant someone who paid.
And for support, it meant someone who hasn’t churned.

Everyone was right – within their own silo.

But we’ve now entered the era of AI, and companies believe that AI will solve their problems, even if the foundational work is skipped. They think they can just put an LLM on top of the chaos and it will magically understand their business logic and align everyone’s definitions.

Spoiler alert: It won’t! AI won’t make the cracks disappear – it will widen them. And worse, it will sound perfectly reasonable while doing so.

The trust tax is not just a metaphor

Though much of this may sound like organizational and behavioral factors, it shows up in very tangible economic terms.

Many studies have recently highlighted the financial impact of poor data quality. Gartner estimates it costs organizations an average of USD 12.9 million per year. According to Thomas C. Redman, the cost to the U.S. economy is USD 3 trillion annually.

In 2025, Alteryx also reported that 76% of analysts still rely on spreadsheets as their primary tool for cleaning and preparing data, revealing that even in the AI era, the most common safety net is still an Excel export.

The trust tax shows up as manual work whose sole purpose is to validate the numbers. And beyond the money, there’s also a loss of confidence in data systems and slower, more fragmented decision-making.

Why AI can’t answer “simple” business questions

People ask: “Why can an agent write a poem but not calculate my churn rate?” Because a poem is a single output, whereas a business metric is a chain.

A typical report rarely boils down to one query. It’s often a process involving 10 to 15 steps – from identifying the right tables to computing ratios. And even if an agent is 90% accurate at each step, the probability of getting the entire chain right drops to around 35%.

What you end up with is a system that is often plausible, sometimes correct, and occasionally catastrophic. And that assumes “90% per step” performance is even realistic in your environment.

In real enterprise settings, text-to-SQL gets significantly harder. Benchmarks have evolved for a reason. Newer evaluations, such as Spider 2.0-style environments, reflect messier, more realistic conditions, characterized by larger schemas, multi-step reasoning, and hidden assumptions. Performance declines accordingly. 

Lack of trust isn’t a result of AI not being smart enough. It comes from organizations lacking a shared contract for meaning.

The missing contract: A semantic layer

If you want AI to stop guessing, you need to give it something solid to anchor to: a contract.

In analytics, that contract is the semantic layer, which functions as the official dictionary of your business. Definitions like “revenue,” “active user,” and “gross margin” are formally defined in code, each with explicit rules, covering filters, time and currency logic, exclusions, and more.

Instead of letting an agent query raw tables like t_sales_v2_final, invoice_line_items_2021_backup, or prod_users_all_time, you point it to something that reflects business reality.

This isn’t about making data prettier. It’s about removing ambiguity – and that’s precisely the role of a semantic layer. 

Next week, in a second part, we’ll look at what this means in practice. How the role of data analysts is evolving, what a reliable AI analytics stack looks like, and why semantic infrastructure is becoming one of the key parts of modern data systems.

To be continued…

About Databao

Databao is a new data product from JetBrains that helps data teams create and maintain a shared semantic context and build their own data agents on top of it. Our goal is to provide an AI-native analytics experience that business users can trust, enabling them to query and analyze data in plain language.

Databao’s modular components, the context engine and data agent, can run independently, either locally or within your existing infrastructure, using your own API keys.

We are also inviting data teams to build a proof of concept with us: we’ll explore your use case, define a context-building process, and grant agent access to a selected group of business users. Together, we will then evaluate the quality of responses and the overall value.

TALK TO THE TEAM

What are your goals for the week? #170

Spring Break is over but now the temp has dropped back to the 30s. There are even some flurries. Tennessee false spring. Or as so called it wrong coat season. Cause what ever coat you grab in the morning will be the wrong one at the end of the day.

What are your goals for the week?

  • What are you building this week?
  • What do you want to learn?
  • What events are you attending this week?

This Week’s Goals.

  • Job Search.
    • Network
    • Apply
  • Project work.
    • Content for side project.
    • Work on my own project.
    • Use my Content & Project Calendar for March.
  • Blog
  • Events.
    • Wednesday State of React
    • Thursday Virtual Coffee
  • Run a goal setting thread on Virtual Coffee(VC) Slack.

How I did last week.

  • Job Search.
    • ✅ Network
    • ✅ Apply
  • Project work.
    • Content for side project.
    • Work on my own project.
    • ✅ Use my Content & Project Calendar for March.
  • ❌ Blog
  • Events.
    • Thursday Virtual Coffee
    • ✅ Friday LinkedIn event
  • ✅ Run a goal setting thread on Virtual Coffee(VC) Slack.

Your turn, what do you plan to do this week?

  • What are you building this week?
  • What do you want to learn?
  • What events are you attending this week?

Cover image is my LEGO photography. It’s Daffy Duck working at a desktop computer. In the background is a mini rubber duck.

-$JarvisScript git commit -m "edition 170"

Sunsetting Code With Me

Code With Me has been part of JetBrains IDEs for years, providing real-time collaborative coding and pair programming directly inside your development environment. It enabled teams to share a workspace, tackle issues together, and learn from one another without leaving the IDE.

Today, we’re announcing plans to gradually sunset Code With Me.

In this post, we’ll explain why we’re making this change, what the sunset timeline looks like, and what it means for existing users. We’ll also outline how the transition will work and answer common questions in the FAQ below to help make the process as smooth as possible.

Why we’re making this change

Demand for built-in pair programming and real-time collaboration tools like Code With Me peaked during the pandemic and has since shifted, with many teams adopting different collaboration workflows. At the same time, maintaining Code With Me alongside the evolving IntelliJ Platform requires ongoing engineering investment.

After reviewing usage trends and the long-term direction of our IDEs, we’ve decided to discontinue Code With Me. This will allow us to focus our efforts on areas that deliver the most value to developers and align with how teams collaborate today.

Timeline and what to expect

2026.1 release

  • Code With Me will be unbundled from all JetBrains IDEs and made available as a separate plugin via JetBrains Marketplace.
  • 2026.1 will be the last IDE release to officially support Code With Me.
  • No new features will be developed from this point forward.

Transition period (2026.1 → Q1 2027)

  • The plugin will continue to function on supported IDE versions.
  • Security updates will be provided during this period.
  • Public relay infrastructure will remain operational.

Final shutdown (Q1 2027)

  • The public relay infrastructure will be turned off.
  • The service will be fully deactivated.

What this means for existing Code With Me users

We understand that Code With Me is part of some teams’ workflows, and we want to make this transition as smooth as possible. 

If you currently use Code With Me:

  • You can continue installing and using the plugin from JetBrains Marketplace throughout the transition period outlined above.
  • It will work on supported IDE versions, with security updates provided until the final sunset date in Q1 2027.
  • Existing subscriptions will remain active until support ends. New sales and renewals will be discontinued.
  • Our Support team will remain available during the transition to assist with any questions or compatibility concerns.

Depending on your workflow, you may find that general-purpose collaboration tools cover your needs. If your primary use case is remote access to development environments, our remote development features, which have been improved significantly in recent releases, may be a better fit.

For more details, see the FAQ below. If you have questions or feedback, please leave a comment or contact our Support team.

Looking ahead

As we invest in the future of our tools, we remain focused on delivering tools that support modern software development and bring the greatest value to developers and teams.

We’re grateful to everyone who used Code With Me, shared feedback, and contributed to its journey. Your input has helped shape the product. Thank you!

The JetBrains team

FAQ

Below, we have compiled answers to the most common questions about the discontinuation of Code With Me and the migration options available.

What is the last IDE release to support Code With Me?

2026.1 will be the last IDE release to officially support Code With Me.

Will I still be able to use Code With Me after the 2026.1 release?

Yes, you will be able to use Code With Me on all supported IDE versions for at least one year until Q1 2027. After that, our public relays will be shut down, and public sessions won’t be available anymore.

What alternatives are available for Code With Me users?

Depending on your workflow, you may find that general-purpose collaboration tools cover your needs. If your primary use case is remote access to development environments, our remote development features may be a better fit.

Does this decision affect remote development within JetBrains IDEs?

No. The discontinuation of Code With Me does not affect JetBrains IDEs’ remote development functionality.

We continue to actively invest in and evolve our remote development capabilities as part of the IntelliJ Platform. Remote development remains a strategic focus area, and progress in this direction will continue.

What will happen to my Code With Me license?

You can continue using Code With Me on supported IDE versions until the end of your current subscription term. Code With Me subscriptions will not renew.

For more details about your specific subscription, please contact our Support team.

I’m using Code With Me Enterprise. What does this mean for me?

If you are using Code With Me as part of a JetBrains IDE Services (Enterprise) agreement, your current contract terms remain valid during the supported period.

As we approach the end of the sunset period, renewal of Code With Me Enterprise will no longer be available. For contracts with specific provisions or custom arrangements, we will work individually to define the appropriate transition path.

If you have questions about how this change affects your agreement, please contact your JetBrains representative.

What should I do if I recently purchased an Code With Me license?

Our standard refund policy applies to recent purchases. If you have questions about your eligibility for a refund, please contact JetBrains support.

Where can I find more information or assistance?

For any further questions or support inquiries, please visit our Support page or reach out to us directly. We sincerely appreciate the Code With Me community’s support and look forward to continuing to provide the best solutions within our JetBrains IDEs.

Kotlin 2.3.20 Released

The Kotlin 2.3.20 release is out! Here are the main highlights:

  • Gradle: Compatibility with Gradle 9.3.0 and Kotlin/JVM compilation uses the Build tools API by default.
  • Maven: Simplified setup for Kotlin projects.
  • Kotlin compiler plugins: Lombok is Alpha and improved JPA support in the kotlin.plugin.jpa plugin.
  • Language: Support for name-based destructuring declarations.
  • Standard library: New API for creating immutable copies of Map.Entry.
  • Kotlin/Native: New interoperability mode for C and Objective-C libraries.

For the complete list of changes, refer to What’s new in Kotlin 2.3.20 or the release notes on GitHub.

How to install Kotlin 2.3.20

The latest version of Kotlin is included in the latest versions of IntelliJ IDEA and Android Studio.

To update to the new Kotlin version, make sure your IDE is updated to the latest version and change the Kotlin version to 2.3.20 in your build scripts.

If you need the command-line compiler, download it from the GitHub release page.

If you run into any problems:

  • Find help on Slack (get an invite).
  • Report issues to our issue tracker, YouTrack.

Stay up to date with the latest Kotlin features! Subscribe to receive Kotlin updates by filling out the form at the bottom of this post. ⬇️

Special thanks to our EAP Champions

  • Alexander Nozik
  • Benoit Lubek
  • Bernd Prünster
  • David Lopez
  • Dayan Ruben
  • Florian Schreiber
  • Johannes Svensson
  • Josh Stagg
  • Łukasz Wasylkowski
  • Rick Clephas
  • Sechaba Mofokeng
  • Yang
  • Zac Sweers

Further reading

  • What’s new in Kotlin 2.3.20 documentation
  • Kotlin 2.3 compatibility guide
  • Kotlin EAP Champions

Share Your Opinion of Qodana for the Chance to WIN!

At Qodana, we’re always looking for ways to make our code quality platform more useful for development teams. Whether you’re using it to automate code reviews, enforce quality gates in CI, or monitor the health of your codebase, your feedback plays a vital role in helping us improve our product and how we provide features and functionality to you.

That’s why we’re inviting Qodana users and trial-lers (past and present) to give us your feedback!

Take Survey & Enter

Why it matters

As development workflows evolve, so do the expectations around code quality, security, automation, and visibility. We want to better understand how teams are currently using Qodana, what challenges you’re solving with it, and where we can make the biggest improvements.

Your responses will help us:

  • Improve existing features
  • Prioritise future development
  • Better support the workflows of DevOps teams, developers, and engineering leaders
  • Ensure Qodana continues to integrate smoothly with modern CI/CD pipelines

The survey should only take approximately 8 minutes to complete, and your insights will directly influence how Qodana evolves.

Win one of 25 spot prizes

As a thank you for your time, participants will be entered into a draw to win one of 25 spot prizes. The prizes include:

  • One-year All Products Pack personal subscription
  • One-year personal license for a JetBrains IDE of your choice
  • USD 100 coupon for the JetBrains Merchandise Store

These prizes are designed to give you access to some of the tools developers love most across the JetBrains ecosystem, as well as something you can use to bring more joy into your work every day.

Take the Survey

If you’re using Qodana or exploring code quality automation, we’d love to hear from you.

Take Survey & Enter