[AutoBe] Qwen 3.5-27B Just Built Complete Backends from Scratch — 100% Compilation, 25x Cheaper

Qwen 3.5-27B Just Built Complete Backends from Scratch

We ran Qwen 3.5-27B on 4 backend generation tasks — from a todo app to a full ERP system. Every single project compiled. The output was nearly identical to Claude Opus 4.6, at 25x less cost.

This is AutoBe — an open-source system that turns natural language into complete, compilable backend applications.

AutoBe generating a Shopping Mall backend with Qwen 3.5-27B

1. Generated Examples

All generated by Qwen 3.5-27B. All compiled. All open source.

  • Todo
  • Reddit
  • Shopping

    • Entity Relationship Diagram
    • API Schema
    • Controller
    • E2E Test
  • ERP (Enterprise Resource Planning)

From a simple todo app to a full-scale ERP system. Each includes Database schema, OpenAPI spec, API implementation, E2E tests, and type-safe SDK.

2. The Benchmark

Benchmark: 11 AI models all scoring near-identically on backend generation

11 models benchmarked. Scores are nearly uniform — from Qwen 3.5-27B to Claude Sonnet 4.6.

A 27B model shouldn’t match a frontier model. So why are the outputs identical? Because the compiler decides output quality — not the model.

3. Cost

Model Input / 1M tokens Output / 1M tokens
Claude Opus 4.6 $5.000 $25.000
Qwen 3.5-27B (OpenRouter) $0.195 $1.560

~25x cheaper on input. ~16x on output. Self-host Qwen and it drops to electricity.

4. How Is This Possible?

AutoBe doesn’t generate text code. Instead, LLMs fill the AST structures of AutoBe’s custom-built compilers through function calling harness.

AutoBe's 4 compiler AST pipeline — Database, OpenAPI, Test, and Hybrid compilers validating LLM output through function calling

Four compilers validate every output, and when something fails, the compiler’s diagnoser feeds back exactly what broke and why. The LLM corrects only the broken parts and resubmits — looping until every compiler passes.

This harness is tight enough that model capability differences don’t produce quality differences. They only affect how many retries it takes — Claude Opus gets there in 1-2 attempts, Qwen 3.5-27B in 3-4. Both converge to the same output. That’s why the benchmark distribution is so uniform.

“If you can verify, you converge.”

5. Coming Soon: Qwen 3.5-35B-A3B

Qwen 3.5-35B-A3B benchmark showing near-complete compilation success

Only 3B active parameters. Not at 100% yet — but close.

When it gets there: 77x cheaper, running on a normal laptop.

No cloud. No high-end GPU. Just your machine building entire backends.

6. Try It

git clone https://github.com/wrtnlabs/autobe
pnpm install
pnpm playground

Star the repo if this is useful: https://github.com/wrtnlabs/autobe

7. Deep Dives

  • Function Calling Harness: From 6.75% to 100%
  • AutoBe vs. Claude Code: 3rd-Gen Coding Agent

Claude Code Leak: Why Every Developer Building AI Systems Should Be Paying Attention

“If your code gets exposed, how much damage can someone actually do?”
That’s the question I kept coming back to when the Claude Code discussions started surfacing across developer forums and security channels in early 2025. Reports indicated that portions of internal tooling, module structure, and system architecture associated with Anthropic’s Claude Code — an agentic coding assistant built on Claude — were exposed or reconstructable through a combination of leaked artefacts and reverse engineering.
And before the “it’s just a leak” crowd closes this tab: I want to make the case that this one is different. Not because of who it happened to. But because of what got exposed and why that matters for every developer building AI-driven products right now.

What the Claude Code Leak Actually Involved

To be precise: this wasn’t a single catastrophic breach where source code was dumped publicly. What made this incident notable was the partial exposure of internal system architecture — things like file structure, module naming conventions, agent workflow patterns, and tool orchestration logic.

In traditional software, a leaked file structure is mildly embarrassing. In an AI system, it’s a blueprint.

Here’s why. When you expose:

  • File structure → you reveal how the system is decomposed and what abstractions it uses
  • Module naming → you signal what capabilities exist and how they’re scoped
  • Agent workflow patterns → you expose the decision-making logic and tool-call sequences
  • Safety layer positioning → you reveal where guardrails sit, which tells an attacker where they don’t

Understanding the system architecture of an AI agent doesn’t just tell you how it works. It tells you exactly how to manipulate it.

Why AI Codebases Are Uniquely Vulnerable

Traditional application security assumes a relatively stable attack surface. You protect your API, your auth layer, your database. You patch CVEs. You rotate secrets.

AI systems change that calculus fundamentally. The attack surface in an LLM-powered system includes things that don’t exist in conventional software:

  1. Prompt Engineering as Infrastructure

In a standard app, business logic lives in code. In an AI system, a significant portion of business logic lives in prompts — system prompts, tool descriptions, chain-of-thought scaffolds. These are text, often stored as strings or markdown files. They’re not compiled. They’re not obfuscated. And they encode your product’s entire decision-making philosophy.

Expose a system prompt and you expose the rules of the game. An attacker can now craft inputs that navigate around your guardrails with surgical precision instead of brute force.

  1. Tool Orchestration Is a Dependency Graph

Modern AI agents don’t just generate text — they call tools. Search, code execution, file access, API calls. The orchestration logic that decides when to call which tool, and with what parameters, is often the most competitively sensitive part of the system.

Leaking that orchestration logic is the equivalent of leaking your microservices architecture and your internal API contracts simultaneously.

  1. Safety Layers Are Positional

In a well-designed AI system, safety measures are layered — input filtering, output validation, human-in-the-loop triggers, rate limiting. But these layers have positions in the pipeline. Once an attacker knows where a guardrail sits, they know what comes before it and what comes after it. They can craft inputs that appear clean at the filter point and only reveal their intent downstream.

This is why security-through-obscurity, while generally a bad strategy, is more damaging to abandon in AI systems than in traditional ones.

A Hypothetical Attack Scenario

Let’s make this concrete. Imagine you’ve built a customer-facing AI assistant for a SaaS product. Your system architecture includes:

  1. An input classifier that blocks obvious jailbreak attempts
  2. A system prompt that defines the assistant’s role and access permissions
  3. Tool calls that can query your internal database and send emails on behalf of users

Now imagine a researcher (or attacker) reverse-engineers enough of your architecture to know:

  • Your input classifier runs before the system prompt is injected
  • Your tool-call permissions are enforced by a description in the system prompt, not by a hard-coded permission layer
  • Your email tool doesn’t validate the recipient domain

With that knowledge, they don’t need to brute-force anything. They craft a single, clean-looking input that passes your classifier, then uses indirect prompt injection to override your system prompt’s tool permission language, and triggers an email to an external domain.

That’s not a theoretical attack. Variants of it have been demonstrated in research settings against production AI systems. The Claude Code leak is notable because it suggests even well-resourced AI labs can have enough of their internals reconstructable to enable this kind of targeted exploitation.

The “Systems Still Under Development” Problem

Here’s the angle that worries me most as someone actively building an AI product.

When a mature, production-hardened system gets partially exposed, it’s bad — but the blast radius is somewhat contained. The security assumptions have been tested. The edge cases have been handled. The architecture is, at least in theory, stable.

When a system still under active development gets exposed, the attacker doesn’t just find bugs. They find intentions.

They find the module you haven’t wired up yet. The permission check that’s commented out during testing. The hardcoded API key in the dev config. The tool that’s been scaffolded but not yet rate-limited.
Early-stage AI systems — which describes most of what the developer community is building right now — are architecturally porous by design. Speed of iteration is the priority. Security hardening comes later. The Claude Code incident is a reminder that “later” has a way of arriving before you’re ready.

How to Actually Build for This

These aren’t abstract recommendations. Here’s what I’d implement on any AI system today:

Design for the Inevitable Breach

Assume your prompts, your tool descriptions, and your agent workflows will eventually be exposed. Design them such that exposure doesn’t immediately translate to exploitation. This means:

  • No security by prompt alone. Permissions enforced only in a system prompt are not permissions — they’re suggestions. Enforce access control at the infrastructure layer.
  • Validate tool inputs at the tool level. Don’t rely on the LLM to self-police what parameters it passes to your tools. Treat every tool call as an untrusted external input.

Reduce Blast Radius

Segment your agent’s capabilities. An agent that can read files and send emails and make external API calls is a single prompt injection away from a multi-vector breach. Apply least-privilege to tools the same way you’d apply it to IAM roles.

# Instead of one god-agent with all capabilities:
agent.tools = [read_files, send_email, call_api, query_db]

# Scope tools to the task:
research_agent.tools = [read_files, web_search]
comms_agent.tools = [send_email]  # scoped to internal domains only

Treat Internal Architecture as Public

CI/CD configurations, agent workflow diagrams, prompt files — if they live in a repo, on a shared drive, or in a Notion doc accessible to more than three people, treat them as potentially public. Not because your team is untrustworthy, but because attack surfaces compound.

Red Team Your Prompts Before Shipping

Run adversarial prompt testing before any agent capability ships to production. This doesn’t require a dedicated security team — a single afternoon with a structured prompt injection checklist will surface more issues than you expect. Resources like OWASP’s LLM Top 10 are a solid starting point.

Secure the CI/CD Pipeline Specifically

AI systems often have unique CI/CD patterns — model fine-tuning pipelines, prompt version registries, embedding generation jobs. These are as sensitive as your application code and are frequently less scrutinised. Audit what has access to your prompt store and model configuration with the same rigour you’d apply to your production database credentials.

The Uncomfortable Truth About AI Security Maturity

The wider developer community — and I include myself here — is building AI systems at a pace that has significantly outrun our collective security intuition.

We’ve spent decades developing mental models for securing web applications. We know about SQL injection, XSS, CSRF, broken auth. We have frameworks, checklists, and automated tooling.

For AI systems? We’re still writing the playbook. Prompt injection, indirect prompt injection, model inversion, training data extraction, agent goal hijacking — these are real attack classes with real-world implications, and most developers building AI products today have limited formal exposure to any of them.

The Claude Code incident, whatever its precise scope, is valuable as a forcing function. It makes the abstract concrete. It invites the question: if this happened to Anthropic, what’s my exposure?

Final Thought

We’re not just writing code anymore. We’re building systems that reason, plan, and act — often with access to real data, real APIs, and real users.

When a traditional application fails, it crashes. When an AI agent gets exploited, it executes — just not in the direction you intended.
Security for AI systems isn’t a feature you bolt on at the end of the sprint. It’s an architectural decision you make on day one, and revisit every time you add a new tool, a new agent, or a new capability.

The Claude Code leak is a reminder that no one is immune. The question is whether it changes how you build.

What’s your current approach to securing AI agents in production? Drop a comment — I’d genuinely like to know what others are doing.

If you found this useful, follow for more on building real-world AI systems — covering architecture, security, and the hard lessons from shipping.

Spread vs Rest Operators in JavaScript

JavaScript gives us powerful tools to work with data more easily—and two of the most useful are the spread (…) and rest (…) operators.

Even though they look the same, they behave very differently depending on where you use them. Let’s break it down step by step.

What the Spread Operator Does
The spread operator is used to expand (spread out) values.
Think of it like unpacking items from a box.
Example with Arrays

const numbers = [1, 2, 3];
const newNumbers = [...numbers, 4, 5];

console.log(newNumbers);
// [1, 2, 3, 4, 5]

Here, …numbers takes each element and expands it into the new array.

What the Rest Operator Does

The rest operator does the opposite—it collects multiple values into one.
Think of it like packing items into a box.

function sum(...nums) {
  return nums.reduce((total, num) => total + num, 0);
}

console.log(sum(1, 2, 3, 4));
// 10

…nums collects all arguments into a single array.
Using Spread with Arrays and Objects
Array

const arr1 = [1, 2];
const arr2 = [3, 4];

const combined = [...arr1, ...arr2];

console.log(combined);
// [1, 2, 3, 4]

Objects:

const user = { name: "Alice", age: 25 };

const updatedUser = {
  ...user,
  age: 26
};

console.log(updatedUser);
// { name: "Alice", age: 26 }

Spread is commonly used to copy and update data without modifying the original.

Use Cases:

  1. Copying Arrays (Without Mutation)
const original = [1, 2, 3];
const copy = [...original];
  1. Merging Objects
const defaults = { theme: "light" };
const settings = { theme: "dark", fontSize: 16 };

const finalSettings = { ...defaults, ...settings };
  1. Passing Arguments to Functions
const nums = [5, 10, 15];

Math.max(...nums); // 15
  1. Extracting Values with Rest
const [first, ...others] = [10, 20, 30, 40];

console.log(first);  // 10
console.log(others); // [20, 30, 40]

How to Master SQLAlchemy I/O: Testing Queries in CI to Prevent Database Disasters 🚨

It’s 3:00 AM. Your pager is screaming.

The application is completely unresponsive, the database CPU is pegged at 100%, and connection pools are exhausted. Desperate customers with critical systems offline are flooding the support channels. To stop the bleeding, your team scales up to the biggest AWS RDS instance available, literally burning thousands of USD per minute just to keep the lights on.

You scramble to find the root cause, expecting a massive infrastructure failure. Instead, you find a single, seemingly harmless Python loop that was recently deployed.

Your CI pipeline was completely green. All the unit tests passed. The API returned the correct JSON schema. But beneath that green checkmark, your ORM was quietly executing 5,000 individual SELECT statements per request.

Testing what your application does is no longer enough. If you aren’t testing how it communicates with your database, you are exposing your business to catastrophic financial and operational risk. Let’s explore how to take control of your execution footprint.

🏢 The Cultural Divide: Whose Problem is the Database?

For years, software development has suffered from a toxic, siloed mentality: “Writing the code is my job; the database performance is the DBA’s problem.”

This culture is a massive financial liability. C-Level executives are painfully aware that the “database black box” directly inflates cloud infrastructure bills. You cannot simply throw more expensive AWS compute power at poorly optimized I/O.

At the same time, developers are constantly pushed to deliver features faster, relying heavily on Object-Relational Mappers (ORMs) to abstract away the SQL layer. But abstractions are not magic. Building a resilient engineering culture requires developers to take absolute ownership of their execution footprint. You must understand the exact cost of the code you write.

🛡️ Engineering Excellence Disclaimer

Let’s get one thing straight: SQLAlchemy is not slow. When a database reaches a critical state, it is almost never the fault of the ORM itself. The ORM is doing exactly what you commanded it to do.

Modern engineering demands “agnostic generalist specialists.” You do not need to be a DBA, but you must understand relational mechanics and make architectural decisions about your I/O layer:

  • The Python GC & Object Hydration Trap: SQLAlchemy does far more than just translate Python to SQL. It manages an IdentityMap, tracks the “dirty state” of every record, and hydrates complex Python objects. If you lazily load 10,000 rows as full ORM models instead of lightweight tuples, you aren’t just stressing the database—you are suffocating Python’s memory. When the Garbage Collector (GC) eventually kicks in to clean up thousands of discarded objects, your application’s CPU will spike and the event loop will stall. You must know when to yield raw tuples or use load_only.
  • The JOIN Illusion: It is a common misconception that a massive JOIN is always the best way to avoid an N+1 problem. While a JOIN utilizes database indexes efficiently, it can easily destroy your networking performance. If you join a root table with a heavily populated child table, the database sends the root data duplicated across every single row over the network. This Cartesian explosion causes terrible I/O bottlenecks.
  • The Two-Query Strategy: Often, it is vastly superior to execute a first query, aggregate the IDs in memory, and then execute a second query using an IN (...) clause. This completely eliminates the N+1 problem while keeping the network payload incredibly lean.
  • Virtual Tables and Pushdown Logic: When dealing with heavy aggregations, doing the math in Python memory is a critical mistake. It is almost always better to create a virtual table (like a View or a CTE) to push the computational weight down to the database engine, returning only the final, lightweight result to your application.

You must be in control of these decisions. pytest-capquery exists to make this invisible I/O battle visual. It puts you in control, commander.

💡 The Solution: Bridging the Gap with pytest-capquery

We need a way to incentivize developers to care about database I/O without forcing them to manually write and maintain brittle, hardcoded SQL assertions in their test suites.

This is why pytest-capquery was created. It intercepts the SQLAlchemy engine at the driver level, providing a strict, chronological timeline of your application’s execution footprint.

  • For the Business: This is about protecting the bottom line. Catching a database regression in CI preserves your system’s SLA and safeguards your customer reputation. You avoid emergency weekend patches, furious customers with offline security panels, and the sheer financial drain of desperately scaling up your cloud infrastructure just to keep the platform breathing.
  • For Developers: It uses a zero-friction snapshot workflow. You don’t write SQL strings; the test suite generates them for you. If an N+1 regression occurs, the test fails immediately. You use the snapshot as a debugging mechanism to continuously improve your query logic.
  • For DBAs: It automatically generates physical .sql files. DBAs can review these raw SQL artifacts during Pull Requests to validate query plans and indexes without ever reading a line of Python code.

🛠️ Getting Started: Proving Your Execution Footprint

Let’s look at how to protect a critical domain—like monitoring Alarm Panels and their associated Sensors—using a real PostgreSQL integration database.

1. The Setup (conftest.py)

First, we provision a tangible PostgreSQL engine to ensure our tests replicate production-grade execution topologies. We configure the postgres_capquery fixture to intercept the engine.

from typing import Generator
import pytest
from sqlalchemy import create_engine, Engine, text
from sqlalchemy.orm import Session, sessionmaker
from pytest_capquery.plugin import CapQueryWrapper
from pytest_capquery.snapshot import SnapshotManager
from tests.models import Base

@pytest.fixture(scope="session")
def postgres_engine() -> Generator[Engine, None, None]:
    engine = create_engine("postgresql+psycopg2://postgres@localhost:5432/capquery_test")
    Base.metadata.create_all(engine)
    yield engine
    Base.metadata.drop_all(engine)
    engine.dispose()

@pytest.fixture(scope="function")
def postgres_session(postgres_engine: Engine) -> Generator[Session, None, None]:
    SessionMaker = sessionmaker(bind=postgres_engine)
    session = SessionMaker()
    session.execute(text("TRUNCATE TABLE alarm_panels, sensors RESTART IDENTITY CASCADE"))
    session.commit()
    yield session
    session.rollback()
    session.close()

@pytest.fixture(scope="function")
def postgres_capquery(
    postgres_engine: Engine, capquery_context: SnapshotManager
) -> Generator[CapQueryWrapper, None, None]:
    with CapQueryWrapper(postgres_engine, snapshot_manager=capquery_context) as captured:
        yield captured

2. The Test (test_snapshot.py)

Instead of guessing how many queries are executed, we wrap our business logic in the capture(assert_snapshot=True) context manager.

import pytest
from sqlalchemy.orm import joinedload
from tests.models import AlarmPanel, Sensor

pytestmark = pytest.mark.xdist_group("e2e_postgres")

def test_insert_and_select_snapshot(postgres_session, postgres_capquery):
    with postgres_capquery.capture(assert_snapshot=True):
        panel = AlarmPanel(mac_address="00:11:22:33:44:55", is_online=True)
        sensor = Sensor(name="Front Door", sensor_type="Contact")
        panel.sensors.append(sensor)

        postgres_session.add(panel)
        postgres_session.flush()

        queried_panel = (
            postgres_session.query(AlarmPanel)
            .options(joinedload(AlarmPanel.sensors))
            .filter_by(mac_address="00:11:22:33:44:55")
            .first()
        )
        assert queried_panel is not None

3. The Universal Artifact (.sql Snapshot)

When you run your test suite, pytest-capquery generates this exact file. This is the ultimate source of truth. If a developer accidentally alters the fetching strategy and destroys your networking performance, the test will instantly fail because the query structure and count will deviate from this approved baseline.

-- CAPQUERY: Query 1
-- EXPECTED_PARAMS: None
-- PHASE: 1
BEGIN

-- CAPQUERY: Query 2
-- EXPECTED_PARAMS: {'mac_address': '00:11:22:33:44:55', 'is_online': True}
-- PHASE: 1
INSERT INTO alarm_panels (mac_address, is_online)
VALUES (%(mac_address)s, %(is_online)s) RETURNING alarm_panels.id

-- CAPQUERY: Query 3
-- EXPECTED_PARAMS: {'panel_id': 1, 'name': 'Front Door', 'sensor_type': 'Contact'}
-- PHASE: 1
INSERT INTO sensors (panel_id, name, sensor_type)
VALUES (%(panel_id)s, %(name)s, %(sensor_type)s) RETURNING sensors.id

-- CAPQUERY: Query 4
-- EXPECTED_PARAMS: {'mac_address_1': '00:11:22:33:44:55', 'param_1': 1}
-- PHASE: 1
SELECT anon_1.alarm_panels_id AS anon_1_alarm_panels_id,
       anon_1.alarm_panels_mac_address AS anon_1_alarm_panels_mac_address,
       anon_1.alarm_panels_is_online AS anon_1_alarm_panels_is_online,
       sensors_1.id AS sensors_1_id,
       sensors_1.panel_id AS sensors_1_panel_id,
       sensors_1.name AS sensors_1_name,
       sensors_1.sensor_type AS sensors_1_sensor_type
FROM
  (SELECT alarm_panels.id AS alarm_panels_id,
          alarm_panels.mac_address AS alarm_panels_mac_address,
          alarm_panels.is_online AS alarm_panels_is_online
   FROM alarm_panels
   WHERE alarm_panels.mac_address = %(mac_address_1)s
   LIMIT %(param_1)s) AS anon_1
LEFT OUTER JOIN sensors AS sensors_1 ON anon_1.alarm_panels_id = sensors_1.panel_id

🚀 Stop Guessing, Start Asserting

The database is the beating heart of your application. Leaving its performance up to chance and ORM black boxes is no longer an option.

By integrating tools like pytest-capquery into your CI pipeline, you transform performance testing from an afterthought into a rigorous, automated standard. You protect your cloud budget, you give your DBAs the transparency they desperately need, and you empower yourself to truly command the systems you build.

Stop guessing your execution footprint. Profile your test suite today:

🔗 fmartins/pytest-capquery on GitHub

pip install pytest-capquery

Together we can do more! If you care about engineering excellence and robust testing, jump into the repository. Issues, discussions, and Pull Requests are always welcome. Let’s build a culture that respects the database.

dotInsights | April 2026

Did you know? You can use LINQ to XML to write queries in a readable and strongly-typed way directly against an XML document, making it one of the most intuitive ways to deal with XML in .NET.

dotInsights | April 2026

Welcome to dotInsights by JetBrains! This newsletter is the home for recent .NET and software development related information.

🔗 Links

Here’s the latest from the developer community.

  • 7 Testing Myths Every Software Developer Should STOP Believing 🎥 – Emily Bache
  • From 3 Worktrees to N: How AI Agents Changed My Parallel Development Workflow on Windows – Laurent Kempé
  • records ToString and inheritence – Steven Giesel
  • Coding isn’t the hard part… 🎥 – CodeOpinion by Derek Comartin
  • 5 UX Tips for .NET MAUI Developers – Leomaris Reyes
  • I Don’t Know If I’d Recommend Software Development Anymore 🎥 – Gui Ferreira
  • Splitting the NetEscapades.EnumGenerators packages: the road to a stable release – Andrew Lock
  • Daniel Ward: AI Agents – Episode 393 – Jeffrey Palermo hosts Daniel Ward
  • Behavioural Inference: How I Learned to Stop Worrying and Love Probabilistic Systems – Scott Galloway
  • Creating case-sensitive folders on Windows using C# – Gérald Barré
  • AI Benefits – But at What Cost? – Steve Smith
  • A Primer on Using Agent Skills 🎥 – The AI Daily Brief: Artificial Intelligence News
  • How C# Strings Silently Kill Your SQL Server Indexes in Dapper – Kevin Griffin
  • How to Implement Prototype Pattern in C#: Step-by-Step Guide – Nick Constantino
  • Writing a .NET Garbage Collector in C#  – Part 8: Interior pointers and Writing a .NET Garbage Collector in C#  – Part 9: Frozen segments and new allocation strategy – Kevin Gosse
  • How To Containerize A Twilio App With Docker – Dylan Frankcom
  • Building a Real-time Audio Processing App with SKSL Shaders in .NET MAUI – Nick Kovalsky
  • How to Create Fillable PDF Forms in C# for Server-Side .NET Apps – Arun Kumar Chandrakesan
  • C# class types explained with examples – David Grace
  • Regular Expression Performance: Supercharge Your Match Counting – David McCarter
  • CoreSync – A .NET library that provides data synchronization between databases – Adolfo Marinucci
  • Software Craftsmanship in the Age of AI – Tim O’Reilly
  • Validation Options in Wolverine – Jeremy D. Miller
  • How to Organize Minimal APIs – Assis Zang
  • When NOT to use the repository pattern in EF Core – Ali Hamza Ansari
  • # 14 New Features: A Developer Guide for .NET 10 – Dirk Strauss
  • What’s the EXACT Technical Gap That Separates AI SUCCESS From AI FAILURE? – Dave Farley and Steve Smith at Modern Software Engineering
  • What 81,000 people want from AI – Anthropic

☕ Coffee Break

Take a break to catch some fun social posts.

You just know this is happening in some company out there…

10x engineers.

Rules of code…

🗞️ JetBrains News

What’s going on at JetBrains? Check it out here:

🎉 dotUltimate 2026.1 Release Party 🎉

🎉 ReSharper for Visual Studio Code, Cursor, and Compatible Editors Is Out  🎉

More JetBrains news…

  • ReSharper 2026.1 Release Candidate Released!
  • Rider 2026.1 Release Candidate Is Out!
  • Rider 2026.1: More AI Choice, Stronger .NET Tooling, and Expanded Game Dev Support
  • ReSharper 2026.1: Built-in Performance Monitoring, Expansion to VS Code, and Faster Everyday Workflows

✉️ Comments? Questions? Send us an email. 

Subscribe to dotInsights

Is your AI wrapper a “High-Risk” system? (A dev’s guide to the EU AI Act)

If you’re building AI features right now, you and your team are probably arguing about the tech stack:

  • Should we use LangChain or LlamaIndex?
  • Should we hit the OpenAI API or run Llama 3 locally?

Here is the harsh truth about the upcoming EU AI Act:

Regulators do not care about your tech stack.

They don’t care if it’s a 100B parameter model or a simple Python script using scikit-learn.

The law only cares about one thing:

Your use case.

Why This Matters

Your use case determines your risk category.

If your product falls into the High-Risk category, you are legally required to implement:

  • human oversight
  • risk management systems
  • detailed technical documentation (Annex IV)

Getting this wrong doesn’t just mean “non-compliance”.

It means:

  • failed procurement audits
  • blocked enterprise deals
  • serious regulatory exposure

🔍 5 Real-World AI Scenarios

Here are practical examples to help you understand where your system might fall.

1. AI Chatbot for Customer Support

Use case:

  • routing tickets
  • answering FAQs

Classification:

👉 Limited Risk

Dev requirement:

Add UI elements disclosing that users are interacting with AI.

The trap:

If your bot starts making decisions (e.g. auto-refunds, banning users), you might cross into High-Risk territory.

2. AI for CV Screening / Hiring

Use case:

  • parsing resumes
  • ranking candidates

Classification:

👉 High-Risk (explicitly listed under Annex III)

Dev requirement:

  • bias monitoring
  • human-in-the-loop (HITL) flows
  • full decision logging

3. E-commerce Recommendation Engine

Use case:

  • tracking user behavior
  • suggesting products

Classification:

👉 Minimal Risk

Dev requirement:

Almost none under the AI Act (GDPR still applies).

4. AI Credit Scoring System

Use case:

  • determining loan eligibility

Classification:

👉 High-Risk

Dev requirement:

Full traceability — you must be able to explain decisions made by the system.

5. AI Generating Marketing Content

Use case:

  • generating blog posts
  • writing ad copy

Classification:

👉 Minimal to Limited Risk

Dev requirement:

Minimal — unless generating deepfakes (then disclosure/watermarking applies).

🛠️ The Real Risk: Feature Creep

The biggest danger isn’t writing documentation.

It’s this:

Your system can move from Limited Risk to High-Risk with a single merged PR.

A small feature change can completely change your regulatory obligations.

Quick Self-Check

If you’re targeting the EU market, ask yourself:

  • Does my system influence hiring decisions?
  • Does it impact financial outcomes?
  • Does it affect people’s rights or opportunities?

If yes:

👉 You may already be in High-Risk territory.

🧪 A Simple Way to Check

If you’re not sure, I built a free developer tool to calculate this instantly:

👉 https://www.complianceradar.dev/ai-act-risk-classification

No signup required.

Final Thought

Most AI products won’t fail because of bad code.

They’ll fail because of misunderstood regulation.

Understand your risk level early — and build with confidence.

💬 What kind of AI features are you building right now?

Drop your use case below and we can try to classify it together.

Using ACP + Deep Agents to Demystify Modern Software Engineering

This guest post comes from Jacob Lee, Founding Software Engineer at LangChain, who set out to build a coding agent more aligned with how he actually likes to work. Here, he walks through what he built using Deep Agents and the Agent Client Protocol (ACP), and what he learned along the way.

I’ve come to accept that I will delegate an ever-increasing amount of my work as a software engineer to LLMs. I was an early Claude Code superfan, and though my ego still tells me I can write better code situationally than Anthropic’s proto-geniuses in a data center, these days I’m mostly making point edits and suggestions rather than writing modules by hand.

This shift has made me far more productive, but I’ve become increasingly uncomfortable with blindly turning over such a big part of my job to an opaque third party. While training my own model was out of the question for many obvious reasons (and model interpretability is an unsolved problem anyway), the agent harness and UX on top of it is just software, and software IS something I understand. So when I had some free time during my paternity leave, I took a stab at building some tooling to my own specifications.


I work at a startup called LangChain, where we’ve been developing our own set of open-source agentic building blocks, and I settled on building an adapter between our Deep Agents framework and Agent Client Protocol (ACP). My goal was just to build a bespoke coding agent that fit my workflows, but the results were better than I expected. Over the past few months, it’s completely replaced Claude Code as my daily driver, with the added benefit of full observability into my agent’s actions by running LangSmith on top. In this post, I’ll cover how it works and how to set it up for yourself!

Why an IDE + ACP instead of a terminal + TUI?

If you’re not familiar with ACP, it’s an open protocol that defines how a client (most often used with IDEs like WebStorm or Zed) interacts with AI agents. It allows you to do cool things like quickly pass a coding agent the exact context you’re looking at in an IDE.

I’ve gotten quite used to being productive in IDEs over my decade writing software professionally, and I still find them valuable for a few reasons:

  • I do still edit code by hand occasionally. Most often, these are small edits I can make faster than explaining the problem to an agent, or because I can do something in parallel alongside a running agent, like adding debug statements, but this still provides some alpha.
  • IDEs are fantastic interfaces for viewing code in context. I most often use this to understand the general scope of a problem before prompting, or to self-review my current branch, but it’s also often just faster for me to point the agent at a file rather than asking it to grep around.

I previously used Claude Code in a separate terminal pane in an IDE, which worked but always felt like two disconnected tools. In JetBrains IDEs, the agent lives in a native tool window with tight integration. I can @mention the file or block of code I’m currently looking at, and many of my threads are littered with messages like “Take a look at this. Does it look funny? @thisFile“.

How it works

The agent

Though I could have created the various pieces for my agent from scratch, Deep Agents provided a good, opinionated starting point, providing the following:

  • Tools around interacting with the filesystem (read/write/edit_file, ls, grep, etc.).
  • Shell access, which allows the agent to run verifications like lint, tests, and more.
    • Alongside this, human-in-the-loop support to allow restricting dangerous actions
  • A write_todos tool, which encourages the agent to take a planning step that breaks work into steps and tracks progress.
    • In practice, this makes a big difference for longer refactors to keep the agent focused.
  • Capabilities around spawning isolated sub-agents for parallel or compartmentalized work.
    • Each one gets its own context, runs independently, and reports back, keeping the model’s context window manageable.
  • Other important UX features like streaming, cancellation, prompt caching, and context summarization.

I also added some custom middleware that appends information about the current project setup in the system prompt, such as the current directory open in the IDE, whether a git repo was present, package manager detection, and more.

It’s also possible to add skills, tweak the system prompt, add custom tools or MCP servers, and more, directly in Python, rather than having to create a new CLI config option.

The ACP adapter

After deciding on a basic agent setup, I needed to hook that agent into the client via ACP. I created an adapter that implements the ACP interface and handles the session lifecycle, message routing, model switching, and streaming.

One nice surprise was how cleanly the agent’s capabilities mapped onto ACP concepts.

For example:

  • The agent’s planning step (write_todos) maps naturally to agent plans in ACP.
  • Interrupts from the agent (e.g. “I want to run this command”) map to permission requests.
  • Threads and session persistence were nearly 1:1 with Deep Agents checkpointers.

This meant I didn’t need to invent much glue logic – the protocol already had good primitives for most of what I wanted. The overall agent runner looks roughly like this, minus the tool call and message formatting:

current_state = None
user_decisions = []
while current_state is None or current_state.interrupts:
    # Check for cancellation
    if self._cancelled:
        self._cancelled = False  # Reset for next prompt
        return PromptResponse(stop_reason="cancelled")

    async for stream_chunk in agent.astream(
        Command(resume={"decisions": user_decisions})
        if user_decisions
        else {"messages": [{"role": "user", "content": content_blocks}]},
        config=config,
        stream_mode=["messages", "updates"],
        subgraphs=True,
    ):
        if stream_chunk.__interrupt__:
            # If Deep Agents interrupts, request next actions from
            # the client via ACP's session/request_permission method
            user_decisions = await self._handle_interrupts(
                current_state=current_state,
                session_id=session_id,
            )
            # Break out of the current Deep Agent stream. The while
            # loop above resumes it with the user decisions
            # returned from the session/request_permission method
            break

        # ...translate LangGraph output into ACP
        # Tools that do not require interrupts are called
        # internally results are just streamed back here as well

        # current_state will be none when the agent has finished
        current_state = await agent.aget_state(config)

return PromptResponse(stop_reason="end_turn")

The human-in-the-loop flow was where I spent the most time. When the agent wants to run a shell command or make a file edit that requires approval, the adapter intercepts the interrupt from Deep Agents, and depending on what permissions mode the user has selected and what they have previously approved, either resumes immediately or sends a permission request to the IDE with options to approve, reject, or always-allow that command type.

The always-allow is session-scoped – if you approve uv sync once and choose “always allow”, subsequent uv sync calls skip the prompt automatically, but I made efforts to prevent similar commands such as uv run script.py from bypassing the permission check.

Here’s how the end result looks in WebStorm:

How it went

While I haven’t run formal evals, I was pleasantly surprised by how well my agent performed after only a few iterations. I didn’t actually expect to switch away from Claude Code, and it was a great dogfooding exercise as well, since our OSS team was able to upstream some of my feedback back into Deep Agents itself.

My original goal of regaining code-level, rather than config-level, control over my daily workflows has also been great. When Anthropic had an outage a few weeks ago, I was able to switch over to OpenAI’s gpt-5.4 without skipping a beat, and I even found that it had some interesting quirks. I switch back and forth between models mid-session to gain different perspectives from each model when working on tricky tasks, and have also found open-source models like GLM-5 are quite capable while offering significant cost savings.

Another boon is observability via LangSmith tracing, which allows me to debug and improve my agent when I run into issues. Being able to see exactly what context was passed to the model, which tools it called, and where it went sideways helped me understand behaviors that were previously hidden inside the harness. Here’s an example of what such a trace looks like:

For example, when I noticed that my agent was starting to take wide, slow sweeps of my filesystem, I used a trace to find a bug in my system prompt that told the agent the project was at the filesystem root rather than the current working directory.

Taking back your dev workflows for fun and profit

What started as a small late-night project I worked on around taking care of a newborn daughter turned into a huge success, both for my own understanding of agent behavior and for improving my daily workflow.

It proved to me that Claude Code isn’t magic but a bundle of very clever tricks rolled up into a neat package. The harness layer is just software, and software is something any developer can shape to fit how they want to work.

If you’re curious, I’d highly recommend trying an experiment like this yourself. Even a small prototype can teach you a lot about how these systems think and where they break. Clone the repo and follow the setup guide here to get started from source code. I’d love to know what you think. You can reach out to me on X @Hacubu to let me know!

Special thanks to @veryboldbagel and @masondxry for helping productionize the adapter and dealing with my unending questions and feedback!

Junie CLI Now Connects to Your JetBrains IDE

Until now, Junie CLI has worked like any other standalone agent. It was powerful, but disconnected from the workflows you set up for your specific projects. That changes today.

Junie CLI can now connect to your running JetBrains IDE and use its full code intelligence, including the indexing, semantic analysis, and tooling you already rely on. The agent works with your IDE the same way you do. It sees what you see, knows what you’ve been working on, and uses the same build and test configurations you’ve already set up.

No manual setup is required – Junie CLI detects your running IDE automatically. If you have a JetBrains AI subscription, everything works out of the box.

Install Junie CLI

What Junie can do with your IDE

Most AI coding agents operate in isolation. They read your files, guess at your project structure, and and attempt to run builds or tests without full context. This can work for simple projects, but it falls apart in real-world codebases, such as monorepos with complex build configurations, projects with hundreds of modules, or test setups that took your team weeks to get right.

Junie doesn’t guess. It asks your IDE, which gives it the power to:

Understand your context

Junie sees what you’re working on right now – which file is open, what code you’ve selected, and which builds and tests you’ve run recently. Instead of scanning your entire repository to understand what’s relevant, it starts with the same context you have.

Run tests without guessing

On a monorepo or any project with a non-trivial test setup, Junie uses the IDE’s pre-configured test runners – no guessing at commands and no broken configurations.

Refactor with precision

When Junie renames a symbol, it uses the IDE’s semantic index to find every usage – searching across files, respecting scope, and handling overloads and variables with the same name that appear in different contexts. This is the kind of refactoring that text-based search gets wrong.

Build and debug complex projects

Junie runs builds and tests using your existing IDE configurations.

Custom build commands, non-obvious test runners, cross-compilation targets – if your IDE understands them, Junie does too.

Use semantic code navigation

From the IDE’s index, Junie accesses the project structure without reading files line by line. Its synonym-aware search finds “variants” when you search for “options”. It navigates code the way you would, not the way grep does.

Installation

Junie CLI’s IDE integration works in all JetBrains IDEs. Support for Android Studio is coming soon.

Make sure your JetBrains IDE is running, then launch Junie CLI in your project directory. It will automatically detect the IDE and prompt you to install the integration plugin. One click, and you’re connected.

If you’re a JetBrains AI subscriber, authentication is automatic, while Bring Your Own Key (for Anthropic, OpenAI, etc.) is also fully supported.

Try Junie CLI

What’s next

This integration is currently in Beta. We’re actively expanding the capabilities Junie can access through your IDE, and your feedback will directly shape what comes next.

Try it out, and let us know what you think.

Identifying Necessary Transparency Moments In Agentic AI (Part 1)

Designing for autonomous agents presents a unique frustration. We hand a complex task to an AI, it vanishes for 30 seconds (or 30 minutes), and then it returns with a result. We stare at the screen. Did it work? Did it hallucinate? Did it check the compliance database or skip that step?

We typically respond to this anxiety with one of two extremes. We either keep the system a Black Box, hiding everything to maintain simplicity, or we panic and provide a Data Dump, streaming every log line and API call to the user.

Neither approach directly addresses the nuance needed to provide users with the ideal level of transparency.

The Black Box leaves users feeling powerless. The Data Dump creates notification blindness, destroying the efficiency the agent promised to provide. Users ignore the constant stream of information until something breaks, at which point they lack the context to fix it.

We need an organized way to find the balance. In my previous article, “Designing For Agentic AI”, we looked at interface elements that build trust, like showing the AI’s intended action beforehand (Intent Previews) and giving users control over how much the AI does on its own (Autonomy Dials). But knowing which elements to use is only part of the challenge. The harder question for designers is knowing when to use them.

How do you know which specific moment in a 30-second workflow requires an Intent Preview and which can be handled with a simple log entry?

This article provides a method to answer that question. We will walk through the Decision Node Audit. This process gets designers and engineers in the same room to map backend logic to the user interface. You will learn how to pinpoint the exact moments a user needs an update on what the AI is doing. We will also cover an Impact/Risk matrix that will help to prioritize which decision nodes to display and any associated design pattern to pair with that decision.

Transparency Moments: A Case Study Example

Consider Meridian (not real name), an insurance company that uses an agentic AI to process initial accident claims. The user uploads photos of vehicle damage and the police report. The agent then disappears for a minute before returning with a risk assessment and a proposed payout range.

Initially, Meridian’s interface simply showed Calculating Claim Status. Users grew frustrated. They had submitted several detailed documents and felt uncertain about whether the AI had even reviewed the police report, which contained mitigating circumstances. The Black Box created distrust.

To fix this, the design team conducted a Decision Node Audit. They found that the AI performed three distinct, probability-based steps, with numerous smaller steps embedded:

  • Image Analysis
    The agent compared the damage photos against a database of typical car crash scenarios to estimate the repair cost. This involved a confidence score.
  • Textual Review
    It scanned the police report for keywords that affect liability (e.g., fault, weather conditions, sobriety). This involved a probability assessment of legal standing.
  • Policy Cross Reference
    It matched the claim details against the user’s specific policy terms, searching for exceptions or coverage limits. This also involved probabilistic matching.

The team turned these steps into transparency moments. The interface sequence was updated to:

  • Assessing Damage Photos: Comparing against 500 vehicle impact profiles.
  • Reviewing Police Report: Analyzing liability keywords and legal precedent.
  • Verifying Policy Coverage: Checking for specific exclusions in your plan.

The system still took the same amount of time, but the explicit communication about the agent’s internal workings restored user confidence. Users understood that the AI was performing the complex task it was designed for, and they knew exactly where to focus their attention if the final assessment seemed inaccurate. This design choice transformed a moment of anxiety into a moment of connection with the user.

Applying the Impact/Risk Matrix: What We Chose to Hide

Most AI experiences have no shortage of events and decision nodes that could potentially be displayed during processing. One of the most critical outcomes of the audit was to decide what to keep invisible. In the Meridian example, the backend logs generated 50+ events per claim. We could have defaulted to displaying each event as they were processed as part of the UI. Instead, we applied the risk matrix to prune them:

  • Log Event: Pinging Server West-2 for redundancy check.
    • Filter Verdict: Hide. (Low Stakes, High Technicality).
  • Log Event: Comparing repair estimate to BlueBook value.
    • Filter Verdict: Show. (High Stakes, impacts user’s payout).

By cutting out the unnecessary details, the important information — like the coverage verification — was more impactful. We created an open interface and designed an open experience.

This approach uses the idea that people feel better about a service when they can see the work being done. By showing the specific steps (Assessing, Reviewing, Verifying), we changed a 30-second wait from a time of worry (“Is it broken?”) to a time of feeling like something valuable is being created (“It’s thinking”).

Let’s now take a closer look at how we can review the decision-making process in our products to identify key moments that require clear information.

The Decision Node Audit

Transparency fails when we treat it as a style choice rather than a functional requirement. We have a tendency to ask, “What should the UI look like?” before we ask, “What is the agent actually deciding?”

The Decision Node Audit is a straightforward way to make AI systems easier to understand. It works by carefully mapping out the system’s internal process. The main goal is to find and clearly define the exact moments where the system stops following its set rules and instead makes a choice based on chance or estimation. By mapping this structure, creators can show these points of uncertainty directly to the people using the system. This changes system updates from being vague statements to specific, reliable reports about how the AI reached its conclusion.

In addition to the insurance case study above, I recently worked with a team building a procurement agent. The system reviewed vendor contracts and flagged risks. Originally, the screen displayed a simple progress bar: “Reviewing contracts.” Users hated it. Our research indicated they felt anxious about the legal implications of a missing clause.

We fixed this by conducting a Decision Node Audit. I’ve included a step-by-step checklist for conducting this audit at the conclusion of this article.

We ran a session with the engineers and outlined how the system works. We identified “Decision Points” — moments where the AI had to choose between two good options.

In standard computer programs, the process is clear: if A happens, then B will always happen. In AI systems, the process is often based on chance. The AI thinks A is probably the best choice, but it might only be 65% certain.

In the contract system, we found a moment when the AI checked the liability terms against our company rules. It was rarely a perfect match. The AI had to decide if a 90% match was good enough. This was a key decision point.

Once we identified this node, we exposed it to the user. Instead of “Reviewing contracts,” the interface updated to say: “Liability clause varies from standard template. Analyzing risk level.”

This specific update gave users confidence. They knew the agent checked the liability clause. They understood the reason for the delay and gained trust that the desired action was occurring on the back end. They also knew where to dig in deeper once the agent generated the contract.

To check how the AI makes decisions, you need to work closely with your engineers, product managers, business analysts, and key people who are making the choices (often hidden) that affect how the AI tool functions. Draw out the steps the tool takes. Mark every spot where the process changes direction because a probability is met. These are the places where you should focus on being more transparent.

As shown in Figure 2 below, the Decision Node Audit involves these steps:

  1. Get the team together: Bring in the product owners, business analysts, designers, key decision-makers, and the engineers who built the AI. For example,

    Think about a product team building an AI tool designed to review messy legal contracts. The team includes the UX designer, the product manager, the UX researcher, a practicing lawyer who acts as the subject-matter expert, and the backend engineer who wrote the text-analysis code.

  2. Draw the whole process: Document every step the AI takes, from the user’s first action to the final result.

    The team stands at a whiteboard and sketches the entire sequence for a key workflow that involves the AI searching for a liability clause in a complex contract. The lawyer uploads a fifty-page PDF → The system converts the document into readable text. → The AI scans the pages for liability clauses. → The user waits. → Moments or minutes later, the tool highlights the found paragraphs in yellow on the user interface. They do this for many other workflows that the tool accommodates as well.

  3. Find where things are unclear: Look at the process map for any spot where the AI compares options or inputs that don’t have one perfect match.

    The team looks at the whiteboard to spot the ambiguous steps. Converting an image to text follows strict rules. Finding a specific liability clause involves guesswork. Every firm writes these clauses differently, so the AI has to weigh multiple options and make a prediction instead of finding an exact word match.

  4. Identify the ‘best guess’ steps: For each unclear spot, check if the system uses a confidence score (for example, is it 85% sure?). These are the points where the AI makes a final choice.

    The system has to guess (give a probability) which paragraph(s) closely resemble a standard liability clause. It assigns a confidence score to its best guess. That guess is a decision node. The interface needs to tell the lawyer it is highlighting a potential match, rather than stating it found the definitive clause.

  5. Examine the choice: For each choice point, figure out the specific internal math or comparison being done (e.g., matching a part of a contract to a policy or comparing a picture of a broken car to a library of damaged car photos).

    The engineer explains that the system compares the various paragraphs against a database of standard liability clauses from past firm cases. It calculates a text similarity score to decide on a match based on probabilities.

  6. Write clear explanations: Create messages for the user that clearly describe the specific internal action happening when the AI makes a choice.

    The content designer writes a specific message for this exact moment. The text reads: Comparing document text to standard firm clauses to identify potential liability risks.

  7. Update the screen: Put these new, clear explanations into the user interface, replacing vague messages like “Reviewing contracts.”

    The design team removes the generic Processing PDF loading spinner. They insert the new explanation into a status bar located right above the document viewer while the AI thinks.

  8. Check for Trust: Make sure the new screen messages give users a simple reason for any wait time or result, which should make them feel more confident and trusting.

The Impact/Risk Matrix

Once you look closely at the AI’s process, you’ll likely find many points where it makes a choice. An AI might make dozens of small choices for a single complex task. Showing them all creates too much unnecessary information. You need to group these choices.

You can use an Impact/Risk Matrix to sort these choices based on the types of action(s) the AI is taking. Here are examples of impact/risk matrices:

First, look for low-stakes and low-impact decisions.

Low Stakes / Low Impact

  • Example: Organizing a file structure or renaming a document.
  • Transparency Need: Minimal. A subtle toast notification or a log entry suffices. Users can undo these actions easily.

Then identify the high-stakes and high-impact decisions.

High Stakes / High Impact

  • Example: Rejecting a loan application or executing a stock trade.
  • Transparency Need: High. These actions require Proof of Work. The system must demonstrate the rationale before or immediately as it acts.

Consider a financial trading bot that treats all buy/sell orders the same. It executes a $5 trade with the same opacity as a $50,000 trade. Users might question whether the tool recognizes the potential impact of transparency on trading on a large dollar amount. They need the system to pause and show its work for the high-stakes trades. The solution is to introduce a Reviewing Logic state for any transaction exceeding a specific dollar amount, allowing the user to see the factors driving the decision before execution.

Mapping Nodes to Patterns: A Design Pattern Selection Rubric

Once you have identified your experience’s key decision nodes, you must decide which UI pattern applies to each one you’ll display. In Designing For Agentic AI, we introduced patterns like the Intent Preview (for high-stakes control) and the Action Audit (for retrospective safety). The decisive factor in choosing between them is reversibility.

We filter every decision node through the impact matrix in order to assign the correct pattern:

High Stakes & Irreversible: These nodes require an Intent Preview. Because the user cannot easily undo the action (e.g., permanently deleting a database), the transparency moment must happen before execution. The system must pause, explain its intent, and require confirmation.

High Stakes & Reversible: These nodes can rely on the Action Audit & Undo pattern. If the AI-powered sales agent moves a lead to a different pipeline, it can do so autonomously as long as it notifies the user and offers an immediate Undo button.

By strictly categorizing nodes this way, we avoid “alert fatigue.” We reserve the high-friction Intent Preview only for the truly irreversible moments, while relying on the Action Audit to maintain speed for everything else.

Reversible Irreversible
Low Impact Type: Auto-Execute
UI: Passive Toast / Log
Ex: Renaming a file
Type: Confirm
UI: Simple Undo option
Ex: Archiving an email
High Impact Type: Review
UI: Notification + Review Trail
Ex: Sending a draft to a client
Type: Intent preview
UI: Modal / Explicit Permission
Ex: Deleting a server

Table 1: The impact and reversibility matrix can then be used to map your moments of transparency to design patterns.

Qualitative Validation: “The Wait, Why?” Test

You can identify potential nodes on a whiteboard, but you must validate them with human behavior. You need to verify whether your map matches the user’s mental model. I use a protocol called the “Wait, Why?” Test.

Ask a user to watch the agent complete a task. Instruct them to speak aloud. Whenever they ask a question, “Wait, why did it do that?” or “Is it stuck?” or “Did it hear me?” — you mark a timestamp.

These questions signal user confusion. The user feels their control slipping away. For example, in a study for a healthcare scheduling assistant, users watched the agent book an appointment. The screen sat static for four seconds. Participants consistently asked, “Is it checking my calendar or the doctor’s?”

That question revealed a missing Transparency Moment. The system needed to split that four-second wait into two distinct steps: “Checking your availability” followed by “Syncing with provider schedule.”

This small change reduced users’ expressed levels of anxiety.

Transparency fails when it only describes a system action. The interface must connect the technical process to the user’s specific goal. A screen displaying “Checking your availability” falls flat because it lacks context. The user understands that the AI is looking at a calendar, but they do not know why.

We must pair the action with the outcome. The system needs to split that four-second wait into two distinct steps. First, the interface displays “Checking your calendar to find open times.” Then it updates to “Syncing with the provider’s schedule to secure your appointment.” This grounds the technical process in the user’s actual life.

Consider an AI managing inventory for a local cafe. The system encounters a supply shortage. An interface reading “contacting vendor” or “reviewing options” creates anxiety. The manager wonders if the system is canceling the order or buying an expensive alternative. A better approach is to explain the intended result: “Evaluating alternative suppliers to maintain your Friday delivery schedule.” This tells the user exactly what the AI is trying to achieve.

Operationalizing the Audit

You have completed the Decision Node Audit and filtered your list through the Impact and Risk Matrix. You now have a list of essential moments for being transparent. Next, you need to create them in the UI. This step requires teamwork across different departments. You can’t design transparency by yourself using a design tool. You need to understand how the system works behind the scenes.

Start with a Logic Review. Meet with your lead system designer. Bring your map of decision nodes. You need to confirm that the system can actually share these states. I often find that the technical system doesn’t reveal the exact state I want to show. The engineer might say the system just returns a general “working” status. You must push for a detailed update. You need the system to send a specific notice when it switches from reading text to checking rules. Without that technical connection, your design is impossible to build.

Next, involve the Content Design team. You have the technical reason for the AI’s action, but you need a clear, human-friendly explanation. Engineers provide the underlying process, but content designers provide the way it’s communicated. Do not write these messages alone. A developer might write “Executing function 402,” which is technically correct but meaningless to the user. A designer might write “Thinking,” which is friendly but too vague. A content strategist finds the right middle ground. They create specific phrases, such as “Scanning for liability risks”, that show the AI is working without confusing the user.

Finally, test the transparency of your messages. Don’t wait until the final product is built to see if the text works. I conduct comparison tests on simple prototypes where the only thing that changes is the status message. For example, I show one group (Group A) a message that says “Verifying identity” and another group (Group B) a message that says “Checking government databases” (these are made-up examples, but you understand the point). Then I ask them which AI feels safer. You’ll often discover that certain words cause worry, while others build trust. You must treat the wording as something you need to test and prove effective.

How This Changes the Design Process

Conducting these audits has the potential to strengthen how a team works together. We stop handing off polished design files. We start using messy prototypes and shared spreadsheets. The core tool becomes a transparency matrix. Engineers and the content designers edit this spreadsheet together. They map the exact technical codes to the words the user will read.

Teams will experience friction during the logic review. Imagine a designer asking the engineer how the AI decides to decline a transaction submitted on an expense report. The engineer might say the backend only outputs a generic status code like “Error: Missing Data”. The designer states that this isn’t actionable information on the screen. The designer negotiates with the engineer to create a specific technical hook. The engineer writes a new rule so the system reports exactly what is missing, such as a missing receipt image.

Content designers act as translators during this phase. A developer might write a technically accurate string like “Calculating confidence threshold for vendor matching.” A content designer translates that string into a phrase that builds trust for a specific outcome. The strategist rewrites it as “Comparing local vendor prices to secure your Friday delivery.” The user understands the action and the result.

The entire cross-functional team sits in on user testing sessions. They watch a real person react to different status messages. Seeing a user panic because the screen says “Executing trade” forces the team to rethink their approach. The engineers and designers align on better wording. They change the text to “Verifying sufficient funds” before buying stock. Testing together guarantees the final interface serves both the system logic and the user’s peace of mind.

It does require time to incorporate these additional activities into the team’s calendar. However, the end result should be a team that communicates more openly, and users who have a better understanding of what their AI-powered tools are doing on their behalf (and why). This integrated approach is a cornerstone of designing truly trustworthy AI experiences.

Trust Is A Design Choice

We often view trust as an emotional byproduct of a good user experience. It is easier to view trust as a mechanical result of predictable communication.

We build trust by showing the right information at the right time. We destroy it by overwhelming the user or hiding the machinery completely.

Start with the Decision Node Audit, particularly for agentic AI tools and products. Find the moments where the system makes a judgment call. Map those moments to the Risk Matrix. If the stakes are high, open the box. Show the work.

In the next article, we will look at how to design these moments: how to write the copy, structure the UI, and handle the inevitable errors when the agent gets it wrong.

Appendix: The Decision Node Audit Checklist

Phase 1: Setup and Mapping

✅ Get the team together: Bring in the product owners, business analysts, designers, key decision-makers, and the engineers who built the AI.

Hint: You need the engineers to explain the actual backend logic. Do not attempt this step alone.

✅ Draw the whole process: Document every step the AI takes, from the user’s first action to the final result.

Hint: A physical whiteboard session often works best for drawing out these initial steps.

Phase 2: Locating the Hidden Logic

✅ Find where things are unclear: Look at the process map for any spot where the AI compares options or inputs that do not have one perfect match.

✅ Identify the best guess steps: For each unclear spot, check if the system uses a confidence score. For example, ask if the system is 85 percent sure. These are the points where the AI makes a final choice.

✅ Examine the choice: For each choice point, figure out the specific internal math or comparison being done. An example is matching a part of a contract to a policy. Another example involves comparing a picture of a broken car to a library of damaged car photos.

Phase 3: Creating the User Experience

✅ Write clear explanations: Create messages for the user that clearly describe the specific internal action happening when the AI makes a choice.

Hint: Ground your messages in concrete reality. If an AI books a meeting with a client at a local cafe, tell the user the system is checking the cafe reservation system.

✅ Update the screen: Put these new, clear explanations into the user interface. Replace vague messages like Reviewing contracts with your specific explanations.

✅ Check for Trust: Make sure the new screen messages give users a simple reason for any wait time or result. This should make them feel confident and trusting.

Hint: Test these messages with actual users to verify they understand the specific outcome being achieved.

ChromeFlash

I built a Chrome extension to track where Chrome’s RAM actually goes.
Chrome uses a lot of memory. We all know this. But when I actually tried to figure out which tabs were eating my RAM, I realized Chrome doesn’t make it easy.

Task Manager gives you raw process IDs. chrome://memory-internals is a wall of text. Neither tells you “your 12 active tabs are using ~960 MB and your 2 YouTube tabs are using ~300 MB.”

So I built ChromeFlash — a Manifest V3 extension that estimates Chrome’s memory by category and gives you tools to reclaim it.

What it looks like

The popup shows a breakdown of Chrome’s estimated RAM:

  • Browser Core — ~250 MB for Chrome’s internal processes
  • Active Tabs — ~80 MB each
  • Pinned Tabs — ~50 MB each (lighter footprint)
  • Media Tabs — ~150 MB each (audio/video)
  • Suspended Tabs — ~1 MB each
  • Extensions — estimated overhead

A stacked color bar visualizes the proportions at a glance.

The honest caveat

Chrome’s extension APIs in Manifest V3 don’t expose per-tab memory. The chrome.processes API exists but is limited to dev channel. So these are estimates based on real-world averages — not exact measurements.

If you know a better approach, I’d genuinely love to hear it.

Tab suspension

The biggest win. Calling chrome.tabs.discard() on an inactive tab drops it from ~80 MB to ~1 MB. The tab stays in your tab bar, and when you click it, Chrome reloads it.

ChromeFlash lets you:

  • Suspend inactive tabs manually or on a timer (1–120 min)
  • Protect pinned tabs and tabs playing audio
  • Detect and close duplicate tabs

The auto-suspend runs via chrome.alarms since MV3 service workers can’t use setInterval.

// The core of tab suspension
chrome.alarms.create('tab-audit', { periodInMinutes: 5 });

chrome.alarms.onAlarm.addListener(async (alarm) => {
  if (alarm.name === 'tab-audit') {
    const tabs = await chrome.tabs.query({});
    for (const tab of tabs) {
      if (shouldDiscard(tab)) {
        await chrome.tabs.discard(tab.id);
      }
    }
  }
});

Optimization profiles

Four presets that configure tab suspension + Chrome settings in one click:

Profile Suspend Services Off Est. RAM Saved
Gaming 1 min 5 (DNS, spell, translate, autofill, search) ~500–2000 MB
Productivity 15 min 0 ~200–600 MB
Battery Saver 5 min 4 (DNS, spell, translate, search) ~400–1500 MB
Privacy 30 min 7 (+ Topics, FLEDGE, Do Not Track ON) ~150–400 MB

Each profile shows the exact numbers and what changes — no vague “optimizes your browser” marketing.

Hidden settings via chrome.privacy

Chrome exposes several settings through the chrome.privacy API that most users never touch:

// Toggle DNS prefetching
chrome.privacy.network.networkPredictionEnabled.set({ value: false });

// Disable cloud spell check
chrome.privacy.services.spellingServiceEnabled.set({ value: false });

// Disable Topics API (ad tracking)
chrome.privacy.websites.topicsEnabled.set({ value: false });

// Disable FLEDGE / Protected Audiences
chrome.privacy.websites.fledgeEnabled.set({ value: false });

The extension exposes 8 of these as toggle switches. Disabling background services like spell check and translation reduces both network calls and CPU usage — not dramatically, but it adds up.

Chrome Flags guide

Chrome flags (chrome://flags) can meaningfully improve performance, but extensions can’t modify them programmatically — Chrome blocks this for security reasons.

So ChromeFlash includes a curated database of 21 performance-relevant flags:

  • Rendering — GPU Rasterization, Zero-Copy, Skia Graphite
  • Network — QUIC Protocol, WebSocket over HTTP/2
  • Memory — Automatic Tab Discarding, High Efficiency Mode
  • JavaScript — V8 Sparkplug, V8 Maglev compilers
  • Loading — Back/Forward Cache, Parallel Downloads

Each flag shows a risk level, impact rating, and a button that opens it directly in chrome://flags/#flag-name.

Architecture

ChromeFlash/
  manifest.json
  src/
    background/service-worker.js    # Alarms, tab audit, memory pressure
    modules/
      tab-manager.js                # Suspend, discard, duplicates
      memory-optimizer.js           # Chrome RAM breakdown
      network-optimizer.js          # DNS prefetch toggle
      privacy-optimizer.js          # 8 privacy setting toggles
      performance-monitor.js        # CPU/memory stats, score
      profiles.js                   # 4 profiles with detailed stats
      flags-database.js             # 21 curated flags
      settings.js / storage.js      # Persistence
    popup/                          # Main UI
    pages/                          # Dashboard + Flags Guide

No build step. No framework. No bundler. ES modules loaded natively by Chrome. The entire extension is under 50 KB.

What I learned

MV3 service workers are stateless. Every alarm fires into a fresh context. You can’t store state in module-level variables — it has to go in chrome.storage. This tripped me up early.

chrome.tabs.discard() is underrated. It’s the single highest-impact thing an extension can do for memory. 85–92% reduction per tab with zero user friction — the tab just reloads when you click it.

chrome.privacy is powerful but underdiscovered. Most developers don’t know you can programmatically toggle DNS prefetching, Topics API, or FLEDGE from an extension. The API surface is small but useful.

Flags can’t be automated. I spent time looking for workarounds before accepting that chrome://flags is intentionally walled off. The guide approach works well enough.

Try it

ChromeFlash is free, open source, and collects zero data. No analytics, no remote servers, no tracking. Everything stays in chrome.storage.local