Teaching an AI Agent to Debug Flaky Tests

If you’ve been connected to the internet for a while, you’ve surely heard of AI Agent Skills. They teach your agent to do this and that. You might have even used or written a couple of them yourself.

If you aren’t yet familiar with them, the idea is simple: Instead of prompting instructions for a specific task each time, you define them once and reuse them later. A Skill is an AI equivalent of a knowledge base article: a plain text document that lives in a discoverable location and describes steps, a set of conventions, or domain-specific knowledge.

Most Skills you see in the wild are for simple things like enforcing code style or commit message conventions. But they can be much more powerful than that. In this article, we’ll combine AI Skills, good old developer tools, and a bit of creative thinking to address a notoriously challenging task: making AI deterministically find the root cause of flaky tests.

The problem

Quoting the TeamCity CI/CD guide:

Flaky tests are defined as tests that return both passes and failures despite no changes to the code or the test itself.

Flakiness undermines the whole point of tests: When a test fails, you can’t tell whether something is actually broken. You can’t fully rely on the test results, and at the same time, you can’t ignore them. This wastes both human and infrastructure resources.

And as if the underlying bugs weren’t difficult enough on their own, flaky tests often have this property of failing once in several thousand runs, making them extremely hard to reproduce and debug.

Example project

For the example project, let’s take the webshop demo from this article: Your Programs Are Not Single-Threaded. It is a Spring Boot project, in which one of the services has a TOCTOU (time-of-check to time-of-use) problem: It checks a condition and then acts on it, but another thread can change the state in between. In this particular case, it may sometimes cause duplicate invoice numbers and also makes the corresponding test flaky.

Here’s the problematic test:

@SpringBootTest
class InvoiceServiceTest {

    @Autowired
    private OrderService orderService;

    @Test
    void firstTwoOrdersGetInvoiceNumbersOneAndTwo() {
        CompletableFuture<Invoice> alice = CompletableFuture.supplyAsync(
                () -> orderService.checkout("Alice", BigDecimal.TEN));
        CompletableFuture<Invoice> bob = CompletableFuture.supplyAsync(
                () -> orderService.checkout("Bob", BigDecimal.TEN));

        String num1 = alice.join().getInvoiceNumber();
        String num2 = bob.join().getInvoiceNumber();

        assertEquals(Set.of("INV-00001", "INV-00002"), Set.of(num1, num2));
    }
}

The test creates two orders concurrently and checks that the resulting invoices get numbers INV-00001 and INV-00002. Because of a bug in InvoiceService, it can either pass or fail randomly.

Note: If you’re using IntelliJ IDEA, you can test whether a test is actually flaky by using the Run until failure option in the test runner. Leave the suspect spinning for some time and see if it eventually fails.


If we knew nothing about the underlying bug, and only had the test, is there a tool that could help us find the root cause? Or can we make one ourselves? Furthermore, could we delegate both building and using the tool to AI?

The intuition

Let’s come up with some intuition for this class of problem.

To produce two kinds of results, the execution must follow different code paths. The difference might be minimal, possibly just one extra method call or one if branch taken instead of another. But it has to be there; otherwise, the result would be consistent. So, if we could record the code path for a passing run and a failing run and then compare them, the diff should at least point us in the right direction. And ideally, by following the call tree, we could find the place where execution splits. This line must be exactly where the flakiness originates.

Does this reasoning make sense? Let’s put it to the test.

Build the tools

What tool can we use for recording code paths? While not designed specifically for tracing, a test coverage tool can give us the information we’re after.

There are a couple of Java coverage tools to choose from, such as JaCoCo and IntelliJ IDEA’s coverage tool. We’ll go with IntelliJ IDEA’s, because it includes a hit counting feature that is very useful. We may need this extra granularity because the flakiness might stem not only from what is executed, but also how many times.

Run coverage from the command line

IntelliJ IDEA’s coverage tool has a familiar UI, but we need a way to launch it programmatically. Fortunately, coverage can also be collected from the command line by attaching the coverage agent to the JVM via Maven Surefire:

mvn surefire:test 
  -Dtest=com.example.webshop.service.InvoiceServiceTest 
  "-DargLine=-Didea.coverage.calculate.hits=true 
    -javaagent:$AGENT_JAR=$IC_FILE,true,false,false,true,com.example.webshop.*"

The -Didea.coverage.calculate.hits=true flag tells the agent to record invocation counts per line rather than just a boolean hit/not-hit mask. After the test finishes, the results are written to a binary .ic file.

So far so good, but we need the report in a human (and AI)-readable format.

Add text output

Luckily, the IntelliJ coverage agent is open-source. Let’s clone the project and ask AI to add a text reporter that converts binary reports to plain text.

The agent creates a new class called TextCoverageStatistics. After we build the project and run the reporter against our .ic file, we get something like this:

=== Coverage Summary ===

  Instructions: 236/618  38,2%
  Branches    : 0/20   0,0%
  Lines       : 56/150  37,3%
  ...

=== Per-Class Coverage ===

Class                                                           Lines    Line%  Methods    Meth%
--------------------------------------------------------------------------------------------
...
com.example.webshop.service.InvoiceNumberGenerator              4/4    100,0%    2/2    100,0%
com.example.webshop.service.InvoiceService                     10/10   100,0%    3/3    100,0%
com.example.webshop.service.OrderService                        6/6    100,0%    2/2    100,0%
...

The first part of the report gives a high-level overview: How many lines, branches, and methods were covered across the entire project. Below that, there’s a per-class breakdown showing the same metrics for each class individually.

Then it is followed by per-line hit counts for each class:

--- com.example.webshop.service.InvoiceService ---
  Line       Hits  Branch
  19            2
  20            1
  22            2
  23            2
  24            2
  ...

For every line that the coverage agent instrumented, we see how many times it was executed and whether any branches were taken. The actual report is longer, but you get the idea. Now we have a text representation of which lines were executed, and exactly how many times.

This is the raw material we need for the diff. So far, so good!

Diff the reports

Supposedly, the obtained reports contain the necessary information, and a very determined developer could peruse them and find the bug. But we’re not here for mundane tasks like that, right?

Let’s upgrade the tool so that it gets multiple report variations and presents the diff. The most controllable way would be to do one “brick” at a time, but I think we’re safe to delegate the entire thing to AI here, including the automation:

The resulting script runs the test in a loop until both of the following happen:

  • We get at least one passing and one failing run.
  • The specified number of runs have passed.

Both conditions are important because test failures can be very rare, and the specified number of runs might not be enough. At the same time, there can be finer grained variations within pass and fail runs, so we might want to catch those too.

After the reports are collected, the script summarizes the lines that have variations between the runs. Here’s what it looks like:

Collected 20 runs: 12 pass, 8 fail

Lines that vary across runs:

  Invoice:29                           Hits(1,2)
  Invoice:31                           Hits(1,2)
  Invoice:32                           Hits(1,2)
  InvoiceNumberGenerator:15            Hits(1,2)
  InvoiceService:19                    Hits(1,2)  Branch(1/2)
  InvoiceService:20                    Hits(1,2)
  InvoiceService:22                    Hits(1,2)
  InvoiceService:24                    Hits(1,2)

All variations have the same pattern: the difference is not which lines were executed, but how many times. As we expected, the hit counting feature of IntelliJ IDEA’s coverage agent proved useful!

The varying lines point at a lazy initialization block in InvoiceService and its downstream effects in InvoiceNumberGenerator and Invoice. The variation in hit counts means that the initialization sometimes runs more than once, which shouldn’t happen. That’s exactly where the flakiness comes from.

If you missed the article that describes the problem, here’s why double initialization causes this bug. The createGenerator() method queries the database for the last used invoice number and creates a counter starting from that value. When two threads both enter the if (generator == null) block before either finishes, each reads the same number from the database and creates its own generator starting from the same value. The result is duplicate invoice numbers.

The coverage diff has pointed us at the very same TOCTOU race discussed in more detail in the previous article. But, what is novel in our current approach is that it doesn’t solely rely on human expertise and is easily accessible for AI.

Turning it into a Skill

Now, I’d say that AI-assisted modifications to open-source tools that help you solve the task at hand, all within minutes, are amazing on their own. But let’s keep our eyes on the bigger picture.

Here’s what we’ve done so far: We started with an intuition: Flaky tests take different code paths, and coverage analysis can reveal where they diverge. Then we turned that intuition into a concrete, repeatable procedure. Does this warrant a knowledge base article, or an AI Agent Skill, perhaps? Yes!

In the same agent session, let’s ask the agent to:

  1. Make sure all the scripts are self-contained and runnable.
  2. Document the entire procedure in a SKILL.md file, step by step, so that another agent can follow it without any prior context.

The agent packages everything and writes a guide that describes when to apply the Skill, what tools are needed, and what steps to follow.

The only follow-up during review was to align the Skill with the specification. The original Skill written by the agent lacks meta in frontmatter. Agents are good at sorting out Skills that omit minor details, but meta is important for discoverability. Without it, a Skill might not be picked up by an agent in the first place.

Testing the Skill

To verify that the Skill actually works, let’s start a fresh agent session. No warm-up, no hints. Instead, let’s deliberately phrase it in a very general way, something like “find and fix the cause of flakiness in InvoiceServiceTest“.

An agent uses the skill

The agent matches the Skill description from SKILL.md with the problem description, discovers the instructions, and executes them: It runs the coverage script, reads the diff, and identifies the race condition. Instead of guesswork, it follows the established steps and arrives at the same conclusion every time. That’s about as deterministic as generative AI can get!

Summary

The changes that we’ve made to the coverage agent are already published with the new version 1.0.774. And the Skill is available here.

In this article, we started with an intuition about flaky tests, built custom tooling around an open-source coverage agent, used it to find a race condition, and packaged the entire procedure into a reusable AI Skill. You can use this Skill for finding flaky tests in your own projects, but I hope this post conveys the bigger idea.

AI Skills allow you to teach agents to solve virtually anything, as long as you can stack text interfaces together. Many hard programming problems can be broken down into simpler ones and solved using familiar tools. And with AI orchestrating all this, we can even make the process enjoyable. As was the case long before AI, curiosity is the only real prerequisite.

Have you been inspired to solve a tough problem in your own work? Would you like to share the Skills you wrote or find most useful? Let us know in the comments!

Happy debugging!

Kodee’s Kotlin Roundup: Golden Kodee Finalists, Kotlin 2.4.0-Beta2, and New Learning Resources

Hi everyone! April brought exciting community news with the announcement of the Golden Kodee finalists, along with Kotlin and tooling releases, multiplatform progress, and fresh backend resources. I also came across the new Kotlin Professional Certificate on LinkedIn Learning, which is a great way to build your skills. And if you want something more playful, I found a fun way to practice coroutines. Here are the stories that stood out to me most.

Where you can learn more

  • Workshops – KotlinConf 2026, May 20–22, Munich
  • Introducing the Skill Manager and Skill Repository
  • Give AI Something Worth Amplifying: Three Priorities for Technical Leaders
  • Using ACP + Deep Agents to Demystify Modern Software Engineering
  • Prototype LLM calls on the JVM using Kotlin Notebook and LangChain4j in IntelliJ IDEA
  • Next-Level Observability with OpenTelemetry
  • Exposed now supports array types out of the box for PostgreSQL
  • Using Spring Data JDBC With Kotlin
  • Ktor 3.4.3 has been released
  • Dokka 2.2.0 is out

YouTube highlights

  • How KMP Helped a 378-Year-Old Company
  • How Uber Uses AI to Move from Java to Kotlin
  • Exposed 1.0 and Beyond | Talking Kotlin
  • Best KMP libraries with Klibs.io

¿Cuánta energía, agua, dinero e infraestructura estamos dispuestos a gastar para sostenerla?

La conversación sobre IA no debería quedarse solo en lo que la tecnología puede hacer.
También deberíamos preguntarnos:

¿Cuánta energía, agua, dinero e infraestructura estamos dispuestos a gastar para sostenerla?

Hoy vemos una carrera acelerada por construir modelos más grandes, más centros de datos, más GPUs y más automatización. Las grandes empresas están invirtiendo miles de millones como si el retorno económico estuviera garantizado, pero todavía no existe suficiente claridad sobre el costo real por usuario, el margen neto de muchos productos de IA y el impacto energético a largo plazo.

Y aquí aparece un punto crítico:

Aunque los modelos se optimicen, el consumo total puede seguir creciendo.
Si una tecnología se vuelve más eficiente y barata de usar, normalmente se usa más. Esto se conoce como efecto rebote. En IA puede pasar lo mismo: modelos más rápidos y económicos podrían llevar a integrar IA en todo: programación, marketing, soporte, publicidad, CRM, ERP, educación, salud, finanzas, agricultura, logística y agentes autónomos trabajando en segundo plano.

El problema no es que la IA exista.

El problema es una IA sin límites, sin medición clara y sin responsabilidad ambiental, económica y social.

La IA puede ser una gran herramienta para mejorar productividad, ciencia, educación, salud y acceso al conocimiento. Pero si se usa principalmente para reemplazar personas, producir contenido basura, automatizar publicidad infinita y aumentar consumo sin medir impacto, deja de ser progreso y empieza a parecer extracción de recursos.
No estamos listos para darle control completo de nuestros recursos a sistemas de IA.

No estamos listos para una IA que crezca más rápido que la infraestructura eléctrica, las regulaciones y la capacidad ambiental del planeta.
Por eso creo que necesitamos una conversación más seria sobre:

  • Medición obligatoria de energía, agua y emisiones por data center.
  • Modelos grandes solo cuando realmente sean necesarios.
  • Modelos pequeños y especializados para tareas concretas.
  • Reportes claros de costos e ingresos reales de productos de IA.
  • Límites regionales donde la red eléctrica o el agua no alcancen.
  • Auditorías externas de impacto ambiental y económico.
  • Regulación antes de aprobar nuevas cargas gigantes de infraestructura.
    La pregunta no debería ser solo:

“¿Qué puede hacer la IA?”

La pregunta más importante debería ser:

“¿Qué costo estamos dispuestos a pagar como sociedad para usarla sin límites?”

La IA puede ser parte del futuro, sí.

Pero no debería convertirse en una excusa para consumir energía, agua, talento, dinero e infraestructura sin control.

IA sí, pero con límites.
IA útil, no IA desbordada.
IA supervisada, no IA dueña de nuestros recursos.

Feature Flags That Actually Ship: Lessons From the Trenches

It was 2:47 AM when the alerts started. A seemingly straightforward database migration had triggered a cascading failure across three downstream services, and our payment processing pipeline was dropping roughly 12% of transactions. The on-call engineer didn’t need to wake anyone, locate a rollback script, or wait for a CI pipeline to churn through another deploy. She opened the LaunchDarkly dashboard, toggled one kill switch, and the system reverted to the stable path within seconds. The migration was still there, still deployed — just no longer live.

That moment crystallized something I’d been learning across two and a half decades of building software: separating deployment from release isn’t a nice-to-have. It’s the difference between a system you trust and one you fear touching on a Friday afternoon.

This article captures what I’ve learned using feature flags in production — the patterns that held up under pressure, the mistakes I’ve watched teams repeat (and made myself), and the practical steps you can take whether you’re evaluating LaunchDarkly or already deep into your feature flag journey. I’m publishing this here first because the developer community gives the most honest feedback, and I’d rather refine these ideas with you before they land on LeadDev and DZone.

The Patterns That Actually Matter

When you first start with feature flags, everything looks like a toggle. The key consideration here is understanding that not all flags serve the same purpose, and conflating them creates the very fragility you’re trying to avoid.

Release Flags

These gate unfinished features. They’re temporary by design — the flag exists while the feature stabilizes, then gets removed. The mistake I see most often is teams treating release flags as permanent configuration knobs. When a flag has been at 100% for three months, nobody remembers which code path is the “real” one, and your test matrix silently doubles.

In practice, this means setting a removal date the moment you create the flag. Our team attaches an expiration tag to every release flag and runs a weekly script that surfaces anything past its removal window. We borrowed from the FlagShark playbook here: flags older than 90 days that aren’t operational kill switches get an automatic ticket filed.

Centralize your flag keys in a single file, it gives you a one-glance inventory and prevents the typo-driven debugging sessions that scattered string literals create:

// code/src/flags.js — single source of truth for all flag keys
// See companion project: code/src/flags.js

const FLAGS = {
  // Kill switch: wraps the payment provider integration.
  // Defaults to FALSE (safe path) if SDK is unreachable.
  PAYMENT_PROVIDER_KILL_SWITCH: "ops_payments_new_provider",

  // Release flag: gates the new checkout UI.
  // Temporary — remove after 100% rollout + 14 days stable.
  NEW_CHECKOUT_UI: "release_checkout_redesigned_ui",

  // Experiment flag: percentage rollout of recommendation engine.
  RECOMMENDATION_ENGINE: "experiment_recommendations_v2",

  // Permission flag: enterprise-only feature.
  ENTERPRISE_ANALYTICS: "permission_enterprise_analytics",
};

The naming convention follows a pattern: {type}_{team/domain}_{feature}_{detail}. This tells you at a glance what a flag does, who owns it, and when it should be removed. Release flags should be short-lived. Ops flags (kill switches) should be reviewed annually. Experiment flags expire when the experiment ends.

Here’s the LaunchDarkly client initialization — a singleton that streams flag rules and caches them locally so evaluations work even during network interruptions:

// code/src/launchdarkly.js — LD client singleton
// See companion project: code/src/launchdarkly.js

const LaunchDarkly = require("@launchdarkly/node-server-sdk");

async function initLaunchDarkly(sdkKey) {
  const ldClient = LaunchDarkly.init(sdkKey);

  try {
    await ldClient.waitForInitialization({ timeout: 5 });
    console.log("[LaunchDarkly] Client initialized successfully");
  } catch (err) {
    console.warn(
      "[LaunchDarkly] Initialization timed out — operating from cache or defaults"
    );
  }

  return ldClient;
}

Kill Switches

A kill switch is a different animal entirely. It’s not about shipping features — it’s about operational safety. Every integration point with an external system, every experimental code path, every performance-sensitive refactor gets wrapped in one.

The pattern that saved us at 2:47 AM looked like this:

// code/src/server.js — Kill Switch pattern
// See companion project: code/src/server.js, GET /api/payment/status

app.get("/api/payment/status", async (req, res) => {
  const context = { kind: "user", key: req.query.user || req.ip };

  // Default: false = use safe fallback path.
  // If LaunchDarkly is unreachable, the SDK returns the default.
  const useNewProvider = await client.boolVariation(
    FLAGS.PAYMENT_PROVIDER_KILL_SWITCH,
    context,
    false   // <-- THE CRITICAL DEFAULT: safe path
  );

  if (useNewProvider) {
    return res.json({ provider: "new-payment-provider", status: "ok" });
  }

  // Safe fallback: the existing, battle-tested provider.
  res.json({ provider: "existing-payment-provider", status: "ok" });
});

The critical design requirement: the fallback path must be the one that works. If your kill switch guards a new payment provider integration, the fallback routes through the existing, battle-tested provider. If the flag evaluation itself fails due to a network issue, LaunchDarkly’s SDK returns the default value you specify — which should always trigger the safe path.

Percentage Rollouts

Deterministic hashing based on a stable user attribute means the same user sees the same experience across sessions. This matters more than you’d think — users notice inconsistency, and your metrics become meaningless if a single user bounces between variants.

Our rollout cadence settled into a rhythm: internal team for one day, 1% of external users for a day, then 5%, 25%, and full release if all guardrails stay green. At each stage, we watch application error rates, API latency, and business metrics. LaunchDarkly’s Guarded Releases can automate the pause-or-rollback decision if a threshold breaches, which removes the 3 AM judgment call from the equation.

// code/src/server.js — Percentage rollout with string variation
// See companion project: code/src/server.js, GET /api/recommendations

app.get("/api/recommendations", async (req, res) => {
  const context = { kind: "user", key: req.query.user || "anonymous" };

  // stringVariation for multi-variant experiments.
  // Deterministic hashing on user key ensures the same user
  // consistently sees the same variant.
  const variant = await client.stringVariation(
    FLAGS.RECOMMENDATION_ENGINE,
    context,
    "v1"   // default: existing recommendation engine
  );

  if (variant === "v2") {
    return res.json({
      engine: "collaborative-filtering-v2",
      recommendations: ["Item-A", "Item-B", "Item-C"],
    });
  }

  res.json({
    engine: "popularity-based-v1",
    recommendations: ["Item-X", "Item-Y", "Item-Z"],
  });
});

And here’s user targeting in action — enterprise features gated by a custom attribute:

// code/src/server.js — Targeting with custom attributes
// See companion project: code/src/server.js, GET /api/analytics/dashboard

app.get("/api/analytics/dashboard", async (req, res) => {
  const context = {
    kind: "user",
    key: req.query.user || "anonymous",
    plan: req.query.plan || "free",  // custom attribute for targeting rules
  };

  const canAccess = await client.boolVariation(
    FLAGS.ENTERPRISE_ANALYTICS,
    context,
    false
  );

  if (!canAccess) {
    return res.status(403).json({
      error: "Enterprise analytics require the Enterprise plan.",
    });
  }

  res.json({
    dashboard: "advanced-analytics",
    metrics: ["revenue-per-user", "churn-prediction", "cohort-retention"],
  });
});

All the code above comes from the companion project — a fully runnable Express app in code/src/server.js. Clone it, set your SDK key, and you’ll see every pattern respond to flag toggles in real time without a server restart.

The Questions Your Team Will Ask (And How to Answer Them)

When you introduce feature flags at scale, you’ll hear the same objections. I’ve had these conversations enough times to recognize the patterns.

“Doesn’t this just create more code to maintain?”

Yes, if you treat flags as permanent. The entire discipline of flag lifecycle management exists because flags without expiration dates become technical debt with a feature flag logo. The countermeasure is mechanical, not cultural: automation that flags stale toggles, creates cleanup tasks, and blocks new flags when the ratio of creation to removal tips past 2:1.

We enforce a simple rule: every flag has an owner, an expiration date, and a ticket filed at creation time for its eventual removal. When a release flag hits 100% rollout for two weeks, the cleanup PR gets auto-generated. This isn’t optional — it’s how you prevent the flag graveyard.

“What if the flag service goes down?”

LaunchDarkly SDKs maintain a streaming connection and cache flag rules locally. If the connection drops, evaluations continue against the cached ruleset. The boolVariation call includes a default value parameter precisely for this scenario — and every code path I write defaults to the safe, existing behavior.

In the 2:47 AM scenario, the kill switch worked because the SDK had already cached the flag state. Even if LaunchDarkly’s service had been unavailable at that exact moment, the toggle would have still evaluated correctly against the local cache.

“Can’t we just build this ourselves?”

Technically, yes. I’ve seen teams build internal feature flag systems. I’ve also seen those same teams spend sprint after sprint maintaining edge-case evaluation logic, building dashboards, and debugging deterministic hashing when they could have been building their actual product. The key consideration here isn’t whether you can build it — it’s whether maintaining a feature flag platform is where your team’s time creates the most value.

Where We Go From Here

If you’re starting with feature flags, begin with one operational kill switch on a high-risk integration. Get comfortable with the pattern, build the muscle memory for flag cleanup, then expand to release flags and progressive rollouts. The most successful adoptions I’ve seen started small and grew organically, rather than attempting a company-wide flag-everything initiative overnight.

For deeper dives, the LaunchDarkly documentation on guarded rollouts and kill switch flags is excellent. The FlagShark best practices guide informed much of our internal naming and lifecycle discipline. And if you want to understand why stale flags genuinely keep me up at night, read about the $460M Knight Capital incident — a stark reminder that unreachable code paths aren’t harmless.

The original version of this article, along with a companion project demonstrating every pattern discussed here, lives on this blog. I’ll be expanding it based on your questions and feedback before it goes to LeadDev and DZone — so if something here sparks a thought or a disagreement, I’d genuinely like to hear it in the comments.

Key Takeaways

Separate deployment from release. A deployed change that isn’t live yet is a safety net. A deployed change that’s fully live with no way to turn it off is a liability.

Treat flag cleanup as a first-class engineering practice. Naming conventions, expiration dates, and automated removal aren’t overhead — they’re what keep your codebase comprehensible six months from now.

Default to safety. Every flag evaluation should fall back to the known-good path. The time to verify your kill switch works isn’t during an incident at 2:47 AM.

Start small, automate early, and build the habits before you build the flag count. The teams I’ve watched succeed with feature flags aren’t the ones with the most sophisticated tooling — they’re the ones with the most disciplined lifecycle management.

Logic Apps Agent Loop + MCP: Two Bugs Worth Knowing About

I spent the long weekend pushing Logic Apps MCP server capabilities further than I had before — and hit two bugs worth documenting. Both are filed. If you’re building in this space, save yourself the debugging time.

Context

If you’ve been following along, the MCP server and BODMAS Agent are covered in the previous posts. This post is just about what broke when I wired them together.

Bug 1 — Intermittent duplicate key error at tool registration

What happens

The Agent Loop fails with a BadRequest before making a single MCP call:

HTTP request failed: 'An item with the same key has already been added. Key: {tool_name}'.

The key referenced in the error — BasicArithmeticMCP, ExtendedArithmeticMCP, whatever you name it — appears exactly once in the workflow definition. There is no actual duplicate in the JSON.

What makes it particularly frustrating to diagnose

It is intermittent. Some runs fail, others succeed with identical configuration and identical input. No changes between a failing and a succeeding run — same workflow, same expression, same everything.

Load test

I fired 5 to 10 parallel requests at the Agent Loop as a mini stress test. It failed — the duplicate key error appeared across multiple runs in the batch.

Sequential calls with proper spacing between them worked fine.

What you can’t do

The Agent action has a default retry policy, but it does not help here. A BadRequest (400) is not treated as a transient error — the retry policy targets server-side failures (5xx), not client errors. So even with retries configured, the duplicate key error causes an immediate terminal failure. There is no clean in-workflow workaround.

Bug 2 — MCP Connector does not support OAuth

What happens

Both the MCP server and the MCP client are Logic Apps Standard. When OAuth is configured on the MCP server side, the workflow doesn’t trigger at all — it never reaches the Logic App. The connection gets corrupted at design time with the OAuth setup, and no run is created.

Tools don’t load but you can save the workflow.

You get a 502 bad gateway error when you push a request.

The same endpoint called directly from Postman with a valid bearer token works fine.

Why it matters

To get the Agent Loop working, the MCP server has to run with either anonymous authentication or key-based authentication. OAuth simply does not work with the built-in MCP client connector.

Current state

Both issues are filed on the Logic Apps GitHub repo:

Agent Loop: “An item with the same key has already been added” when using McpClientTool

The issue covers both bugs with full workflow JSON, reproduction steps, and screenshots. If you’ve hit either of these, add a reaction or comment — the more signal on the issue, the better.

What works in the meantime

  • Set "type": "anonymous" in the McpServerEndpoints authentication block in host.json — removes the OAuth blocker for dev and demo use
  • Accept the intermittent failure rate on the Agent Loop and re-trigger manually when it hits — not a fix, but the success rate is high enough to keep building and testing

Both issues are filed. If you hit either of them, the GitHub issue is the right place to add signal.

Mythos Found a 27-Year-Old Bug in OpenBSD. Your Code Is Next.

Anthropic’s new Mythos Preview surfaced a 27-year-old vulnerability in OpenBSD — the most-audited operating system in commercial software — and generated 181 working Firefox exploits in a benchmark where Claude Opus 4.6 managed two. Eleven organizations are inside the launch cohort. The rest of us aren’t, and the next Mythos won’t be gated.

What Mythos is, in hard numbers

On April 7, Anthropic announced Claude Mythos Preview, a frontier general-purpose model with a step-change in computer security capability. The numbers are the story:

  • A 27-year-old vulnerability in OpenBSD, surfaced by Mythos in the TCP SACK implementation. OpenBSD’s audit posture is the high bar in the industry.
  • A 16-year-old vulnerability in FFmpeg’s H.264 codec — the media component shipped in nearly every modern browser and video pipeline.
  • A 17-year-old remote code execution vulnerability in FreeBSD’s NFS implementation (CVE-2026-4747).
  • Linux kernel vulnerabilities autonomously chained by the model into a complete privilege escalation to root.
  • 181 working Firefox exploits in a benchmark where Claude Opus 4.6 produced two — an order-of-magnitude leap in a single model generation.
  • 271 vulnerabilities patched in Firefox 150 after Mozilla used an early version of Mythos Preview to scan its codebase. Mozilla described the model as “every bit as capable” as the best human security researchers.
  • Thousands of zero-days identified in operating systems, browsers, and infrastructure software in the weeks before announcement.

Anthropic was clear about something else worth dwelling on: the company did not explicitly train Mythos for these capabilities. They emerged as a downstream consequence of general improvements in code, reasoning, and autonomy. The same improvements that make the model a better defender make it a better attacker. That equivalence is the whole story.

Mythos isn’t a security tool. It’s a frontier model that happens to be very good at a security task that turns out to require general intelligence. The distinction matters: capability of this kind doesn’t stay siloed.

The asymmetry just collapsed

For thirty years, the offensive-defensive asymmetry in software security was: attackers needed to find one bug, defenders needed to find all of them. The economics favored attackers — but only because finding bugs was hard, slow, and required deep human expertise.

Mythos didn’t flip the asymmetry. It collapsed the cost difference between the two activities. The same model that can find thousands of zero-days for a defender can find thousands of zero-days for an attacker. There is no “attacker mode” and “defender mode.” There is one capability with two uses, and the user picks.

For the launch cohort inside Project Glasswing — including Microsoft, Google, Apple, AWS, JPMorganChase, Nvidia, the Linux Foundation, and major security vendors — this is a defensive windfall. They get to find and patch their own bugs before anyone else can. For everyone else, the math is uglier. When this class of capability becomes broadly available (and it will), the same scan that takes Apple a quiet weekend will take a determined adversary the same quiet weekend.

What this changes about threat modeling

Pre-Mythos, the assumption underlying most enterprise risk frameworks was that vulnerabilities cost time to discover. Post-Mythos, that assumption no longer holds for sophisticated actors. The vulnerabilities are already there, in code that’s already deployed. The only question is who finds them first.

Project Glasswing’s narrow gate

Anthropic’s response to the dual-use problem is Project Glasswing: instead of releasing Mythos publicly, the model is gated to vetted partners doing defensive security work on critical infrastructure. The launch cohort is eleven outside organizations — AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks — with another forty-plus organizations given extended access. Anthropic has committed $100M in Mythos usage credits and additional funding to upstream open-source security ($2.5M to Alpha-Omega and OpenSSF, $1.5M to the Apache Software Foundation). On April 21, Bloomberg and TechCrunch reported that a small group of unauthorized users — reportedly a third-party Anthropic contractor who guessed the model’s online location — had accessed Mythos on the same day Anthropic announced the limited release.

The Glasswing structure is a reasonable response to a hard problem. The cohort is a serious set of defenders, the Linux Foundation’s inclusion broadens the open-source impact, and the upstream funding commitments are not trivial. But the structure has implications worth thinking through:

  • The launch cohort is well-resourced and concentrated. Megacaps, major security vendors, and one open-source foundation. Most enterprises, healthcare systems, utilities, and government agencies are not in the launch cohort.
  • The cohort is the world’s biggest target. Concentrating frontier offensive capability inside a known list of well-resourced firms makes those firms exponentially more valuable to compromise. The April 21 unauthorized-access incident is the canary, not the bird.
  • The gate is temporary. The capability emerged from general intelligence improvements. Other labs are on the same trajectory. Within twelve to twenty-four months, equivalent capability will be available somewhere — through a competitor, an open-weights model, or a leak. Anthropic’s caution buys the industry time. It does not buy the industry safety.
  • The defenders inside the gate have a head start. The defenders outside the gate don’t. By the time Mythos-class capability is broadly available, the cohort will have spent a year hardening their stacks. Everyone else will be starting cold.

None of this is criticism of Glasswing. It’s a description of where the rest of the industry sits: outside the gate, on the clock, with a year-or-so head start to spend on infrastructure that doesn’t assume bug discovery is expensive.

Why your legacy stack is the easy target

If Mythos found a bug in OpenBSD that survived twenty-seven years of obsessive auditing, what does it find in code that’s been quietly running in production since 1998 with no audit at all?

Legacy systems are uniquely exposed to this class of capability for reasons that have nothing to do with their original quality:

  • The code was written in a different threat model. COBOL batch jobs, C-based middleware, and FORTRAN scientific computing were written assuming network isolation, trusted operators, and small adversary budgets. None of those assumptions hold today.
  • The maintainers are gone. The engineers who wrote the original code retired a decade ago. The people who maintain it now read it; they don’t reason about it. A capable adversary scanning the same code reasons about it just fine.
  • The scale is enormous. A typical Fortune 100 enterprise runs millions of lines of legacy code. Manual audit is impossible at this volume; automated tools were built for the threat model where bug discovery was expensive. Mythos-class capability inverts that economics.
  • The code is statistically interesting. Old code has been running long enough that bugs which never triggered in production are still latent. The defects are there. They just haven’t been found yet.
  • The patch path is brittle. Even when a bug is found in a legacy system, the cost of patching is often catastrophic — recompiling a forty-year-old build chain, validating against a forty-year-old behavior contract, regression-testing dependencies that may no longer have maintainers. “We can’t patch this” is a common honest answer for legacy systems, and adversaries know it.

The 27-year-old OpenBSD bug is the canary. OpenBSD is among the most-audited code in the world. Your COBOL payroll system, your FORTRAN actuarial engine, your C-based supply chain ETL — they have not had that audit. They have the same age. They do not have the same hardening.

The honest framing is this: Mythos-class capability does not introduce new vulnerabilities. It surfaces vulnerabilities that have been latent in your systems for years or decades. The defects are already there. The economics of finding them just changed.

The defender’s playbook for the next 90 days

If we accept that Mythos-class capability will be broadly available within twenty-four months and that legacy systems are the most exposed surface, the defensive question is what to do this quarter that materially reduces risk. Five things worth prioritizing.

1. Get an honest inventory of your legacy attack surface

Most enterprises do not have an accurate inventory of what legacy code they actually run, what it touches, and what depends on it. The first step is unglamorous: catalog the legacy systems, their network exposure, the data they process, and the dependencies that would break if they went down. You cannot defend what you cannot see.

2. Build the SBOM you should already have

A Software Bill of Materials isn’t a compliance artifact; it’s the data structure you need to answer the question “is the new zero-day in our stack?” in minutes instead of weeks. Federal contractors will need one for compliance under recent OMB guidance. Build it now, before the next Mythos disclosure forces the question.

3. Modernize the highest-exposure legacy primitives first

Total legacy modernization is a multi-year program. Prioritized modernization isn’t. Identify the legacy components with (a) network exposure, (b) sensitive data flow, and (c) no maintainer — and modernize those first. Pull the C-based parser out of the perimeter. Replace the COBOL service that processes external data with a memory-safe equivalent. Leave the back-office batch job for next year.

4. Assume the patch tsunami is coming

If Mythos-class scanning produces ten thousand findings against your stack, your security team cannot triage ten thousand findings by hand. Invest in automated patch prioritization, exploit-prediction scoring (EPSS), and patch-deployment automation now — before you need it under pressure. The bottleneck of the next two years is not finding bugs. It’s deciding which ones to patch first and shipping the patches without breaking production.

5. Threat-model with AI-assisted attackers in scope

Update your threat models to assume adversaries have Mythos-class capability. The questions change. “What’s our mean-time-to-detect?” matters more than “Is this code vulnerable?” (it almost certainly is). “What’s the blast radius if a single legacy primitive is fully compromised?” matters more than “Is this primitive likely to be compromised?” (it is more likely than it was). Defense in depth, network segmentation, and rapid containment become first-class controls, not best-practice nice-to-haves.

The shift in posture

Pre-Mythos: defenders optimize for bug-finding cost. Post-Mythos: defenders optimize for time-to-patch and blast-radius containment, because bugs will be found whether you find them first or someone else does.

A note for federal contractors

Federal contractors and agencies have an extra layer of implications: the procurement and compliance machinery that governs federal software is going to reckon with this — slowly, but inexorably. Expect SBOM and provenance requirements (already mandated under EO 14028) to get enforced in earnest. Expect NIST SSDF / SP 800-218 to shift from documentation to continuous attestation. Expect legacy waivers to become harder to defend, with risk-acceptance memos required to explicitly acknowledge Mythos-class threat. Expect patch SLAs to compress — sub-week response on high-severity findings against widely-deployed primitives is the realistic floor, not the ceiling. Vendor due-diligence will move from annual questionnaires to continuous attestation.

The realistic posture for the next twenty-four months is not “modernize everything.” It is “modernize the exposed surface, instrument the rest, and assume the rest will eventually be reached.” The agencies and primes that prepare for that reality now will not be the ones writing breach-notification letters in 2027.

The honest read

Mythos is not a doomsday model. It is a step on a curve that the entire industry has been on for several years, and Anthropic’s decision to gate it through Glasswing is, in our view, the responsible move. We don’t think the right reaction is panic, and we don’t think the right reaction is dismissal.

The right reaction is to use the Glasswing window — the twelve to twenty-four months where this capability is concentrated in twelve hands and a national-security agency — to do the unglamorous defensive work that everyone has been deferring. Inventory the legacy. Build the SBOM. Modernize the exposed primitives. Automate the patch path. Threat-model with AI-assisted attackers in scope.

We don’t know exactly when the next Mythos lands or who ships it. We do know it will not be gated like this one. The defenders who used the window will be fine. The defenders who didn’t will be writing the postmortem.

Codavyn helps enterprise and federal teams modernize the exposed surface of legacy stacks before AI-assisted scanning catches up. Custom software, modernization, and a threat model that assumes the attacker is reading your code as fast as you are. See our modernization services or book a 30-minute risk review.

How to Prevent IDOR Vulnerabilities in Django REST APIs

How to Prevent IDOR Vulnerabilities in Django REST APIs

An authenticated user changes /api/orders/42/ to /api/orders/43/ and reads someone else’s order. No privilege escalation needed — the endpoint just returns it. This is IDOR in its simplest form, and it’s endemic in Django REST Framework code because DRF makes it trivially easy to wire up a ModelViewSet that exposes every object in a table. The authentication layer does its job; the authorization layer was never written.

How IDOR Attacks Work Against Django REST APIs

IDOR (Insecure Direct Object Reference) happens when an API accepts a user-controlled identifier — a URL path segment, query param, or request body field — and retrieves the corresponding object without verifying that the requesting user has any right to it. Authentication proves who you are. Authorization proves what you can touch. Most IDOR bugs exist because the first check was implemented and the second was skipped.

A typical attack against a vulnerable DRF app:

  1. Attacker authenticates as alice@example.com and creates an order. The response contains {"id": 101, ...}.
  2. Attacker sends GET /api/orders/100/. The API returns Bob’s order because nothing checks ownership.
  3. Attacker scripts a loop from ID 1 to 10000, dumps every order in the database. Sequential integer PKs make enumeration take seconds.

Here is the vulnerable ViewSet pattern we see most often in real codebases:

# views.py — VULNERABLE
from rest_framework import viewsets
from rest_framework.permissions import IsAuthenticated
from .models import Order
from .serializers import OrderSerializer

class OrderViewSet(viewsets.ModelViewSet):
    serializer_class = OrderSerializer
    permission_classes = [IsAuthenticated]  # proves identity, not ownership

    def get_queryset(self):
        # Returns every order in the database — any authenticated user
        # can retrieve, update, or delete any order by guessing its PK.
        return Order.objects.all()

IsAuthenticated blocks anonymous requests, which makes it look like the endpoint is secured. But any valid session token — including one the attacker registered themselves — bypasses it. The retrieve(), update(), and destroy() actions in ModelViewSet all call get_object(), which calls get_queryset() and then filters by the URL pk. Since get_queryset() returns everything, get_object() happily resolves any ID.

Fixing IDOR by Scoping Querysets to the Authenticated User

The correct fix is to scope get_queryset() to the authenticated user so that the object simply doesn’t exist from the API’s perspective if it doesn’t belong to the requester. This gives you a 404 instead of a 403, which is almost always the right behavior — a 403 confirms the resource exists and leaks information about the ID space.

Add a second layer with a custom BasePermission that implements has_object_permission. The queryset filter handles list and retrieve; the object permission handles mutating actions where DRF calls check_object_permissions explicitly.

# permissions.py
from rest_framework.permissions import BasePermission

class IsOwner(BasePermission):
    def has_object_permission(self, request, view, obj):
        # Explicit ownership check — queryset scoping is the first line,
        # but we defend in depth for any path that bypasses get_queryset.
        return obj.owner == request.user
# views.py — FIXED
from rest_framework import viewsets
from rest_framework.permissions import IsAuthenticated
from .models import Order
from .serializers import OrderSerializer
from .permissions import IsOwner

class OrderViewSet(viewsets.ModelViewSet):
    serializer_class = OrderSerializer
    permission_classes = [IsAuthenticated, IsOwner]

    def get_queryset(self):
        # Scope to the requesting user at the ORM layer — objects that don't
        # belong to this user never enter the retrieval pipeline at all.
        return Order.objects.filter(owner=self.request.user).select_related("owner")

    def perform_create(self, serializer):
        # Bind the new object to the authenticated user so the POST path
        # can't accept a user-controlled owner field.
        serializer.save(owner=self.request.user)

Filtering at the queryset layer beats checking IDs inside the view body for two reasons. First, it’s impossible to forget: every action — list, retrieve, update, partial update, destroy — goes through get_queryset(). Second, it eliminates a whole class of time-of-check / time-of-use bugs where you check ownership in get but forget to re-check in patch.

The same defense-in-depth principle applies to object-level auth in gRPC services and any RPC-style API where the framework doesn’t give you a queryset abstraction: filter first, check permissions on the resolved object second.

Use Unguessable Identifiers Instead of Sequential IDs

Sequential integer PKs are an enumeration gift. Once an attacker has one valid ID, they have a roadmap to every other record. Replacing exposed identifiers with UUIDs or opaque slugs doesn’t fix the authorization hole — that requires the fixes above — but it raises the cost of bulk enumeration from “write a loop” to “brute-force a 128-bit space.”

# models.py
import uuid
from django.db import models

class Order(models.Model):
    # Use UUIDField as the primary key to prevent sequential enumeration.
    # This is defense in depth — queryset scoping is still mandatory.
    id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
    owner = models.ForeignKey(
        "auth.User", on_delete=models.CASCADE, related_name="orders"
    )
    total = models.DecimalField(max_digits=10, decimal_places=2)
    created_at = models.DateTimeField(auto_now_add=True)
# urls.py — router uses the UUID field as the lookup
from rest_framework.routers import DefaultRouter
from .views import OrderViewSet

router = DefaultRouter()
router.register(r"orders", OrderViewSet, basename="order")

# Override lookup_field on the ViewSet to match the UUID primary key
# so DRF resolves /api/orders/<uuid>/ instead of /api/orders/<int>/
# views.py addition
class OrderViewSet(viewsets.ModelViewSet):
    lookup_field = "id"  # matches the UUIDField name on the model
    # ... rest of ViewSet unchanged from the fix above

One tradeoff: UUIDs inflate index size and can slow joins on large tables. If that matters, use a separately-stored public_id = models.UUIDField(default=uuid.uuid4, editable=False, unique=True) alongside an integer PK, and expose only public_id in serializers and URLs. The internal integer PK never appears in any HTTP response.

Never treat opaque IDs as a substitute for proper authorization. We’ve reviewed APIs that switched to UUIDs, removed the queryset scoping because “users can’t guess them now,” and then leaked UUIDs in webhook payloads, browser history, or third-party analytics — instantly making every ID known to an attacker.

Enforce Authorization at the Serializer and Nested Resource Level

Queryset scoping protects URL-path-based access. IDOR also hides in writable foreign key fields where a user submits a payload referencing another tenant’s object. A user who owns projects 10 and 11 might try {"project": 99} on a task creation endpoint to attach their task to someone else’s project.

This is especially common in multi-tenant SaaS applications where related resources belong to different organizational boundaries.

# serializers.py
from rest_framework import serializers
from .models import Task, Project

class TaskSerializer(serializers.ModelSerializer):
    class Meta:
        model = Task
        fields = ["id", "title", "project", "due_date"]

    def validate_project(self, value):
        request = self.context.get("request")
        if request is None:
            raise serializers.ValidationError("No request context available.")

        # Reject foreign keys that don't belong to the authenticated user —
        # without this check, any user can write into any project by ID.
        if not Project.objects.filter(id=value.id, owner=request.user).exists():
            raise serializers.ValidationError(
                "Project not found."  # Deliberately vague — don't confirm existence
            )
        return value

Always pass request in serializer context. DRF does this automatically when you use get_serializer() inside a view, but if you instantiate serializers directly (in management commands, signals, or background tasks), you must pass context={"request": request} manually. When there’s no request context at all — background jobs, for example — you need a different mechanism to establish the authorization boundary, typically passing the owner explicitly.

The same class of bug appears in writable nested serializers. If a LineItem serializer accepts a nested order object with an id field, a user can point that id at any order. Validate every inbound relation. For more on how this nesting problem scales, the same concepts appear in authorization patterns in GraphQL APIs, where every resolver is effectively a relation that needs its own ownership check.

Test for IDOR with Automated Authorization Checks

The only reliable way to prevent IDOR regressions is to write tests that explicitly attempt cross-user access and assert they fail. Code reviews miss it. Manual QA misses it. Tests that authenticate as user B and try to touch user A’s resources catch it every time — if you write them.

# tests/test_order_idor.py
import pytest
from django.contrib.auth import get_user_model
from rest_framework.test import APIClient
from orders.models import Order

User = get_user_model()

@pytest.fixture
def alice(db):
    return User.objects.create_user(username="alice", password="testpass123")  # noqa: S106

@pytest.fixture
def bob(db):
    return User.objects.create_user(username="bob", password="testpass123")  # noqa: S106

@pytest.fixture
def alice_order(alice):
    return Order.objects.create(owner=alice, total="99.99")

@pytest.mark.django_db
class TestOrderIDOR:
    def _client_for(self, user):
        client = APIClient()
        client.force_authenticate(user=user)
        return client

    def test_bob_cannot_retrieve_alice_order(self, alice_order, bob):
        # 404, not 403 — we don't confirm the resource exists to unauthorized users.
        response = self._client_for(bob).get(f"/api/orders/{alice_order.id}/")
        assert response.status_code == 404

    def test_bob_cannot_update_alice_order(self, alice_order, bob):
        response = self._client_for(bob).patch(
            f"/api/orders/{alice_order.id}/", {"total": "0.01"}, format="json"
        )
        assert response.status_code == 404

    def test_bob_cannot_delete_alice_order(self, alice_order, bob):
        response = self._client_for(bob).delete(f"/api/orders/{alice_order.id}/")
        assert response.status_code == 404

    def test_bob_list_does_not_include_alice_order(self, alice_order, bob):
        # List endpoint must not leak cross-user data even if IDs are unknown.
        response = self._client_for(bob).get("/api/orders/")
        assert response.status_code == 200
        ids = [item["id"] for item in response.data["results"]]
        assert str(alice_order.id) not in ids

The list-endpoint test is easy to forget and catches a different bug: get_queryset() returning everything on list() but correctly filtering on retrieve(). Write both.

Wire these into CI as required checks. A failing IDOR test should block a merge the same way a failing unit test does. This is not optional — the whole point is that a developer adding a new ModelViewSet in a Friday pull request doesn’t ship a data leak to production by Monday.

Catch IDOR in Code Review and CI

Human review of pull requests should pattern-match on a short list of high-risk constructs. Any Model.objects.get(pk=...) or Model.objects.filter(id=...) call that doesn’t chain a user-scoping filter is a candidate IDOR. Any ViewSet missing permission_classes is an unauthenticated endpoint or is inheriting from a base class that may not have adequate defaults. Any serializer field of type PrimaryKeyRelatedField with a broad queryset is a potential cross-tenant write.

Automate this with Semgrep. Here is a rule that flags the most common pattern: a DRF view calling .objects.get() without an owner filter anywhere in the same expression:

# semgrep/rules/drf-idor.yml
rules:
  - id: drf-unscoped-objects-get
    patterns:
      - pattern: $MODEL.objects.get(pk=...)
      - pattern-not: $MODEL.objects.get(pk=..., owner=...)
      - pattern-not: $MODEL.objects.get(pk=..., owner__in=...)
    message: >
      Unscoped .objects.get(pk=...) in a view — add an owner filter or replace with
      a queryset scoped in get_queryset(). Risk: IDOR.
    languages: [python]
    severity: ERROR
    metadata:
      cwe: CWE-639

Run this rule in your CI pipeline on every pull request. To shift IDOR checks left in your CI/CD pipeline, add it as a required status check alongside your test suite — not a separate “security scan” that developers learn to ignore.

Code review checklist for IDOR-prone patterns:

  • ModelViewSet or GenericAPIView subclass with no explicit get_queryset override — check what the default queryset returns.
  • permission_classes = [] or a ViewSet that inherits permission_classes from a base class you don’t control.
  • PrimaryKeyRelatedField(queryset=Model.objects.all()) in any writable serializer — this gives any user access to the full table.
  • perform_create or perform_update that doesn’t pin the owner field, leaving it open to user-supplied values.
  • Tests that only assert status_code == 200 for the happy path, with no cross-user negative test.

SAST tools like Semgrep will catch structural patterns; they won’t catch logic bugs where the filter is present but uses the wrong field. Code review has to cover that gap. The combination — automated rules catching the obvious omissions, human review focused on logic — is more effective than either alone.

Hardening Checklist and Next Steps

The layered controls, in priority order:

Queryset scoping (required): get_queryset() filters by request.user. No exceptions for convenience. If an admin view needs to return all objects, it lives in a separate ViewSet with explicit admin permission checks.

Object-level permissions (required): IsOwner or equivalent BasePermission with has_object_permission as a second line of defense. Attach it to every mutating ViewSet.

Serializer-level FK validation (required for relational writes): Every PrimaryKeyRelatedField or nested writable serializer validates that the referenced object belongs to request.user.

perform_create owner binding (required): Never accept owner from request data. Always call serializer.save(owner=self.request.user).

Opaque identifiers (defense in depth): UUIDs or opaque public IDs in all URLs and serializer output. Still mandatory to have the above controls in place.

Automated cross-user tests (required for CI gates): One test class per resource that authenticates as User B and asserts 404 on User A’s list, retrieve, update, and delete endpoints.

SAST rules in CI (defense in depth): Semgrep rules flagging unscoped .objects.get() and missing permission_classes, run as required checks on pull requests.

These controls address the majority of IDOR patterns in DRF, but authorization bugs extend well beyond the patterns covered here. If you want to build systematic habits around authorization review — across frameworks, auth protocols, and API types — the Application Security Engineer learning path on Code Review Lab covers the full scope, including scenarios more complex than single-tenant ownership checks.

The part most teams skip is the test suite. You can write perfect queryset scoping today and watch a future contributor add a get_object_or_404(Order, pk=pk) shortcut that bypasses it entirely. Tests that authenticate as the wrong user and assert 404 are the only automated check that catches that regression. Write them now, gate CI on them, and review them alongside any new ViewSet. If you want a reference for how IDOR shows up in security interviews and assessments, common IDOR interview questions are a useful signal for the gaps engineers typically leave in production systems.

Further Reading

  • OWASP IDOR Prevention Cheat Sheet — authoritative guidance on access control patterns across frameworks.
  • CWE-639: Authorization Bypass Through User-Controlled Key — the formal taxonomy entry with real-world consequences and detection guidance.
  • Django REST Framework: Permissions — official DRF docs on has_permission and has_object_permission, including check_object_permissions call semantics.
  • Application Security Engineer learning path on Code Review Lab — structured curriculum for building authorization review skills across multiple API paradigms.
  • PortSwigger Web Security Academy: IDOR — interactive labs that demonstrate enumeration, parameter tampering, and horizontal privilege escalation in concrete exercises.

Desplegando una página web en Amazon EC2 con Nginx

Creando y desplegando una instancia en Amazon EC2

¿Alguna vez te has preguntado cómo funcionan los servidores en la nube o cómo puedes publicar tu propia página web en internet sin necesidad de tener un servidor físico?

En este laboratorio te guiaré paso a paso en el proceso de creación de una instancia en Amazon EC2, explicando de manera clara cada una de las configuraciones necesarias para que puedas comprender y realizar este proceso sin complicaciones.

Además, no solo nos quedaremos en la teoría: utilizaremos Nginx para desplegar un sitio web real y aprender cómo personalizarlo con nuestro propio contenido, logrando que esté disponible desde cualquier lugar.

Paso 1: Acceder a Amazon EC2

Para comenzar con el lanzamiento de una instancia en Amazon EC2, nos dirigimos al buscador de la consola de AWS y escribimos “EC2”.

Una vez aparezca el servicio, hacemos clic en él para ingresar al panel principal. Allí encontraremos un botón naranja con la opción “Launch instance” (Lanzar instancia), el cual seleccionaremos para iniciar el proceso de creación de nuestra instancia.

Paso 2: Configuración inicial de la instancia

En este paso comenzamos definiendo los parámetros básicos de nuestra instancia en Amazon EC2.

Primero, asignamos un nombre que nos permita identificarla fácilmente. En este caso utilizamos “laboratorio-ec2”.

A continuación, seleccionamos la AMI (Amazon Machine Image), que es la plantilla del sistema operativo que tendrá nuestra instancia. La AMI incluye el sistema base y configuraciones iniciales necesarias para su funcionamiento.

Para este laboratorio, elegimos Amazon Linux, ya que es una opción optimizada para AWS, ligera y ampliamente utilizada en entornos reales.

Utilizamos t3.micro porque es la opción más básica y barata de AWS.

  • Sirve para aprender y hacer pruebas
  • Es gratis en el Free Tier
  • Tiene recursos suficientes para proyectos pequeños

Paso 3: Creación del par de claves

En este paso creamos un par de claves, el cual nos permitirá conectarnos de forma segura a nuestra instancia en Amazon EC2 mediante SSH.

Primero, asignamos un nombre al par de claves para poder identificarlo fácilmente.

Luego, seleccionamos el tipo de clave RSA, ya que es uno de los algoritmos más utilizados y compatibles para la autenticación SSH, ofreciendo un buen nivel de seguridad y facilidad de uso.

En cuanto al formato, elegimos .pem, ya que es el más adecuado para conectarnos desde entornos Linux, macOS o herramientas como Git Bash en Windows, permitiendo usar el comando SSH directamente.

Es importante mencionar que, aunque en este laboratorio se creó el par de claves, no se utilizó durante la conexión, ya que se accedió a la instancia mediante EC2 Instance Connect, una herramienta que permite conectarse directamente desde el navegador sin necesidad de configurar la clave privada. Sin embargo, el uso de claves .pem es fundamental en entornos reales y representa una práctica estándar para conexiones seguras mediante SSH.

Tip importante

Es fundamental descargar y guardar este archivo en un lugar seguro, ya que será necesario para acceder a la instancia. Si se pierde, no será posible conectarse a ella.

Paso 4: Configuración de red

En este paso configuramos las reglas de acceso a nuestra instancia en Amazon EC2 mediante un Security Group, el cual actúa como un firewall que controla el tráfico de entrada.

Para este laboratorio, habilitamos las siguientes reglas:

SSH (puerto 22): permite conectarnos de forma remota a la instancia desde nuestra máquina.
HTTP (puerto 80): permite que el sitio web sea accesible desde el navegador.

Estas configuraciones son fundamentales, ya que sin el acceso por HTTP no sería posible visualizar la página web desplegada.

Con esto terminaríamos la configuración para lanzar nuestra instancia EC2.

Paso 5: Conexión a la instancia

Una vez lanzada la instancia en Amazon EC2, accederemos a la sección de detalles donde encontraremos la opción para conectarnos.

Para ello, seleccionamos la instancia y hacemos clic en “Connect” (Conectar). Dentro de esta sección, nos desplazamos hasta la opción EC2 Instance Connect, que nos permite acceder directamente desde el navegador sin necesidad de configuraciones adicionales.

Finalmente, hacemos clic en el botón “Connect”, lo que abrirá una terminal desde donde podremos interactuar con nuestra instancia.

Paso 6: Actualización del sistema e instalación de Nginx

Este comando permite actualizar el sistema operativo, instalando las últimas versiones disponibles de los paquetes y corrigiendo posibles vulnerabilidades.

sudo dnf update -y

Este comando descarga e instala Nginx en la instancia, dejándolo listo para ser configurado y utilizado.

sudo dnf install nginx -y

Paso 7: Iniciar y habilitar Nginx

Este comando pone en funcionamiento Nginx, permitiendo que el servidor web comience a atender solicitudes.

sudo systemctl start nginx

Esto permite que Nginx se inicie automáticamente cada vez que la instancia se reinicie.

sudo systemctl enable nginx

Paso 8: Obtener la dirección IP pública

Para poder acceder a nuestro servidor web, debemos obtener la dirección IP pública de la instancia en Amazon EC2.

Para ello, nos dirigimos al panel de Instancias, seleccionamos la que hemos creado y buscamos el campo “Dirección IPv4 pública” en la sección de detalles.

Esta dirección será la que utilizaremos en el navegador para visualizar nuestra página web.

Esta es la pagina web que hemos creado

Paso 9: Modificar la página web

Para personalizar el contenido de nuestro sitio en la instancia de Amazon EC2, debemos acceder a la carpeta donde Nginx almacena los archivos web.

Primero, nos dirigimos al directorio correspondiente:
cd /usr/share/nginx/html
Luego, abrimos el archivo principal de la página:

sudo nano index.html

Este archivo contiene el contenido que se muestra en el navegador. Aquí podremos editarlo y reemplazar la página por defecto de Nginx con nuestro propio diseño.

Paso 10: Editar y guardar la página web

Para personalizar nuestro sitio, eliminamos el contenido existente del archivo index.html y lo reemplazamos con el código de nuestra propia página web.

Una vez realizados los cambios, procedemos a guardarlos utilizando el editor nano:

Presionamos Ctrl + X
El sistema nos preguntará si deseamos guardar los cambios (Y/N)
Presionamos Y (Yes)
Finalmente, presionamos Enter para confirmar el nombre del archivo.

Paso 11: Visualizar la página web

Finalmente, para ver el resultado de nuestro trabajo, utilizamos nuevamente la dirección IP pública de la instancia en Amazon EC2.

Ingresamos esta dirección en el navegador:

http://TU_IP_PUBLICA

Y este es el resultado final de nuestra pagina web después de la modificación.

Aprendizaje del laboratorio

En este laboratorio aprendí el paso a paso para lanzar y configurar una instancia en Amazon EC2. También aprendí a conectarme de forma remota con EC2 Instance Connect y a desplegar un servidor web funcional usando Nginx.

Además, comprendí la importancia de los Security Groups para controlar el acceso mediante SSH y HTTP, y cómo la IP pública permite que una página web sea accesible desde internet.

En general, fue una práctica útil para conectar la teoría con la práctica y entender cómo se publica una aplicación en la nube.

AI Can Write Your Code. But It Can’t Design Your System.

We are living in the golden age of developer productivity. With tools like Copilot and ChatGPT, you can generate hundreds of lines of boilerplate and complex API endpoints in seconds.

It feels like magic. But there is a hidden danger lurking behind that flashing cursor: If you don’t possess foundational architectural knowledge, AI will just help you build a Big Ball of Mud faster than ever before.

The “Junior Developer on Steroids”

Think of AI as the most enthusiastic, tireless, and blisteringly fast Junior Developer you’ve ever managed. It knows the syntax of every language perfectly.

But it has a fatal flaw: It defaults to the easiest path, not the right one.

If you prompt an AI to “write a function to process a user order,” it will happily give you a massive, 300-line controller method. It will hard-code the database connection, mix in the business validation, trigger a third-party payment API synchronously, and tightly couple the entire thing together.

The code will compile. The tests might even pass. But architecturally? It is a ticking time bomb.

Why Foundational Knowledge is Your Superpower

The developers who will thrive in the AI era are not the ones who can type the fastest. The future belongs to the Clarity Engineers—the developers who understand system design, tradeoffs, and architectural boundaries.

When you know software architecture, your relationship with AI completely changes. Instead of accepting its first messy draft, an architected prompt looks like this:

“Write a service class to process user orders. Ensure the core business logic is decoupled from the database using Hexagonal Architecture (Ports and Adapters). The payment processing must not be synchronous; instead, publish a domain event to a message broker so we achieve temporal decoupling.”

Suddenly, the AI isn’t just writing code. It is executing your blueprint.

The Takeaway

AI isn’t going to replace software architects. It is going to make them 10x more powerful. But to wield that power, you need to know the rules of the game so you can instruct the AI on how to play it.

My new book, Grokking Software Architecture (published by Manning Publications Co. ), is the practical, conversational guide I wish I’d had when I started my journey nearly two decades ago. It’s fun, engaging, and filled with information you can start using on DAY ONE in your new job, or starting TODAY at your current job.

Don’t just accept the code the AI hands you. Learn how to hand the AI a blueprint.

Grab your Early Access (MEAP) copy at 🔥 50% OFF today during Manning’s Sitewide Sale: http://hubs.la/Q03-d27Y0

Let’s build systems that last.

Never trust the client with your Stripe price

I was reading a Stripe tutorial last week and watched the author write amount: req.body.amount. That single line lets any user buy Premium for $1. It’s also a common pattern in Stripe Checkout starter code. This post is about why, and how to make it impossible.

The setup

You’re building a paywalled product. You wire up Stripe Checkout, follow a popular tutorial, ship it. Looks great. Tests pass. Users are paying.

Six months later, someone opens DevTools, edits the request body, and pays €1 for your Premium plan. Your Stripe dashboard shows a successful charge. Stripe doesn’t validate your business logic. It charged what it was told to charge. Your database shows a Premium subscription. Your billing logic is doing exactly what you wrote.

This is price tampering. It happens at the one line where the server decides what to charge.

The vulnerable pattern

Here’s the shape of the bug. Paraphrased from a tutorial I won’t link. You’ve seen this shape before:

// app/api/checkout/route.ts (don't do this)
export async function POST(req: Request) {
  const { priceId, amount, plan } = await req.json();

  const session = await stripe.checkout.sessions.create({
    mode: "payment",
    line_items: [
      {
        price_data: {
          currency: "eur",
          product_data: { name: plan },
          unit_amount: amount, // attacker controls this
        },
        quantity: 1,
      },
    ],
    success_url: `${origin}/success`,
    cancel_url: `${origin}/cancel`,
  });

  return Response.json({ url: session.url });
}

The frontend POSTs { priceId: "premium", amount: 2999, plan: "Premium" }. The server passes amount straight into Stripe. Stripe charges what it’s told.

Exploiting this needs nothing fancy:

curl -X POST https://yoursite.com/api/checkout 
  -H "Content-Type: application/json" 
  -H "Cookie: session=..." 
  -d '{"priceId":"premium","amount":100,"plan":"Premium"}'

amount: 100 is €1.00 in cents. Attacker gets a Stripe Checkout link for €1, completes the payment, and your post-checkout webhook hands them Premium.

The same bug shape applies to priceId if you trust it from the client:

// Also bad. Trusting which price the client picked.
const { priceId } = await req.json();
const session = await stripe.checkout.sessions.create({
  line_items: [{ price: priceId, quantity: 1 }],
  // ...
});

If your “Hobby” plan’s priceId is price_xxx_5eur and your “Enterprise” plan’s priceId is price_xxx_500eur, an attacker swaps the value in the request body and pays €5 for Enterprise.

Why this keeps happening

Three reasons it slips through.

1. Most Stripe tutorials are demos. They want to show you Stripe in 50 lines of code, so they wire the frontend straight to the checkout endpoint. Demos become starter templates. Starter templates become production code.

2. The bug looks like working code. Real users complete real payments. Until somebody opens DevTools, you have no signal that anything is wrong. Logs, dashboards, webhooks, all green.

3. Stripe gives you both APIs. price_data (inline price definition) and price (reference to a Price object) live side by side in their docs. Inline price_data has legitimate uses (true dynamic pricing, donations, marketplace splits). But it’s the same shape as the vulnerable pattern, so the bug hides in plain sight.

The fix in one rule

The client tells you which plan the user wants. The server decides what that plan costs.

That’s it. Implementation:

// app/api/checkout/route.ts (server-determined pricing)
const PLANS = {
  hobby: { priceId: process.env.STRIPE_PRICE_HOBBY },
  premium: { priceId: process.env.STRIPE_PRICE_PREMIUM },
  enterprise: { priceId: process.env.STRIPE_PRICE_ENTERPRISE },
} as const;

type PlanKey = keyof typeof PLANS;

export async function POST(req: Request) {
  const { plan } = (await req.json()) as { plan: PlanKey };

  // 1. Validate the plan key against a server-side allowlist
  if (!Object.hasOwn(PLANS, plan)) {
    return new Response("Invalid plan", { status: 400 });
  }

  // 2. Look up the priceId server-side. Never accept it from the client.
  const { priceId } = PLANS[plan];

  const session = await stripe.checkout.sessions.create({
    mode: "subscription",
    line_items: [{ price: priceId, quantity: 1 }],
    success_url: `${origin}/success`,
    cancel_url: `${origin}/cancel`,
  });

  return Response.json({ url: session.url });
}

The client sends { plan: "premium" }. That’s the most they can influence. The mapping from "premium" to a real, server-controlled priceId is unforgeable. If the attacker sends { plan: "free_lifetime" }, the allowlist check rejects it. If they send { plan: "premium", amount: 100 }, the amount field is ignored. It doesn’t exist in the server’s logic.

For genuinely dynamic amounts (donations, custom one-off charges), you compute the amount on the server from inputs you’ve validated:

// Dynamic amount, still server-determined
const { donationCents } = await req.json();

if (
  typeof donationCents !== "number" ||
  donationCents < 100 ||
  donationCents > 100000
) {
  return new Response("Invalid amount", { status: 400 });
}

const session = await stripe.checkout.sessions.create({
  mode: "payment",
  line_items: [
    {
      price_data: {
        currency: "eur",
        product_data: { name: "Donation" },
        unit_amount: donationCents,
      },
      quantity: 1,
    },
  ],
  // ...
});

The user can choose the amount, but only within bounds you’ve defined. They can’t pass unit_amount: 1 if your minimum is 100.

How to verify you don’t have this bug

A two-minute self-audit:

# 1. Open your /pricing page. Click your most expensive plan.
#    Watch the Network tab when you hit "Subscribe" or "Buy".

# 2. Find the request to your checkout-create endpoint. Copy it as cURL.

# 3. Replay it with a tampered body. Change priceId, amount, plan name,
#    quantity, anything money-shaped:
curl -X POST https://yoursite.com/api/checkout 
  -H "Content-Type: application/json" 
  -H "Cookie: <your auth cookie>" 
  -d '{"plan":"premium","priceId":"price_FAKE","amount":1,"quantity":-1}'

# 4. Check the response. If you got a Stripe Checkout URL, open it.
#    If the price shown is anything other than your real plan price, you have a bug.

If the resulting Stripe Checkout page shows the correct, original price regardless of what you sent, you’re safe. If it reflects the tampered fields, patch before you do anything else.

Three more places the same bug hides

Once “the server owns money-shaped values” clicks for you, you start seeing it everywhere.

1. Quantity. Same bug, different field. quantity: -1 in older Stripe versions caused weird negative-amount behavior. Validate quantity bounds explicitly.

2. Coupon / promo codes from client. If you let the client say “apply coupon XYZ,” the server has to verify XYZ is real, active, and applies to this plan for this user. Never just pass it through.

3. Customer ID. If the client sends { customerId } to attach the checkout to an existing Stripe customer, an attacker can swap their customerId for someone else’s. Always derive customerId from the authenticated session on the server.

The pattern: anything that influences money or attribution comes from authenticated server state, not from the request body.

The principle

Stripe is one of the safer payment APIs because it pushes you toward the right patterns most of the time. But it can’t enforce “client doesn’t send money values”. That’s on your code. The same principle applies anywhere the client shouldn’t have authority: authorization roles, feature flags, internal IDs, prices, plan tiers, expiration dates.

Think of a request body as a wish, not a fact. The server decides what to grant.

I run MatchResume.ai, a B2C SaaS with token-based pricing. Exactly the kind of product where this bug would have been embarrassing. The pattern above is what I wish every Stripe tutorial led with, instead of saving it for a footnote.

If you ship paid features and you’ve never tampered your own checkout request as a test, do it tonight. Two minutes, one curl, real peace of mind.