How Intelligent Agents Work — From Perception to Decision and Action

AI is not just models.

It is a system that perceives, decides, and acts.

If you only think in terms of algorithms, you miss the bigger structure.

The real question is:

How does an AI system turn input into action?

Core Idea

An intelligent agent is the simplest way to understand AI as a system.

It takes input from the environment.

Processes that information.

Then selects an action.

That loop defines AI behavior.

The Key Structure

The basic agent loop looks like this:

Environment → Perception → State → Decision → Action → Environment

Or more compact:

Agent = Perception + Decision + Action

This is why the agent concept matters.

It connects data, reasoning, and behavior into one structure.

Implementation View

At a high level, an agent behaves like this:

observe environment

update internal state

evaluate possible actions

choose the best action

execute action

repeat

This loop appears everywhere.

Game AI.

Robotics.

Autonomous systems.

Recommendation systems.

Even large language models follow a version of this pattern.

Concrete Example

Imagine a simple robot.

It receives sensor input.

It detects obstacles.

It chooses a direction.

It moves.

That is already an intelligent agent.

Now scale that idea:

A recommendation system observes user behavior.

Updates internal preferences.

Chooses the next item to show.

That is also an agent.

Different domain.

Same structure.

Reactive vs Intelligent Agent

Not all agents are equal.

This comparison matters.

Reactive agent:

  • responds directly to input
  • no memory or internal model
  • simple and fast
  • limited flexibility

Intelligent agent:

  • maintains internal state
  • evaluates future outcomes
  • can optimize decisions
  • adapts to complex environments

So the difference is not just complexity.

It is the presence of internal reasoning.

Why Cognition Matters

As problems become more complex, simple reaction is not enough.

The agent needs internal representation.

Memory.

Inference.

That is where cognition comes in.

Cognitive systems treat thinking as information processing.

Input is transformed into internal structure.

That structure supports reasoning.

So the flow becomes:

Perception → Representation → Reasoning → Action

Without this layer, AI is limited to simple responses.

With it, AI can plan and infer.

Action vs Understanding

This is where things get interesting.

Does acting correctly mean understanding?

A system can follow rules and produce correct outputs.

But does it truly understand meaning?

This question is not just philosophical.

It affects how we interpret AI systems.

Rule-following can look like intelligence.

But it may not imply true understanding.

That distinction matters when designing or evaluating AI.

Decision vs Free Will

If an agent chooses actions, is that the same as free will?

In humans, experiments suggest decisions may begin before conscious awareness.

In AI, decisions are the result of computation.

So the deeper question becomes:

Is decision-making just a process?

Or is there something more?

Even if you do not answer it fully, this perspective helps you see AI systems differently.

They are not just tools.

They are structured decision systems.

From Agents to Modern AI Systems

The agent view scales.

Search algorithms:

  • choose next state

Knowledge-based systems:

  • use rules and inference

Neural networks:

  • learn representations

Modern AI combines these ideas.

Perception.

Representation.

Decision.

Learning.

The agent is the unifying abstraction.

Why This Matters

If you only learn models, you miss system design.

If you understand agents, you understand AI structure.

That matters in practice.

Because real systems are not just one model.

They are pipelines.

Loops.

Decision processes.

The agent view helps you design them.

Recommended Learning Order

If this feels broad, follow this order:

  1. Agent vs Intelligent Agent
  2. Intelligent Agent
  3. Cognitive Agents
  4. Cognitivism
  5. Chinese Room Argument
  6. Free Will and Decision Systems

This order works because you first understand action.

Then internal reasoning.

Then the limits of understanding.

Takeaway

AI is best understood as an agent.

Not just a model.

Not just an algorithm.

A system that:

  • perceives
  • represents
  • decides
  • acts

The shortest version is:

Agent = perception + decision + action

If you remember one idea, remember this:

AI systems are decision loops, not isolated models.

Discussion

When designing AI systems, do you think more in terms of models, or in terms of agents that interact with environments?

Originally published at zeromathai.com.
Original article: https://zeromathai.com/en/intelligent-agent-and-cognition-hub-en/

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

Your MCP server eats 55,000 tokens before your agent says a word — I measured the real cost

The invisible bill

I was debugging why my Claude Code sessions felt sluggish after connecting a few MCP servers. Token usage was through the roof — but I hadn’t even asked the agent to do anything yet. I rewrote my prompts three times before I thought to check where the tokens were actually going.

Turns out, the moment you connect an MCP server, every tool definition gets loaded into the context window. Names, descriptions, parameter schemas, enum values — all of it, on every single conversation turn. Not just when you call a tool. Every turn.

Think of it like walking into a library to read one book, but the librarian insists you read the entire catalog first. Every time you walk in.

The measurement: 4 servers, 500x cost difference

I measured the tool-definition token overhead for four MCP servers, from minimal to massive:

MCP Server Tools Est. tokens Monthly cost (10 calls)
PostgreSQL 1 ~35 ~$0.0005
Google Maps 7 ~704 ~$0.009
GitHub 26 ~4,242 ~$0.06
GitHub (full) 93 ~55,000 ~$0.74

PostgreSQL to full GitHub: a 1,500x difference. Same protocol, same “MCP server” label, radically different cost profiles.

And this is just the definition overhead. The actual tool calls consume additional tokens on top.

Where the tokens go

A single MCP tool definition looks harmless:

{
  "name": "gmail_create_draft",
  "description": "Creates a draft email...",
  "inputSchema": {
    "type": "object",
    "properties": {
      "to": { "type": "string", "description": "..." },
      "subject": { "type": "string", "description": "..." },
      "body": { "type": "string", "description": "..." }
    }
  }
}

That single tool? 820 tokens. More than the entire PostgreSQL MCP server with its one tool.

Now multiply. A business API like a full accounting platform might expose 270+ tools across invoicing, HR, payroll, time tracking, and sales management. At ~65 tokens per tool average, that’s 17,500 tokens consumed before your first question.

Connect three services like that simultaneously, and you’re burning 143,000 out of 200,000 tokens on schema definitions alone. 71% of your context window is gone. Your agent is trying to think inside a closet.

At scale, the math gets uncomfortable: 1,000 requests/day with heavy MCP overhead = roughly $170/day = $5,100/month — just for loading tool schemas.

The quality cliff

Token cost isn’t even the worst part. Claude’s output quality visibly degrades after 50+ tool definitions are loaded. The model starts chasing tangents, referencing tools instead of answering your actual question.

More tools in context doesn’t mean more capability. Past a threshold, it means worse capability. I confirmed this firsthand — five servers connected, and my agent started recommending create_github_issue as the fix for a database timeout. Very confident. Very wrong.

Three strategies to cut 95%

Strategy 1: Expose only what you need

If you’re using an accounting platform’s 270 tools but only need 10 for your tax filing workflow:

{
  "mcpServers": {
    "accounting": {
      "allowedTools": [
        "create_transaction",
        "list_transactions",
        "get_trial_balance",
        "list_account_items",
        "list_partners"
      ]
    }
  }
}

10 tools instead of 270: ~650 tokens instead of ~17,500. 96% reduction.

Strategy 2: Write tighter descriptions

API docs make terrible tool descriptions. They’re written for humans who read documentation; LLMs need the compressed version.

// Before: ~80 tokens
{
  "description": "Uses the accounting API to create a new
    transaction (journal entry) for the specified company ID.
    You can specify amount, date, account item, partner name,
    memo, and more. Tax category is auto-determined."
}

// After: ~20 tokens
{
  "description": "Create transaction. Args: amount, date, account_item, partner"
}

75% fewer tokens, same functionality. The model doesn’t need a paragraph to understand what create_transaction does.

Strategy 3: Connect only when needed

Don’t keep all MCP servers connected during every conversation. Connect the accounting server when you’re doing accounting work. Disconnect it when you’re writing code. This alone zeroes out overhead for unrelated tasks.

MCP Tool Search: the protocol-level fix

In January 2026, a protocol-level solution arrived: MCP Tool Search. When tool definitions exceed 10% of your context window, the client automatically defers loading them. Instead of dumping every schema into context, the model discovers and loads tools on-demand via search.

Early reports show a 95% reduction in startup token cost. The schema bloat problem is being solved at the infrastructure level, not just through workarounds.

But Tool Search isn’t universally deployed yet. Until it is, the three strategies above are your defense.

What to check right now

1. Count your tools. Run tools/list against each connected MCP server and count the total. If you’re above 30 tools across all servers, you’re likely paying a meaningful overhead tax.

2. Audit descriptions. Look at the JSON schemas your servers return. Are the descriptions essay-length? Trim them. Every token in a description is paid on every conversation turn.

3. Use allowedTools. Most MCP clients support filtering which tools are exposed. Use it. There’s no reason to load 270 tools when you need 10.

4. Measure before/after. Token usage is visible in most LLM clients. Check your per-turn consumption before and after connecting each MCP server. The numbers will tell you exactly which servers are expensive.

The irony of MCP: the protocol designed to extend AI capabilities can end up crippling them — if you load too many tools and leave no room for actual thinking.

This article is based on Chapter 3 of MCP Security in Practice: What OWASP Won’t Tell You About AI Tool Integrations. The book covers the full token cost analysis across services, OWASP MCP Top 10 security risks, file upload limitations, and production hardening patterns.

Generation 1 — Standalone Models (2018–2022)

The Foundation of Modern AI Systems
When people think of tools like ChatGPT, they often assume the intelligence comes from a single powerful system that “remembers,” “reasons,” and “understands context.”

That intuition is misleading. To truly understand how modern AI systems evolved, we need to go back to Generation 1 — the era of Standalone Models, where everything began. Generation 1 (2018–2022) refers to the period defined by:

  • Large pre‑trained models like GPT, GPT‑2, and GPT‑3
  • Minimal system design around them, with no real external memory or tool integration
    These models were powerful—but fundamentally isolated. They could generate text, but they couldn’t access information, retrieve knowledge, or take actions beyond what was encoded in their training data.

The Core Idea: AI as a Stateless Engine, At the heart of Generation 1 is a critical concept. The model is stateless. Every time you send a prompt, The model processes it independently, It does not remember previous interactions and It does not learn in real time. This is true for GPT-3, Claude, Gemini, Grok. Different vendors, same architectural truth.

The 3-Layer Architecture (Simplified Mental Model)
Even in Generation 1, what you interact with (like ChatGPT) is not just a model.

3-layer
It can be understood as three distinct layers:

➡️Layer 1 — The UI Layer (Interaction Surface)
This is everything the user directly touches. It includes the chat window, the input box, the streaming response area, the conversation sidebar, the “regenerate” button, and even small touches like the copy‑to‑clipboard icon.

You see this layer in tools like ChatGPT, Claude.ai, Perplexity, Gemini, and chat panels inside apps like Cursor or Slack.

Core responsibilities

  • Capture user intent — text input, file uploads, voice, images, tool toggles, model selection
  • Render model output — token‑by‑token streaming, markdown, code blocks, math, citations
  • Create continuity — the illusion that the AI “remembers” the conversation
  • Manage session state — active chat, history navigation, drafts, error recovery
  • Surface controls — stop, regenerate, edit message, branch conversation, share, export

The non‑obvious insight
A great UI layer is what makes ChatGPT feel magical.
Under the hood, it’s the same model you could call with a simple API request.
But the experience is completely different.

➡️Layer 2 — The Orchestration Layer (The Hidden Middleware)
This is the layer most beginners never notice — and it’s the reason many “ChatGPT clones” feel broken or low‑quality. It sits between the UI and the model, quietly doing a huge amount of work the user never sees but always feels. When you send a message to ChatGPT, the text that reaches the model is not the raw message you typed. The orchestration layer transforms it first.

What this layer does

  • System prompt injection — Adds a long, carefully written instruction set that defines the assistant’s personality, tone, abilities, and safety rules.
  • Conversation history management — Decides which past messages to include, which to summarize, and which to drop as the context window fills.
  • Context window budgeting — Tracks token usage across system prompt + history + user message + expected output.
  • Safety and policy filtering — Checks your message before it reaches the model, and checks the model’s output before it reaches you.
  • Rate limiting and quotas — Enforces usage limits that show up as “You’ve reached your limit.”
  • Routing logic — Sends simple queries to cheaper models and complex ones to stronger models.
  • Telemetry and evaluation — Logging, A/B tests, quality checks, and feedback loops.

The non-obvious part: This is where AI products truly differentiate themselves. Two companies can use the same base model, yet one feels magical and the other feels clunky. Why?

Because most of the perceived quality comes from the orchestration layer — not the model.

Why “stateless model + stateful product” matters

The model behind ChatGPT is stateless. Every request is a fresh start.
It doesn’t remember your name, your last message, or that you said “use Python” earlier.

The illusion of memory and continuity is created by the orchestration layer, which replays the relevant parts of your conversation every single time.

This is the most important idea for beginners to understand:

Continuity is created by the UI + orchestration layer, not by the model.

Even today, “memory” features are built on top of the model — the model itself still forgets everything between calls.

➡️Layer 3 — The Model Layer (The Engine That Generates the Output)
This is the part everyone thinks they’re interacting with — the actual AI model. In reality, it’s only one piece of the system, but it’s the piece that does the core job: turning text in → generating text out.
At this layer, things are surprisingly simple.
What the model actually does It takes the final prompt created by the orchestration layer, and it predicts the next token Then the next, and the next, until it forms a complete response. That’s it.

  • No memory.
  • No awareness.
  • No understanding of past conversations unless they’re replayed to it.

What the model doesn’t do

  • It doesn’t remember previous chats
  • It doesn’t store facts about you
  • It doesn’t know the “session” you’re in
  • It doesn’t know what it said 10 minutes ago
  • It doesn’t know what tools the product has
    All of that lives in Layer 2, not here.

Why this layer still matters Even though the model is “just” a prediction engine, it defines the system’s raw capabilities:

  • Language fluency
  • Reasoning ability
  • Knowledge encoded during training
  • Creativity and style
  • Generalization
    A stronger model gives the orchestration layer more to work with — but the model alone is never the full product.

The key beginner insight
The model is stateless. Every request is a blank slate. It only knows what’s inside the prompt it receives right now.This is why the orchestration layer is so important: It builds the illusion of memory, personality, and continuity. The model simply reacts to whatever text it’s given.

Putting it all together

  1. Layer 1 (UI) makes the experience feel smooth
  2. Layer 2 (Orchestration) makes the experience feel intelligent
  3. Layer 3 (Model) generates the actual words

Most people think they’re talking to Layer 3.
In reality, they’re experiencing all three layers working together.

But the foundation remains:

UI + Orchestration + model
Key Takeaway for Developers

If you remember one thing, make it this, LLMs don’t remember—they are made to simulate memory through prompt construction.

This insight is essential when:
Designing AI applications
Debugging responses
Optimizing prompts
Building scalable systems
What Comes Next?

Generation 1 solved text generation. But it couldn’t:

Fetch real-time data
Ground responses in facts

That led to the next evolution:

➡️ Generation 2 — RAG (Retrieval-Augmented Generation)
Where models are no longer isolated—but connected to knowledge.

Final Thought
Generation 1 was not about building “smart assistants.”
It was about discovering that, A stateless probabilistic model, when scaled, can simulate intelligence. Everything that followed—RAG, agents, multi-agent systems—is built on top of this simple but powerful idea.

TaskDev – a task runner for AI coding agents (MCP)

One place for your dev tasks. One place for your logs. And your AI agent sees them too.

Like most developers working on web apps, I usually have a few long-running processes open during the day:

  • the API server
  • the frontend dev server
  • a build watcher

Usually one terminal each. That works, but it is not the handiest setup – you end up jumping between tabs to check what is running and where the logs are.

TaskDev puts them in one place – and makes them visible to your AI agent over MCP.

TaskDev sidebar showing a project node with two tasks

Why I built TaskDev

Agents can read output, but they can’t manage processes.

AI coding agents – Codex, Claude Code, Windsurf Cascade, Cursor – write code well and can read terminal output. What they lack is a stable interface for starting, stopping, and tracking long-running processes. So they spawn duplicates, lose track of what is running, fight stuck ports, and retry until the developer takes over.

The Model Context Protocol (MCP) makes a unified solution possible: one task list that both the developer and the agent can drive.

That is TaskDev:

  • a sidebar for the developer
  • an MCP server for the agent
  • one source of truth – same tasks, same processes, same logs
  • agent commands are sandboxed (see Trust and safety below)

The agent problem, in detail

Long-running tasks like a web service are the worst case:

  • the agent forgets a task is already running and starts it again – and again
  • the previous process still holds the port, so the new one fails
  • it sometimes takes several attempts to stop a task, burning tokens for no reason
  • some agents spawn tasks in hidden terminals or redirect the console output, and the developer doesn’t see what is going on
  • the agent waits forever on a command that never returns

As a result, failed attempts, wasted tokens, and a developer forced to intervene.

The agent itself is not the issue. It just doesn’t have a reliable control interface to manage tasks.

TaskDev is a small, lightweight process supervisor that provides exactly that interface – start, stop, restart, status, logs.

What it is

A small extension for VS Code-based editors (VS Code, Cursor, Windsurf).

  • plain JSON config
  • local processes
  • local logs
  • no telemetry

Tasks are defined in taskdev.json at the root of the workspace.

Install TaskDev

Repository: github.com/tolbxela/taskdev – MIT license.

Install TaskDev from the Extensions panel – search for TaskDev:

  • VS Code → Visual Studio Marketplace
  • Cursor and Windsurf → Open VSX Registry

Then drop a taskdev.json in your workspace and run TaskDev: Install MCP config to wire up the agent side.

Configuration

Example for an ASP.NET Core + Vue.js project:

{
  "project": "My App",
  "tasks": [
    {
      "name": "api",
      "command": "dotnet run --project src/Api",
      "detail": "Starts the backend API",
      "icon": "server-process"
    },
    {
      "name": "ui",
      "type": "npm",
      "command": "npm run dev",
      "cwd": "ui",
      "detail": "Starts the Vite dev server",
      "icon": {
        "id": "globe",
        "color": "terminal.ansiBlue"
      }
    }
  ]
}

Each task needs a name and a command. Everything else is optional:

  • cwd – working directory for the command
  • env – extra environment variables
  • detail – short description shown in the sidebar
  • icon – a codicon id, or { id, color }
  • type – a free-form label like npm or dotnet

Add as many tasks as you want. Two shapes fit naturally:

  • long-running – dev server, build watcher, worker, tunnel, test watcher
  • repetitive – test run, lint, type-check, one-off build, data seed

Both end up in the same sidebar with the same logs, and the agent can start either one on demand.

Multi-root workspaces are supported: each folder can have its own taskdev.json.

Sidebar with the title-bar Open taskdev.json button next to the open config

The sidebar

Click the TaskDev icon in the Activity Bar. You get a tree grouped by project – one node per workspace folder that has a taskdev.json. The project header shows the task count and how many are running.

Each task row shows:

  • an icon (auto-picked from the name, or whatever you set in icon) that turns green while the task is running
  • the task name, plus either the first line of detail or running · 12m once started
  • a rich tooltip on hover with status, command, cwd, PID, uptime, and log path

Inline buttons appear on the task row:

  • play when the task is stopped
  • stop when it is running
  • log to open the current log file in the editor

Hovering a task row reveals Start task and Show log buttons

Clicking log opens the current run in a regular editor tab – searchable, scrollable, and the same file the agent reads over MCP.

Task log open beside the sidebar

The view title has three more actions:

  • Install MCP config – wire up agents (see below)
  • Open taskdev.json – jump to the config, or create one if it is missing
  • Refresh – re-read the config

TaskDev sidebar showing a project node with two tasks

The sidebar refreshes itself every 10 seconds while at least one task is running, every 60 seconds otherwise, and immediately when you edit taskdev.json. Multi-root workspaces show each project side by side.

MCP integration

Run TaskDev: Install MCP config from the command palette and pick which agents to wire up. Detected config files are pre-checked.

Install MCP config picker listing Windsurf, Claude Code, Cursor, Codex, and workspace-scoped configs

The MCP config is only written when this command runs. Nothing happens implicitly.

One necessary drawback is that the MCP config stores the installed extension path, which changes with each new TaskDev version. So you need to re-run TaskDev: Install MCP config after each update. TaskDev will prompt you after an upgrade, but the configs are only rewritten when you confirm in the picker.

The agent gets eight tools:

Tool Purpose
taskdev_list list tasks with status, PID, command, cwd, log path
taskdev_status status of one task or all
taskdev_control start or stop a task
taskdev_restart stop and start
taskdev_logs read recent log lines (current run, or an older run by file)
taskdev_logs_history list previous log files for a task
taskdev_add add a task (with confirmation)
taskdev_remove remove a stopped task (with confirmation)

Agents communicate with TaskDev over MCP and can manage tasks efficiently.

Typical agent loop: change code → taskdev_restart apitaskdev_logs api → read the error → fix or report.

No retry loops. No hung commands. No wasted tokens.

Trust and safety

Commands in your own taskdev.json are normal shell commands – treat the file like code, and only run it in trusted workspaces.

Agent-added tasks (taskdev_add) are sandboxed:

  • no shell chaining, redirects, variables, or subshells
  • no path traversal or arguments outside the project
  • no risky env overrides (PATH, NODE_OPTIONS, dynamic-loader vars, …)
  • only known dev command shapes – npm / pnpm / yarn scripts, dotnet, cargo, go
  • explicit confirmation before any add or remove

The agent can spin up dotnet test. It cannot invent curl ... | sh.

For the exact allow-list, env rules, runtime layout, and MCP tool reference, see security-and-config.md. For setup, see the extension README.

Feedback

Found a bug or have an idea? Open an issue at github.com/tolbxela/taskdev/issues.

What Building a SAST Tool Taught Me About AppSec That 13 Years of Software Engineering Didn’t

I’ve been writing software professionally since 2011.

Java, C#, Kotlin, Node.js. Enterprise backends, microservices, APIs, data pipelines. I’ve shipped production code that millions of people have used without knowing it. I’ve led teams, reviewed architectures, mentored junior engineers, and done all the things that accumulate into what people call “senior software engineer.”

And yet, when I decided to transition into application security, I realised I had significant blind spots — not about how software works, but about how software fails. Specifically, how it fails in ways that attackers can exploit.

This is the final article in a series about building a SAST scanner from scratch, embedding it in CI/CD pipelines, writing custom detection rules, and managing false positives. But it’s really about what that whole process taught me about application security as a discipline — and what I wish I’d understood earlier.

I Knew How to Write Secure Code. I Didn’t Know Why It Was Secure.

Here’s an embarrassing admission: I’ve been using parameterised queries for SQL for at least a decade. I knew you were supposed to use them. I used them every time. I would have told you confidently that they prevent SQL injection.

But if you’d asked me, before I started studying AppSec seriously, to explain why they prevent SQL injection — the actual mechanism — I would have given you a hand-wavy answer about “the database handling it separately.”

Building the SQL injection detection rule forced me to get precise. I had to understand exactly what makes "SELECT * FROM users WHERE id = " + userId dangerous, what makes SELECT * FROM users WHERE id = ? with a bound parameter safe, and why the difference matters at the level of how the database parses and executes the statement.

The answer — that parameterised queries send the query structure and the data in separate messages, so the database never attempts to parse the data as SQL syntax — is not complicated. But I didn’t actually know it at that level of precision until I had to write a rule that distinguishes between the two patterns.

This was a theme throughout the project. I knew the what of secure coding from years of following conventions and best practices. Building detection rules forced me to learn the why — the actual attack mechanics that the conventions are defending against.

The lesson: Knowing the secure pattern is not the same as understanding the vulnerability. For a software engineer, the secure pattern is enough to write safe code. For an AppSec engineer, you need to understand the attack, because your job is to find it when someone else didn’t write the safe pattern.

Security Is an Adversarial Discipline

Software engineering is largely a collaborative discipline. You’re building something. The goal is for it to work. Your mental model of the system is oriented around the happy path — the flow where inputs are valid, networks are reliable, and users do what you expect.

AppSec is adversarial. The mental shift required is genuinely disorienting at first.

When I was building the JWT algorithm none rule, I had to think like someone who wants to forge authentication tokens. Not because I want to do that, but because unless I understand exactly how the attack works — what the attacker controls, what assumptions the vulnerable code makes, what the exploit chain looks like — I can’t write a rule that reliably detects it.

This is the skill that 13 years of software engineering didn’t develop: adversarial thinking. The question isn’t “does this code do what it’s supposed to do?” It’s “how could someone make this code do something it’s not supposed to do?”

The OWASP Top 10 is, at its core, a catalogue of the assumptions developers make that attackers exploit. A03 — Injection assumes that input is data, not instructions. A07 — Authentication Failures assumes that the code correctly validates identity. A02 — Cryptographic Failures assumes that encryption means the data is protected.

Every category is a place where the developer’s mental model of the system diverges from what an attacker can actually do to it. Understanding OWASP deeply means understanding those divergences — not as a checklist, but as a way of thinking.

The lesson: You can’t find vulnerabilities you can’t imagine. Developing adversarial thinking — the habit of asking “how could this go wrong for someone who wants it to go wrong” — is the most important cognitive shift in the AppSec transition.

Tools Are Amplifiers, Not Answers

Before I built my own SAST tool, I used SAST tools. And I treated them roughly like a compiler warning: something fires, I look at it, I decide whether to fix it or ignore it.

Building one changed how I think about what a SAST tool actually is.

A SAST tool is a codified set of heuristics about what vulnerable code looks like. Those heuristics are written by humans, based on human understanding of vulnerability patterns, with human decisions about confidence levels and severity ratings. The tool doesn’t know your codebase. It doesn’t know your threat model. It doesn’t know whether the finding it just generated is actually exploitable in your specific deployment context.

This sounds like a criticism. It isn’t. It’s a description of a tool’s appropriate role.

When I run Snyk or Semgrep now, I engage with the results differently than I did before. I ask: what pattern is this rule trying to catch? Is that pattern present in my code for the reason the rule assumes? Does the vulnerability the rule targets actually apply in my context? What would an attacker need to control to exploit this?

Those are AppSec questions, not DevOps questions. A DevOps mindset treats SAST output as a compliance gate. An AppSec mindset treats it as a starting point for analysis.

The lesson: A SAST scanner is a signal generator, not an oracle. The value it provides is proportional to the quality of thinking applied to its output — not to the number of findings it generates or suppresses.

False Positives Taught Me About Risk Tolerance

Every time I suppressed a finding in my own scanner, I had to make a decision: is this actually safe, and how confident am I?

That turns out to be the central skill of AppSec: structured risk assessment under uncertainty.

You almost never have complete information. You can’t always trace every data flow through a complex system. You can’t always know whether a finding is exploitable without building a proof of concept. You have to make a judgment call about whether the risk is acceptable given what you know.

What I learned from managing false positives is that risk tolerance is not a feeling — it’s a position that needs to be documented and defensible. “I suppressed this because it looked fine” is not a risk assessment. “I suppressed this because the data being processed is always from our internal configuration system and never from user input, as confirmed by tracing the call stack in lines 42–67” is a risk assessment.

The difference matters when something goes wrong. And in security, things go wrong.

The lesson: Risk assessment is a core AppSec competency, not a soft skill. Developing a structured, documented approach to risk decisions — even informal ones — is more valuable than any specific technical knowledge.

The Gap Between Writing Secure Code and Finding Insecure Code

These are related skills. They are not the same skill.

Writing secure code is a constructive activity. You know what you’re building. You apply secure patterns. You follow established conventions. The feedback loop is relatively tight — if you use parameterised queries, you know you’re not vulnerable to SQL injection there.

Finding insecure code is a forensic activity. You’re examining code you didn’t write, often without full context, looking for patterns that indicate vulnerability. The feedback loop is loose — you might flag something, triage it, determine it’s a false positive, and never know whether your triage was correct.

The cognitive skills are different. Construction requires knowing the secure pattern. Detection requires knowing the vulnerable pattern and all its variations. It requires understanding which variations are genuinely dangerous and which are contextually safe. It requires maintaining a mental model of an attacker’s perspective while reading code that was written from a developer’s perspective.

I’ve spent 13 years getting good at construction. Building this scanner was the first systematic exercise I did in detection. It was harder than I expected — not technically, but cognitively. Shifting from “I’m building this thing to work” to “I’m looking for ways this thing could be exploited” is a genuine gear change.

The lesson: AppSec is not “software engineering plus security knowledge.” It’s a different cognitive discipline that happens to use the same raw material. Senior software engineers making this transition should expect a genuine learning curve, not just a knowledge gap.

What I’d Tell Someone Starting This Transition

If you’re a software engineer moving into AppSec — or considering it — here’s what I’d tell you based on this project and the broader transition.

Build something. Reading about OWASP is useful. Reading CVE writeups is useful. Neither teaches you what building a detection rule teaches you. The act of translating “this is a vulnerability” into “this is what the vulnerable code looks like in text” forces a precision of understanding that passive learning doesn’t produce.

Study the attacks, not just the defences. Most of your software engineering career was spent learning defences — secure patterns, safe APIs, frameworks that handle the dangerous parts for you. AppSec requires understanding the attacks those defences are designed against. Read exploit writeups. Understand how CVEs actually work. Build your own vulnerable applications and attack them.

Get comfortable with ambiguity. Software engineering has right answers. Does this code compile? Does this test pass? Does this function return the correct value? AppSec often doesn’t. Is this finding exploitable? Is this suppression justified? Is this risk acceptable? These questions frequently don’t have clean answers, and developing comfort with that ambiguity is part of the transition.

Use your engineering background as a superpower, not a crutch. The thing that makes engineers valuable in AppSec is the ability to read code at scale, understand system architecture, and reason about data flows — skills most pure security professionals develop slowly. Use that. But don’t assume that understanding how the code is supposed to work means you understand how it can be broken.

Write about what you’re learning. This series started as a way to document my own thinking. Every article forced me to be more precise about something I thought I understood. The act of explaining something to someone else reveals the gaps in your own understanding faster than almost anything else.

Where This Goes Next

Building this scanner and writing this series was one project. The transition is ongoing.

The next project is taking an old Java service and doing something I haven’t done yet in this series: running Snyk against a real dependency tree on real legacy code, remediating real CVEs, and measuring the before-and-after security posture with actual metrics.

That’s a different kind of AppSec work — Software Composition Analysis rather than static analysis, dependency vulnerabilities rather than code vulnerabilities, Snyk’s recommendations rather than my own rules. But the underlying skills are the same: understand the attack, assess the risk, make a defensible decision, measure the outcome.

The transition from software engineer to AppSec engineer is not a destination. It’s an ongoing process of developing adversarial thinking, structured risk assessment, and the forensic discipline of finding what’s broken rather than building what works.

Thirteen years in, I’m still learning. That’s the right state to be in.

The full SAST tool that this series was built around is at github.com/pgmpofu/sast-tool.

If this series was useful to you — or if you’re making a similar transition and want to compare notes — I’d genuinely like to hear from you. Find me here on dev.to or connect on LinkedIn.

Python argparse: Build CLI Tools in 10 Minutes

Python argparse: Build CLI Tools in 10 Minutes

🎁 Free: AI Publishing Checklist — 7 steps in Python · Full pipeline: germy5.gumroad.com/l/xhxkzz (pay what you want, min $9.99)

The Problem with sys.argv[1]

You’ve been there. You write a quick script, hardcode a filename, then immediately need to change it. So you reach for sys.argv:

import sys

filename = sys.argv[1]
count = int(sys.argv[2])

This works — until it doesn’t. Run it without arguments and you get an IndexError. Pass a string where you expected an integer and it crashes. There’s no help text, no validation, no defaults. Anyone else who picks up your script has to read the source code to know how to run it.

argparse solves all of this. It’s in the standard library, requires no installation, and turns your script into a proper CLI tool in minutes.

The Basics: ArgumentParser

Every argparse script starts with a parser:

import argparse

parser = argparse.ArgumentParser(
    description="My CLI tool — does useful things."
)
args = parser.parse_args()

That one call to parse_args() handles everything: reading sys.argv, validating inputs, and printing help when the user passes --help.

Positional Arguments

Positional arguments are required and identified by position, not name:

parser.add_argument("filename", help="Path to the input file")
parser.add_argument("count", help="Number of items to process")

Optional Arguments (--flag and -f)

Optional arguments use -- prefix and can have short aliases:

parser.add_argument("--output", "-o", help="Output file path", default="output.txt")
parser.add_argument("--verbose", "-v", help="Enable verbose logging", action="store_true")

Type Validation: No More Manual Casting

Instead of int(sys.argv[1]) wrapped in a try/except, let argparse handle it:

parser.add_argument("--count", type=int, default=10, help="Number of items")
parser.add_argument("--rate", type=float, default=1.5, help="Processing rate")
parser.add_argument(
    "--format",
    choices=["json", "csv", "txt"],
    default="json",
    help="Output format"
)

If a user passes --count hello, argparse prints a clean error message and exits — no stack trace, no confusion.

Required Arguments, nargs, and Lists

Required Optional Arguments

parser.add_argument("--title", required=True, help="Article title (required)")

Accepting Multiple Values

# One or more values: --tags python beginner tutorial
parser.add_argument("--tags", nargs="+", help="One or more tags")

# Zero or more values: --tags (empty is fine)
parser.add_argument("--tags", nargs="*", help="Zero or more tags")

The result is a Python list you can iterate directly:

args = parser.parse_args()
for tag in args.tags:
    print(tag)

Boolean Flags: store_true and store_false

Boolean flags don’t take a value — their presence or absence is the value:

parser.add_argument("--dry-run", action="store_true", help="Simulate without writing")
parser.add_argument("--no-color", action="store_false", dest="color", help="Disable color output")

Usage:

python publish.py --dry-run        # args.dry_run is True
python publish.py                  # args.dry_run is False
python publish.py --no-color       # args.color is False

Subcommands: One Tool, Many Commands

Real CLI tools like git, docker, and pip use subcommands. add_subparsers() gives you the same structure.

parser = argparse.ArgumentParser(description="Publish queue manager")
subparsers = parser.add_subparsers(dest="command", required=True)

# `publish` subcommand
publish_parser = subparsers.add_parser("publish", help="Publish the next article in queue")
publish_parser.add_argument("--dry-run", action="store_true", help="Simulate without publishing")

# `list` subcommand
list_parser = subparsers.add_parser("list", help="Show the publish queue")
list_parser.add_argument("--format", choices=["table", "json"], default="table")

Now args.command tells you which subcommand was chosen, and each subcommand has its own arguments.

The --verbose / -v Pattern

A common pattern is using --verbose to set the logging level at runtime:

import argparse
import logging

parser = argparse.ArgumentParser()
parser.add_argument("--verbose", "-v", action="store_true", help="Enable debug logging")
args = parser.parse_args()

logging.basicConfig(
    level=logging.DEBUG if args.verbose else logging.INFO,
    format="%(levelname)s: %(message)s"
)

log = logging.getLogger(__name__)
log.info("Starting...")
log.debug("This only shows with --verbose")

Complete Example: Publish Queue CLI

Here’s a working CLI for managing an article publish queue — the same pattern used in the full pipeline.

#!/usr/bin/env python3
"""
publish_queue.py — CLI for managing the article publish queue.
Usage: python publish_queue.py <command> [options]
"""

import argparse
import json
import logging
import sys
from pathlib import Path

QUEUE_FILE = Path("queue.json")


def load_queue() -> list[dict]:
    if not QUEUE_FILE.exists():
        return []
    return json.loads(QUEUE_FILE.read_text())


def save_queue(queue: list[dict]) -> None:
    QUEUE_FILE.write_text(json.dumps(queue, indent=2))


def cmd_list(args: argparse.Namespace) -> None:
    queue = load_queue()
    if not queue:
        print("Queue is empty.")
        return
    for i, article in enumerate(queue, 1):
        status = "[published]" if article.get("published") else "[pending]  "
        print(f"{i}. {status} {article['title']} ({', '.join(article.get('tags', []))})")


def cmd_add(args: argparse.Namespace) -> None:
    queue = load_queue()
    article = {
        "title": args.title,
        "tags": args.tags or [],
        "published": False,
    }
    queue.append(article)
    save_queue(queue)
    logging.info("Added: %s", args.title)
    print(f"Added '{args.title}' to queue. Total: {len(queue)} articles.")


def cmd_publish(args: argparse.Namespace) -> None:
    queue = load_queue()
    pending = [a for a in queue if not a.get("published")]
    if not pending:
        print("No pending articles.")
        return
    next_article = pending[0]
    if args.dry_run:
        print(f"[DRY RUN] Would publish: {next_article['title']}")
        return
    next_article["published"] = True
    save_queue(queue)
    print(f"Published: {next_article['title']}")
    logging.info("Published: %s", next_article["title"])


def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(
        prog="publish_queue",
        description="Manage your article publish queue.",
    )
    parser.add_argument(
        "--verbose", "-v",
        action="store_true",
        help="Enable debug logging",
    )

    subparsers = parser.add_subparsers(dest="command", required=True)

    # list
    list_parser = subparsers.add_parser("list", help="Show the publish queue")
    list_parser.set_defaults(func=cmd_list)

    # add
    add_parser = subparsers.add_parser("add", help="Add an article to the queue")
    add_parser.add_argument("--title", required=True, help="Article title")
    add_parser.add_argument("--tags", nargs="*", help="Tags for the article")
    add_parser.set_defaults(func=cmd_add)

    # publish
    publish_parser = subparsers.add_parser("publish", help="Publish the next pending article")
    publish_parser.add_argument("--dry-run", action="store_true", help="Simulate without writing")
    publish_parser.set_defaults(func=cmd_publish)

    return parser


def main() -> None:
    parser = build_parser()
    args = parser.parse_args()

    logging.basicConfig(
        level=logging.DEBUG if args.verbose else logging.INFO,
        format="%(levelname)s: %(message)s",
    )

    args.func(args)


if __name__ == "__main__":
    main()

--help Output

$ python publish_queue.py --help
usage: publish_queue [-h] [--verbose] {list,add,publish} ...

Manage your article publish queue.

positional arguments:
  {list,add,publish}
    list              Show the publish queue
    add               Add an article to the queue
    publish           Publish the next pending article

options:
  -h, --help          show this help message and exit
  --verbose, -v       Enable debug logging

$ python publish_queue.py add --help
usage: publish_queue add [-h] --title TITLE [--tags [TAGS ...]]

options:
  -h, --help           show this help message and exit
  --title TITLE        Article title
  --tags [TAGS ...]    Tags for the article

Running It

# Add articles to the queue
python publish_queue.py add --title "Python argparse guide" --tags python beginners tutorial
python publish_queue.py add --title "Automate your workflow" --tags python automation

# List the queue
python publish_queue.py list
# 1. [pending]   Python argparse guide (python, beginners, tutorial)
# 2. [pending]   Automate your workflow (python, automation)

# Publish next (dry run first)
python publish_queue.py publish --dry-run
# [DRY RUN] Would publish: Python argparse guide

python publish_queue.py publish
# Published: Python argparse guide

# Check updated queue with debug logging
python publish_queue.py list --verbose

Key Patterns to Remember

Pattern When to use it
type=int / type=float Any numeric input
choices=[...] Fixed set of valid values
required=True Mandatory optional args
nargs="+" / nargs="*" Lists of values
action="store_true" Boolean flags
add_subparsers() Multi-command tools
set_defaults(func=...) Dispatch to subcommand functions

What You Get for Free

Every argparse-based script automatically has:

  • --help / -h — generated from your help= strings
  • Type validation — with clear error messages, no tracebacks
  • Default values — documented in the help output
  • Usage line — auto-generated from your argument definitions

No third-party libraries. No pip install. Just the standard library.

The publish queue CLI in the full pipeline uses argparse for –list, –add, and –publish: germy5.gumroad.com/l/xhxkzz — pay what you want, min $9.99.

Further Reading

  • Your First Automated Python Script That Validates and Runs Itself
  • Python logging: Stop Using print() in Your Automation Scripts
  • How to Schedule Python Scripts with Cron: A Beginner’s Complete Guide

Build your own AI-powered Voice To-Do Assistant using a Waveshare 1.75″ display + Cursor + DuckyClaw — from setup to full feature implementation

As a developer, I recently built a custom voice-enabled to-do assistant using the Waveshare 1.75″ display, Cursor IDE, and DuckyClaw framework. This guide breaks down my step-by-step implementation, with practical tips and pitfalls to avoid—no fluff, just actionable steps for fellow makers. No advanced embedded experience is needed, but basic familiarity with Git and hardware flashing will help.

🧭 Step-by-step Implementation Guide
Step 1 – Clone the DuckyClaw repo

  1. Navigate to the DuckyClaw official documentation and locate the Waveshare dev board quick start section.
  2. Find the “Clone the repo” step, copy the official repository URL (https://github.com/tuya/DuckyClaw.git).
  3. Open Cursor IDE, use the built-in Git integration to clone the repo. Cursor automatically installs required dependencies, eliminating manual package management—this saves time and avoids version conflicts.

Step 2 – Install TuyaOpen Dev Skills (workflow)

  1. Visit the TuyaOpen website and navigate to the developer tools section to find the TuyaOpen Dev Skills workflow installation prompt.
  2. Copy the exact prompt provided (it’s tailored for DuckyClaw integration) and paste it into the Cursor chat panel.
  3. The workflow installs automatically, establishing a direct connection between your project and TuyaOpen’s SDK—critical for accessing cloud services and hardware drivers later.

Step 3 – Create product & get credentials (PID / UUID / AuthKey)

  1. Follow the DuckyClaw quick start guide to create a new product on the Tuya Developer Platform (select “AI Agent” as the product type for seamless DuckyClaw integration).
  2. From the product dashboard, retrieve your Product ID (PID)—this identifies your custom device in the Tuya ecosystem.
  3. Navigate to the “Hardware Development” tab to download your UUID and AuthKey. These credentials are non-negotiable—store them securely, as they authenticate your board with Tuya Cloud and DuckyClaw.

Step 4 – Build & flash with Cursor

  1. In Cursor, use this precise prompt to ensure proper compilation and flashing:
    Build and flash DuckyClaw firmware for Waveshare 1.75&#34; display, using the PID, UUID, and AuthKey I retrieved from Tuya Developer Platform.
  2. Cursor detects your connected Waveshare board automatically, compiles the firmware with your credentials, and flashes it—no manual CLI commands or makefiles required. I tested this with three different Waveshare boards, and it worked consistently.

Step 5 – Activate in Smart Life app

  1. Download the Smart Life app (iOS/Android) and create an account if you don’t already have one.
  2. Follow the app’s “Add Device” flow to complete Wi-Fi provisioning—ensure your phone and Waveshare board are on the same Wi-Fi network for a smooth pairing process.
  3. Complete the pairing and activation steps. Once done, your board is connected to Tuya Cloud and ready to interact with DuckyClaw.

Step 6 – Add To-Do List feature
To implement the to-do functionality, I used Cursor to generate and integrate the code with DuckyClaw’s skill system. Use this specific prompt to avoid missing key features:
Implement a To-Do system for DuckyClaw + Waveshare 1.75&#34; display: Swipe left to access To-Do List, swipe right for Scheduled tasks, UI styled after Apple Reminders, and smooth scrolling using the lv_example_scroll_6 component. Integrate with DuckyClaw’s CRON skill for task scheduling and heartbeat skill for reminders. Cursor generates clean, framework-compatible code—review it briefly to ensure display dimensions match the 1.75″ screen, then adjust any UI elements if needed.

Step 7 – Build & flash again
Re-run the build and flash process in Cursor (use the same prompt as Step 4) to push the to-do feature to your board. The flash process takes 30-60 seconds—do not disconnect the board during this time. I recommend testing the UI immediately after flashing to catch any display alignment issues early.

Step 8 – Final Testing &amp; Debugging
After flashing, test all core features to ensure stability. Here’s what to verify:
● 🎙️ Voice input: Test DuckyClaw’s hardware ASR (ensure your board has a built-in mic or external mic connected) – it should recognize voice commands to add to-dos.
● ✅ To-Do management: Add, edit, and mark tasks as complete—verify UI responsiveness and swipe navigation.
● ⏰ Scheduled tasks: Set a test reminder to confirm the CRON skill triggers notifications (check the display and any connected speaker).
● 📱 Display functionality: Ensure smooth scrolling and no UI glitches on the 1.75″ screen.
If you encounter issues, check the Cursor output log for compilation errors or the Tuya Developer Platform for device connection status.

💡 Developer Notes & Key Takeaways
This project is a practical example of combining AI, IoT, and low-code development to build a useful hardware product. Here’s what I learned during implementation:

  • DuckyClaw’s TuyaOpen foundation simplifies hardware integration—its built-in drivers for displays and ASR save hours of custom coding.
  • Cursor’s low-code approach accelerates feature development, but always review generated code to ensure compatibility with DuckyClaw’s skill system.
  • Credential management is critical—never hardcode PID/UUID/AuthKey in public repos; use DuckyClaw’s config files for secure storage.
  • Extensibility is a strong point: you can easily add more features (e.g., IoT device control, voice TTS) using DuckyClaw’s modular skills.

🔗 Resources & Contribution
Official Docs: Step-by-step hardware setup, SDK guides, and skill development tutorials — https://tuyaopen.ai/duckyclaw

GitHub Repo: GitHub – tuya/DuckyClaw: Edge-Hardware (SoC/MCU) oriented Claw🦞 (check the TODOs.md for upcoming features)

Discord Community: [https://discord.com/invite/yPPShSTttG]

If you build this project, share your tweaks and improvements—I’d love to see how fellow developers extend the to-do functionality or integrate additional DuckyClaw skills. Feel free to drop a comment with questions or your build details! 🦆✨

The Hidden 43% — How Teams Are Wasting Almost Half Their LLM API Budget

You look at your provider dashboard and see one number: the total bill. It’s like getting an electricity bill that just says “$5,000” with no breakdown of whether it was the AC, the fridge, or someone leaving the lights on all month.

tbh, most AI startups are flying blind right now. We recently looked into the cost breakdown for several teams and found something crazy: almost 43% of LLM API spend is completely wasted. It’s not about paying for usage; it’s about paying for bad architecture.

Here’s where the leaks are actually happening:

  1. Retry Storms (34% of waste)
    Your agent fails to parse a JSON response, so it retries. And retries. Sometimes 5-10 times in a loop. You aren’t just paying for the failure, you are paying for the massive context window sent every single time.

  2. Duplicate Calls (85% of apps have this issue)
    Multiple users asking the exact same question, or internal systems running the same RAG pipeline on the same document. Without caching at the provider level, you’re paying OpenAI to generate the identical tokens twice.

  3. Context Bloat
    Sending the entire 50-page document history when the user just asked “what’s the summary of page 2”. RAG is great, but shoving everything into the prompt “just in case” is burning your runway.

  4. Wrong Model Selection
    Using GPT-4o or Claude 3 Opus for simple classification tasks when Haiku or GPT-3.5-turbo would do it for a fraction of the cost.

You can’t fix what you can’t see. That’s exactly why I built LLMeter (https://llmeter.org?utm_source=devto&utm_medium=article&utm_campaign=hidden-43-percent-llm-waste). It’s an open-source dashboard that gives you per-customer and per-model cost tracking. Stop guessing who or what is draining your API budget.

Fwiw, just setting up basic budget alerts and seeing the breakdown by tenant usually drops a team’s bill by 20% in the first week. Give it a try, it’s open source (AGPL-3.0) and you can self-host or use the free tier.

Building a Multi-Agent Fleet with No Central Server

Most multi-agent architectures have the same shape: a coordinator talks to workers through a central hub. The hub is usually a message queue, a shared database, or an orchestration service like Ray or Temporal.

That hub is also the first thing that breaks. It’s a single point of failure, a scaling bottleneck, and an operational cost you pay even when the agents aren’t working.

Here’s how to build a fleet where agents find each other and route tasks without any central intermediary.

The Central Hub Problem

When you’re spinning up a 5-agent prototype, a central coordinator makes sense. It’s simple, debuggable, and gets out of your way.

At 50 agents it starts to fray. At 500 it becomes your hardest reliability problem.

The hub becomes a global lock. Every message goes through it. Every failure cascades through it. Every scaling decision has to account for it.

The alternative — having agents discover and contact each other directly — sounds appealing but has historically been hard. How does Agent A know Agent B’s address? How do you handle NAT traversal? How do you authenticate the connection?

These are solved problems in networking. We just haven’t applied the solutions to agents until now.

Peer-to-Peer at the Session Layer

Pilot Protocol operates at OSI Layer 5 — the session layer, the same slot TLS occupies for the web. It gives each agent:

  • A permanent 48-bit address (0:A91F.0000.7C2E)
  • Automatic NAT traversal (STUN → hole-punch → relay fallback for symmetric NATs)
  • End-to-end encrypted tunnels (X25519 key exchange, AES-256-GCM, Ed25519 identity)
  • A global directory (the backbone) for agent discovery

With Pilot, the hub isn’t a server you run. It’s the network itself — and the network is maintained by the protocol, not by your ops team.

A Fleet Pattern That Actually Works

Here’s a concrete pattern for a research fleet:

Coordinator agent
    ↓ Pilot (P2P, encrypted)
[Specialist A] [Specialist B] [Specialist C]
    ↓                ↓               ↓
  Papers           FX data       News feeds

Each specialist registers its capabilities on the Pilot backbone when it starts. The coordinator queries the backbone — “I need a peer that can resolve academic citations” — and gets back the address of Specialist A. Direct connection from there.

No service registry you maintain. No hardcoded addresses. No configuration file you update when a worker moves.

The Code

Getting an agent online:

curl -fsSL https://pilotprotocol.network/install.sh | sh
pilotctl daemon start --hostname coordinator

That’s it. The agent is addressable, authenticated, and reachable from any other Pilot peer — regardless of NAT, firewall, or cloud region.

For the specialists:

# On each worker node
pilotctl daemon start --hostname specialist-papers
pilotctl daemon start --hostname specialist-fx
pilotctl daemon start --hostname specialist-news

Each one joins the backbone automatically. The coordinator can ping them:

pilotctl ping specialist-papers
# ✓ reply from 0:4B2E.0000.1A3D · 22ms

Self-Organization: How Groups Work

Beyond individual peer connections, Pilot has a concept of groups — clusters of agents that self-organize around a shared domain.

A trading fleet might form a TRADING group. A research fleet might join RESEARCH. Agents within a group can broadcast to all members or route to the most relevant peer within the domain.

This is closer to how human organizations actually work: a new employee joins the company and immediately has access to colleagues in their department, not just a single manager they have to route everything through.

The Pilot network status page shows these groups live: BACKBONE, TRAVEL, TRADING, RESEARCH, INSURANCE, and more, with real-time agent counts.

What You Give Up

Centralized orchestration isn’t all downside. You give up some things going P2P:

Observability. A central hub is easy to instrument. A P2P mesh requires distributed tracing from day one. Plan for this.

Debuggability. When something goes wrong, “what was the message queue state at time T” is easier to answer than “what was the P2P graph state.” Log aggressively at the agent level.

Simplicity. For a 3-agent prototype, a coordinator is simpler. P2P earns its complexity at scale.

When to Switch

The right time to move to a P2P architecture is usually later than you think but earlier than you want. Signals that you’re ready:

  • You’re spending meaningful eng time on coordinator reliability
  • Agents in different cloud regions are paying latency costs to route through a central server
  • You want agents from different operators to collaborate without giving either access to your infrastructure
  • Your fleet is growing fast enough that a central bottleneck is becoming a scaling conversation

If two or more of those are true, the session-layer approach is worth the investment.

Further Reading

  • Pilot Protocol documentation — addressing, groups, NAT traversal
  • Multi-agent setups on Pilot — pre-wired fleet configurations
  • The IETF Internet-Draft — the protocol spec if you want to go deep

The network is live: ~163,000 agents, 12.7B+ requests routed, +28% growth in the past week.

One line to get started: curl -fsSL https://pilotprotocol.network/install.sh | sh

Por Qué Fallan los Agentes de IA: 3 Modos de Fallo Que Cuestan Tokens y Tiempo

Los agentes de IA no fallan como el software tradicional: no se bloquean con un stack trace. Fallan silenciosamente: devuelven respuestas incompletas, se congelan en APIs lentas o queman tokens llamando a la misma herramienta una y otra vez. El agente parece funcionar, pero la salida está mal, llega tarde o es costosa.

Esta serie cubre los tres modos de fallo más comunes con soluciones respaldadas por investigación. Cada técnica tiene una demostración ejecutable que mide la diferencia antes/después.

Código funcional: github.com/aws-samples/sample-why-agents-fail

Las demos usan Strands Agents con OpenAI (GPT-4o-mini). Los patrones son independientes del framework: aplican a LangGraph, AutoGen, CrewAI o cualquier framework que soporte llamadas a herramientas y hooks de ciclo de vida.

Esta Serie: 3 Soluciones Esenciales

  1. Desbordamiento de Ventana de Contexto — Patrón de Puntero de Memoria para datos grandes
  2. Herramientas MCP Que Nunca Responden — Patrón handleId asíncrono para APIs externas lentas
  3. Loops de Razonamiento en Agentes de IA — DebounceHook + estados claros de herramientas para bloquear llamadas repetidas

¿Qué Sucede Cuando las Salidas de Herramientas Desbordan la Ventana de Contexto?

El desbordamiento de ventana de contexto ocurre cuando una herramienta devuelve más datos de los que el LLM puede procesar: logs del servidor, resultados de bases de datos o contenidos de archivos que exceden el límite de tokens. El agente no falla con un error. Se degrada silenciosamente: trunca datos, pierde contexto o produce respuestas incompletas.

Una investigación de IBM cuantifica esto: un flujo de trabajo de Ciencia de Materiales consumió 20 millones de tokens y falló. El mismo flujo con punteros de memoria usó 1,234 tokens y tuvo éxito.

Comparación de un agente de IA sin Patrón de Puntero de Memoria versus con él, mostrando cómo los datos grandes permanecen fuera de la ventana de contexto

La solución — Patrón de Puntero de Memoria: Almacena datos grandes en agent.state, devuelve un puntero corto al contexto. La siguiente herramienta resuelve el puntero para acceder a los datos completos:

from strands import tool, ToolContext

@tool(context=True)
def fetch_application_logs(app_name: str, tool_context: ToolContext, hours: int = 24) -> str:
    """Obtiene logs. Almacena datos grandes como puntero para evitar desbordamiento de contexto."""
    logs = generate_logs(app_name, hours)  # Podría ser 200KB+

    if len(str(logs)) > 20_000:
        pointer = f"logs-{app_name}"
        tool_context.agent.state.set(pointer, logs)
        return f"Datos almacenados como puntero '{pointer}'. Usa herramientas de análisis para consultarlo."
    return str(logs)

@tool(context=True)
def analyze_error_patterns(data_pointer: str, tool_context: ToolContext) -> str:
    """Analiza errores — resuelve puntero desde agent.state."""
    data = tool_context.agent.state.get(data_pointer)
    errors = [e for e in data if e["level"] == "ERROR"]
    return f"Se encontraron {len(errors)} errores en {len(set(e['service'] for e in errors))} servicios"

El LLM nunca ve los 200KB: solo ve "Datos almacenados como puntero 'logs-payment-service'" (52 bytes).

¿Por qué Strands Agents? La API de ToolContext proporciona agent.state como un almacén clave-valor nativo con alcance para cada agente: sin diccionarios globales, sin infraestructura externa. Para flujos multi-agente, invocation_state comparte datos entre agentes en un Swarm con la misma API.

Métrica Sin punteros Con Punteros de Memoria
Datos en contexto 214KB (logs completos) 52 bytes (puntero)
Comportamiento del agente Trunca o falla Procesa todos los datos
Errores detectados Parcial Completo

Gráfico de barras mostrando uso de tokens en diferentes estrategias de gestión de contexto

Demo completa: 01-context-overflow-demo — implementaciones de agente único y multi-agente (Swarm) con notebooks.

¿Por Qué los Agentes de IA se Congelan al Llamar APIs Externas?

Los agentes de IA se congelan cuando las herramientas MCP llaman a APIs externas lentas o que no responden. El agente se bloquea en la llamada a la herramienta, el usuario no ve progreso, y después de 7 segundos muchas implementaciones devuelven un error 424. MCP (Model Context Protocol) les da a los agentes la capacidad de llamar herramientas externas, pero no maneja timeout o reintentos por defecto.

Llamada síncrona a herramienta MCP mostrando agente bloqueado mientras espera API lenta

La solución — Patrón handleId asíncrono: La herramienta devuelve inmediatamente un ID de trabajo. El agente consulta una herramienta separada check_status:

from mcp.server.fastmcp import FastMCP

mcp = FastMCP("timeout-demo")
JOBS = {}

@mcp.tool()
async def start_long_job(task: str) -> str:
    """Devuelve handle inmediatamente — previene timeout."""
    job_id = str(uuid.uuid4())[:8]
    JOBS[job_id] = {"status": "processing", "task": task}
    asyncio.create_task(_process_job(job_id))  # Trabajo en segundo plano
    return f"Trabajo iniciado. Handle: {job_id}. Usa check_job_status para consultar."

@mcp.tool()
async def check_job_status(job_id: str) -> str:
    """Consulta estado del trabajo — devuelve 'processing' o 'completed' con resultado."""
    job = JOBS.get(job_id)
    if not job:
        return f"FAILED: Trabajo '{job_id}' no encontrado"
    return f"{job['status'].upper()}: {job.get('result', 'Todavía procesando...')}"
Escenario Tiempo de respuesta UX
API rápida (1s) 3s total OK
API lenta (15s) 18s bloqueado Agente congelado
API fallida Error 424 después de 7s Agente falla
handleId asíncrono ~4s (inmediato + consulta) Agente responde

Visualización de línea de tiempo mostrando cuatro patrones de respuesta MCP

¿Por qué Strands Agents? El MCPClient se conecta a cualquier servidor MCP. El agente descubre herramientas en tiempo de ejecución vía list_tools_sync(): sin lista de herramientas codificada. Cuando el servidor MCP implementa el patrón asíncrono, el agente consulta automáticamente sin código de orquestación adicional.

Demo completa: 02-mcp-timeout-demo — servidor MCP local con los 4 escenarios y notebook.

¿Por Qué los Agentes de IA Repiten la Misma Llamada a Herramienta?

Los loops de razonamiento en agentes de IA ocurren cuando el agente llama a la misma herramienta repetidamente con parámetros idénticos, sin hacer progreso. La causa raíz es retroalimentación ambigua de la herramienta: respuestas como “puede haber más resultados disponibles” hacen que el agente piense que otra llamada producirá mejores resultados. Las investigaciones muestran que los agentes pueden hacer loops cientos de veces sin entregar una respuesta.

Diagrama mostrando cómo la retroalimentación ambigua de herramientas causa loops versus cómo estados claros y DebounceHook los previenen

Solución 1 — Estados terminales claros: Las herramientas devuelven SUCCESS o FAILED explícito en lugar de mensajes ambiguos:

# Ambiguo (causa loops)
return f"Vuelos encontrados: {results}. Puede haber más resultados disponibles."

# Claro (el agente se detiene)
return f"SUCCESS: Vuelo {conf_id} reservado para {passenger}. Confirmación enviada."

Solución 2 — DebounceHook: Detecta y bloquea llamadas duplicadas a herramientas a nivel de framework:

from strands.hooks.registry import HookProvider, HookRegistry
from strands.hooks.events import BeforeToolCallEvent

class DebounceHook(HookProvider):
    """Bloquea llamadas duplicadas a herramientas en una ventana deslizante."""
    def __init__(self, window_size=3):
        self.call_history = []
        self.window_size = window_size

    def register_hooks(self, registry: HookRegistry) -> None:
        registry.add_callback(BeforeToolCallEvent, self.check_duplicate)

    def check_duplicate(self, event: BeforeToolCallEvent) -> None:
        key = (event.tool_use["name"], json.dumps(event.tool_use.get("input", {})))
        if self.call_history.count(key) >= 2:
            event.cancel_tool = f"BLOCKED: Llamada duplicada a {event.tool_use['name']}"
        self.call_history.append(key)
        self.call_history = self.call_history[-self.window_size:]
Estrategia Llamadas a herramientas Resultado
Retroalimentación ambigua (línea base) 14 llamadas Sin respuesta definitiva
DebounceHook 12 llamadas (2 bloqueadas) Completa con bloqueos
Estados SUCCESS claros 2 llamadas Completado inmediato

Gráfico de barras mostrando llamadas a herramientas en diferentes estrategias

¿Por qué Strands Agents? La API de HookProvider intercepta llamadas a herramientas vía BeforeToolCallEvent antes de que se ejecuten. Establecer event.cancel_tool bloquea la ejecución a nivel de framework: el LLM no puede omitirlo. Esto hace que los hooks sean componibles para apilar DebounceHook, LimitToolCounts y validadores personalizados en el mismo agente.

Demo completa: 03-reasoning-loops-demo — los 4 escenarios con hooks y notebook.

Requisitos Previos

Necesitas Python 3.9+, uv (un gestor de paquetes rápido de Python), y una clave API de OpenAI.

git clone https://github.com/aws-samples/sample-why-agents-fail
cd sample-why-agents-fail/stop-ai-agents-wasting-tokens

# Elige cualquier demo
cd 01-context-overflow-demo   # o 02-mcp-timeout-demo, 03-reasoning-loops-demo
uv venv && uv pip install -r requirements.txt
export OPENAI_API_KEY="tu-clave-aquí"

uv run python test_*.py

Cada demo es independiente con sus propias dependencias, script de prueba y notebook de Jupyter.

Preguntas Frecuentes

¿Cuáles son los modos de fallo más comunes en agentes de IA?

Los tres modos de fallo más comunes son el desbordamiento de ventana de contexto (la herramienta devuelve más datos de los que el LLM puede procesar), timeouts de herramientas MCP (APIs externas bloquean al agente indefinidamente) y loops de razonamiento (el agente repite la misma llamada a herramienta sin progresar). Cada modo de fallo causa desperdicio de tokens y degrada la calidad de respuesta.

¿Cómo reduzco los costos de tokens de un agente de IA?

Las dos técnicas más efectivas son los punteros de memoria y estados claros de herramientas. El Patrón de Puntero de Memoria almacena salidas grandes de herramientas en estado externo y pasa referencias cortas al contexto del LLM, reduciendo el uso de tokens de más de 200KB a menos de 100 bytes por llamada a herramienta. Estados terminales claros (SUCCESS/FAILED) en respuestas de herramientas previenen que el agente reintente operaciones completadas, lo que puede reducir las llamadas a herramientas de 14 a 2.

¿Puedo usar estos patrones con frameworks distintos a Strands Agents?

Sí. El Patrón de Puntero de Memoria funciona con cualquier framework que soporte contexto de herramientas (pasar estado entre herramientas). El patrón handleId asíncrono es un patrón de diseño de servidor MCP: funciona con cualquier agente compatible con MCP. DebounceHook requiere hooks de ciclo de vida, que están disponibles en LangGraph, AutoGen y CrewAI con APIs diferentes.

Referencias

Investigación

  • Solving Context Window Overflow in AI Agents — IBM Research, Nov 2025
  • Towards Effective GenAI Multi-Agent Collaboration — Amazon, Dec 2024
  • Resilient AI Agents With MCP — Octopus, May 2025
  • Language models can overthink — The Decoder, Jan 2025

Implementación

  • Strands Agent State — ToolContext and agent.state
  • Strands MCP Tools — Connect any MCP server
  • Strands Hooks — Lifecycle events and tool cancellation

¿Qué modo de fallo has encontrado en tus agentes? Comparte en los comentarios.

Gracias!

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube

elizabethfuentes12 image

Elizabeth Fuentes LFollow

I help developers build production-ready AI applications through hands-on tutorials and open-source projects.