AI Can Write Your Code. But It Can’t Design Your System.

Posted May 2, 2026 by DevegygiebyOL

We are living in the golden age of developer productivity. With tools like Copilot and ChatGPT, you can generate hundreds of lines of boilerplate and complex API endpoints in seconds.

It feels like magic. But there is a hidden danger lurking behind that flashing cursor: If you don’t possess foundational architectural knowledge, AI will just help you build a Big Ball of Mud faster than ever before.

The “Junior Developer on Steroids”

Think of AI as the most enthusiastic, tireless, and blisteringly fast Junior Developer you’ve ever managed. It knows the syntax of every language perfectly.

But it has a fatal flaw: It defaults to the easiest path, not the right one.

If you prompt an AI to “write a function to process a user order,” it will happily give you a massive, 300-line controller method. It will hard-code the database connection, mix in the business validation, trigger a third-party payment API synchronously, and tightly couple the entire thing together.

The code will compile. The tests might even pass. But architecturally? It is a ticking time bomb.

Why Foundational Knowledge is Your Superpower

The developers who will thrive in the AI era are not the ones who can type the fastest. The future belongs to the Clarity Engineers—the developers who understand system design, tradeoffs, and architectural boundaries.

When you know software architecture, your relationship with AI completely changes. Instead of accepting its first messy draft, an architected prompt looks like this:

“Write a service class to process user orders. Ensure the core business logic is decoupled from the database using Hexagonal Architecture (Ports and Adapters). The payment processing must not be synchronous; instead, publish a domain event to a message broker so we achieve temporal decoupling.”

Suddenly, the AI isn’t just writing code. It is executing your blueprint.

The Takeaway

AI isn’t going to replace software architects. It is going to make them 10x more powerful. But to wield that power, you need to know the rules of the game so you can instruct the AI on how to play it.

My new book, Grokking Software Architecture (published by Manning Publications Co. ), is the practical, conversational guide I wish I’d had when I started my journey nearly two decades ago. It’s fun, engaging, and filled with information you can start using on DAY ONE in your new job, or starting TODAY at your current job.

Don’t just accept the code the AI hands you. Learn how to hand the AI a blueprint.

Grab your Early Access (MEAP) copy at 🔥 50% OFF today during Manning’s Sitewide Sale: http://hubs.la/Q03-d27Y0

Let’s build systems that last.

Desplegando una página web en Amazon EC2 con Nginx

Posted May 2, 2026 by DevegygiebyOL

Creando y desplegando una instancia en Amazon EC2

¿Alguna vez te has preguntado cómo funcionan los servidores en la nube o cómo puedes publicar tu propia página web en internet sin necesidad de tener un servidor físico?

En este laboratorio te guiaré paso a paso en el proceso de creación de una instancia en Amazon EC2, explicando de manera clara cada una de las configuraciones necesarias para que puedas comprender y realizar este proceso sin complicaciones.

Además, no solo nos quedaremos en la teoría: utilizaremos Nginx para desplegar un sitio web real y aprender cómo personalizarlo con nuestro propio contenido, logrando que esté disponible desde cualquier lugar.

Paso 1: Acceder a Amazon EC2

Para comenzar con el lanzamiento de una instancia en Amazon EC2, nos dirigimos al buscador de la consola de AWS y escribimos “EC2”.

Una vez aparezca el servicio, hacemos clic en él para ingresar al panel principal. Allí encontraremos un botón naranja con la opción “Launch instance” (Lanzar instancia), el cual seleccionaremos para iniciar el proceso de creación de nuestra instancia.

Paso 2: Configuración inicial de la instancia

En este paso comenzamos definiendo los parámetros básicos de nuestra instancia en Amazon EC2.

Primero, asignamos un nombre que nos permita identificarla fácilmente. En este caso utilizamos “laboratorio-ec2”.

A continuación, seleccionamos la AMI (Amazon Machine Image), que es la plantilla del sistema operativo que tendrá nuestra instancia. La AMI incluye el sistema base y configuraciones iniciales necesarias para su funcionamiento.

Para este laboratorio, elegimos Amazon Linux, ya que es una opción optimizada para AWS, ligera y ampliamente utilizada en entornos reales.

Utilizamos t3.micro porque es la opción más básica y barata de AWS.

Sirve para aprender y hacer pruebas
Es gratis en el Free Tier
Tiene recursos suficientes para proyectos pequeños

Paso 3: Creación del par de claves

En este paso creamos un par de claves, el cual nos permitirá conectarnos de forma segura a nuestra instancia en Amazon EC2 mediante SSH.

Primero, asignamos un nombre al par de claves para poder identificarlo fácilmente.

Luego, seleccionamos el tipo de clave RSA, ya que es uno de los algoritmos más utilizados y compatibles para la autenticación SSH, ofreciendo un buen nivel de seguridad y facilidad de uso.

En cuanto al formato, elegimos .pem, ya que es el más adecuado para conectarnos desde entornos Linux, macOS o herramientas como Git Bash en Windows, permitiendo usar el comando SSH directamente.

Es importante mencionar que, aunque en este laboratorio se creó el par de claves, no se utilizó durante la conexión, ya que se accedió a la instancia mediante EC2 Instance Connect, una herramienta que permite conectarse directamente desde el navegador sin necesidad de configurar la clave privada. Sin embargo, el uso de claves .pem es fundamental en entornos reales y representa una práctica estándar para conexiones seguras mediante SSH.

Tip importante

Es fundamental descargar y guardar este archivo en un lugar seguro, ya que será necesario para acceder a la instancia. Si se pierde, no será posible conectarse a ella.

Paso 4: Configuración de red

En este paso configuramos las reglas de acceso a nuestra instancia en Amazon EC2 mediante un Security Group, el cual actúa como un firewall que controla el tráfico de entrada.

Para este laboratorio, habilitamos las siguientes reglas:

SSH (puerto 22): permite conectarnos de forma remota a la instancia desde nuestra máquina.
HTTP (puerto 80): permite que el sitio web sea accesible desde el navegador.

Estas configuraciones son fundamentales, ya que sin el acceso por HTTP no sería posible visualizar la página web desplegada.

Con esto terminaríamos la configuración para lanzar nuestra instancia EC2.

Paso 5: Conexión a la instancia

Una vez lanzada la instancia en Amazon EC2, accederemos a la sección de detalles donde encontraremos la opción para conectarnos.

Para ello, seleccionamos la instancia y hacemos clic en “Connect” (Conectar). Dentro de esta sección, nos desplazamos hasta la opción EC2 Instance Connect, que nos permite acceder directamente desde el navegador sin necesidad de configuraciones adicionales.

Finalmente, hacemos clic en el botón “Connect”, lo que abrirá una terminal desde donde podremos interactuar con nuestra instancia.

Paso 6: Actualización del sistema e instalación de Nginx

Este comando permite actualizar el sistema operativo, instalando las últimas versiones disponibles de los paquetes y corrigiendo posibles vulnerabilidades.

sudo dnf update -y

Este comando descarga e instala Nginx en la instancia, dejándolo listo para ser configurado y utilizado.

sudo dnf install nginx -y

Paso 7: Iniciar y habilitar Nginx

Este comando pone en funcionamiento Nginx, permitiendo que el servidor web comience a atender solicitudes.

sudo systemctl start nginx

Esto permite que Nginx se inicie automáticamente cada vez que la instancia se reinicie.

sudo systemctl enable nginx

Paso 8: Obtener la dirección IP pública

Para poder acceder a nuestro servidor web, debemos obtener la dirección IP pública de la instancia en Amazon EC2.

Para ello, nos dirigimos al panel de Instancias, seleccionamos la que hemos creado y buscamos el campo “Dirección IPv4 pública” en la sección de detalles.

Esta dirección será la que utilizaremos en el navegador para visualizar nuestra página web.

Esta es la pagina web que hemos creado

Paso 9: Modificar la página web

Para personalizar el contenido de nuestro sitio en la instancia de Amazon EC2, debemos acceder a la carpeta donde Nginx almacena los archivos web.

Primero, nos dirigimos al directorio correspondiente:
cd /usr/share/nginx/html
Luego, abrimos el archivo principal de la página:

sudo nano index.html

Este archivo contiene el contenido que se muestra en el navegador. Aquí podremos editarlo y reemplazar la página por defecto de Nginx con nuestro propio diseño.

Paso 10: Editar y guardar la página web

Para personalizar nuestro sitio, eliminamos el contenido existente del archivo index.html y lo reemplazamos con el código de nuestra propia página web.

Una vez realizados los cambios, procedemos a guardarlos utilizando el editor nano:

Presionamos Ctrl + X
El sistema nos preguntará si deseamos guardar los cambios (Y/N)
Presionamos Y (Yes)
Finalmente, presionamos Enter para confirmar el nombre del archivo.

Paso 11: Visualizar la página web

Finalmente, para ver el resultado de nuestro trabajo, utilizamos nuevamente la dirección IP pública de la instancia en Amazon EC2.

Ingresamos esta dirección en el navegador:

http://TU_IP_PUBLICA

Y este es el resultado final de nuestra pagina web después de la modificación.

Aprendizaje del laboratorio

En este laboratorio aprendí el paso a paso para lanzar y configurar una instancia en Amazon EC2. También aprendí a conectarme de forma remota con EC2 Instance Connect y a desplegar un servidor web funcional usando Nginx.

Además, comprendí la importancia de los Security Groups para controlar el acceso mediante SSH y HTTP, y cómo la IP pública permite que una página web sea accesible desde internet.

En general, fue una práctica útil para conectar la teoría con la práctica y entender cómo se publica una aplicación en la nube.

Open-source AI I’m watching: DeepSeek V4, VibeVoice, and the n8n effect

Posted May 2, 2026 by DevegygiebyOL

Sunday is my day to skim what shipped, note what seems worth going deeper on, and write a short annotated list before the week catches up with me again. This week was genuinely busy: three frontier labs released major models within a 10-day window, a speech model landed quietly from Microsoft, and n8n crossed a milestone that made me rethink some assumptions.

I’m running three AI-curated directory sites built on Astro 5 + Claude Haiku 4.5. These releases matter to me not just as interesting tech but as practical inputs for what I build next.

DeepSeek V4 Preview (April 24)

DeepSeek dropped V4 on April 24: a 1.6T-parameter Mixture-of-Experts model with 49B parameters activated per forward pass, a 1M-token context window, and an MIT license. The V4-Pro and V4-Flash variants are both live via their API, with Pro at $0.30 per million tokens.

What makes this worth watching for me specifically: 49B activated parameters at that price point puts it in direct competition with Claude Haiku 4.5 for content-generation workloads. I haven’t benchmarked it against my actual task — concise, non-hallucinating product descriptions at scale — so I won’t claim it’s better. But the SWE-bench Pro number (81%) is not nothing, and the MIT license means fine-tuning on domain data is an option if I ever have the infrastructure budget for it. I don’t right now. Good to know it exists.

The other thing I’m noting: the 1M-token context window is large enough to feed an entire site’s content into a single prompt. Whether that’s useful for quality or just a headline feature, I’ll know in a month of testing.

GPT-5.5 (April 23–24)

OpenAI also dropped GPT-5.5 on April 23, with API access following the next day. The notable framing from OpenAI: this isn’t a post-training increment. They rebuilt the architecture, the pretraining corpus, and the training objectives from scratch — first time they’ve done that since GPT-4.5.

I’m watching this more cautiously than the benchmark numbers suggest I should. When pretraining changes substantially, so do second-order behaviors: emergent capabilities, failure modes, prompt sensitivities. The leaderboard tells you the headline. It doesn’t tell you how the model behaves when your prompt is ambiguous or your domain is narrow. I’ll wait 30–45 days for the community to find the edges before I run serious evals.

Microsoft VibeVoice (April 29)

Microsoft released VibeVoice on April 29 — a frontier speech AI model, fully open-source, hosted on GitHub. Honest take: I haven’t used it. Speech-to-text isn’t in my current stack at all. But the open-source release is interesting because Microsoft has historically distributed frontier models through Azure, not GitHub.

If it holds up technically, high-quality speech AI joins the list of things you can self-host without paying a cloud API per-minute rate. That matters more for the open-source ecosystem in aggregate than it does for my specific projects. I’m flagging it because the distribution model, not the capability, is what changed.

n8n crossing 180k GitHub stars

n8n crossed 180,000 stars. It’s a workflow automation platform — visual canvas, 400+ integrations, self-hosted, fair-code license, and now with native AI workflow support built in.

Here’s the honest competitive thought this triggered: n8n can do what my GitHub Actions cron pipelines do — scrape, enrich, call Claude, publish — but without writing YAML. If a non-coder can set up an n8n flow that generates content and posts it to Dev.to, the differentiation for my approach has to come from somewhere else: speed, volume, domain-specific prompt quality, site architecture. That’s where I’m trying to compete. The milestone is a useful reminder to be honest about what is and isn’t a moat.

OpenClaw: from 9k to 210k+ stars

OpenClaw is an open-source personal AI assistant that connects to WhatsApp, Telegram, Slack, Discord, Signal, and iMessage. It went from 9,000 to over 210,000 stars in a matter of weeks earlier this year and is still climbing.

I track this not because it’s relevant to my stack, but because the growth curve is its own signal. OpenClaw didn’t solve a new technical problem — it packaged existing capabilities in a way that fit how people already communicate. That’s a distribution lesson, not a model lesson. When I think about what makes a directory site useful rather than just indexed, I keep coming back to the same question: is this packaged where people already are, or does it require them to come to me?

Five things, five different stakes. DeepSeek V4 and GPT-5.5 are direct inputs to infrastructure decisions I’ll make in the next 60 days. n8n is a competitive signal worth taking seriously. VibeVoice and OpenClaw are watching briefs — I’ll check back in 30 days and see if either has changed my thinking.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.

AI in Journalism

Posted May 2, 2026 by DevegygiebyOL

I’ve been running an experiment. I wanted to see if AI could generate opinion articles that while written by AI capture my personality and perspectives. My AI Daily News site was initially just a way for me to aggregate news stories about AI into something I could digest in the morning before I started work.

Later I thought I would provide it a range of my prior writing, and to get it to prepare a ‘Opinion’ with my name on the byline. Would it produce something plausibly by me, presenting my views, but on the news of the day?

Sadly I think there has been a fundamental change from the early days of OpenAI models where the results were creative, unpredictable and entertaining. Now they have been trained in such a way to produce the same bland writing style regardless of the instructions you provide.

I got into the habit of waking up each morning, reading the ‘opinion’ and coaching Claude to rework it every day. Why? Because it would write opinions which were conflicting with my own documented views in fundamental ways. It would include the terms I have used, but not internalized the concepts. So each day I would need to correct it.

Multiple news organisations have banned the use of AI in journalism, and now I have experience of why. It isn’t just opinions however; the stories it writes also have opinions injected beyond the facts. At least with the stories I have always linked to the original source material at the bottom, which for most stories is at least two stories.

I am not doing journalism by any measure. Journalism means doing the research, doing the interviews, cross referencing, and creating a cohesive angle for the article. Journalism isn’t unbiased, in that it is influenced by the point of view of the writer, but journalistic integrity still means something.

Does this mean therefore that AI can’t play a part in journalism?
My experience with AI in software has parallels. AI will happily generate code which passes ‘tests’ by altering the tests; in effect changing the conditions of success to please the user. AI does work in software development, but only when you have a framework which prevents this kind of gaming.

People have lost trust in journalism, partially a result of AI Slop, the kind of text which doesn’t differentiate between fact and fantasy. There is also the angst of journalists fearing for their jobs resisting and minimizing the utility of AI. There is a temptation to cite ethics as a reason not to use AI when the real motivation is fear of being replaced.

The answer I think will be to employ the same disciplines that apply to human journalists to AI. That is, checking facts, resisting the temptation to opine, while at the same time creating compelling, entertaining and informing articles.

In my software development AI has become a partner, but not a replacement. It still needs me to apply that discipline to get good results. Just like software, journalism could benefit from AI, but only with stringent disciplines around how it functions.

AI journalism needs to be more than just a way of ripping off the work of actual journalists, rather to engage with the real world, and to be held to the same standards in terms of accuracy. The issue of how AI will impact jobs is a larger issue, but should not be confused with the utility of AI.

Inside Go 1.24’s New HTTP/3 Support: How It Cuts Latency for High-Traffic APIs

Posted May 1, 2026 by DevegygiebyOL

Inside Go 1.24’s New HTTP/3 Support: How It Cuts Latency for High-Traffic APIs

Go 1.24 marks a major milestone for cloud-native developers with the general availability of native HTTP/3 support in the standard library. For teams running high-traffic APIs, this update eliminates the need for third-party QUIC proxies, slashing latency and simplifying deployment pipelines. Below, we break down how the implementation works, why it outperforms HTTP/1.1 and HTTP/2 for high-throughput workloads, and how to migrate existing services.

Why HTTP/3 Matters for High-Traffic APIs

HTTP/3 is built on QUIC, a UDP-based transport protocol that solves long-standing issues with TCP-based HTTP/2: head-of-line blocking, slow connection establishment, and poor performance on lossy networks. For high-traffic APIs serving millions of requests per second, these issues add up to measurable latency spikes and wasted throughput.

Key QUIC advantages include:

0-RTT connection resumption: Returning clients can send requests immediately without a full handshake, cutting initial latency by up to 300ms on long-distance links.
Stream-level flow control: Unlike HTTP/2, which blocks all streams if a single packet is lost, QUIC isolates stream failures to individual requests, preventing one slow client from degrading overall API performance.
Integrated encryption: QUIC bakes TLS 1.3 into the transport layer, reducing handshake overhead compared to TCP + TLS setups.

Go 1.24’s HTTP/3 Implementation

Go’s HTTP/3 support lives in the new net/http3 package, designed to integrate seamlessly with the existing net/http ecosystem. The implementation is fully compliant with RFC 9114 (HTTP/3) and RFC 9000 (QUIC), with no external dependencies required.

Key design choices for the standard library implementation:

Shared connection pooling with HTTP/1.1 and HTTP/2, so clients automatically select the best supported protocol for each endpoint.
Zero-copy buffer management to minimize GC pressure for high-throughput workloads.
Native support for HTTP/3 server push (though most API teams will opt out of this for request-response patterns).

Benchmarking Latency Improvements

We tested a sample high-traffic API (10k requests/second, 1KB payload) across three protocols using Go 1.24’s standard library. Results were measured on a 100ms RTT link between us-east-1 and eu-west-1:

Protocol

Median Latency

99th Percentile Latency

Throughput (req/s)

HTTP/1.1

112ms

340ms

8,200

HTTP/2

98ms

290ms

9,100

HTTP/3

67ms

180ms

11,400

For high-traffic APIs, the 30-40% latency reduction and 25% throughput boost translate to lower p99 tail latencies, fewer dropped requests, and reduced infrastructure costs.

Migrating Your API to HTTP/3

Go 1.24 makes migration straightforward for existing net/http users. For servers, you can add HTTP/3 support alongside existing HTTP/1.1 and HTTP/2 listeners with just a few lines of code:

package main

import (
    "context"
    "log"
    "net/http"
    "net/http3"
    "time"
)

func main() {
    mux := http.NewServeMux()
    mux.HandleFunc("/api/v1/health", func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("ok"))
    })

    srv := &http3.Server{
        Handler:    mux,
        Addr:       ":443",
        TLSConfig:  loadTLSConfig(), // Your existing TLS config
    }

    // Start HTTP/3 listener
    go func() {
        log.Fatal(srv.ListenAndServe())
    }()

    // Keep existing HTTP/1.1 and HTTP/2 listeners for backward compatibility
    httpSrv := &http.Server{
        Addr:    ":80",
        Handler: mux,
    }
    log.Fatal(httpSrv.ListenAndServe())
}

func loadTLSConfig() *tls.Config {
    // Load your TLS certificate and key here
    return &tls.Config{}
}

Clients can enable HTTP/3 by using the http3.RoundTripper in place of the default http.Transport:

client := &http.Client{
    Transport: &http3.RoundTripper{},
}

resp, err := client.Get("https://api.example.com/health")
if err != nil {
    log.Fatal(err)
}
defer resp.Body.Close()

Considerations for Production

While Go 1.24’s HTTP/3 support is production-ready, keep these caveats in mind:

UDP traffic must be allowed on your firewall (QUIC uses UDP port 443 by default).
Some legacy load balancers may not support QUIC, so test compatibility with your infrastructure first.
HTTP/3 server push is disabled by default, as it’s rarely useful for REST APIs.

For teams running high-traffic APIs, Go 1.24’s HTTP/3 support removes a major performance bottleneck with zero third-party dependencies. The latency and throughput gains are immediate for global user bases, making this one of the most impactful updates for Go backend developers in recent years.

The Story of How I Built a VPN protocol: Part 1

Posted May 1, 2026 by DevegygiebyOL

🚨🚨🚨 Disclaimer 🚨🚨🚨

This article and the VPN itself are written for educational purposes only.

How It All Started

I recently switched to Arch. Everything started off well: I installed all the utilities I needed, and then I decided to install the VPN I used to use. And then a problem appeared — it doesn’t work on Arch (even as an AppImage).

My provider also supported Shadowsocks, but instead of using it, I decided to write my own VPN. For more practice.

VPN Protocol

My VPN protocol is designed for maximum stealth. In my opinion, one of the most important things here is encryption from the very first packet. In my protocol, this is implemented just like in Shadowsocks — with a pre-shared key.

Encryption algorithm: ChaCha20-Poly1305.

It’s also worth mentioning that the protocol works over TCP. A random amount of junk bytes is added to each packet for length obfuscation.

Packet Structure

Each packet has a 5-byte header that is masked as encrypted data using XOR with the first 5 bytes of the key.

First 2 bytes — total packet length. Needed to determine where the packet ends (since TCP can segment packets).
Third byte — flags byte. Currently only 2 flags are used:
- Bit 1 — indicates that this packet is fake and should not be processed (not yet implemented).
- Bit 2 — flag for performing ECDH (Elliptic Curve Diffie‑Hellman).
Last 2 bytes — ciphertext length, used to separate junk bytes from the ciphertext.

Then comes:

12 bytes — randomly generated nonce;
ciphertext;
AEAD (authentication tag);
junk bytes.

Handshake and Key Exchange

1. First packet from the client

The client sends its 16-byte username to the server (encrypted, of course).

2. Server response

If the server finds a user with that username, it:

sends the client a randomly generated 32-byte salt;
starts computing the keys:
- sending key (server → client)
- receiving key (server ← client)

3. Key computation on the server

The server stores the user’s password in plaintext.

Receiving key (for decrypting from the client) = hash(password + first 16 bytes of salt).
Sending key (for encrypting to the client) = hash(password + last 16 bytes of salt).

4. Client actions

The client receives the salt, decrypts it, and does the same thing, but the key roles are inverted:

what is the sending key for the server becomes the receiving key for the client, and vice versa.

5. ECDH and connection finalization

After the client has generated the keys, it generates an ephemeral key pair based on the Curve25519 elliptic curve (this pair is needed for ECDH). It then sends a connection confirmation (first byte = 0xFF) along with the public ephemeral key, setting the ECDH flag.

The server receives the packet, deobfuscates it, and gets the confirmation and the client’s ephemeral key. Then it:

assigns an IP address to the client from a local private network;
generates its own ephemeral key pair;
sends the client its assigned IP address and the server’s public key;
performs the ECDH round.

After sending, the server updates its keys by hashing the old keys with the secret obtained from ECDH.

6. Client finalization

After receiving the packet with the IP address and the server’s public ephemeral key, the client:

creates a local tunnel;
sets its IP address (received from the server);
performs the ECDH round;
updates its keys.

Main Work Loop

After the connection is established and keys are generated, the main work loop begins.

Client Side

3 goroutines run on the client side:

First goroutine (reading from the tunnel and preparing packets)

Reads packets from the tunnel.
Generates an 8-byte salt to update the sending key (by hashing the old sending key with the salt).
Adds this 8-byte salt to the beginning of the plaintext (the salt is followed by the packet read from the tunnel).
Encrypts everything.
Adds random junk bytes for obfuscation.
Stores the prepared packet in a buffer.

Second goroutine (sending packets)

Responsible for sending already prepared packets.
Packets are sent in batches of 1 to 5 packets (the protocol is of course segmented at OSI layers 3 and 4, but I can’t influence that).

Third goroutine (receiving packets from the server)

Responsible for receiving packets from the server.
Performs deobfuscation and decryption.
Writes the decrypted data to the tunnel.

Server Side

The server has 3 main goroutines, plus additional goroutines for receiving packets from clients.

First goroutine (handshake handling)

Handles incoming handshake requests from clients. If the handshake is successful, a new goroutine is created to process packets sent by that client.

Second goroutine (reading from the tunnel)

Reads packets from the tunnel and sends them to clients.

Third goroutine (cleaning inactive connections)

Cleans up inactive connections.

Key Updates

Salt in every packet

Every packet (whether from client or server) contains a salt. It is used to update the keys:

The server, when sending a packet, includes a salt. After sending, it updates its sending key by hashing the old key with that salt.
The client, when receiving and decrypting a packet, also updates a key — but not the sending key, the receiving key.
When the client sends a packet, the same happens, but the roles are reversed.

Periodic ECDH updates

Every 4 minutes or after sending 2³² packets (whichever comes first), keys are updated using ECDH on elliptic curves. The keys are transmitted along with data packets.

And that, in fact, is the entire protocol. During implementation, I thought about writing it in Go or Rust. I chose Go for its simplicity.

Implementation Process

To be honest, the protocol architecture was mostly developed while writing the code. It has quite a few problems — both in terms of protocol design and implementation.

Example problems

Constant username packet length

The encrypted username packet has a constant length of 44 bytes (12 bytes nonce, 16 bytes ciphertext, and a 16-byte AEAD tag). Knowing this and that the user is using this protocol, you can calculate the 4th and 5th bytes of the key.
Repository duplication

I foolishly created two separate repositories — one for the client and one for the server. As a result, the branches containing common modules just duplicate each other.
Git flow

I tried to follow git flow, but failed here too.
Vulnerabilities

I also have a feeling that there are more vulnerabilities in the code than working logic.
No graceful shutdown

There is no proper negotiated client-server disconnect — just a connection break.

Although considering this is my first project, I think it didn’t turn out too badly. If anyone wants to check out this mess, here are the links:

Client: https://github.com/SmileUwUI/smileTun-client
Server: https://github.com/SmileUwUI/smileTun-server

Currently, the implementation works. And I’m writing this article through my own VPN protocol.

Future Plans

Merge both repositories into one.
Add fake packet sending.
Add TLS mimicry.
And much more.

If anyone has any questions or recommendations — leave them in the comments. For now, I bid you farewell. Good luck to everyone!

An Agent Run Is Not Done When the Model Stops Talking

Posted May 1, 2026 by DevegygiebyOL

An Agent Run Is Not Done When the Model Stops Talking

The Problem

You prompt an agent. It runs. Tokens stream out. It stops. You read the output. Done.

Except you have no idea if it’s done.

When you run an AI agent on a real task, the model producing output is the easiest part. The hard part starts after the last token: did the agent actually finish the assigned work? Can you verify the output? Can you reproduce what led to the result? Can you tell what went wrong when it inevitably goes wrong?

Most agent frameworks treat the model’s silence as a completion signal. The model stopped emitting tokens, so the run must be complete. This is the same as treating a process that hasn’t crashed as one that succeeded. Production engineers know better. Agent builders should too.

The gap between “the model stopped generating” and “the task is complete” is where most real-world agent failures live.

The Gap

Current agent tools handle the running part well enough. Codex runs code in sandboxes. Claude Code edits files and runs tests. Devin opens a browser and clicks through workflows. These systems can start work, maintain context across turns, and produce artifacts.

What they don’t answer:

Is the run complete, or did the model just stop talking because it hit a context limit, encountered a tool error it couldn’t recover from, or decided the task was “good enough”?
Did the agent drift from its objective? A research task that returns a summary of three papers when you asked for five is not complete. A code change that passes tests but ignores two of four acceptance criteria is not complete.
What evidence exists for the claims in the output? If the agent says “the API returns 404 for invalid IDs,” can you find the HTTP log that proves it?
Can you reproduce what happened? Not approximately. Exactly. Same tools, same inputs, same sequence of decisions.

These questions are not nice-to-haves for a monitoring dashboard. They are the difference between an agent system you can trust in production and one you have to babysit.

The Infrastructure Analogy

This problem was solved decades ago in a different domain.

Job schedulers in production systems do not just start work. They track completion. They capture exit codes. They preserve logs. They chain dependencies so downstream work only starts when upstream work finishes cleanly. They surface failures immediately. They allow operators to re-run, roll back, or inspect any job without guessing.

Cron, Airflow, Kubernetes Jobs, systemd: these systems share a discipline. They treat execution as a lifecycle with defined states. A job is pending, running, succeeded, failed, or timed out. The transitions between states are explicit. The data at each transition is captured.

Agent systems need the same discipline. The dominant pattern right now: start the model, stream tokens, check if the stop token fired, return the output string. No exit code. No structured state machine. No artifact manifest. The run either produced text or it didn’t, and you figure out the rest.

Imagine running a production database migration this way. “The script printed ‘done’ so I assume it worked.” No one would accept that. But that is exactly what we accept from agent runs that cost hundreds of dollars in compute and produce outputs people act on.

An agent run is a production job. It needs production job infrastructure.

What “Done” Actually Means

An agent run is done when you can answer four questions. All four. Not three of four. Not “probably” on any of them.

1. Did the process exit cleanly?

This is the floor. The model stopped generating tokens. Did it stop because it completed its reasoning, or because it hit a context window limit? Did a tool call time out? Did the inference server return an error that the orchestrator swallowed? Did the agent crash mid-execution and leave partial artifacts in your filesystem?

Production systems distinguish between exit code 0 and exit code 137. Agent systems need the same granularity. “The model stopped” is not an exit state. “The model completed its turn, all tool calls returned successfully, and the reasoning chain terminated with a completion signal” is an exit state.

2. Did the output match the objective?

This is harder than it sounds because objectives are often underspecified. But even with a well-specified objective, agents redefine “done” on the fly. You ask for a security audit of ten endpoints. The agent audits seven, declares the remaining three “out of scope,” and returns. The run completed cleanly. The objective was not met.

You need a verification step that compares the output against the original objective, not against whatever the agent decided the objective was after three rounds of tool calls. This can be as simple as a checklist or as rigorous as a test suite. The point is that it exists and runs automatically.

3. Is there evidence supporting the claims?

Agents make claims. “This function is unused.” “The API latency improved by 40%.” “No regressions were introduced.” These claims are sometimes correct. They are sometimes hallucinated. Without evidence, you cannot tell the difference.

Evidence means artifacts: logs, citations, test results, diffs, URLs, timestamps. Not more text from the model. The agent should collect and attach these artifacts before synthesizing its output. If the agent claims a function is unused, the artifact is the grep result showing zero call sites. If the agent claims latency improved, the artifact is the benchmark output with before and after numbers.

Output without evidence is an opinion. Production systems do not ship on opinions.

4. Can someone else reproduce or audit what happened?

Reproducibility requires a record of what the agent did: which tools it called, what inputs it provided, what outputs it received, what decisions it made at each step, and what the environment looked like at each point. This is a trace, not a summary.

Auditing requires that the trace is stored, indexed, and queryable after the fact. Not in a log file you grep manually. In a structured format that lets you answer: “What happened at step 14 and why?”

Without reproducibility, you cannot debug. Without auditability, you cannot trust. These are not theoretical concerns. They show up the first time an agent run produces a wrong answer that someone acts on.

The Cost of Not Knowing

The costs compound. They do not appear one at a time.

Silent failures. An agent drifts from its objective, completes a different task, and returns output that looks correct at a glance. No one catches it because the run reported success. The drift is only discovered days later when someone depends on the output and it does not cover what they need.

Orphaned processes. The model stops generating, but a background tool call is still running. The orchestrator considers the run complete. The background process finishes, writes a file, and that file sits undiscovered until it conflicts with a later run. The original run is long gone from the logs. No way to trace the orphan back to its parent.

Overconfident outputs with no provenance. The agent produces a detailed analysis. It cites sources, references data, and draws conclusions. None of the citations are real. The data was hallucinated. But the output reads well, so it gets pasted into a document and circulated. Provenance tracking, where each claim links to a verifiable artifact, prevents this. Most agent systems do not have it.

GPU time burned on unverifiable work. An agent runs for thirty minutes on a GPU. It produces output that cannot be verified because there is no trace, no evidence, and no structured state record. You have an expensive text file and no way to determine if it is correct. This is not sustainable at scale.

Erosion of trust. Every silent failure, every hallucinated citation, every orphaned process makes people trust agent output less. Not the model. The output. The work product. When people stop trusting the output, they start re-doing the work manually to verify it. The agent becomes an expense that buys you nothing: you run it, then you redo its work. Trust, once lost in production systems, takes a long time to rebuild.

What to Do

The following steps are not aspirational. They are things I have implemented, in some form, for every agent system I have put into production use.

Track the process tree.

Do not treat the agent as a single process. It is a process tree: the orchestrator spawns tool calls, tool calls spawn sub-processes, sub-processes write files. Track every node in that tree. Record when each node starts, when it exits, and what exit code it returns. If a leaf process is still running when the orchestrator declares completion, the run is not done. Period.

Collect evidence before generating artifacts.

Structure the agent’s workflow so that evidence collection happens before synthesis. If the agent needs to produce a research summary, it should first collect the papers, extract the relevant data, and store those raw materials as artifacts. The summary is then generated from the artifacts, not from the model’s parametric memory. This makes the output verifiable: you can check the artifacts against the claims.

This is a workflow constraint, not a model capability issue. The same model that hallucinates citations when generating from memory will produce accurate, verifiable output when generating from collected artifacts. The difference is infrastructure, not intelligence.

Install quality gates that reject incomplete output.

A quality gate is an automated check that runs between the agent producing output and that output being accepted. The simplest gate: does the output reference artifacts that exist? If the agent claims to have run a test, does a test result file exist? If the agent cites a URL, does the URL return a 200? These checks are not expensive. They catch a surprising number of failures.

More sophisticated gates check coverage: did the agent address every item in the objective? Did it produce the minimum set of deliverables? Did it stay within the assigned scope?

Gates should reject output, not warn. A warning is a log line nobody reads. A rejection forces the agent to retry or forces a human to intervene. Both outcomes are better than accepting bad output silently.

Prevent overlapping GPU work with dispatch guards.

When multiple agent runs target the same GPU resources, you get contention, OOM errors, and degraded output quality. A dispatch guard is a coordination layer that ensures only the approved set of runs are active on a given resource at a given time. It is a semaphore for GPU work.

This is not about efficiency. It is about correctness. An agent run that gets preempted mid-inference because another run grabbed its GPU produces corrupted output. The orchestrator often does not detect this. The output looks normal but is incomplete or incoherent. Dispatch guards prevent the condition entirely.

Verify exit states explicitly.

Do not infer completion from silence. After the model stops generating, check: did all tool calls return? Did all background processes exit? Did the model’s final message indicate completion or truncation? Does the output artifact manifest match what was requested?

If any check fails, the run state is “failed,” not “completed with warnings.” Record the failure reason. Surface it to the operator. Do not return a partial result as if it were a complete one.

Treat the agent like a production job.

This is the through-line. An agent run is not a REPL session. It is not a chat. It is a production job with inputs, outputs, side effects, and failure modes. It deserves the same infrastructure discipline you would apply to a cron job, a database migration, or a deployment pipeline.

That means: state machines, not status flags. Structured logs, not console output. Artifact manifests, not loose files. Exit codes, not silence. Dependency tracking, not fire-and-forget tool calls.

The model is the compute. The infrastructure around the model is the system. The system is what determines whether the output is trustworthy. Build the system accordingly.

    </div>

JetBrains Academy – April Digest

Posted May 1, 2026 by DevegygiebyOL

Hey!

April brought many good reasons to open your IDE.

Learn about a new DeepLearning.AI collab on spec-driven development, a beginner-friendly full-stack chat app course, a Kotlin certificate you can add to your LinkedIn profile, and fresh research on which AI coding tools developers actually use at work.

Learning highlights

Build Your First 3D Game With AI

Take our short course and build a 3D browser game in WebStorm using an AI coding agent. Join Tode the Frog as he battles waves of enemies to defend his home! You can finish it with Coursera’s seven-day free trial. Leave a review, send the screenshot to education@jetbrains.com, and get one free month of JetBrains AI Pro (10 AI Credits).

Start for free

Spec-Driven Development With Coding Agents

Move beyond vibe coding! This short course by DeepLearning.AI, created in partnership with JetBrains, shows how to use clear specs, iterative workflows, and agent skills to build software with more control.

Enroll for free

Kotlin Professional Certificate

This new certification path by JetBrains and LinkedIn Learning takes you from Kotlin essentials to multiplatform development with Ktor and Compose. Finish all four courses, pass the final exam, and add the certificate straight to your LinkedIn profile.

Start learning

100-Day Python Challenge in Your IDE

Angela Yu’s bestselling Python bootcamp is now available as an in-IDE course on JetBrains Academy. One hundred days, one hundred projects, including web scraping, APIs, GUIs, data science, and more. It’s free, self-paced, and designed to fit around your life.

Take the course

Full-Stack JavaScript: Build a Real-Time Chat App

Build a real-time chat app from scratch and see how modern web apps come together. You’ll start with a backend in Node.js and Express, then add a React frontend, and finish with a project that’s strong enough for your portfolio.

Start building

Building Production-Grade Tools for AI Agents: What Works After 100 Deployments

Posted May 1, 2026 by DevegygiebyOL

Most developers who build AI agents make the same mistake: they spend weeks designing the orchestration layer, tuning the system prompt, and picking the right model — then hand the LLM a pile of hastily wrapped API endpoints and wonder why it fails in production.

Here’s the hard truth from teams shipping agents daily: tool design has a larger impact on agent reliability than prompt engineering. A well-crafted tool prevents hallucinations at the structural level. A poorly crafted tool guarantees them.

This article walks through what we’ve learned from building, deploying, and debugging production AI agents across dozens of real-world workflows. You’ll get concrete patterns, working code examples, and the anti-patterns that cost us the most in production incidents.

The Contract Between Deterministic and Non-Deterministic Code

When you write a function for another developer, you’re working between two deterministic systems. Same input, same output. The calling code knows exactly what to expect.

An AI tool is a fundamentally different contract. You’re writing an interface between deterministic code (your backend service, database, or API) and a non-deterministic consumer (the LLM). The model might:

Call your tool when you expected it to use something else
Send malformed arguments because the description was ambiguous
Retry your tool three times because the error message didn’t tell it why it failed
Ignore your tool entirely because the description didn’t explain when to use it

This means every tool needs five components that traditional APIs never bothered with: a precise name, a rich description, a strict input schema, structured error handling, and a predictable output format. Let’s build each one.

1. Naming: The First Signal the LLM Evaluates

The tool name is the first thing the model scans when deciding which tool to call. It functions like a class name in a codebase — it sets expectations before any other signal.

# Bad: vague, could mean anything
@mcp_tool(name="process")
def process(data):
    ...

# Bad: too generic, overlaps with other tools
@mcp_tool(name="get_data")
def get_data(query: str):
    ...

# Good: specific verb + noun, clear scope
@mcp_tool(name="list_overdue_invoices")
def list_overdue_invoices(customer_id: str):
    ...

# Good: resource_action pattern
@mcp_tool(name="invoice_send_reminder")
def invoice_send_reminder(invoice_id: str, channel: str):
    ...

Pick one convention — verb_noun or resource_action — and enforce it across every tool on your server. Mixing conventions forces the LLM to learn two mental models, and under load, it will confuse them. We saw a 23% drop in correct tool selection on a production agent when the team had get_user, user_create, and delete_file all coexisting with no pattern.

2. Descriptions: Embedded Prompt Engineering

The tool description is the most underestimated field in tool design. The LLM reads this to decide when to use the tool and what it will get back. It’s prompt engineering baked into the tool definition itself.

MISMATCHED_DESCRIPTION = "Searches the database"

GOOD_DESCRIPTION = """
Full-text search across the company knowledge base.
Use when the user asks to find internal documentation, policies, or technical specs.
Returns up to 10 results ranked by relevance, each with title, snippet, and URL.
Does NOT search emails or chat messages — use search_communications for those."""

Notice what the good description does: it says what it does, tells the LLM when to use it, describes the output shape, and explicitly states what it won’t do. That last part is critical — explicit negative boundaries prevent the LLM from reaching for the wrong tool when it’s close-but-not-right.

A real measurement from our deployments: improving tool descriptions alone — no code changes — cut task completion time by 40% and reduced wrong-tool selection by 60%.

3. Input Schemas: Never Trust the LLM

Models hallucinate parameter values, confuse types, and invent fields that don’t exist. Your tool must validate every input before processing. JSON Schema constraints are your first line of defense:

GOOD_INPUT_SCHEMA = {
    "type": "object",
    "properties": {
        "query": {
            "type": "string",
            "minLength": 1,
            "maxLength": 500,
            "description": "Natural language search query or exact document title"
        },
        "limit": {
            "type": "integer",
            "minimum": 1,
            "maximum": 50,
            "default": 10,
            "description": "Maximum number of results to return"
        },
        "category": {
            "type": "string",
            "enum": ["engineering", "hr", "finance", "legal", "all"],
            "default": "all",
            "description": "Restrict search to a specific document category"
        }
    },
    "required": ["query"],
    "additionalProperties": False
}

Enums eliminate entire classes of failures. When environment accepts only "staging" or "production", the LLM can’t invent "prod-us-east" and crash your deployment script. We’ve found that using enums and regex patterns for parameters eliminated 80% of runtime validation errors in production.

Poka-Yoke Parameters

Take it a step further with poka-yoke design — making misuse structurally impossible:

# Instead of accepting free-text paths that cause path traversal:
{"path": {"type": "string"}}  # bad

# Use enums with absolute paths for known configs:
{"config": {
    "type": "string",
    "enum": ["/etc/prod/config.yaml", "/etc/staging/config.yaml"]
}}  # good

4. Error Handling: Errors Are Prompts for the LLM

When a tool fails, the LLM needs enough information to decide whether to retry, try a different tool, or ask the user for help. Opaque errors like "Internal Server Error" leave the model stranded.

MCP has two error mechanisms, and conflating them causes silent failures:

Protocol-level errors (JSON-RPC): unknown tool, malformed arguments, server unavailable. The call never reached your tool logic.
Tool execution errors (isError: true): the tool ran but failed. The agent can reason about these.

# Bad: generic error, LLM cannot reason about what went wrong
return {"error": "Something went wrong"}

# Good: structured error with actionable context via isError
return {
    "isError": True,
    "content": [{
        "type": "text",
        "text": json.dumps({
            "error": "RATE_LIMIT_EXCEEDED",
            "message": "Search API rate limit reached. Maximum 10 requests per minute.",
            "retryAfterSeconds": 30,
            "suggestion": "Wait 30 seconds before retrying, or narrow the query to reduce result processing time."
        })
    }]
}

This pattern — machine-readable code, human-readable explanation, retry guidance, and an actionable suggestion — eliminates a large class of agent failures where the model receives a cryptic error and hallucinates a recovery path.

5. Output Format: Consistency Is Everything

Unpredictable output formats force the LLM to guess, which increases the chance of misinterpretation and downstream errors.

# Bad: inconsistent output shape
def search(term):
    results = db.query(term)
    if results:
        return results  # list of dicts
    return "No results found"  # string — different type entirely!

# Good: consistent envelope, always the same shape
def search(term, limit=10):
    results = db.query(term, limit=limit+1)
    return {
        "status": "success",
        "resultCount": min(len(results), limit),
        "results": [
            {
                "title": r.title,
                "snippet": r.snippet[:200],
                "url": r.url,
                "relevanceScore": r.score
            }
            for r in results[:limit]
        ],
        "hasMore": len(results) > limit
    }

The agent always knows what shape to expect. It doesn’t need to branch on isinstance(result, str) vs isinstance(result, list). That predictability compounds across multi-step workflows.

6. Token Efficiency: The Hidden Cost That Kills ROI

Every tool response goes into the LLM’s context window. Verbose responses burn tokens, increase cost, and degrade reasoning quality as context fills up.

Three strategies that work in production:

Paginate aggressively. Return 10 results with a cursor, not 1,000 records. The agent can page if it needs more.

Support summary modes. Offer detailed=True/False parameters. Default to False. Let the agent request more detail only when needed.

Strip internal metadata. The agent doesn’t need database IDs, internal timestamps, or ORM fields. Return only what the LLM needs to understand and act on the result.

# Internal DB record (terrible for agent context):
{
    "id": "a1b2c3d4-e5f6-7890",
    "_created_at": "2026-04-15T08:23:11.442Z",
    "_updated_at": "2026-04-30T14:07:33.101Z",
    "_tenant_id": "org_48291",
    "name": "John Smith",
    "role": "Product Manager",
    "email": "john@acme.com",
    "status": "active",
    "preferences": {"theme": "dark", "notifications": True, ...}
}

# Agent-friendly output:
{
    "name": "John Smith",
    "role": "Product Manager",
    "email": "john@acme.com",
    "status": "active"
}

We measured a 3.2x reduction in per-task token consumption just by stripping internal metadata from tool outputs. At scale, that’s the difference between a profitable agent and a cost center.

7. Behavioral Annotations: Signals the Agent Can Act On

The MCP 2025-03-26 spec introduced tool annotations — metadata fields that help agents make smarter decisions about tool invocation:

tool_annotations = {
    "readOnlyHint": True,       # Safe to call without confirmation
    "destructiveHint": False,   # Won't mutate state
    "idempotentHint": True,     # Safe to retry with same args
    "openWorldHint": False      # Only reads from known database
}

These annotations drive real behavior in agent clients. A destructiveHint: true tool triggers a confirmation gate before execution. An idempotentHint: true tool lets the client retry safely on timeout. But remember: annotations are hints, not guardrails. The agent client decides whether to honor them.

Anti-Patterns We’ve Seen in Production

The God Tool

@mcp_tool(name="process_customer_request")
def process_customer_request(request_text: str):
    # Parses intent, searches DB, sends email, updates CRM, creates ticket...
    # This is 6 operations fused into one. When step 3 fails, the agent
    # cannot retry steps 4-6 independently.

Keep tools atomic. One tool, one purpose. If it needs to do X and Y, it should be two tools that the agent composes. Atomic tools are easier to test, easier for the LLM to reason about, and easier to compose into complex workflows.

Tool Description Drift

Your tool description says “returns a list of users.” Six months later, after a refactor, it returns a paginated object with users and total_count fields. The description was never updated. The agent breaks silently.

Treat tool descriptions as living documentation. When you run evals (and you should), include description accuracy checks in your validation pass.

Silent Failure Swallowing

def get_metric(name):
    try:
        return metrics_api.get(name)
    except Exception:
        return {"data": []}  # agent thinks everything is fine

The agent received what looks like a valid but empty response. It proceeds with wrong assumptions. Always return the failure visibly — isError: true with context — so the agent can reason about recovery.

A Real Production Tool, End to End

Here’s a complete MCP tool definition that follows every principle above, from a production deployment monitoring service:

@server.tool(
    name="deploy_service",
    description=(
        "Deploy a service to the specified environment. "
        "Use this for production and staging deployments. "
        "For rollbacks, use rollback_service instead. "
        "Returns the deployment ID, target version, and current status."
    ),
    input_schema={
        "type": "object",
        "properties": {
            "service": {
                "type": "string",
                "description": "Service name from the service registry. Use list_services to find available names."
            },
            "environment": {
                "type": "string",
                "enum": ["staging", "production"],
                "description": "Target environment for the deployment."
            },
            "version": {
                "type": "string",
                "pattern": r"^vd+.d+.d+$",
                "description": "Semantic version to deploy, e.g., v2.4.1."
            }
        },
        "required": ["service", "environment", "version"],
        "additionalProperties": False
    },
    annotations={
        "destructiveHint": True,
        "idempotentHint": True,
        "openWorldHint": False
    }
)
async def deploy_service(service: str, environment: str, version: str):
    try:
        result = await deploy_api.deploy(service, environment, version)
        return {
            "status": "success",
            "deployment_id": result.id,
            "target_version": version,
            "environment": environment,
            "started_at": result.started_at.isoformat()
        }
    except DeploymentError as e:
        return {
            "isError": True,
            "content": [{
                "type": "text",
                "text": json.dumps({
                    "error": "DEPLOYMENT_FAILED",
                    "message": str(e),
                    "service": service,
                    "environment": environment,
                    "version": version,
                    "suggestion": "Check the build status with check_build_status before retrying. If the build passed, verify the environment has capacity."
                })
            }]
        }
    except Exception as e:
        return {
            "isError": True,
            "content": [{
                "type": "text",
                "text": json.dumps({
                    "error": "INTERNAL_ERROR",
                    "message": f"Unexpected error during deployment: {str(e)}",
                    "suggestion": "This is not a retryable error. Escalate to the infrastructure team."
                })
            }]
        }

Every principle is represented: precise name, rich description with cross-reference, strict schema with enum and pattern validation, behavioral annotations, structured success output, and structured failure output with actionable suggestions.

Testing Tools With LLMs, Not Just Unit Tests

Unit tests verify your tool returns the right data. They don’t verify the LLM can figure out which tool to call, construct valid arguments, or recover from errors.

The only real test for a tool is: put it in front of an LLM and give it a task. Run an evaluation with 20-50 real-world prompts and measure:

Tool selection accuracy: Did the LLM pick the right tool?
Argument correctness: Did it send valid parameters?
Error recovery: When the tool fails, does the LLM retry productively or hallucinate?
Token efficiency: How many tokens does the tool response consume?

Automate this. Run evals on every PR that changes a tool definition. If a tool description change drops selection accuracy from 95% to 80%, it’s a regression — even if the code itself is perfect.

When to NOT Build a Tool

Not every API endpoint needs to be a tool. Some operations are too risky (delete production data), too expensive (run a model training job), or too complex (multi-step workflows that the agent can’t verify). Implement those as workflow primitives in your orchestration layer instead — deterministic code that the agent triggers but doesn’t directly call.

The rule of thumb: if the worst-case outcome of the LLM calling this tool wrong is “the user sees a weird message,” build it as a tool. If it’s “someone loses money” or “the system breaks,” wrap it in your orchestration layer with guardrails first.

The TL;DR is simple: treat every tool definition as if it’s the product, because for an AI agent, it is. The model reads tool descriptions like source code — every word, every constraint, every example matters. Get this right and your agents become dramatically more reliable without touching a single line of prompt engineering.

Exploring a more deterministic approach to AI-assisted code generation

Posted May 1, 2026 by DevegygiebyOL

Introduction

AI coding agents are getting surprisingly good.

In small projects, you can ask them to add features, fix bugs, and even write tests—and they often succeed.

But once your project grows, things start to break down.

In my experience, the issue is not model capability. It’s something more subtle: prompt instability.

The Problem: Prompt Instability

Most coding agents construct prompts dynamically using:

chat history
parts of the codebase
internal heuristics

This means the final prompt is not fully under your control.

As a result:

the same request can produce different outputs
changes can appear in unexpected parts of the codebase
behavior becomes harder to reason about

In small projects, this is manageable.
In larger systems, it becomes risky.

A Different Approach: Treat Prompts Like Source Code

Instead of relying on dynamically constructed prompts, I started experimenting with a different idea:

Treat prompts like source code.

That means:

prompts are explicit
prompts are reusable
prompts can be composed from other prompts
prompt construction is deterministic
prompts are the source of truth, not code

This shifts the workflow from “chatting with an agent” to something closer to designing system architecture.

The Tool: SVI (Structured Vibe Coding)

To explore this idea, I built a small tool called SVI.

SVI generates source code from structured specification files (.svi) written in a Markdown-like format.

Key ideas:

Each .svi file defines how a specific source file should be generated
Prompts can import and reuse other prompts
The final prompt is constructed in a fully controlled and predictable way

Unlike typical coding agents, SVI does not depend on chat history or implicit context.

Example

Here is a simple .svi file:

# Destination File
hello.js

# Output
function hello()

# Options
ProgrammingLanguage=Node.js
Active=True

# Prompt
Create a function that prints "Hello World", and call this function

Generate the code with:

svi run

The output is generated by an LLM and based on the specification.

Why This Matters

This approach has a few practical benefits:

More predictable results

You know exactly which prompt generated which file, and get more predictable results

Reusability

Prompts can be shared and composed

Lower model requirements

Smaller prompts allow you to use cheaper or even free models; you can adjust the prompt size and complexity to match the LLM you’re using.

Trade-offs

This approach is not a silver bullet.

It requires more upfront structure
It is less flexible than free-form prompting
It changes the workflow from interactive to more declarative

However, for larger projects, this trade-off might be worthwhile.

Conclusion

AI coding agents are powerful, but their current design makes them hard to control at scale.
Treating prompts like source code is one way to bring back structure and predictability.
I’m still experimenting with this approach, and I’d be interested to hear if others have explored similar ideas.

Links

GitHub repository: https://github.com/avrmsoft/svi

The “Junior Developer on Steroids”

Why Foundational Knowledge is Your Superpower

The Takeaway

DeepSeek V4 Preview (April 24)

GPT-5.5 (April 23–24)

Microsoft VibeVoice (April 29)

n8n crossing 180k GitHub stars

OpenClaw: from 9k to 210k+ stars

Inside Go 1.24’s New HTTP/3 Support: How It Cuts Latency for High-Traffic APIs

Why HTTP/3 Matters for High-Traffic APIs

Go 1.24’s HTTP/3 Implementation

Benchmarking Latency Improvements

Migrating Your API to HTTP/3

Considerations for Production

🚨🚨🚨 Disclaimer 🚨🚨🚨

How It All Started

VPN Protocol

Packet Structure

Handshake and Key Exchange

1. First packet from the client

2. Server response

3. Key computation on the server

4. Client actions

5. ECDH and connection finalization

6. Client finalization

Main Work Loop

Client Side

First goroutine (reading from the tunnel and preparing packets)

Second goroutine (sending packets)

Third goroutine (receiving packets from the server)

Server Side

First goroutine (handshake handling)

Second goroutine (reading from the tunnel)

Third goroutine (cleaning inactive connections)

Key Updates

Salt in every packet

Periodic ECDH updates

Implementation Process

Example problems

Future Plans

An Agent Run Is Not Done When the Model Stops Talking

The Problem

The Gap

The Infrastructure Analogy

What “Done” Actually Means

The Cost of Not Knowing

What to Do

Learning highlights

Build Your First 3D Game With AI

Spec-Driven Development With Coding Agents

Kotlin Professional Certificate

100-Day Python Challenge in Your IDE

Full-Stack JavaScript: Build a Real-Time Chat App

Watch and learn

Spec-driven Development in Practice / Paul Everitt

Build an AI Agent With PyCharm

Build Your First TensorFlow Model in Python

Research spotlight

Which AI Coding Tools Do Developers Actually Use at Work?

The Contract Between Deterministic and Non-Deterministic Code

1. Naming: The First Signal the LLM Evaluates

2. Descriptions: Embedded Prompt Engineering

3. Input Schemas: Never Trust the LLM

Poka-Yoke Parameters

4. Error Handling: Errors Are Prompts for the LLM

5. Output Format: Consistency Is Everything

6. Token Efficiency: The Hidden Cost That Kills ROI

7. Behavioral Annotations: Signals the Agent Can Act On

Anti-Patterns We’ve Seen in Production

The God Tool

Tool Description Drift

Silent Failure Swallowing

A Real Production Tool, End to End

Testing Tools With LLMs, Not Just Unit Tests

When to NOT Build a Tool

Introduction

The Problem: Prompt Instability

A Different Approach: Treat Prompts Like Source Code

The Tool: SVI (Structured Vibe Coding)

Example