Voice-Controlled Local AI Agent

Building a Voice-Controlled Local AI Agent: Architecture, Models & Lessons Learned

A deep-dive into wiring together Groq Whisper, Ollama, and Gradio into a fully working voice agent.

Why I Built This

The promise of a voice-controlled AI agent is compelling: speak naturally, and the machine understands, decides, and acts. But most tutorials skip the hardest part — how do you get from raw audio to a reliable tool execution, without things falling apart the moment the user says something unexpected?

This article walks through every layer of the system I built: the Speech-to-Text (STT) choice, the intent classification strategy, tool execution, and the UX patterns that make it feel robust rather than brittle.

GitHub: https://github.com/anjali-kumari94/AI-Controlled-voice-agent

Architecture Overview

The system is a linear pipeline with five stages:

Audio Input → STT → Intent Classification → Tool Execution → UI Display

Each stage has a single responsibility and fails gracefully with a user-visible error rather than a silent crash. Let me walk through each.

Stage 1: Audio Input

Two input modes are supported:

  1. Live microphone — Gradio’s built-in gr.Audio(sources=["microphone"]) handles capture
  2. File upload — accepts .wav, .mp3, and .m4a

The choice of Gradio here was deliberate. Streamlit requires workarounds for microphone access, and raw HTML/JS adds maintenance overhead. Gradio abstracts both input modes into a single audio_path string — making the rest of the pipeline input-agnostic.

Stage 2: Speech-to-Text

The local vs. cloud trade-off

My first instinct was to run Whisper locally. It preserves privacy and removes API dependency. But Whisper Large v3 — the most accurate open model — requires about 6 GB of VRAM to run at real-time speed. Most developer laptops (including mine) cannot meet this without significant latency.

The benchmarks told the story clearly:

Setup Real-time factor Notes
Whisper Large v3 (local, CPU) ~8× 8 seconds of audio takes ~64 s
Whisper Large v3 (local, GPU) ~0.8× Requires ≥6 GB VRAM
Groq Whisper API ~0.3× Cloud, free tier, ~0.3 s/s audio
OpenAI Whisper API ~0.5× Paid, slightly slower

I chose Groq Whisper for three reasons:

  • Best latency on available hardware
  • Free tier (sufficient for a demo)
  • Identical model quality to local Whisper Large v3

For a fully air-gapped deployment, faster-whisper or whisper.cpp are solid alternatives.

Implementation

from groq import Groq

client = Groq(api_key=os.getenv("GROQ_API_KEY"))

with open(audio_path, "rb") as f:
    transcription = client.audio.transcriptions.create(
        file=(os.path.basename(audio_path), f),
        model="whisper-large-v3",
        response_format="text",
        language="en",
    )

One gotcha: Groq returns a plain string (not a dict) when response_format="text". Wrapping it in str() before .strip() avoids type errors.

Stage 3: Intent Classification

This is where most voice agent projects fall short. Naive approaches use keyword matching (“if ‘create’ in text: create_file”). This breaks instantly on real speech patterns.

My approach: ask the LLM to return structured JSON.

The system prompt

The key insight is to give the model a contract — a specific JSON schema — and validate the output programmatically:

{
  "intents": ["write_code", "create_file"],
  "filename": "retry.py",
  "language": "python",
  "summary_target": null,
  "confidence": 0.92
}

This gives me:

  • Multiple intents in one utterance (compound commands)
  • Suggested filename (so the tool doesn’t have to guess)
  • Detected programming language
  • A confidence score for UI feedback

Fallback handling

LLMs occasionally return malformed JSON, especially with smaller models. The _safe_parse() function strips markdown fences, handles partial JSON, and always returns a valid dict — defaulting to general_chat if classification fails entirely.

Model choice: llama3 vs mistral vs phi3

I tested all three on a set of 20 representative voice commands:

Model Accuracy (correct intent) Latency (avg) JSON validity
llama3 8B 94% 3.2s 96%
mistral 7B 89% 2.8s 94%
phi3-mini 3.8B 82% 1.6s 91%

llama3 wins on accuracy. phi3-mini is worth considering on machines with less than 8 GB RAM.

Stage 4: Tool Execution

Four tools, each isolated in tools.py:

create_file

Creates a blank file or directory in output/. All paths are sanitised to prevent traversal attacks:

name = re.sub(r"[^w-. ]", "_", os.path.basename(name))
filepath = os.path.join(OUTPUT_DIR, name)

write_code

Makes a second Ollama call — this time as a code-generation assistant. The system prompt instructs the model to return raw code only (no markdown fences). A regex strip handles the occasional fence anyway.

summarize

Also uses Ollama. If the compound intent includes create_file, the summary is additionally saved to a .md file. This is how compound commands work — the intent dict carries all context, and each tool reads what it needs.

general_chat

Passes the last 10 conversation turns as context. This is the session memory at work — the user can ask follow-up questions naturally.

Compound command routing

The dispatcher strips the meta-label “compound” and routes to each real intent:

active = [i for i in intents if i != "compound"] or ["general_chat"]
for intent_name in active:
    results.append(route_to_tool(intent_name))

This means “Summarize this text and save it to notes.md” correctly triggers both summarize and create_file — and the UI shows both results.

Stage 5: UI — Human-in-the-Loop

File operations are irreversible (at least without undo logic). A key UX decision: pause before executing file ops and ask the user to confirm.

This is toggled by a checkbox. When enabled, the pipeline returns early after intent classification, renders a confirmation panel, and waits. Approve → execute. Reject → cancel with explanation.

This pattern is sometimes called “human-in-the-loop” (HITL) and dramatically increases trust in autonomous agents.

Challenges & Lessons Learned

1. Ollama connection handling

Ollama must be running (ollama serve) before the app starts. If it isn’t, every Ollama call raises a ConnectionError. The fix: catch ConnectionError everywhere and surface a clear message: “Cannot connect to Ollama. Run: ollama serve“.

2. JSON from LLMs is unreliable

Even with "format": "json" in the Ollama API call, some models wrap the JSON in a markdown code block. Always strip fences before parsing, and always have a fallback.

3. Gradio state management

Gradio components don’t share Python global state cleanly across event handlers. The _pending dict for confirmation state works but isn’t production-safe for multi-user deployments. For production, use gr.State() — or a proper database.

4. Audio format diversity

Real users upload everything: .webm, .ogg, .m4a. Groq Whisper handles most formats natively. The only failure mode I encountered was with very low bitrate .ogg files — the workaround is to convert with ffmpeg before sending.

What I’d Do Differently

  • Streaming output: Ollama supports streaming tokens. Gradio supports streaming via generators. Wiring these together would make code generation feel much faster.
  • Local STT fallback: Package faster-whisper as a fallback for when Groq is unavailable.
  • Persistent memory: Replace in-process SessionMemory with SQLite so history survives app restarts.
  • Multi-user support: Move all state into gr.State() so multiple users can interact simultaneously.

Conclusion

Building this agent taught me that the hard part of voice AI isn’t any single component — it’s the seams between them. Structured JSON intent classification + graceful fallbacks + a sandboxed execution environment is the recipe that makes the whole thing feel reliable rather than brittle.

If you build on top of this, I’d love to see what you create.

GitHub: github.com/YOUR_USERNAME/voice-agent

Published as part of the Mem0 AI/ML Developer Intern assignment.

AVIF in 2026: The Complete Guide to the Image Format That Beat JPEG, PNG, and WebP

AVIF landed quietly in 2019 and spent years as the format nobody used. Then Chrome shipped native support. Then Firefox. Then Safari 16. By 2024, Can I Use reported 93% global browser coverage. And suddenly the question flipped from “can I use AVIF?” to “why aren’t I using AVIF?”

If you’re still serving JPEG and PNG on the web in 2026, you’re leaving performance on the table. Here’s what you need to know.

What AVIF actually is

AVIF (AV1 Image File Format) is a still-image format derived from the AV1 video codec, developed by the Alliance for Open Media; a consortium that includes Google, Apple, Microsoft, Mozilla, Netflix, and Amazon. It’s royalty-free, open-source, and designed to replace JPEG, PNG, and even WebP.

The key technical facts:

  • Lossy and lossless compression in a single format
  • 10-bit and 12-bit color depth (vs JPEG’s 8-bit)
  • HDR and wide color gamut (BT.2020, PQ, HLG)
  • Alpha transparency (like PNG, unlike JPEG)
  • Film grain synthesis (stores grain parameters instead of actual noise)
  • Based on the HEIF container (ISO/IEC 23008-12)

The compression comes from AV1’s intra-frame coding tools: directional prediction, recursive partitioning (up to 128×128 superblocks), and the CDEF (Constrained Directional Enhancement Filter). These tools were designed for video but work remarkably well for still images.

The compression numbers

Let’s be specific. In independent tests by Netflix, Cloudflare, and Google, AVIF consistently outperforms:

Comparison File size reduction
AVIF vs JPEG 30-50% smaller at same quality
AVIF vs PNG 50-70% smaller (lossy mode)
AVIF vs WebP 15-25% smaller at same quality

For a concrete example: a 500KB JPEG photograph typically compresses to ~250KB as AVIF with no visible quality difference. A 2MB PNG screenshot drops to ~400KB. These are not edge cases; they’re typical results across diverse image types.

Analytics dashboard showing web performance metrics, representing the impact of image optimization

Why AVIF matters for web performance

Images account for roughly 50% of the average web page’s total weight, according to HTTP Archive data. Cutting that by 30-50% has real consequences:

  1. Core Web Vitals: Smaller images directly improve LCP (Largest Contentful Paint). Google has confirmed LCP is a ranking signal.

  2. Bandwidth costs: A site serving 1 million page views/month with 500KB of images per page transfers ~500GB. Switching to AVIF cuts that to ~300GB. At CDN rates, that’s real money.

  3. Mobile experience: On 3G connections (still common in emerging markets), a 500KB image takes ~4 seconds to load. A 250KB AVIF takes ~2 seconds. That’s the difference between a bounce and a conversion.

  4. Carbon footprint: Less data transferred means less energy consumed. The Green Web Foundation estimates that data transfer accounts for ~0.06g CO2 per MB.

AVIF vs WebP vs JPEG vs PNG: when to use each

The format landscape isn’t “use AVIF for everything.” Each format still has its place:

Use case Best format Why
Photographs for the web AVIF Best compression, no visible quality loss
Transparent graphics AVIF or WebP Both support alpha; AVIF is smaller
Pixel-perfect screenshots PNG Lossless, universal compatibility
Email attachments JPEG Universal; every client supports it
Legacy system input BMP/JPEG Some systems can’t decode modern formats
Favicons ICO Required by browsers for tab icons
Scalable logos SVG Vector; infinitely scalable
HDR photography AVIF Only modern format with full HDR support

The practical advice: serve AVIF as your primary web format with JPEG/PNG fallbacks using the <picture> element or CDN-based content negotiation.

How to convert images to and from AVIF

You don’t need to install command-line tools or pay for cloud services. Kitmul provides free, browser-based AVIF converters that process everything locally; your images never leave your device.

Converting TO AVIF (reduce file sizes)

  • JPEG to AVIF Converter: The most common conversion. Shrink your photo library by 30-50%.
  • PNG to AVIF Converter: Perfect for screenshots and graphics with transparency.
  • WebP to AVIF Converter: Upgrade from WebP to the next generation; another 20% savings.
  • SVG to AVIF Converter: Rasterize vector graphics into compact AVIF files.
  • BMP to AVIF Converter: Compress raw bitmaps from legacy systems; 95%+ reduction.
  • PDF to AVIF Converter: Extract PDF pages as lightweight AVIF images.
  • ICO to AVIF Converter: Convert icon files for web galleries.

Converting FROM AVIF (for compatibility)

  • AVIF to JPEG Converter: When you need universal compatibility for email or print.
  • AVIF to PNG Converter: Lossless output with transparency for editing workflows.
  • AVIF to WebP Converter: WebP fallback for CDN pipelines.
  • AVIF to PDF Converter: Embed images in professional documents.
  • AVIF to GIF Converter: For email signatures and legacy platforms.
  • AVIF to BMP Converter: Raw pixel data for industrial systems.
  • AVIF to ICO Converter: Create favicons from AVIF source images.

Kitmul's AVIF converter tool showing an AVIF to WebP conversion with the converted result

The <picture> element: serving AVIF with fallbacks

The standard pattern for progressive format delivery on the web:

<picture>
  <source srcset="photo.avif" type="image/avif">
  <source srcset="photo.webp" type="image/webp">
  <img src="photo.jpg" alt="Description" loading="lazy">
</picture>

The browser picks the first format it supports. AVIF-capable browsers (93%+) get the smallest file. WebP serves as a fallback for the remaining ~7%. JPEG is the universal safety net.

For responsive images, combine with srcset:

<picture>
  <source
    srcset="photo-400.avif 400w, photo-800.avif 800w, photo-1200.avif 1200w"
    type="image/avif"
    sizes="(max-width: 800px) 100vw, 800px">
  <source
    srcset="photo-400.webp 400w, photo-800.webp 800w, photo-1200.webp 1200w"
    type="image/webp"
    sizes="(max-width: 800px) 100vw, 800px">
  <img
    src="photo-800.jpg"
    alt="Description"
    loading="lazy"
    width="800"
    height="600">
</picture>

CDN and build-tool support

Most modern infrastructure handles AVIF natively:

  • Cloudflare Image Resizing: Auto-converts to AVIF via Polish or Image Resizing
  • Vercel/Next.js: next/image supports AVIF via formats: ['image/avif', 'image/webp'] in next.config.js
  • Netlify: Automatic AVIF via Image CDN
  • Sharp (Node.js): Full AVIF encode/decode since v0.29
  • Squoosh: Google’s browser-based encoder supports AVIF
  • libavif: The reference implementation by the AOM

Server racks in a data center with network cables, representing CDN infrastructure for image delivery

Known limitations

AVIF isn’t perfect. Be aware of these trade-offs:

  1. Encoding speed: AVIF encodes 5-10x slower than JPEG. This matters for real-time processing but not for batch/build-time conversion.

  2. Maximum dimensions: The AV1 spec limits individual tiles to 8192×4320 pixels. Larger images require tiling, which some tools don’t support.

  3. Animation support: AVIF supports animated sequences (AVIS), but tooling is immature compared to GIF/WebP animation.

  4. Older browsers: IE11 and pre-2020 browsers don’t support AVIF. Always include fallbacks.

  5. Email clients: Most email clients don’t render AVIF. Use AVIF to JPEG for email content.

  6. Print workflows: Print shops typically expect TIFF, PDF, or high-quality JPEG. Convert with AVIF to PDF before sending to print.

The bottom line

AVIF is the best general-purpose image format available in 2026. It delivers smaller files than any alternative at equivalent quality, supports features no other format offers (HDR, wide gamut, film grain synthesis), and has near-universal browser support.

The migration path is straightforward:

  1. Convert your existing images to AVIF (use our free converters)
  2. Serve AVIF with <picture> fallbacks
  3. Keep JPEG/PNG originals for compatibility workflows

Every kilobyte you save loads faster, ranks better, and costs less. The format war is over. AVIF won.

This Concurrency Bug Stayed Hidden for a Year

We had a background job that processed thousands of records in parallel.
Each batch ran concurrently, and we kept track of total successful and failed records.

Everything worked perfectly.

For almost a year.

Then one day, the totals started becoming… wrong.

No exceptions.
No crashes.
Just incorrect numbers.

The Setup

  • Records processed in chunks
  • Multiple chunks running concurrently
  • Shared counters tracking totals
  • Periodic database updates with progress

All standard parallel batch processing.

And yet — totals drifted.

The Symptom

  • Some runs showed fewer successful records than expected
  • Re-running the same data produced different counts
  • The issue appeared only in one environment

Classic signs of a concurrency issue.

But the tricky part?

We were already using thread-safe collections.

What Was Actually Happening

Imagine two workers updating the same counter:

Initial total = 10

Worker A reads total (10)
Worker B reads total (10)

Worker A increments → 11
Worker B increments → 11  (overwrites A)

Final total = 11  ❌ (should be 12)

No exception.
No crash.
Just a lost update.

This is a race condition.

The Buggy Code

A simplified version looked like this:

int totalSuccess = 0;

Parallel.ForEach(records, record =>
{
    if (Process(record))
    {
        totalSuccess++; // not atomic
    }
});

++ is not atomic. It performs:

  1. Read
  2. Increment
  3. Write

Multiple threads interleaving these steps leads to lost updates.

Why volatile Alone Doesn’t Fix It

A common attempt is to use volatile:

private static volatile int totalSuccess = 0;

This ensures visibility, but not atomicity.

Two threads can still:

  • read the same value
  • increment
  • overwrite each other

So volatile alone does not solve the race.

Why It Took a Year to Appear

Concurrency bugs are timing dependent.

The race condition existed from the beginning, but it didn’t surface consistently.
In fact, it only appeared in one environment.

Subtle runtime differences — thread scheduling, CPU contention, and execution timing — made overlapping updates more likely there, eventually exposing the issue.

No code changes were required.
Just different timing.

The Fix: Atomic Counters

We replaced non-atomic updates with atomic operations:

int totalSuccess = 0;

Parallel.ForEach(records, record =>
{
    if (Process(record))
    {
        Interlocked.Increment(ref totalSuccess);
    }
});

This guarantees increments are atomic.

The Real-World Fix: Snapshot-Based Progress Reporting

We also had periodic progress updates.
Multiple workers updated counters while one periodically persisted totals.

The correct pattern was:

var finished = Interlocked.Increment(ref completedChunks);

if (finished % maxConcurrency == 0)
{
    var successSnapshot = Volatile.Read(ref totalSuccess);
    var failureSnapshot = Volatile.Read(ref totalFailed);

    job.TotalSuccessfulRecords = successSnapshot;
    job.TotalFailedRecords = failureSnapshot;

    await UpdateJobProgress(job);
}

Why This Works

  • Interlocked → atomic updates
  • Volatile.Read → latest visible value
  • Snapshot → consistent progress reporting
  • Batched DB updates → reduced contention

This eliminates inconsistent totals.

Additional Improvement: Local Aggregation

To reduce contention further:

Parallel.ForEach(chunks, chunk =>
{
    int localSuccess = 0;
    int localFailure = 0;

    foreach (var record in chunk)
    {
        if (Process(record))
            localSuccess++;
        else
            localFailure++;
    }

    Interlocked.Add(ref totalSuccess, localSuccess);
    Interlocked.Add(ref totalFailed, localFailure);
});

This minimizes shared writes.

Lessons Learned

  • Thread-safe collections ≠ thread-safe logic
  • ++ is not atomic
  • volatile ensures visibility, not correctness
  • Use Interlocked for counters
  • Snapshot values using Volatile.Read
  • Reduce shared mutable state
  • Batch progress updates
  • Concurrency bugs are timing dependent

Takeaway

If you’re running parallel batch jobs and tracking totals:

  • Use atomic counters
  • Take snapshot reads for reporting
  • Avoid frequent shared writes

Otherwise, everything may look fine…

Until it doesn’t.

I Shrunk My Docker Image From 1.58GB to 186MB. Then I Had to Explain What I Actually Broke.

Most Docker tutorials end at the win.

“Look, smaller image! Ship it!” And then you’re left alone at 11pm wondering why your perfectly optimized container is crashing in production doing something it did fine before.

This article doesn’t do that. We’re going through both sides: how I got from 1.58GB to 186MB, every error I hit along the way, and the honest conversation about what Alpine actually takes away from you. Because the shrink is real. But so are the trade-offs.

First, What Even Is a Docker Image?

Your app works on your machine because your machine has Node installed, the right OS, the right dependencies. Someone else’s server has none of that. Docker fixes this by packaging your app together with everything it needs to run — the runtime, the OS slice, the dependencies — into a sealed portable unit called an image.

A Dockerfile is the recipe. docker build executes it and produces the image. That image can now run anywhere Docker is installed, identically.

The problem is most beginners write that recipe without thinking about what goes into the package. I learned this the hard way — and I want to save you the 11pm production surprise. So let’s do this properly: the win, the errors, and everything the win quietly broke.

The Fat Build

Here’s the Dockerfile I started with:

FROM node:18
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
CMD ["node", "app.js"]

Clean. Readable. Standard tutorial stuff.

When you build this and check the image size, the number that comes back stops you cold. 1.58 gigabytes. For a Node.js app that runs a simple HTTP server.

Every layer bakes into that image permanently. RUN npm install alone contributes megabytes of frozen layer. COPY . . adds more on top. Every one of those is locked inside the image forever.

The problem is not the app. The app is tiny. The problem is node:18. That base image is built on Debian Linux — a full operating system — and ships with compilers, build tools, package managers, debugging utilities, and about 400MB of things you will never use in production. When your npm install runs on top of that, all of it bakes into the final image together.

You are shipping the construction site instead of the finished building.

The .dockerignore vs .gitignore Mistake

Before we go further , this caught me early and it will catch you too.

.dockerignore and .gitignore are completely separate files.

  • .dockerignore tells Docker what not to copy into the build context.
  • .gitignore tells Git what not to track.

I had a .dockerignore but no .gitignore. When I pushed to GitHub, my entire node_modules folder went with it — hundreds of files committed to the repo. I had to go back and clean the git history.

Always create both. They often contain the same entries but they serve different tools entirely. Get this right before you build anything else.

Enter Multi-Stage Builds

The fix is separating your build environment from your runtime environment.

  • Build environment needs everything: the full OS, npm, build tools, all of it.
  • Runtime environment needs almost nothing: just Node and your app files.

Multi-stage builds let you use both in one Dockerfile, but only ship the second one.

# Stage 1: builder (does the work, never ships)
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .

# Stage 2: runtime (only this becomes your image)
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/app.js ./app.js
COPY --from=builder /app/package.json ./package.json
CMD ["node", "app.js"]

The COPY --from=builder line is the bridge. It reaches back into Stage 1 and pulls only what you specify. Everything else in Stage 1 — the full Debian OS, the compiler tools, the cache — gets discarded and never touches the final image.

Simple idea. But getting there cost me three separate errors.

Error 1: The Empty Dockerfile

ERROR: failed to build: failed to solve: the Dockerfile cannot be empty

I ran docker build before writing anything in the file. The file existed but was empty. Not a deep error — but worth including because it’s the kind of thing that makes you feel stupid for ten seconds before you realise it’s just a file issue.

Fix: write something in the file before you build it.

Error 2: The NUL Character Ambush

After the fat build succeeded I set up my .dockerignore using PowerShell’s echo command:

echo "node_modules" > .dockerignore
echo ".git" >> .dockerignore
echo "*.log" >> .dockerignore
echo ".env" >> .dockerignore

Built again. Got this:

<input>:1:1: invalid character NUL
<input>:1:3: invalid character NUL
<input>:1:5: invalid character NUL

Sixteen lines of it.

PowerShell’s echo writes files in UTF-16 LE with a BOM by default. Docker’s parser expects UTF-8. The invisible encoding header and the null bytes between every character made the entire file unreadable to Docker.

The build still finished because Docker warned and continued — but my .dockerignore was being completely ignored. node_modules was getting copied into the build context on every single build, silently, without telling me.

The fix — always do this on Windows:

"node_modules`n.git`n*.log`n.env" | Out-File -FilePath .dockerignore -Encoding utf8

Or create the file in VS Code and confirm it saves as UTF-8. Never trust PowerShell echo for config files that other tools will read.

Error 3: The builder Name Collision (The Sneaky One)

This is the one that will catch most beginners.

I wrote my multi-stage Dockerfile but forgot AS builder on my first FROM statement:

FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/app.js ./app.js
COPY --from=builder /app/package.json ./package.json
CMD ["node", "app.js"]

Built it. Got this:

ERROR: failed to build: failed to solve: builder: failed to resolve 
source metadata for docker.io/library/builder:latest: pull access 
denied, repository does not exist

Docker looked at --from=builder and thought I was referencing an external Docker Hub image called builder. It went to Docker Hub looking for library/builder:latest. That image does not exist.

--from=builder only works when builder is an alias defined with AS builder in an earlier FROM statement. Without it, Docker has nothing to reference locally and defaults to treating builder as an external image name.

The fix:

# AS builder here is not optional
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .

# Stage 2: no alias needed, this is the final image
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/app.js ./app.js
COPY --from=builder /app/package.json ./package.json
CMD ["node", "app.js"]

AS builder on the first FROM gives Stage 1 a name. --from=builder references that name. Without it, Docker goes looking on the internet for something that doesn’t exist.

The Result

Image Disk Usage Content Size
myapp:fat 1.58GB 397MB
myapp:slim 186MB 45.6MB

88% reduction. Same app.

The slim image history only contains COPY node_modules, COPY app.js, COPY package.json. That’s it. The entire Debian OS, the build tools, the npm cache — none of it made it through. COPY --from=builder is surgical. You get exactly what you name and nothing else.

Now The Part Most Articles Skip

The slim image runs fine for a basic Node app. But “the app is the same” is only true if your app doesn’t touch anything Alpine removed.

Both images produce the same output. Same server on port 3000. Good so far.

Now run this:

docker run --rm myapp:slim bash

Bash does not exist in Alpine. Alpine only ships sh. Any script in your app or CI pipeline that calls bash will crash. And the error message isn’t clean — it throws a full Node.js module-not-found stack trace because CMD ["node", "app.js"] is the entrypoint and Node tried to interpret bash as a script. That’s a deeply confusing error if you don’t know what you’re looking at.

Here’s what else is missing:

glibc: Alpine uses musl libc instead. This is the silent killer. Native npm packages like bcrypt, sharp, canvas, and sqlite3 are compiled against glibc. When you run them on Alpine they break — with no warning during build. The error surfaces at runtime in production when a user tries to do something.

npm: You didn’t copy it into Stage 2. You cannot run npm install inside a running slim container.

curl, wget, ps: Your standard debugging tools. When something goes wrong in a running Alpine container you have almost nothing to work with.

apt-get: Alpine uses apk instead, which has a much smaller package registry.

So When Is Alpine Actually Safe?

Alpine is safe when:

  • Your app is pure JavaScript with no native compiled dependencies
  • You have no bash scripts in your startup or CI process
  • You don’t need to exec into running containers to debug
  • Your node_modules are all JavaScript packages — run npm install and check for node-gyp in the output. That flags a native package.

Alpine is risky when:

  • You use bcrypt for password hashing
  • You use sharp for image processing
  • You use canvas, sqlite3, puppeteer, or anything that compiles C++ bindings
  • Your Dockerfile or startup scripts reference bash anywhere

If you need native packages but still want a smaller image, use node:18-slim instead of node:18-alpine. It’s Debian-based so it keeps glibc, but strips out the heavy development tools. You’ll land around 300–400MB — not as dramatic as Alpine, but safe for production.

The Decision Framework Before You Slim Any Image

1. Do any of my npm packages use node-gyp?

npm install

Check the output for gyp. If it appears, do not use Alpine.

2. Do any of my scripts call bash?

grep -r "#!/bin/bash" .

If yes, switch to sh or do not use Alpine.

3. Do I need to exec into running containers for debugging?

If yes, use node:18-slim instead.

4. Is CI pipeline speed a priority?

Smaller images pull faster in every environment. If you’re running 50 builds a day the difference between 1.58GB and 186MB compounds significantly.

The Full Working Dockerfile

# Stage 1: build environment (discarded after build)
FROM node:18 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .

# Stage 2: runtime environment (this is what ships)
FROM node:18-alpine
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/app.js ./app.js
COPY --from=builder /app/package.json ./package.json
CMD ["node", "app.js"]

Build and verify:

# Build the slim image
docker build -f slim/Dockerfile -t myapp:slim .

# Compare sizes
docker images myapp

# Confirm the app runs
docker run --rm myapp:slim node app.js

# Confirm what is missing
docker run --rm myapp:slim bash

Full repo with both Dockerfiles, the app, and all screenshots:
github.com/Arbythecoder/docker-optimization

What I Actually Learned

Going from 1.58GB to 186MB felt like a win. It is a win — for the right app.

But the real skill isn’t knowing how to shrink an image. It’s knowing whether to shrink it, what you’re trading away, and how to verify nothing broke before it reaches production.

Most tutorials give you the happy path. Production gives you everything else.

This article is part of my Docker for Production ebook series. Ebook 4 covers the complete pre-deployment checklist for containerized Node.js apps — including the full audit framework before you slim any production image. Follow me on DEV.to, LinkedIn and X to get notified when it drops.

Screenshots for reference:

Why your AI agent gets dumber over time (and how to fix memory drift)

Last week, a coding agent in a test repo did something weird: it opened the right files, referenced the wrong API version, and confidently wrote code for a migration we had already rolled back.

Nothing was “broken” in the usual sense. The prompts were fine. The tools were available. The model was good.

The problem was memory drift.

If you’ve built anything with long-running agents, you’ve probably seen it too: the agent starts strong, then gradually retrieves stale facts, outdated decisions, or half-relevant chunks from old work. Over time, its “memory” turns into a confidence amplifier for bad context.

A lot of teams try to solve this with a bigger vector store. That helps… until it doesn’t.

The real issue: vector stores decay quietly

Vector stores are great for fuzzy retrieval. If your agent needs “something similar to this design doc” or “the auth code near this endpoint,” embeddings are useful.

But agent memory is not just similarity search.

It’s often:

  • what changed
  • what supersedes what
  • who approved a decision
  • which fact is still valid
  • what depends on what
  • what should never be forgotten

That’s where vector-only memory starts to decay.

A simple example

Suppose your agent stores these facts over time:

  • JWT auth is used for internal APIs
  • Moved to mTLS for service-to-service auth
  • JWT still used for browser sessions
  • Deprecated auth middleware in v3
  • Hotfix restored old middleware for admin routes

A vector store can retrieve “similar auth-related stuff,” but it won’t naturally answer:

  • which statement is the latest truth?
  • which fact overrides another?
  • which context applies only to admin routes?
  • which decision was temporary?

That’s not an embedding problem. That’s a relationship problem.

Knowledge graphs don’t replace vectors — they constrain them

The best pattern I’ve seen is:

  • vector store for recall
  • knowledge graph for truth maintenance

Think of it like this:

User query
   |
   v
[Vector Search] ---> finds possibly relevant notes/docs/chunks
   |
   v
[Knowledge Graph] ---> resolves relationships:
                      - supersedes
                      - depends_on
                      - approved_by
                      - valid_for
                      - expires_at
   |
   v
[LLM Context] ---> smaller, fresher, less contradictory

A knowledge graph gives your system structure around memory:

  • entities: services, APIs, users, incidents, tasks
  • edges: supersedes, blocked_by, owned_by, approved_by
  • timestamps: when a fact became true
  • scope: where that fact applies
  • confidence: whether it’s canonical or provisional

Instead of asking “what text looks similar?”, you can ask:

  • “What is the current auth method for internal APIs?”
  • “What decision replaced this one?”
  • “Which open task depends on this migration?”
  • “What facts are stale after last deploy?”

That’s how you stop memory from becoming a junk drawer.

A practical rule of thumb

Use a vector store when you need:

  • semantic search
  • fuzzy recall
  • document retrieval
  • broad context gathering

Use a knowledge graph when you need:

  • state over time
  • versioned truth
  • explicit dependencies
  • conflict resolution
  • auditable memory

If you only use vectors, your agent will eventually retrieve both the old answer and the new answer and act like they’re equally valid.

A tiny runnable example

Here’s a minimal Node example using a graph to resolve the “latest truth” for a fact.

npm install graphology
const Graph = require("graphology");

const graph = new Graph();

graph.addNode("auth_v1", { value: "JWT for internal APIs", ts: 1 });
graph.addNode("auth_v2", { value: "mTLS for internal APIs", ts: 2 });

graph.addDirectedEdge("auth_v2", "auth_v1", { type: "supersedes" });

function currentFact(nodes) {
  return nodes
    .filter((n) => graph.inDegree(n) === 0)
    .map((n) => graph.getNodeAttribute(n, "value"));
}

console.log(currentFact(["auth_v1", "auth_v2"]));
// => [ 'mTLS for internal APIs' ]

Obviously, real systems need more than this. But the core idea matters: memory should encode replacement, not just storage.

What this looks like in production

A useful pattern is:

  1. Store raw docs, chats, and artifacts in a vector index
  2. Extract durable facts into a graph
  3. Mark facts with:
    • source
    • timestamp
    • scope
    • confidence
    • supersession links
  4. Retrieve from both systems
  5. Let the graph filter or rank what the LLM actually sees

If you already have a policy engine like OPA in your stack, this is also a good place to enforce rules like:

  • only approved memories can be treated as canonical
  • expired decisions should not be retrieved
  • temporary incident workarounds should not leak into normal planning

That’s usually a better answer than trying to prompt-engineer your way out of stale context.

The trap nobody mentions

The biggest mistake isn’t “using vectors.”

It’s treating all memory as text.

Some memory is text.
Some memory is state.
Some memory is policy.
Some memory is provenance.

If you flatten all of that into embeddings, your agent can retrieve context — but it can’t reliably reason about whether that context is still true.

That’s where drift starts.

Try it yourself

If you’re building agents and want to pressure-test the surrounding security and tooling:

  • Want to check your MCP server? Try https://tools.authora.dev
  • Run npx @authora/agent-audit to scan your codebase
  • Add a verified badge to your agent: https://passport.authora.dev
  • Check out https://github.com/authora-dev/awesome-agent-security for more resources

My take

Vector stores are still the right tool for retrieval.

But if you want long-lived agents that don’t slowly poison themselves with stale context, you need something that models truth over time.

Usually that means adding a knowledge graph, or at least graph-like relationships, on top of your retrieval layer.

How are you handling agent memory today: pure RAG, graph-backed memory, or something else? Drop your approach below.

— Authora team

This post was created with AI assistance.