Hashtag Jakarta EE #325

Hashtag Jakarta EE #325

Welcome to issue number three hundred and twenty-five of Hashtag Jakarta EE!

I am on my way home from JavaOne 2026 with a bag full of swag and a head full of inspiration and new ideas. One of the ideas has already resulted in a brand new abstract that I have submitted to a couple of upcoming conferences. Let’s hope the program committees are as excited as I am about it. If accepted, I think the conference attendees choosing to listen to my talk are up for a treat.

Jakarta EE 12 Milestone 3 is coming up. There is some activity in various specification projects, which is good. Others, on the other hand, could benefit from a little wakening call. There was a welcoming update from Jakarta RESTful Web Services in the platform call this week. It seems like they are making some progress in the CDI replacement of @Context.

You may be using skills to augment your AI Agents in some way or the other. SkillsJars offer a simple solution to publish skills as JAR-files on Maven Central. It is a pretty cool project created by James Ward. I recommend taking a look at it. It may simplify your workflow significantly if you are moving a lot of skills around by copy-paste on the file system. With SkillsJars, you simply add them as dependencies to your project.

A cool thing that was announced at JavaOne this year was the reopening of Project Detroit. The purpose of this project is to bring JavaScript and Python to the JVM. The project was brought to life as a result of the Foreign Function & Memory API from Project Panama.

Ivar Grimstad


Designing for Throwaway-ability in the AI Coding Era

The most impactful engineering leader I ever worked with was a guy named Bill Scott who led UI engineering at PayPal. What I loved about Bill was his constant enthusiasm for new technology. He helped lead the charge to take PayPal’s old C++ and Java stack and modernise it into Java microservices and Node.js for the front end. And just absolutely surrounded himself with people who were looking for ways new technology could be used to create better experiences for our customers and improve the lives of developers.

If he were still with us today, I think he would absolutely love the AI revolution taking place in software development right now.

When I was interviewing at PayPal back in 2013 I spoke with him a an influential blog that he’d written called The Experimentation Layer. Reflecting on his time at Netflix, he was pushing back against the cult of reusability in software development and talked about designing for throwaway-ability instead. He was a huge proponent of Lean UX and A/B testing. In the article he mentioned that in his time at Netflix, after something like a year, only 10% of the front end code stayed the same. It was constantly being optimised and improved to deliver better outcomes for their viewers.

He wasn’t anti-reusability. He wrote an entire book about building reusable components for UIs. One that was extremely important to me as an early engineer coming up in the jQuery UI era through the HTML5 transition and the death of Flash.
Designing Web Interfaces book from O'Reilly

But regardless, it made me a little bit annoyed. It put me off. I had just come from a role where, for the past two years at oDesk, I had been building reusable component libraries. And so we sparred a bit, gently, in the interview. And he helped me understand where he was coming from.

The goal is to build the best experiences for users as quickly as possible. Not being afraid to experiment and try new things, especially when the status quo isn’t working. The biggest thing in his philosophy was just constantly getting user feedback and making sure that what you’re putting out in front of users is actually working for them. And making sure that you have the infrastructure needed to iterate frequently and measure and observe your successes and failures. The other lesson that I’ve taken from that time is not to be married to my code. It’s just code. It’s an experiment. And it’s okay if we decide to throw it away.

The calculus of how “throwaway-able” your code needs to be depends on where it lives in the stack. He uses this funnel to illustrate the volatility of different parts of the software stack:
A funnel diagram to demonstrate different parts of the software stack

If you’re looking at a modern React application, you might think of it like this instead:
A modern React application designed as a building structure

In modern client-side React architecture, there are places that we need to be more careful about changes than others. But there are places in a large codebase that ultimately don’t matter so much. I think for many people, for better or worse, if a bunch of Tailwind CSS classes change in the code for a component used in only one place, you’re not really bothered about it as long as the thing looks like it’s supposed to look.

In a different context, you can also think about which features are likely to be sort of permanent and which features are temporary, like landing pages, right, or some kind of feature-flagged experiment you don’t expect to really make it.

Death of the Code Review

Let the slop flow through you, meme

There’s been a lot of talk lately about the death of the code review. I think it’s premature, at least for teams supporting real codebases with real users. If we use Bill Scott’s framework for determining which code is likely to change frequently or should be designed for throw-away ability, I think that can help us understand where it might be OK for a little “slop” to enter our application without looking too hard.

Modified diagram of React application architecture, suggesting that the slot can live on the roof

The way forward with this deluge of AI-generated code isn’t to avoid reviewing it; it’s figuring out where human review matters the most.

Anticipating UI Engineering’s Future

Today, I asked Claude if it wanted to play a game, and it generated one and let me play it, right in the chat interface.
Playing a memory game with Claude

Vercel’s also been working on tools to allow chats to render UIs in line with all sorts of guardrails and rules in place. It’s called json-render and it’s promoting this concept of “generative UI”.

Sunil Pai at Cloudflare recently hypothesized about about “code mode” sandboxes that generate custom user interfaces for individual users while at the same time suggesting the chat bots aren’t the end game for UIs either.

It’s clear that the paradigm with which user interfaces are built for customers is changing and that future hasn’t been fully written yet. Dynamic AI-generated user interfaces are likely coming one way or another, and that UI code likely isn’t going to be reviewed by humans.

Even so, I’m pretty certain the architecture underlying those future interfaces will be reviewed and tested very carefully by human reviewers. Nobody wants to build their house on foundation of slop. But the “experimentation layer” part, the part of the code we’re less fussed about. I’m pretty sure that part is going to get a whole lot bigger.

RIP Bill. You were a real one.

Bill Scott

We Spent a Week Evaluating a Context Compression Tool, Then Killed It

Here’s Everything We Found

An AgentAutopsy post — dissecting AI agent failures so you don’t have to

177.

That’s how many times our decision-making agent’s context got compacted in two weeks. Claude Opus, sitting at the center of our 1-human + 6-AI autonomous team, hit its context window limit 177 times. Each time that happens, the system summarizes everything and restarts.

Each time, something gets lost — a tool call result, a nuanced decision from three turns ago, the reason we ruled out option B. After 177 of these, you start making decisions with a model that’s kind of… lobotomized. It still sounds smart. It’s just missing the thread.

So we decided to build something about it. We called it Context Squeezer.

We killed it six days later.

Here’s the full dissection.

First — Isn’t This What Prompt Caching Is For?

Before we go further, let’s clear up the thing that confused us for longer than it should have.

Prompt Caching (Anthropic has it, OpenAI has it) caches the static prefix of your request — your system prompt, your fixed instructions, whatever you send at the top of every call. You get up to 90% discount on those repeated tokens. It’s genuinely good, and if you’re not using it, you probably should be.

But it does nothing for conversation history. Nothing.

Our 177 compactions were caused by dynamic history accumulation. Every turn, the conversation grows. Six agents, tool calls flying in every direction, results being passed back up the chain — by the time you’re 40 turns in, you’re hauling a 100K-token payload on every single API call. Prompt Caching only helps with the part that stays the same. Our problem was the part that keeps growing.

Short version: Prompt Caching saves money on repetition. Context compression saves memory as conversations get longer. They’re complementary tools. They do not compete. We had a context compression problem, not a caching problem.

This distinction matters and we’ll come back to it.

What We Were Going to Build

The plan was a Go single-binary local reverse proxy. Dead simple to install — change one line (BASE_URL=http://localhost:8080/v1), done. Every outbound API call gets intercepted. Message history gets compressed by a cheap model (GPT-4o-mini). Smaller payload goes out. Your main model never sees the bloat.

Target: 80% token reduction on dynamic history. Business model: open source core, $29 Pro tier (one-time) with dashboard, smart routing, and history archiving.

Our own pain was real, the tech was straightforward, and we could ship in a week. That was the whole thesis.

The Stress Test That Made Us Look Harder

We put the concept through a structured internal stress test before writing a single line of code. Most of it held up. But one question came back hard: did we actually need to build this, or does something already solve it?

We’d evaluated prompt caching early on and correctly ruled it out. But that question forced us to look more carefully. Not at caching — at compression tools specifically.

That search took about 30 minutes.

Headroom: The Tool We Should Have Found on Day One

github.com/chopratejas/headroom. 718 stars. Actively maintained. Python-based. Open source.

It does context compression for AI agents. It’s free.

Here’s the side-by-side:

Dimension Headroom What We Planned
Price Free, open source Open source + $29 Pro
Install pip install headroom-ai Download Go binary
Compression strategy AST parsing (code) + statistical analysis (JSON) + ModernBERT (text) — multi-strategy Single cheap LLM summarization
Conversation history Explicitly supported Core feature
Frameworks Claude Code, Codex, Cursor, Aider, LangChain, CrewAI Generic proxy
Community 718 stars, Discord, active dev Zero
Unique features SharedContext (multi-agent), MCP integration, KV Cache alignment, Learn mode None
Benchmarks SQuAD 97%, BFCL 97%, built-in eval suite None
Extra API cost per compression Zero (AST/stats are local) Every compression = one API call

We’re not trying to dunk on ourselves here — but looking at that table, the honest answer is: Headroom is better than what we would have shipped, in almost every dimension that matters. Their compression uses actual structural analysis of the content. Ours would’ve called GPT-4o-mini and hoped for the best. Their multi-agent SharedContext feature is something we hadn’t even thought to spec. Their benchmarks exist; ours would have been “we tested it a few times.”

They shipped a real tool. We had a slide deck and six days of planning.

Why We Killed It

The kill decision wasn’t hard once we saw the table clearly.

The problem is real. 177 compactions is a real problem. We’re not killing it because context compression doesn’t matter — it does. We’re killing it because someone already built a better solution and gave it away for free.

Our entire pitch was: cheap model, single binary, open source core, simple enough that anyone can install it. pip install headroom-ai is already that simple. And once you’re inside Headroom, you get AST-based compression, MCP integration, multi-agent context sharing, and a test suite with published benchmarks. Our $29 Pro tier was going to offer… a dashboard.

There was no angle. We closed it.

What We Actually Learned

1. Search GitHub before you write specs.

We designed a full product, stress-tested the concept, got internal approvals — then spent 30 minutes on GitHub and found Headroom. The 30-minute search should have been the first 30 minutes of Day One, not something we did under pressure on Day Four. Embarrassing but fixable. We’re writing it down so it’s actually fixed.

2. “More simple” is not a moat against free.

We told ourselves the Go binary was a differentiator because Python dependencies can be annoying. That’s true. But pip install headroom-ai is not a painful install — it’s one command. Simplicity alone cannot justify a price tag when the free alternative is already simple. You need a moat that isn’t “slightly less friction.”

3. Before you build anything, diagnose exactly what kind of “too much” you have.

This one is the one worth slowing down on.

If your API costs are going up and you’re not sure why, the answer matters a lot before you pick a solution. If you’re sending the same long system prompt on every call, that’s a caching problem — Prompt Caching on Anthropic or OpenAI will cut that cost by up to 90% and you don’t need to build anything. If your conversation history is growing with every turn and ballooning the payload, that’s a compression problem — tools like Headroom are built specifically for that. They’re different shapes of the same symptom. We nearly made a wrong call because we’d initially conflated the two. The diagnostic question is: which part of my payload is growing? Answer that first.

4. Stress-test your own ideas with someone who wants to break them.

Our internal stress test was uncomfortable — it was supposed to be. It raised questions we hadn’t asked ourselves. Some of those were overcorrections. One of them was exactly right. We’ll take that ratio.

5. Killing early is cheap. Killing late is expensive.

We spent a week and zero dollars in development. The alternative — building for two months, shipping, then discovering Headroom during a customer support conversation — would have cost orders of magnitude more. Not just in time, in credibility. The kill at week one is the best possible outcome of a bad starting position.

6. The tool you need probably already exists.

We know this rule. Everyone knows this rule. We still violated it. The rule is: 30 minutes on GitHub before you write a single line of code. It is the highest-ROI activity in product development and it is chronically underdone.

That’s It

Context Squeezer is dead. The problem it was trying to solve is real. If you’re running multi-agent systems and hitting context limits, look at Headroom first — it’s free, it’s maintained, and it’s more technically sophisticated than what most teams would build from scratch.

If you’re confused about prompt caching vs. context compression, re-read Section 1 of this post. They’re different tools for different problems.

We’re a 1-human + 6-AI team. We build things, ship some of them, kill others, and write these autopsies in public because the failure mode we went through is not unique to us. Someone else is planning their own version of Context Squeezer right now. Maybe this saves them a week.

This is an AgentAutopsy post. More autopsies coming — github.com/AgentAutopsy.

AgentAutopsy — dissecting AI agent failures so you don’t have to

Aegis — I built an open-source secrets broker because CyberArk costs more than my salary

Let me paint you a picture.

You join a company. You ask how secrets are managed. Someone looks at their shoes. Eventually you find a .env file in a shared Google Drive folder. It has been there for three years. Nobody knows who created it. It has the production database password in it. Thirteen people have access to the folder.

This is not a horror story. This is Tuesday.

The gap nobody is filling

Secrets management has two tiers and nothing in between.

Tier 1 — Enterprise: CyberArk, HashiCorp Vault (now IBM), AWS Secrets Manager. Powerful, battle-tested, and either eye-wateringly expensive or requiring a dedicated platform team to operate. CyberArk enterprise licences start at six figures. Vault OSS is free but running it reliably in production is a full-time job.

Tier 2 — Nothing: Most teams under 200 people. They use .env files, CI/CD secret stores with no audit trail, or shared password managers never designed for machine-to-machine secrets.

And here is the real problem: most organisations accumulate secrets sprawl over time. Applications that talk directly to CyberArk. Others that hit Vault. A handful pulling from AWS SSM. Each with its own credential logic, its own rotation story, and no centralised visibility. When a safe is renamed, a token expires, or a key leaks — you find out by watching something break in production.

That is what I built Aegis to fix.

What Aegis is

Aegis is a vendor-agnostic secrets broker and PAM gateway. It sits as the only secrets endpoint your applications ever need to know about — regardless of whether those secrets live in CyberArk, HashiCorp Vault, AWS Secrets Manager, or Conjur.

Applications authenticate with a scoped API key (one key per team-registry pair) and receive exactly the secrets they are authorised to see. Every fetch, every rotation, every configuration change is written to an immutable audit log with full attribution. There is no way to touch a secret without leaving a trace.

Your Application                Aegis                    Upstream Vault
      │                            │                            │
      │  GET /secrets              │                            │
      │  X-API-Key: sk_...         │                            │
      │  X-Change-Number: CHG123   │                            │
      ├───────────────────────────►│                            │
      │                            │  1. Hash key → lookup      │
      │                            │     team + registry        │
      │                            │                            │
      │                            │  2. Enforce policy:        │
      │                            │     change number, IP,     │
      │                            │     time window, rate      │
      │                            │                            │
      │                            │  3. Fetch from upstream    │
      │                            ├───────────────────────────►│
      │                            │◄───────────────────────────┤
      │                            │                            │
      │                            │  4. Write audit log        │
      │                            │  5. Emit SIEM event        │
      │                            │                            │
      │  { secret_name: value }    │                            │
      │◄───────────────────────────│                            │

What it handles

Scoped API keys per team

Each team gets one API key per registry they are assigned to. Team A and Team B can both access the same registry with different keys. If one key is compromised, only that assignment needs rotating — the other team is unaffected. Keys are stored as SHA-256 hashes. The plaintext is never persisted.

Vendor-agnostic secret fetching

Aegis resolves the upstream vendor at fetch time based on the object definition. You can migrate a secret from CyberArk to HashiCorp Vault without touching application code — just update the object definition in Aegis. Supported backends: CyberArk (CCP + PVWA), HashiCorp Vault (KV v1/v2), AWS Secrets Manager / SSM, Conjur (OSS + Enterprise).

Policy enforcement

Policies are defined per team, per registry, or per team-registry pair. Enforceable controls include:

  • IP allowlist — only specific CIDRs can request secrets
  • Time windows — a batch job that runs at 2am can only fetch secrets at 2am
  • Change number enforcement — every request must carry a valid ITSM change reference
  • Rate limiting — per-team RPM cap backed by Redis, prevents runaway services hammering upstream vaults
  • Key expiry — maximum key lifetime configurable per policy

Immutable audit logging

Every access is written to audit_log with: timestamp, team identity, registry, objects fetched, source IP, user agent, change number, and outcome. Every admin action is written to change_log with structured before/after diffs. There is no off switch. For regulated environments — financial services, healthcare, public sector — this is the difference between passing and failing a security audit.

SIEM integration

Audit events are emitted as structured JSON to whichever destination you point it at: stdout, Splunk HEC, AWS S3 (gzip JSONL), or Datadog. Configurable at runtime, no code changes needed.

Team self-service model

This is the part I am most pleased with. The security team manages policy — not operations. Teams manage their own:

  • Webhook subscriptions (Slack, MS Teams, Discord, or any HTTP endpoint)
  • CI/CD rotation triggers via auto-generated inbound webhook URLs
  • Notification channels
  • Key rotation

No tickets. No waiting. The security team retains full visibility through the audit log and can override anything — they just do not need to be involved in day-to-day operations.

Designed for scale

Built to handle 100+ teams and 40,000+ secrets under a single security team. The data model is relational and explicit — teams, registries, objects, and the many-to-many assignments between them are all first-class entities with their own audit trails.

The stack

  • FastAPI (Python 3.12) — async, fast, automatic OpenAPI docs
  • PostgreSQL + SQLAlchemy + Alembic — relational, properly migrated, nothing exotic
  • Redis — rate limiting and session tokens
  • Docker + GHCR — single container, published to GitHub Container Registry on every tagged release
  • Terraform — AWS infrastructure modules included
  • GitHub Actions — CI with Bandit static analysis, Trivy CVE scanning on every release. Releases block on CRITICAL/HIGH CVEs.

Get started in five minutes

git clone https://github.com/gustav0thethird/Aegis
cd Aegis
cp config/auth.json.example config/auth.json
docker compose up

Alembic runs migrations on startup. The API is live at http://localhost:8080. Full API reference and configuration docs are in the README.

Why AGPLv3

I chose AGPLv3 deliberately. If you are a team in a regulated environment you need to be able to audit what touches your secrets. With a proprietary tool you are trusting a vendor. With Aegis you can read every line of code that handles your credentials.

AGPLv3 means: use it freely, modify it freely, self-host it freely. If you run it as a network service and make modifications, you share them back. This is the right licence for security tooling.

Who this is for

  • Platform teams at 20–500 person companies who need proper secrets governance without enterprise PAM pricing
  • Regulated industries where audit trails are mandatory — financial services, healthcare, public sector
  • Teams already running Vault or CyberArk who want a controlled, auditable access layer in front of their vault rather than every service talking to it directly
  • Anyone drowning in secrets sprawl across multiple vendors with no central visibility

Come help build it

Aegis is early-stage and actively developed. The core is stable and the architecture is solid — now it needs people who actually run secrets infrastructure at scale to push it further.

What is being worked on:

  • Web UI for policy management
  • LDAP / SSO integration
  • Kubernetes secrets injection
  • Additional vault backends

If you work in security engineering, platform engineering, or regulated infrastructure — your experience is exactly what shapes what gets built next. Open an issue, start a discussion, or send a PR.

Star it on GitHub if it looks useful — it genuinely helps with visibility and lets me know people care about this existing.

github.com/gustav0thethird/Aegis

Real World vs Theory Lessons

In theory, checking disk usage looks simple — just grab the percentage from df -h. But in the real world, scripts break, formats differ, and human‑readable values like 374G don’t compare cleanly. This post is about the lessons learned when theory meets reality.

Disk Usage Monitoring in Linux: Percentage vs. Actual Size

Monitoring disk usage is one of the most common tasks for system administrators and developers. But there’s often confusion between checking percentage usage (df -h) and checking actual disk space (du -sh with numfmt). Let’s break down the challenges, solutions, pros, and cons of each approach.

❓ Common Questions
Should I monitor disk usage by percentage or by actual size?
Why does my script fail with “integer expression expected” errors?
How can I compare human‑readable sizes like 192K, 374G, or 2T against thresholds?
Which method is more reliable across different environments (Linux, Git Bash, macOS)?

⚡ The Challenge
Using df -h with percentages
A typical script might look like this:

disk_usage=$(df -h | awk 'NR>1 {print $5}' | sed 's/%//')
for usage in $disk_usage; do
    if [ "$usage" -gt 70 ]; then
        echo "Warning: Disk usage is high ($usage%)"
    else
        echo "Disk usage is normal ($usage%)"
    fi
done

Problem:
Sometimes df -h outputs values like 374G in other columns.
If parsing isn’t precise, your script may grab 374G instead of 22%.
[ -gt ] only works with integers, so 374G causes integer expression expected errors.

Using du -sh with numfmt
A more robust approach is to normalize values into bytes:

size=$(du -sh | awk '{print $1}')
bytes=$(numfmt --from=iec "$size")
threshold=$(numfmt --from=iec 100G)

if [ "$bytes" -gt "$threshold" ]; then
    echo "Disk usage is high: $size"
else
    echo "Disk usage is normal: $size"
fi

Here:
du -sh → gives human‑readable size (192K, 374G, etc.).
numfmt –from=iec → converts those into raw integers (bytes).
Thresholds like 100G, 500M, 2T are also converted into bytes.
Comparisons are now reliable and portable.

✅ Solutions
For percentage checks: Use df -h | awk ‘NR>1 {print $5}’ | sed ‘s/%//’ to extract only numeric percentages.

For actual size checks: Use du -sh + numfmt to convert human‑readable values into integers.

Hybrid approach: Monitor both percentage and actual size for a complete picture.

📊 Pros and Cons

🚀 Conclusion
If you just want a quick warning when usage exceeds 70%, percentage checks with df -h are fine.
If you need robust monitoring across environments, or want to enforce thresholds like “alert me if usage exceeds 100G,” then numfmt is the best choice.
In real production scripts, combining both methods gives the most reliable monitoring.

Choosing a model means measuring cost vs quality on your data

I wanted to evaluate model-based extraction in a way that would tell me more than benchmarks alone. The scenario is building an AI recruiting agent to help match candidates to job postings. To do this, we need to ingest job postings from career pages, aggregators, social media posts, and other messy sources. Every posting needs to be parsed into structured JSON: title, company, salary range, requirements, benefits.

I set up a comparison with a small dataset of 25 job postings across three model tiers to answer a practical question: does the quality difference between a more expensive model and a budget model justify the cost over time?

Setup

For this exploration, I used Baseten’s Model APIs. You can use whatever model provider you like.

I picked three models across the cost spectrum (priced March 2026):

Tier Model Active Params ~Input $/1M tokens
Frontier DeepSeek V3.1 671B / 37B active $0.50
Mid-tier Nvidia Nemotron 3 Super 120B / 12B active $0.30
Budget OpenAI GPT-OSS-120B 117B / 5.1B active $0.10

I generated a dataset of 25 job postings with Claude, designed to reflect the kinds of messy variation you see in real job posting data: informal listings, non-English postings, missing or no fields, hourly rates vs. annual, multiple currencies. For production, this type of data would likely come from multiple sources and be larger.

The extraction prompt asks for valid JSON with ten fields: title, company, location, work model, salary min/max/currency, requirements, nice-to-haves, and benefits. Temperature is set to 0. For the purpose of this exploration, the same system prompt was used for the entire evaluation.

For scoring, scalar fields (title, company, location, and so on) are compared after normalization with exact match for strings, partial credit for substring containment, and a 5% tolerance band for numbers. Array fields (requirements, nice-to-haves, benefits) are scored using set overlap with a word-overlap threshold for fuzzy matching, then taking the minimum of recall and precision. The overall accuracy per posting is a weighted average across all fields, with title and requirements weighted highest because those matter most for this recruiting-agent use case.

I purposefully included one reasoning model because when you send a prompt to a reasoning model it will “think” first and that output is wrapped in think tags. This is something to consider when building your parser.

Example reasoning response might look like this:

<think>
The posting mentions "$150k - $180k"  I should normalize this to annual integers.
The location says "SF Bay Area"  should I interpret this as San Francisco?
The posting mentions "3 days in office"  this implies Hybrid, not On-site...
</think>
{"title": "Senior Engineer", "company": "Acme Corp", ...}

Reasoning models also affect cost because those thinking tokens count toward output. Nemotron averaged 702 output tokens per call compared to 142 for DeepSeek and 481 for OpenAI.

The results

Metric DeepSeek Nemotron OpenAI
Avg Accuracy 83.5% 80.8% 82.2%
JSON Valid Rate 25/25 25/25 25/25
Avg Latency 0.7s 1.6s 2.3s
Avg Cost/Posting $0.0004 $0.0007 $0.0003
Est. Cost/100K Posts $42.24 $66.08 $28.86

All three models produce valid JSON 100% of the time. Accuracy is within a 3-point spread. The budget model retains ~98% of frontier quality at ~70% of the cost.

Where the models actually differ by field

The aggregate scores tells a partial story. Here’s the per-field breakdown:

Field DeepSeek Nemotron OpenAI
title 88% 87% 87%
company 80% 76% 80%
location 65% 69% 66%
work_model 80% 76% 72%
salary_min 82% 82% 80%
salary_max 84% 82% 80%
requirements 94% 92% 93%
nice_to_have 93% 87% 93%
benefits 82% 70% 83%

A few things stand out. Location was low for everyone, 65-69% across the board. These postings include things like “SF Bay Area,” “remote (US only),” and locations in Portuguese, so that is not surprising. DeepSeek has a slight edge on work-model extraction and nice-to-have extraction.

Nemotron’s weakest spot is benefits at 70%. The repo does not establish a single cause for that, but the result is a useful reminder that extra reasoning tokens do not automatically translate into better structured extraction.

Requirements extraction was the highest scoring area for all three models.

Human review

In general, automated scoring is not enough to confidently choose a model for your agent. How much you validate and against which fields will vary by use case. You may want to review all fields in a subset of data, or you may have one field that must be 100% correct and choose to audit that field across everything.

Human review might reveal that your automated scoring weights don’t reflect what actually matters for your use case.

In my case, because this was a small exploratory dataset, I reviewed a subset of outputs outside the repo with extra attention on fields that scored lower, especially work_model and location. The repo is meant as a companion for readers to run themselves, not as a checked-in record of my manual review.

A few interesting findings:

When a posting did not name a real company in the main content, such as a recruiter email or something ambiguous like “stealth startup,” all three models either left the company unresolved or returned placeholder-like values such as “Stealth Startup.” That is probably the right behavior for a strict extraction pipeline, but it might not be the behavior we want.

In a posting with dual currency salary bands, each model handled it differently. One took the first band, one mixed values across both bands, and one returned nothing. This could potentially be handled with different field design as I was only looking for salary min and max with no flexibility for the dual currency scenario.

In listings with a specific city that did not state remote, in-office, or hybrid, all models tended to set work_model to null. This is another example where whether or not this is acceptable is a product choice a human needs to make.

Cost at scale

At 100K postings per month:

  • Frontier (DeepSeek V3.1): ~$42/month
  • Mid-tier (Nemotron): ~$66/month
  • Budget (GPT-OSS-120B): ~$29/month

The budget model saves you $12/month over frontier for a 1.3-point accuracy drop (not including adjustments from my human review). Nemotron costs more than both while scoring lower. The thinking tokens make it the worst value for this particular task.

If we scale this to 1M postings, the spread becomes roughly $422 vs $661 vs $289 per month, which makes the cost penalty for the reasoning model much more visible.

Making a final choice

For this use case of structured extraction from messy text at volume, I’d go with the budget model. Even with some small inaccuracies or hallucinations, the value from the budget extraction is still good enough for the proposed build.

Now, you may be thinking about the latency of the budget model (2.3s vs 0.7s), which would matter more if this were user-facing and synchronous. In this case, there is no reason the end user needs to trigger extraction and wait on it directly, so batching is a reasonable fit.

I’d skip the reasoning model for this kind of extraction. Nemotron’s chain-of-thought was sometimes useful when handling ambiguous formatting, but for structured output the extra reasoning cost was not justified by the measured quality here.

Final thoughts

This exploration is 25 postings. To evaluate for production, you want a larger sample as the findings here could be within reasonable threshold.

I relied on the same system prompt and changes here could impact results and are worth exploring.

What you evaluate will depend on your final product as well. Your structured extraction problem might have different failure modes and need different scoring weights than mine.

The main takeaway here is that benchmarks won’t tell you which model handles your messy data the best. Build something against what your model really needs to perform well at and see what comes back.

If you would like to run this analysis yourself, the project is hosted on Github. If you have questions or want to chat, please get in touch with me on LinkedIn or X.