Hashtag Jakarta EE #325

Hashtag Jakarta EE #325

Welcome to issue number three hundred and twenty-five of Hashtag Jakarta EE!

I am on my way home from JavaOne 2026 with a bag full of swag and a head full of inspiration and new ideas. One of the ideas has already resulted in a brand new abstract that I have submitted to a couple of upcoming conferences. Let’s hope the program committees are as excited as I am about it. If accepted, I think the conference attendees choosing to listen to my talk are up for a treat.

Jakarta EE 12 Milestone 3 is coming up. There is some activity in various specification projects, which is good. Others, on the other hand, could benefit from a little wakening call. There was a welcoming update from Jakarta RESTful Web Services in the platform call this week. It seems like they are making some progress in the CDI replacement of @Context.

You may be using skills to augment your AI Agents in some way or the other. SkillsJars offer a simple solution to publish skills as JAR-files on Maven Central. It is a pretty cool project created by James Ward. I recommend taking a look at it. It may simplify your workflow significantly if you are moving a lot of skills around by copy-paste on the file system. With SkillsJars, you simply add them as dependencies to your project.

A cool thing that was announced at JavaOne this year was the reopening of Project Detroit. The purpose of this project is to bring JavaScript and Python to the JVM. The project was brought to life as a result of the Foreign Function & Memory API from Project Panama.

Ivar Grimstad


Designing for Throwaway-ability in the AI Coding Era

The most impactful engineering leader I ever worked with was a guy named Bill Scott who led UI engineering at PayPal. What I loved about Bill was his constant enthusiasm for new technology. He helped lead the charge to take PayPal’s old C++ and Java stack and modernise it into Java microservices and Node.js for the front end. And just absolutely surrounded himself with people who were looking for ways new technology could be used to create better experiences for our customers and improve the lives of developers.

If he were still with us today, I think he would absolutely love the AI revolution taking place in software development right now.

When I was interviewing at PayPal back in 2013 I spoke with him a an influential blog that he’d written called The Experimentation Layer. Reflecting on his time at Netflix, he was pushing back against the cult of reusability in software development and talked about designing for throwaway-ability instead. He was a huge proponent of Lean UX and A/B testing. In the article he mentioned that in his time at Netflix, after something like a year, only 10% of the front end code stayed the same. It was constantly being optimised and improved to deliver better outcomes for their viewers.

He wasn’t anti-reusability. He wrote an entire book about building reusable components for UIs. One that was extremely important to me as an early engineer coming up in the jQuery UI era through the HTML5 transition and the death of Flash.
Designing Web Interfaces book from O'Reilly

But regardless, it made me a little bit annoyed. It put me off. I had just come from a role where, for the past two years at oDesk, I had been building reusable component libraries. And so we sparred a bit, gently, in the interview. And he helped me understand where he was coming from.

The goal is to build the best experiences for users as quickly as possible. Not being afraid to experiment and try new things, especially when the status quo isn’t working. The biggest thing in his philosophy was just constantly getting user feedback and making sure that what you’re putting out in front of users is actually working for them. And making sure that you have the infrastructure needed to iterate frequently and measure and observe your successes and failures. The other lesson that I’ve taken from that time is not to be married to my code. It’s just code. It’s an experiment. And it’s okay if we decide to throw it away.

The calculus of how “throwaway-able” your code needs to be depends on where it lives in the stack. He uses this funnel to illustrate the volatility of different parts of the software stack:
A funnel diagram to demonstrate different parts of the software stack

If you’re looking at a modern React application, you might think of it like this instead:
A modern React application designed as a building structure

In modern client-side React architecture, there are places that we need to be more careful about changes than others. But there are places in a large codebase that ultimately don’t matter so much. I think for many people, for better or worse, if a bunch of Tailwind CSS classes change in the code for a component used in only one place, you’re not really bothered about it as long as the thing looks like it’s supposed to look.

In a different context, you can also think about which features are likely to be sort of permanent and which features are temporary, like landing pages, right, or some kind of feature-flagged experiment you don’t expect to really make it.

Death of the Code Review

Let the slop flow through you, meme

There’s been a lot of talk lately about the death of the code review. I think it’s premature, at least for teams supporting real codebases with real users. If we use Bill Scott’s framework for determining which code is likely to change frequently or should be designed for throw-away ability, I think that can help us understand where it might be OK for a little “slop” to enter our application without looking too hard.

Modified diagram of React application architecture, suggesting that the slot can live on the roof

The way forward with this deluge of AI-generated code isn’t to avoid reviewing it; it’s figuring out where human review matters the most.

Anticipating UI Engineering’s Future

Today, I asked Claude if it wanted to play a game, and it generated one and let me play it, right in the chat interface.
Playing a memory game with Claude

Vercel’s also been working on tools to allow chats to render UIs in line with all sorts of guardrails and rules in place. It’s called json-render and it’s promoting this concept of “generative UI”.

Sunil Pai at Cloudflare recently hypothesized about about “code mode” sandboxes that generate custom user interfaces for individual users while at the same time suggesting the chat bots aren’t the end game for UIs either.

It’s clear that the paradigm with which user interfaces are built for customers is changing and that future hasn’t been fully written yet. Dynamic AI-generated user interfaces are likely coming one way or another, and that UI code likely isn’t going to be reviewed by humans.

Even so, I’m pretty certain the architecture underlying those future interfaces will be reviewed and tested very carefully by human reviewers. Nobody wants to build their house on foundation of slop. But the “experimentation layer” part, the part of the code we’re less fussed about. I’m pretty sure that part is going to get a whole lot bigger.

RIP Bill. You were a real one.

Bill Scott

We Spent a Week Evaluating a Context Compression Tool, Then Killed It

Here’s Everything We Found

An AgentAutopsy post — dissecting AI agent failures so you don’t have to

177.

That’s how many times our decision-making agent’s context got compacted in two weeks. Claude Opus, sitting at the center of our 1-human + 6-AI autonomous team, hit its context window limit 177 times. Each time that happens, the system summarizes everything and restarts.

Each time, something gets lost — a tool call result, a nuanced decision from three turns ago, the reason we ruled out option B. After 177 of these, you start making decisions with a model that’s kind of… lobotomized. It still sounds smart. It’s just missing the thread.

So we decided to build something about it. We called it Context Squeezer.

We killed it six days later.

Here’s the full dissection.

First — Isn’t This What Prompt Caching Is For?

Before we go further, let’s clear up the thing that confused us for longer than it should have.

Prompt Caching (Anthropic has it, OpenAI has it) caches the static prefix of your request — your system prompt, your fixed instructions, whatever you send at the top of every call. You get up to 90% discount on those repeated tokens. It’s genuinely good, and if you’re not using it, you probably should be.

But it does nothing for conversation history. Nothing.

Our 177 compactions were caused by dynamic history accumulation. Every turn, the conversation grows. Six agents, tool calls flying in every direction, results being passed back up the chain — by the time you’re 40 turns in, you’re hauling a 100K-token payload on every single API call. Prompt Caching only helps with the part that stays the same. Our problem was the part that keeps growing.

Short version: Prompt Caching saves money on repetition. Context compression saves memory as conversations get longer. They’re complementary tools. They do not compete. We had a context compression problem, not a caching problem.

This distinction matters and we’ll come back to it.

What We Were Going to Build

The plan was a Go single-binary local reverse proxy. Dead simple to install — change one line (BASE_URL=http://localhost:8080/v1), done. Every outbound API call gets intercepted. Message history gets compressed by a cheap model (GPT-4o-mini). Smaller payload goes out. Your main model never sees the bloat.

Target: 80% token reduction on dynamic history. Business model: open source core, $29 Pro tier (one-time) with dashboard, smart routing, and history archiving.

Our own pain was real, the tech was straightforward, and we could ship in a week. That was the whole thesis.

The Stress Test That Made Us Look Harder

We put the concept through a structured internal stress test before writing a single line of code. Most of it held up. But one question came back hard: did we actually need to build this, or does something already solve it?

We’d evaluated prompt caching early on and correctly ruled it out. But that question forced us to look more carefully. Not at caching — at compression tools specifically.

That search took about 30 minutes.

Headroom: The Tool We Should Have Found on Day One

github.com/chopratejas/headroom. 718 stars. Actively maintained. Python-based. Open source.

It does context compression for AI agents. It’s free.

Here’s the side-by-side:

Dimension Headroom What We Planned
Price Free, open source Open source + $29 Pro
Install pip install headroom-ai Download Go binary
Compression strategy AST parsing (code) + statistical analysis (JSON) + ModernBERT (text) — multi-strategy Single cheap LLM summarization
Conversation history Explicitly supported Core feature
Frameworks Claude Code, Codex, Cursor, Aider, LangChain, CrewAI Generic proxy
Community 718 stars, Discord, active dev Zero
Unique features SharedContext (multi-agent), MCP integration, KV Cache alignment, Learn mode None
Benchmarks SQuAD 97%, BFCL 97%, built-in eval suite None
Extra API cost per compression Zero (AST/stats are local) Every compression = one API call

We’re not trying to dunk on ourselves here — but looking at that table, the honest answer is: Headroom is better than what we would have shipped, in almost every dimension that matters. Their compression uses actual structural analysis of the content. Ours would’ve called GPT-4o-mini and hoped for the best. Their multi-agent SharedContext feature is something we hadn’t even thought to spec. Their benchmarks exist; ours would have been “we tested it a few times.”

They shipped a real tool. We had a slide deck and six days of planning.

Why We Killed It

The kill decision wasn’t hard once we saw the table clearly.

The problem is real. 177 compactions is a real problem. We’re not killing it because context compression doesn’t matter — it does. We’re killing it because someone already built a better solution and gave it away for free.

Our entire pitch was: cheap model, single binary, open source core, simple enough that anyone can install it. pip install headroom-ai is already that simple. And once you’re inside Headroom, you get AST-based compression, MCP integration, multi-agent context sharing, and a test suite with published benchmarks. Our $29 Pro tier was going to offer… a dashboard.

There was no angle. We closed it.

What We Actually Learned

1. Search GitHub before you write specs.

We designed a full product, stress-tested the concept, got internal approvals — then spent 30 minutes on GitHub and found Headroom. The 30-minute search should have been the first 30 minutes of Day One, not something we did under pressure on Day Four. Embarrassing but fixable. We’re writing it down so it’s actually fixed.

2. “More simple” is not a moat against free.

We told ourselves the Go binary was a differentiator because Python dependencies can be annoying. That’s true. But pip install headroom-ai is not a painful install — it’s one command. Simplicity alone cannot justify a price tag when the free alternative is already simple. You need a moat that isn’t “slightly less friction.”

3. Before you build anything, diagnose exactly what kind of “too much” you have.

This one is the one worth slowing down on.

If your API costs are going up and you’re not sure why, the answer matters a lot before you pick a solution. If you’re sending the same long system prompt on every call, that’s a caching problem — Prompt Caching on Anthropic or OpenAI will cut that cost by up to 90% and you don’t need to build anything. If your conversation history is growing with every turn and ballooning the payload, that’s a compression problem — tools like Headroom are built specifically for that. They’re different shapes of the same symptom. We nearly made a wrong call because we’d initially conflated the two. The diagnostic question is: which part of my payload is growing? Answer that first.

4. Stress-test your own ideas with someone who wants to break them.

Our internal stress test was uncomfortable — it was supposed to be. It raised questions we hadn’t asked ourselves. Some of those were overcorrections. One of them was exactly right. We’ll take that ratio.

5. Killing early is cheap. Killing late is expensive.

We spent a week and zero dollars in development. The alternative — building for two months, shipping, then discovering Headroom during a customer support conversation — would have cost orders of magnitude more. Not just in time, in credibility. The kill at week one is the best possible outcome of a bad starting position.

6. The tool you need probably already exists.

We know this rule. Everyone knows this rule. We still violated it. The rule is: 30 minutes on GitHub before you write a single line of code. It is the highest-ROI activity in product development and it is chronically underdone.

That’s It

Context Squeezer is dead. The problem it was trying to solve is real. If you’re running multi-agent systems and hitting context limits, look at Headroom first — it’s free, it’s maintained, and it’s more technically sophisticated than what most teams would build from scratch.

If you’re confused about prompt caching vs. context compression, re-read Section 1 of this post. They’re different tools for different problems.

We’re a 1-human + 6-AI team. We build things, ship some of them, kill others, and write these autopsies in public because the failure mode we went through is not unique to us. Someone else is planning their own version of Context Squeezer right now. Maybe this saves them a week.

This is an AgentAutopsy post. More autopsies coming — github.com/AgentAutopsy.

AgentAutopsy — dissecting AI agent failures so you don’t have to

Aegis — I built an open-source secrets broker because CyberArk costs more than my salary

Let me paint you a picture.

You join a company. You ask how secrets are managed. Someone looks at their shoes. Eventually you find a .env file in a shared Google Drive folder. It has been there for three years. Nobody knows who created it. It has the production database password in it. Thirteen people have access to the folder.

This is not a horror story. This is Tuesday.

The gap nobody is filling

Secrets management has two tiers and nothing in between.

Tier 1 — Enterprise: CyberArk, HashiCorp Vault (now IBM), AWS Secrets Manager. Powerful, battle-tested, and either eye-wateringly expensive or requiring a dedicated platform team to operate. CyberArk enterprise licences start at six figures. Vault OSS is free but running it reliably in production is a full-time job.

Tier 2 — Nothing: Most teams under 200 people. They use .env files, CI/CD secret stores with no audit trail, or shared password managers never designed for machine-to-machine secrets.

And here is the real problem: most organisations accumulate secrets sprawl over time. Applications that talk directly to CyberArk. Others that hit Vault. A handful pulling from AWS SSM. Each with its own credential logic, its own rotation story, and no centralised visibility. When a safe is renamed, a token expires, or a key leaks — you find out by watching something break in production.

That is what I built Aegis to fix.

What Aegis is

Aegis is a vendor-agnostic secrets broker and PAM gateway. It sits as the only secrets endpoint your applications ever need to know about — regardless of whether those secrets live in CyberArk, HashiCorp Vault, AWS Secrets Manager, or Conjur.

Applications authenticate with a scoped API key (one key per team-registry pair) and receive exactly the secrets they are authorised to see. Every fetch, every rotation, every configuration change is written to an immutable audit log with full attribution. There is no way to touch a secret without leaving a trace.

Your Application                Aegis                    Upstream Vault
      │                            │                            │
      │  GET /secrets              │                            │
      │  X-API-Key: sk_...         │                            │
      │  X-Change-Number: CHG123   │                            │
      ├───────────────────────────►│                            │
      │                            │  1. Hash key → lookup      │
      │                            │     team + registry        │
      │                            │                            │
      │                            │  2. Enforce policy:        │
      │                            │     change number, IP,     │
      │                            │     time window, rate      │
      │                            │                            │
      │                            │  3. Fetch from upstream    │
      │                            ├───────────────────────────►│
      │                            │◄───────────────────────────┤
      │                            │                            │
      │                            │  4. Write audit log        │
      │                            │  5. Emit SIEM event        │
      │                            │                            │
      │  { secret_name: value }    │                            │
      │◄───────────────────────────│                            │

What it handles

Scoped API keys per team

Each team gets one API key per registry they are assigned to. Team A and Team B can both access the same registry with different keys. If one key is compromised, only that assignment needs rotating — the other team is unaffected. Keys are stored as SHA-256 hashes. The plaintext is never persisted.

Vendor-agnostic secret fetching

Aegis resolves the upstream vendor at fetch time based on the object definition. You can migrate a secret from CyberArk to HashiCorp Vault without touching application code — just update the object definition in Aegis. Supported backends: CyberArk (CCP + PVWA), HashiCorp Vault (KV v1/v2), AWS Secrets Manager / SSM, Conjur (OSS + Enterprise).

Policy enforcement

Policies are defined per team, per registry, or per team-registry pair. Enforceable controls include:

  • IP allowlist — only specific CIDRs can request secrets
  • Time windows — a batch job that runs at 2am can only fetch secrets at 2am
  • Change number enforcement — every request must carry a valid ITSM change reference
  • Rate limiting — per-team RPM cap backed by Redis, prevents runaway services hammering upstream vaults
  • Key expiry — maximum key lifetime configurable per policy

Immutable audit logging

Every access is written to audit_log with: timestamp, team identity, registry, objects fetched, source IP, user agent, change number, and outcome. Every admin action is written to change_log with structured before/after diffs. There is no off switch. For regulated environments — financial services, healthcare, public sector — this is the difference between passing and failing a security audit.

SIEM integration

Audit events are emitted as structured JSON to whichever destination you point it at: stdout, Splunk HEC, AWS S3 (gzip JSONL), or Datadog. Configurable at runtime, no code changes needed.

Team self-service model

This is the part I am most pleased with. The security team manages policy — not operations. Teams manage their own:

  • Webhook subscriptions (Slack, MS Teams, Discord, or any HTTP endpoint)
  • CI/CD rotation triggers via auto-generated inbound webhook URLs
  • Notification channels
  • Key rotation

No tickets. No waiting. The security team retains full visibility through the audit log and can override anything — they just do not need to be involved in day-to-day operations.

Designed for scale

Built to handle 100+ teams and 40,000+ secrets under a single security team. The data model is relational and explicit — teams, registries, objects, and the many-to-many assignments between them are all first-class entities with their own audit trails.

The stack

  • FastAPI (Python 3.12) — async, fast, automatic OpenAPI docs
  • PostgreSQL + SQLAlchemy + Alembic — relational, properly migrated, nothing exotic
  • Redis — rate limiting and session tokens
  • Docker + GHCR — single container, published to GitHub Container Registry on every tagged release
  • Terraform — AWS infrastructure modules included
  • GitHub Actions — CI with Bandit static analysis, Trivy CVE scanning on every release. Releases block on CRITICAL/HIGH CVEs.

Get started in five minutes

git clone https://github.com/gustav0thethird/Aegis
cd Aegis
cp config/auth.json.example config/auth.json
docker compose up

Alembic runs migrations on startup. The API is live at http://localhost:8080. Full API reference and configuration docs are in the README.

Why AGPLv3

I chose AGPLv3 deliberately. If you are a team in a regulated environment you need to be able to audit what touches your secrets. With a proprietary tool you are trusting a vendor. With Aegis you can read every line of code that handles your credentials.

AGPLv3 means: use it freely, modify it freely, self-host it freely. If you run it as a network service and make modifications, you share them back. This is the right licence for security tooling.

Who this is for

  • Platform teams at 20–500 person companies who need proper secrets governance without enterprise PAM pricing
  • Regulated industries where audit trails are mandatory — financial services, healthcare, public sector
  • Teams already running Vault or CyberArk who want a controlled, auditable access layer in front of their vault rather than every service talking to it directly
  • Anyone drowning in secrets sprawl across multiple vendors with no central visibility

Come help build it

Aegis is early-stage and actively developed. The core is stable and the architecture is solid — now it needs people who actually run secrets infrastructure at scale to push it further.

What is being worked on:

  • Web UI for policy management
  • LDAP / SSO integration
  • Kubernetes secrets injection
  • Additional vault backends

If you work in security engineering, platform engineering, or regulated infrastructure — your experience is exactly what shapes what gets built next. Open an issue, start a discussion, or send a PR.

Star it on GitHub if it looks useful — it genuinely helps with visibility and lets me know people care about this existing.

github.com/gustav0thethird/Aegis

Real World vs Theory Lessons

In theory, checking disk usage looks simple — just grab the percentage from df -h. But in the real world, scripts break, formats differ, and human‑readable values like 374G don’t compare cleanly. This post is about the lessons learned when theory meets reality.

Disk Usage Monitoring in Linux: Percentage vs. Actual Size

Monitoring disk usage is one of the most common tasks for system administrators and developers. But there’s often confusion between checking percentage usage (df -h) and checking actual disk space (du -sh with numfmt). Let’s break down the challenges, solutions, pros, and cons of each approach.

❓ Common Questions
Should I monitor disk usage by percentage or by actual size?
Why does my script fail with “integer expression expected” errors?
How can I compare human‑readable sizes like 192K, 374G, or 2T against thresholds?
Which method is more reliable across different environments (Linux, Git Bash, macOS)?

⚡ The Challenge
Using df -h with percentages
A typical script might look like this:

disk_usage=$(df -h | awk 'NR>1 {print $5}' | sed 's/%//')
for usage in $disk_usage; do
    if [ "$usage" -gt 70 ]; then
        echo "Warning: Disk usage is high ($usage%)"
    else
        echo "Disk usage is normal ($usage%)"
    fi
done

Problem:
Sometimes df -h outputs values like 374G in other columns.
If parsing isn’t precise, your script may grab 374G instead of 22%.
[ -gt ] only works with integers, so 374G causes integer expression expected errors.

Using du -sh with numfmt
A more robust approach is to normalize values into bytes:

size=$(du -sh | awk '{print $1}')
bytes=$(numfmt --from=iec "$size")
threshold=$(numfmt --from=iec 100G)

if [ "$bytes" -gt "$threshold" ]; then
    echo "Disk usage is high: $size"
else
    echo "Disk usage is normal: $size"
fi

Here:
du -sh → gives human‑readable size (192K, 374G, etc.).
numfmt –from=iec → converts those into raw integers (bytes).
Thresholds like 100G, 500M, 2T are also converted into bytes.
Comparisons are now reliable and portable.

✅ Solutions
For percentage checks: Use df -h | awk ‘NR>1 {print $5}’ | sed ‘s/%//’ to extract only numeric percentages.

For actual size checks: Use du -sh + numfmt to convert human‑readable values into integers.

Hybrid approach: Monitor both percentage and actual size for a complete picture.

📊 Pros and Cons

🚀 Conclusion
If you just want a quick warning when usage exceeds 70%, percentage checks with df -h are fine.
If you need robust monitoring across environments, or want to enforce thresholds like “alert me if usage exceeds 100G,” then numfmt is the best choice.
In real production scripts, combining both methods gives the most reliable monitoring.

Choosing a model means measuring cost vs quality on your data

I wanted to evaluate model-based extraction in a way that would tell me more than benchmarks alone. The scenario is building an AI recruiting agent to help match candidates to job postings. To do this, we need to ingest job postings from career pages, aggregators, social media posts, and other messy sources. Every posting needs to be parsed into structured JSON: title, company, salary range, requirements, benefits.

I set up a comparison with a small dataset of 25 job postings across three model tiers to answer a practical question: does the quality difference between a more expensive model and a budget model justify the cost over time?

Setup

For this exploration, I used Baseten’s Model APIs. You can use whatever model provider you like.

I picked three models across the cost spectrum (priced March 2026):

Tier Model Active Params ~Input $/1M tokens
Frontier DeepSeek V3.1 671B / 37B active $0.50
Mid-tier Nvidia Nemotron 3 Super 120B / 12B active $0.30
Budget OpenAI GPT-OSS-120B 117B / 5.1B active $0.10

I generated a dataset of 25 job postings with Claude, designed to reflect the kinds of messy variation you see in real job posting data: informal listings, non-English postings, missing or no fields, hourly rates vs. annual, multiple currencies. For production, this type of data would likely come from multiple sources and be larger.

The extraction prompt asks for valid JSON with ten fields: title, company, location, work model, salary min/max/currency, requirements, nice-to-haves, and benefits. Temperature is set to 0. For the purpose of this exploration, the same system prompt was used for the entire evaluation.

For scoring, scalar fields (title, company, location, and so on) are compared after normalization with exact match for strings, partial credit for substring containment, and a 5% tolerance band for numbers. Array fields (requirements, nice-to-haves, benefits) are scored using set overlap with a word-overlap threshold for fuzzy matching, then taking the minimum of recall and precision. The overall accuracy per posting is a weighted average across all fields, with title and requirements weighted highest because those matter most for this recruiting-agent use case.

I purposefully included one reasoning model because when you send a prompt to a reasoning model it will “think” first and that output is wrapped in think tags. This is something to consider when building your parser.

Example reasoning response might look like this:

<think>
The posting mentions "$150k - $180k"  I should normalize this to annual integers.
The location says "SF Bay Area"  should I interpret this as San Francisco?
The posting mentions "3 days in office"  this implies Hybrid, not On-site...
</think>
{"title": "Senior Engineer", "company": "Acme Corp", ...}

Reasoning models also affect cost because those thinking tokens count toward output. Nemotron averaged 702 output tokens per call compared to 142 for DeepSeek and 481 for OpenAI.

The results

Metric DeepSeek Nemotron OpenAI
Avg Accuracy 83.5% 80.8% 82.2%
JSON Valid Rate 25/25 25/25 25/25
Avg Latency 0.7s 1.6s 2.3s
Avg Cost/Posting $0.0004 $0.0007 $0.0003
Est. Cost/100K Posts $42.24 $66.08 $28.86

All three models produce valid JSON 100% of the time. Accuracy is within a 3-point spread. The budget model retains ~98% of frontier quality at ~70% of the cost.

Where the models actually differ by field

The aggregate scores tells a partial story. Here’s the per-field breakdown:

Field DeepSeek Nemotron OpenAI
title 88% 87% 87%
company 80% 76% 80%
location 65% 69% 66%
work_model 80% 76% 72%
salary_min 82% 82% 80%
salary_max 84% 82% 80%
requirements 94% 92% 93%
nice_to_have 93% 87% 93%
benefits 82% 70% 83%

A few things stand out. Location was low for everyone, 65-69% across the board. These postings include things like “SF Bay Area,” “remote (US only),” and locations in Portuguese, so that is not surprising. DeepSeek has a slight edge on work-model extraction and nice-to-have extraction.

Nemotron’s weakest spot is benefits at 70%. The repo does not establish a single cause for that, but the result is a useful reminder that extra reasoning tokens do not automatically translate into better structured extraction.

Requirements extraction was the highest scoring area for all three models.

Human review

In general, automated scoring is not enough to confidently choose a model for your agent. How much you validate and against which fields will vary by use case. You may want to review all fields in a subset of data, or you may have one field that must be 100% correct and choose to audit that field across everything.

Human review might reveal that your automated scoring weights don’t reflect what actually matters for your use case.

In my case, because this was a small exploratory dataset, I reviewed a subset of outputs outside the repo with extra attention on fields that scored lower, especially work_model and location. The repo is meant as a companion for readers to run themselves, not as a checked-in record of my manual review.

A few interesting findings:

When a posting did not name a real company in the main content, such as a recruiter email or something ambiguous like “stealth startup,” all three models either left the company unresolved or returned placeholder-like values such as “Stealth Startup.” That is probably the right behavior for a strict extraction pipeline, but it might not be the behavior we want.

In a posting with dual currency salary bands, each model handled it differently. One took the first band, one mixed values across both bands, and one returned nothing. This could potentially be handled with different field design as I was only looking for salary min and max with no flexibility for the dual currency scenario.

In listings with a specific city that did not state remote, in-office, or hybrid, all models tended to set work_model to null. This is another example where whether or not this is acceptable is a product choice a human needs to make.

Cost at scale

At 100K postings per month:

  • Frontier (DeepSeek V3.1): ~$42/month
  • Mid-tier (Nemotron): ~$66/month
  • Budget (GPT-OSS-120B): ~$29/month

The budget model saves you $12/month over frontier for a 1.3-point accuracy drop (not including adjustments from my human review). Nemotron costs more than both while scoring lower. The thinking tokens make it the worst value for this particular task.

If we scale this to 1M postings, the spread becomes roughly $422 vs $661 vs $289 per month, which makes the cost penalty for the reasoning model much more visible.

Making a final choice

For this use case of structured extraction from messy text at volume, I’d go with the budget model. Even with some small inaccuracies or hallucinations, the value from the budget extraction is still good enough for the proposed build.

Now, you may be thinking about the latency of the budget model (2.3s vs 0.7s), which would matter more if this were user-facing and synchronous. In this case, there is no reason the end user needs to trigger extraction and wait on it directly, so batching is a reasonable fit.

I’d skip the reasoning model for this kind of extraction. Nemotron’s chain-of-thought was sometimes useful when handling ambiguous formatting, but for structured output the extra reasoning cost was not justified by the measured quality here.

Final thoughts

This exploration is 25 postings. To evaluate for production, you want a larger sample as the findings here could be within reasonable threshold.

I relied on the same system prompt and changes here could impact results and are worth exploring.

What you evaluate will depend on your final product as well. Your structured extraction problem might have different failure modes and need different scoring weights than mine.

The main takeaway here is that benchmarks won’t tell you which model handles your messy data the best. Build something against what your model really needs to perform well at and see what comes back.

If you would like to run this analysis yourself, the project is hosted on Github. If you have questions or want to chat, please get in touch with me on LinkedIn or X.

JavaOne 2026

JavaOne 2026

If I should pick one conference that has been instrumental in defining my career, it would be JavaOne. I have attended almost all editions of JavaOne since my first time in 1999 including the years it was branded as CodeOne. First as an attendee, later as a speaker. What makes JavaOne special is the quality of the technical content and of course the Community. JavaOne is the plays to meet the Java Community. JavaOne 2026 was the second JavaOne since the restart back in the Bay Area. It is now a smaller more boutique-like conference far from what it used to be in its hay-days in the beginning of the Millennium.

I didn’t have a regular talk at this year’s JavaOne and my intention was to go there and enjoy as an attendee. But then the opportunity to host a mentoring session in the Mentoring Hub came up. Since I have done mentoring sessions at the Mentoring Hubs at Jfokus and Devnexus earlier this year, signing up for this was a no-brainer.

I had a session about how to Get Started with Open Source. This is a topic near to my heart. It is also a topic a lot of interested people wonder about.

The Mentorship Hub is the best place to meet new community members, so I ended up hanging around that area most of the time between the sessions I listened to.

JavaOne for me is mostly about the hallway track. And the hallway track this year was just as good as last year. There is no place on the planet where you can bump into so many luminaries in the Java Community.

On Friday, the day after the conference, we had one of our two yearly face-to-face meetings in the Java Community Executive Committee.We had a lot of great presentations about what the different members are doing with and for the community. Since the meeting was held at the Oracle campus, it was a natural choice to take the group photo and some selfies in front of the Oracle sponsored Team USA America’s Cup boat outside one of the office buildings.

Ivar Grimstad


Dropdowns Inside Scrollable Containers: Why They Break And How To Fix Them Properly

The scenario is almost always the same, which is a data table inside a scrollable container. Every row has an action menu, a small dropdown with some options, like Edit, Duplicate, and Delete. You build it, it seems to work perfectly in isolation, and then someone puts it inside that scrollable div and things fall apart. I’ve seen this exact bug in three different codebases: the container, the stack, and the framework, all different. The bug, though, is totally identical.

The dropdown gets clipped at the container’s edge. Or it shows up behind content that should logically be below it. Or it works fine until the user scrolls, and then it drifts.
You reach for z-index: 9999. Sometimes it helps, but other times it does absolutely nothing. That inconsistency is the first clue that something deeper is happening.

The reason it keeps coming back is that three separate browser systems are involved, and most developers understand each one on its own but never think about what happens when all three collide: overflow, stacking contexts, and containing blocks.

Once you understand how all three interact, the failure modes stop feeling random. In fact, they become predictable.

The Three Things Actually Causing This

Let’s look at each of those items in detail.

The Overflow Problem

When you set overflow: hidden, overflow: scroll, or overflow: auto on an element, the browser will clip anything that extends beyond its bounds, including absolutely positioned descendants.

.scroll-container {
  overflow: auto;
  height: 300px;
  /* This will clip the dropdown, full stop */
}

.dropdown {
  position: absolute;
  /* Doesn't matter -- still clipped by .scroll-container */
}

That surprised me the first time I ran into it. I’d assumed position: absolute would let an element escape a container’s clipping. It doesn’t.

In practice, that means an absolutely positioned menu can be cut off by any ancestor that has a non-visible overflow value, even if that ancestor isn’t the menu’s containing block. Clipping and positioning are separate systems. They just happen to collide in ways that look completely random until you understand both.

Here’s a React example using createPortal:

import { createPortal } from 'react-dom';
import { useState, useEffect, useRef } from 'react';

function Dropdown({ anchorRef, isOpen, children }) {
  const [position, setPosition] = useState({ top: 0, left: 0 });

  useEffect(() => {
    if (isOpen && anchorRef.current) {
      const rect = anchorRef.current.getBoundingClientRect();
      setPosition({
        top: rect.bottom + window.scrollY,
        left: rect.left + window.scrollX,
      });
    }
  }, [isOpen, anchorRef]);

  if (!isOpen) return null;

  return createPortal(
    <div
      id="dropdown-demo"
      role="menu"
      className="dropdown-menu"
      style={{ position: 'absolute', top: position.top, left: position.left }}
    >
      {children}
    </div>,
    document.body
  );
}

And, of course, we can’t ignore accessibility. Fixed elements that appear over content must still be keyboard-reachable. If the focus order doesn’t naturally move into the fixed dropdown, you’ll need to manage it using code. It’s also worth checking that it doesn’t sit over other interactive content with no way to dismiss it. That one bites you in keyboard testing.

CSS Anchor Positioning: Where I Think This Is Heading

CSS Anchor Positioning is the direction I’m most interested in right now. I wasn’t sure how much of the spec was actually usable when I first looked at it. It lets you declare the relationship between a dropdown and its trigger directly in CSS, and the browser handles the coordinates.

.trigger {
  anchor-name: --my-trigger;
}

.dropdown-menu {
  position: absolute;
  position-anchor: --my-trigger;
  top: anchor(bottom);
  left: anchor(left);
  position-try-fallbacks: flip-block, flip-inline;
}

The position-try-fallbacks property is what makes this worth using over a manual calculation. The browser tries alternative placements before giving up, so a dropdown at the bottom of the viewport automatically flips upward instead of getting cut off.

Browser support is solid in Chromium-based browsers and growing in Safari. Firefox needs a polyfill. The @oddbird/css-anchor-positioning package covers the core spec. I’ve hit layout edge cases with it that required fallbacks I didn’t anticipate, so treat it as a progressive enhancement or pair it with a JavaScript fallback for Firefox.

In short, promising but not universal yet. Test in your target browsers.

And as far as accessibility is concerned, declaring a visual relationship in CSS doesn’t tell the accessibility tree anything. aria-controls, aria-expanded, aria-haspopup — that part is still on you.

Sometimes The Fix Is Just Moving The Element

Before reaching for a portal or making coordinate calculations, I always ask one question first: Does this dropdown actually need to live inside the scroll container?

If it doesn’t, moving the markup to a higher-level wrapper eliminates the problem entirely, with no JavaScript and no coordinate calculations.

This isn’t always possible. If the button and dropdown are encapsulated in the same component, moving one without the other means rethinking the whole API. But when you can do it, there’s nothing to debug. The problem just doesn’t exist.

What Modern CSS Still Doesn’t Solve

CSS has come a long way here, but there are still places it lets you down.

The position: fixed and transform issues are still there. It’s in the spec intentionally, which means no CSS workaround exists. If you’re using an animation library that wraps your layout in a transformed element, you’re back to needing portals or anchor positioning.

CSS Anchor Positioning is promising, but new. As mentioned earlier, Firefox still needs a polyfill at the time I’m writing this. I’ve hit layout edge cases with it that required fallbacks I didn’t anticipate. If you need consistent behavior across all browsers today, you’re still reaching for JavaScript for the tricky parts.

The addition I’ve actually changed my workflow for is the HTML Popover API, now available in all modern browsers. Elements with the popover attribute render in the browser’s top layer, above everything, with no JavaScript positioning needed.

<button popovertarget="dropdown-demo">Open</button>
<div id="dropdown-demo" popover="manual" role="menu">Popover content</div>

Escape handling, dismiss-on-click-outside, and solid accessibility semantics come free for things like tooltips, disclosure widgets, and simple overlays. It’s the first tool I reach for now.

That said, it doesn’t solve positioning. It solves layering. You still need anchor positioning or JavaScript to align a popover to its trigger. The Popover API handles the layering. Anchor positioning handles the placement. Used together, they cover most of what you’d previously reach for a library to do.

A Decision Guide For Your Situation

After going through all of this the hard way, here’s how I actually think about the choice now.

  • Use a portal.
    I’d use this when the trigger lives deep in nested scroll containers. I used this pattern for table action menus and paired it with focus restoration and accessibility checks. It’s the most reliable option, but budget time for the extra wiring.
  • Use fixed positioning.
    This is for when you’re in vanilla JavaScript or a lightweight framework and can verify no ancestor applies transforms or filters. It’s simple to set up and simple to debug, as long as that one constraint holds.
  • Use CSS Anchor Positioning.
    Reach for this when your browser support allows it. If Firefox support is required, pair it with the @oddbird polyfill. This is where the platform is ultimately heading and will eventually become your go-to approach.
  • Restructure the DOM.
    Use this when the architecture permits it, and you want zero runtime complexity. I believe it’s likely the most underrated option.
  • Combine patterns.
    Do this when you want anchor positioning as your primary approach, paired with a JavaScript fallback for unsupported browsers. Or a portal for DOM placement paired with getBoundingClientRect() for coordinate accuracy.

Conclusion

I used to treat this bug as a one-off issue — something to patch and move on from. But once I sat with it long enough to understand all three systems involved — overflow clipping, stacking contexts, and containing blocks — it stopped feeling random. I could look at a broken dropdown and immediately trace which ancestor was responsible. That shift in how I read the DOM was the real takeaway.

There’s no single right answer. What I reached for depended on what I could control in the codebase: portals when the ancestor tree was unpredictable; fixed positioning when it was clean and simple; moving the element when nothing was stopping me; and anchor positioning now, where I can.

Whatever you end up choosing, don’t treat accessibility as the last step. In my experience, that’s exactly when it gets skipped. The ARIA relationships, the focus management, the keyboard behavior — those aren’t polish. They’re part of what makes the thing actually work.

Check out the full source code in my GitHub repo.

Further Reading

These are the references I kept coming back to while working through this:

  • The Stacking Context (MDN)
  • “CSS Anchor Positioning Guide”, Juan Diego Rodriguez
  • “Getting Started With The Popover API”, Godstime Aburu
  • Floating UI (floating-ui.com)
  • CSS Overflow (MDN)

Stop Paying for Slop: A Deterministic Middleware for LLM Token Optimization

Context windows are getting huge, but token budgets are tightening. Every time your agent iterates in an autonomous loop, you’re potentially sending a massive, bloated prompt filled with conversational filler, redundant whitespace, and low-entropy “slop.”

Today, I’ve merged the Prompt Token Rewriter to the Skillware registry (v0.2.1).

It’s a deterministic middleware that aggressively compresses prompts by 50-80% before they ever hit the LLM.

Why does this matter?

  • Lower Costs: Pay only for the “signal,” not the “noise.”
  • Faster Inference: Fewer tokens mean less time spent on KV-caching and long generations.
  • Deterministic Behavior: Because it uses heuristics rather than another expensive LLM call, your agent behavior stays stable and repeatable.

Three Levels of Aggression

The rewriter includes three presets depending on your use case:

  1. Low: Normalizes whitespace and line breaks (Safe for strict code).
  2. Medium: Strips conversational fillers (“please,” “could you,” “ensure that”).
  3. High: Aggressively removes stop-words and non-essential punctuation (Best for machine-to-machine context).

Join the Registry

We are building a community-driven “App Store” for Agentic Capabilities—decoupling logic from intelligence. If you’ve built a specialized tool for LLM optimization, governance, or logic, we’d love your contribution!

Check out our Contributing Guide to get started.

NocoBase 2.0 Beginner Tutorial – Chapter 2: Data Modeling

Originally published at https://docs.nocobase.com/tutorials/v2/02-data-modeling

In the last chapter, we installed NocoBase and got familiar with the interface. Now it’s time to build the skeleton of our HelpDesk system — the data model.

This chapter creates two collections — Tickets and Categories — and configures field types (single-line text, dropdown, many-to-one relations). The data model is the foundation: figure out what data you need and how it’s related, then building pages and setting permissions becomes straightforward.

2.1 What Are Collections and Fields

If you’ve used Excel before, this will feel familiar:

Excel Concept NocoBase Concept Description
Worksheet Collection A container for one type of data
Column header Field An attribute describing the data
Each row Record One specific piece of data

02-data-modeling-2026-03-11-08-32-41

For example, our “Tickets” collection is like an Excel spreadsheet — each column is a field (Title, Status, Priority…), and each row is one ticket record.

But NocoBase is much more powerful than Excel. It supports multiple collection types, each with different built-in capabilities:

Type Best For Examples
General Most business data Tickets, Orders, Customers
Tree Hierarchical data Category trees, Org charts
Calendar Date-based events Meetings, Schedules
File Attachment management Documents, Images

Today we’ll use General and Tree collections. We’ll cover the others when needed.

Enter Data Source Manager: Click the “Data Source Manager” icon in the bottom-left corner (the database icon next to the gear). You’ll see the “Main data source” — this is where all our tables live.

02-data-modeling-2026-03-11-08-35-08

2.2 Creating the Core Table: Tickets

Let’s jump right in and create the heart of our system — the Tickets table.

Create the Table

  1. On the Data Source Manager page, click “Main data source” to enter

02-data-modeling-2026-03-11-08-36-06

  1. Click “Create collection”, then select “General collection”

02-data-modeling-2026-03-11-08-38-52

  1. Collection name: tickets, Display name: Tickets

02-data-modeling-2026-03-11-08-40-34

When creating a table, the system checks a set of system fields by default. These automatically track metadata for every record:

Field Description
ID Primary key, unique identifier
Created at When the record was created
Created by Who created the record
Last updated at When it was last modified
Last updated by Who last modified it

Keep these defaults as-is — no manual management needed. You can uncheck them if a specific scenario doesn’t need them.

Adding Basic Fields

The table is created. Now let’s add fields. Click “Configure fields” on the Tickets table, and you’ll see the default system fields already listed.

02-data-modeling-2026-03-11-08-58-48

02-data-modeling-2026-03-11-08-59-47

Click the “Add field” button in the top-right corner to expand a dropdown of field types — pick the one you want to add.

02-data-modeling-2026-03-11-09-00-22

We’ll add the ticket’s own fields first; relation fields come later.

1. Title (Single line text)

Every ticket needs a short title to summarize the issue. Click “Add field” → select “Single line text”:

02-data-modeling-2026-03-11-09-01-00

  • Field name: title, Display name: Title
  • Click “Set validation rules”, add a “Required” rule

02-data-modeling-2026-03-11-09-02-40

2. Description (Markdown(Vditor))

For detailed problem descriptions with rich formatting — images, code blocks, etc. Under “Add field”“Media” category, you’ll find three options:

Field Type Features
Markdown Basic Markdown, simple styling
Rich Text Rich text editor with attachment uploads
Markdown(Vditor) Most feature-rich: WYSIWYG, instant rendering, and source code editing modes

We’ll go with Markdown(Vditor).

02-data-modeling-2026-03-11-09-09-58

  • Field name: description, Display name: Description

02-data-modeling-2026-03-11-09-10-50

3. Status (Single select)

02-data-modeling-2026-03-11-09-12-00

Tickets go through stages from submission to completion, so we need a status field to track progress.

  • Field name: status, Display name: Status
  • Add option values (each option needs a “Value” and “Label”; color is optional):
Value Label Color
pending Pending Orange
in_progress In Progress Blue
completed Completed Green

02-data-modeling-2026-03-11-09-17-44

Fill in the options and save first. Then click “Edit” on this field again — now you can set the “Default value” to “Pending”.

02-data-modeling-2026-03-11-09-20-28

02-data-modeling-2026-03-11-09-22-34

The first time you create the field, there are no options yet, so you can’t pick a default value — you need to save first, then come back to set it.

Why a single select? Because status is a fixed set of values. A dropdown prevents users from entering arbitrary text, keeping data clean.

4. Priority (Single select)

Helps distinguish urgency so the team can sort and tackle tickets efficiently.

  • Field name: priority, Display name: Priority
  • Add option values:
Value Label Color
low Low
medium Medium
high High Orange
urgent Urgent Red

At this point, the Tickets table has 4 basic fields. But — shouldn’t a ticket have a “category”? Like “Network Issue” or “Software Bug”?

We could make Category a dropdown, but you’d quickly run into a problem: categories can have sub-categories (“Hardware” → “Monitor”, “Keyboard”, “Printer”), and dropdowns can’t handle that.

We need a separate table for categories. And NocoBase’s Tree collection is perfect for this.

2.3 Creating the Categories Tree Table

What Is a Tree Collection

A tree collection is a special type of table with built-in parent-child relationships — every record can have a parent node. This is ideal for hierarchical data:

Hardware          ← Level 1
├── Monitor       ← Level 2
├── Keyboard & Mouse
└── Printer
Software
├── Office Apps
└── System Issues
Network
Account

With a general collection, you’d have to manually create a “Parent Category” field to build this hierarchy. A tree collection handles it automatically and supports tree views, adding child records, and more.

Create the Table

  1. Go back to Data Source Manager, click “Create collection”
  2. This time, select “Tree collection” (not General!)

02-data-modeling-2026-03-11-09-26-07

  1. Collection name: categories, Display name: Categories

02-data-modeling-2026-03-11-09-26-55

After creation, you’ll notice the table has two extra relation fields — “Parent” and “Children” — beyond the standard system fields. This is the tree collection’s special power. Use Parent to access the parent node and Children to access all child nodes, without any manual setup.

02-data-modeling-2026-03-11-09-27-40

Add Fields

Click “Configure fields” to enter the field list. You’ll see the system fields plus the auto-generated Parent and Children fields.
Click “Add field” in the top-right:

Field 1: Category Name

  1. Select “Single line text”
  2. Field name: name, Display name: Name
  3. Click “Set validation rules”, add a “Required” rule

Field 2: Color

  1. Select “Color”
  2. Field name: color, Display name: Color

02-data-modeling-2026-03-11-09-28-59

The Color field gives each category its own visual identity — it will make the interface much more intuitive later.

02-data-modeling-2026-03-11-09-29-23

With that, both tables’ basic fields are configured. Now let’s link them together.

2.4 Back to Tickets: Adding Relation Fields

Relation fields can be a bit abstract at first. If it doesn’t click right away, feel free to skip ahead to Chapter 3: Building Pages and see how data is displayed in practice, then come back here to add the relation fields.

Tickets need to be linked to a category, a submitter, and an assignee. These are called relation fields — instead of storing text directly (like “Title” does), they store the ID of a record in another table, and use that ID to look up the corresponding record.

Let’s look at a specific ticket — on the left are the ticket’s attributes. “Category” and “Submitter” don’t store text; they store an ID. The system uses that ID to find the exact matching record from the tables on the right:

02-data-modeling-2026-03-12-00-50-10

On the interface, you see names like “Network” and “Alice”, but behind the scenes it’s all connected by IDs. Multiple tickets can point to the same category or the same user — this relationship is called Many-to-one.

Adding Relation Fields

Go back to Tickets → “Configure fields” → “Add field”, select “Many to one”.

02-data-modeling-2026-03-12-00-52-39

You’ll see these configuration options:

Option Description How to Fill
Source collection Current table (auto-filled) Don’t change
Target collection Which table to link to Select the target
Foreign key The linking column stored in the current table Enter a meaningful name
Target collection key field Defaults to id Keep as-is
ON DELETE What happens when the target record is deleted Keep as-is

02-data-modeling-2026-03-12-00-58-38

The foreign key defaults to a random name like f_xxxxx. We recommend changing it to something meaningful for easier maintenance. Use lowercase with underscores (e.g., category_id) instead of camelCase.

Add the following three fields:

5. Category → Categories table

  • Display name: Category
  • Target collection: Select “Categories” (if not in the list, type the name and it will be auto-created)
  • Foreign key: category_id

6. Submitter → Users table

Records who submitted this ticket. NocoBase has a built-in Users table — just link to it.

  • Display name: Submitter
  • Target collection: Select “Users”
  • Foreign key: submitter_id

02-data-modeling-2026-03-12-01-00-09

7. Assignee → Users table

Records who is responsible for handling this ticket.

  • Display name: Assignee
  • Target collection: Select “Users”
  • Foreign key: assignee_id

02-data-modeling-2026-03-12-01-00-22

2.5 The Complete Data Model

Let’s review the full data model we’ve built:

02-data-modeling-2026-03-16-00-30-35

}o--|| represents a many-to-one relationship: “many” on the left, “one” on the right.

Summary

In this chapter we completed the data modeling — the entire skeleton of our HelpDesk system:

  1. Tickets (tickets): 4 basic fields + 3 relation fields, created as a General collection
  2. Categories (categories): 2 custom fields + auto-generated Parent/Children fields, created as a Tree collection with built-in hierarchy support

Key concepts we learned:

  • Collection = A container for one type of data
  • Collection types = Different types for different scenarios (General, Tree, etc.)
  • Field = A data attribute, created via “Configure fields” → “Add field”
  • System fields = ID, Created at, Created by, etc. — auto-checked when creating a table
  • Relation field (Many-to-one) = Points to a record in another table, linking tables together

You may notice that later screenshots already contain data — we pre-loaded test data for demonstration purposes. In NocoBase, all CRUD operations are done through the frontend pages. Chapter 3 covers building tables to display data, and Chapter 4 covers forms for data entry — stay tuned.

Next Chapter Preview

The skeleton is ready, but the tables are still empty. In the next chapter, we’ll build pages to make the data visible.

See you in Chapter 3!

Related Resources

  • Data Sources Overview — Core data modeling concepts in NocoBase
  • Field Types — Complete field type reference
  • Many-to-One Relations — Relationship configuration guide