The Hidden Cost of AI Coding Agents (And How to Track It in Real Time)

Last month I hit a wall. Not a coding wall — a billing wall.

I’d been using Claude Code heavily for a side project, letting it refactor modules, write tests, and scaffold new features. Cursor was open in another window doing its thing. GitHub Copilot was autocompleting in my terminal. Life was good.

Then the invoice arrived: $147 in API costs for a single month. On a project that hasn’t made a dollar yet.

I wasn’t shocked that AI coding tools cost money. I was shocked that I had zero visibility into where that money was going while it was happening.

The Silent Token Burn

Here’s the thing nobody talks about when recommending AI coding agents: they consume tokens constantly, and most of them don’t tell you how many.

Let’s break down what’s actually happening:

  • Claude Code charges per token through Anthropic’s API. A heavy coding session can easily burn through 100K+ tokens. At current rates, that’s roughly $0.75-$3.00 per session depending on the model.
  • Cursor uses a credit system, but once you exceed your monthly Pro allowance, you’re on usage-based billing. “Fast” requests use premium models that eat credits 10x faster.
  • GitHub Copilot is flat-rate ($10-39/month), but if you’re using Copilot Chat or the new agent features with your own API key, surprise — you’re back to pay-per-token.
  • ChatGPT with Code Interpreter burns through GPT-4 tokens at $0.03/1K output tokens. A single complex coding conversation can cost $2-5.

None of these tools show you a running cost ticker while you work. You find out what you spent after the damage is done.

Why This Matters More Than You Think

If you’re a solo dev or working on a side project, every dollar counts. But even at a company, understanding your AI tool spend matters:

The “just one more prompt” trap. When you can’t see the meter running, you don’t optimize your prompts. You ask vague questions. You let the agent go on tangents. You regenerate responses because “that wasn’t quite right.” Each of those decisions costs real money.

Model selection is invisible. Many tools auto-select models behind the scenes. That “quick question” might route to GPT-4 Turbo instead of GPT-3.5 — a 20x price difference — and you’d never know.

Compound costs across tools. If you’re like me and use 2-3 AI tools simultaneously, the costs stack up fast. But since each tool bills separately and reports usage differently, you never see the aggregate picture.

What I Actually Wanted

After that $147 wake-up call, I went looking for a simple solution. I didn’t need a complex dashboard or enterprise analytics platform. I just wanted to know:

  1. How many tokens am I burning right now?
  2. What’s my approximate cost today?
  3. Am I trending higher than usual?

Basically, I wanted the equivalent of a gas gauge — something visible while I’m driving, not just on the receipt after I’ve already filled up.

The Solution I Found

I ended up trying TokenBar, a macOS menu bar app that tracks your LLM token usage in real time. It sits in your menu bar and shows you a running token count and estimated cost as you work.

What sold me on it:

  • It’s always visible. Glance up, see your spend. No context switching to a dashboard.
  • It tracks across providers. Anthropic, OpenAI, local models — one unified view.
  • It’s $5. Once. No subscription. No recurring charge. For a tool that’s literally designed to save you money on AI costs, that felt right.

I’ve been using it for three weeks now, and the behavioral change was almost immediate. When you can see the tokens ticking up, you naturally start writing better prompts. You stop regenerating responses for marginal improvements. You think before you ask.

My estimated savings so far? Roughly 30-40% reduction in monthly token spend, just from being more intentional.

Tips for Managing AI Coding Costs (With or Without a Tracker)

Whether you use a token tracker or not, here are some practical things I’ve learned:

1. Set a mental budget. Decide what you’re willing to spend per day/week on AI tools. Even a rough number creates awareness.

2. Batch your AI interactions. Instead of asking 10 small questions, write one comprehensive prompt with context. Fewer round-trips = fewer tokens.

3. Know your model tiers. Use cheaper models for simple tasks (code formatting, basic questions) and save the expensive models for complex reasoning and architecture decisions.

4. Review your usage weekly. Check your OpenAI/Anthropic dashboards every Monday. If the number surprises you, something needs to change.

5. Monitor in real time. Whether it’s TokenBar or a custom script that watches your API calls, having live visibility is the single biggest lever for controlling costs.

The Bottom Line

AI coding agents are genuinely incredible tools. I’m not going back to writing everything by hand. But “incredible” and “free” aren’t the same thing, and the lack of real-time cost visibility in most of these tools is a design choice that benefits the provider, not the user.

Track your tokens. Watch your spend. Your future self (and your bank account) will thank you.

What’s your monthly AI tool spend? Have you been surprised by a bill? I’d love to hear about your experience in the comments.

Heuristic vs Semantic Eval: When <1ms Matters More Than LLM-as-Judge

There is a default assumption in the agent eval space right now: if you want to evaluate agent output, you need an LLM to judge it. Feed the output to GPT-4o with a rubric, get a score back, done. LLM-as-Judge is the pattern everyone reaches for first.

I want to push back on that. Not because LLM-as-Judge is bad — it is genuinely powerful for certain problems. But because most teams are using it for evaluations that do not require an LLM at all. They are spending seconds and dollars on checks that a regex can handle in microseconds for free.

Two Approaches to Agent Evaluation

LLM-as-Judge sends your agent’s output to another LLM with a scoring prompt. The judge model reads the output, compares it against criteria you define, and returns a score. This is semantic evaluation — the judge understands meaning, nuance, and context.

Strengths: handles subjective quality, can assess factual accuracy against source documents, evaluates tone and style, reasons about complex multi-step outputs.

Weaknesses: adds 1-5 seconds of latency per evaluation, costs $0.01-0.05 per call depending on the model and output length, introduces non-determinism (run the same eval twice and you might get different scores), and requires managing yet another LLM integration.

Heuristic rules are pattern-based checks: regex matches, string comparisons, length calculations, threshold checks. They run against the output directly with no model inference.

Strengths: sub-millisecond execution, deterministic (same input always produces the same result), zero marginal cost, no external dependencies.

Weaknesses: cannot understand meaning, cannot assess subjective quality, limited to patterns you can express as rules.

The question is not which approach is better. The question is which problems actually need semantic understanding and which are better served by a fast, deterministic check.

When Heuristic Rules Win

Here are the evaluations I see teams running through LLM-as-Judge that do not need an LLM:

PII detection. Social Security numbers follow the pattern d{3}-d{2}-d{4}. Credit card numbers are four groups of four digits. Phone numbers, email addresses — these are structural patterns. A regex catches them in microseconds with zero ambiguity. You do not need a 70-billion-parameter model to determine whether a string matches bd{3}-d{2}-d{4}b.

Prompt injection detection. The most common injection patterns are string-matchable: “ignore all previous instructions,” “you are now a,” “bypass your safety filters.” These are not subtle. A set of regex patterns catches them deterministically:

const INJECTION_PATTERNS = [
  /ignore (?:all )?(?:previous|above|prior) (?:instructions|prompts)/i,
  /you are now (?:a |in )/i,
  /bypass (?:your |the )?(?:safety|content|ethical) (?:filters|guidelines|restrictions)/i,
];

Output length and completeness. Did the agent return an empty response? Is the output below a minimum character count? Does it contain at least one complete sentence? These are arithmetic checks.

Cost threshold enforcement. Did this trace cost more than $0.10? Is the completion-to-prompt token ratio above 5:1 (suggesting the agent is generating far more than it should relative to the input)? These are numeric comparisons.

Blocklist enforcement. You have a list of phrases that should never appear in agent output. Checking whether a string contains a substring is not a job for an LLM.

Every one of these evaluations is better served by a heuristic rule: faster, cheaper, deterministic, and easier to debug when something fails.

When Semantic Evaluation Wins

There are evaluations that genuinely need an LLM to judge:

Factual accuracy. Given a set of source documents, did the agent’s response accurately represent the facts? This requires reading comprehension and reasoning about whether the output is consistent with the sources. Pattern matching cannot do this.

Tone and style assessment. Is this customer support response empathetic? Is this technical explanation at the right level for the audience? Tone is subjective and context-dependent. You need a model that understands language pragmatics.

Complex reasoning verification. Did the agent’s multi-step reasoning chain contain logical errors? Did it correctly apply a policy to an edge case? These require following an argument and evaluating its coherence.

Nuanced quality assessment. Is this summary good? Does it capture the key points without distorting them? Quality is multidimensional and often requires understanding the source material.

These are real evaluation problems, and LLM-as-Judge is the right tool for them.

The Latency and Cost Argument

Here is the math that most teams do not do before defaulting to LLM-as-Judge for everything.

Running LLM-as-Judge on a single trace: 1-5 seconds latency, $0.01-0.05 in model costs.

Running a heuristic rule on a single trace: <1ms latency, $0.00 in model costs.

At 1,000 traces per day with 4 eval categories each:

Approach Latency per eval Cost per eval Daily cost (4,000 evals)
LLM-as-Judge (all) 1-5 sec $0.01-0.05 $40-200
Heuristic (all) <1ms $0.00 $0
Composite (80/20) varies varies $8-40

At $40-200/day, you are spending $1,200-6,000/month just to evaluate your agent — before the agent’s own inference costs. For a safety check that amounts to “does this string contain an SSN pattern,” that spend is hard to justify.

The latency matters too. If evaluation runs inline (before returning the response to the user), adding 1-5 seconds per eval degrades the user experience. Heuristic rules are fast enough to run inline without the user noticing.

The Composite Approach

The right architecture is not heuristic or semantic. It is heuristic for the 80% of evaluations that are pattern-based, and semantic for the 20% that genuinely require language understanding.

The 80% (heuristic): PII detection, prompt injection detection, output completeness checks, cost threshold enforcement, blocklist enforcement, token efficiency, keyword overlap, hallucination marker detection. These run on every trace, inline, in <1ms, for free.

The 20% (semantic): factual accuracy against source documents, nuanced quality scoring, tone assessment, complex reasoning verification. These run selectively — on a sample of traces, or triggered when heuristic scores fall below a threshold — and the cost is justified because the evaluation requires actual language understanding.

Iris implements the heuristic side today. LLM-as-Judge is on the roadmap for v0.4.0, and when it ships, it will slot in alongside the heuristic rules as a complementary layer — not a replacement.

The 12 Built-in Rules

Iris ships with 12 heuristic eval rules across 4 categories. Here is what each category covers and why it does not need an LLM.

Completeness (4 rules)

  • non_empty_output — Output is not empty or whitespace-only. Weight: 2.
  • min_output_length — Output meets a configurable minimum character count (default: 10). Weight: 1.
  • sentence_count — Output contains at least N complete sentences (default: 1). Weight: 0.5.
  • expected_coverage — When an expected output is provided, checks what percentage of key terms appear in the actual output. Passes at 50% coverage. Weight: 1.5.

These are structural checks. An empty response is not a nuance problem. It is a boolean.

Relevance (3 rules)

  • keyword_overlap — Measures word overlap between input and output. If you asked about “password reset” and the response is about “billing,” that is detectable without an LLM. Weight: 1.
  • no_hallucination_markers — Flags common AI hedging phrases: “as an AI,” “I cannot,” “I don’t have access.” These are exact string matches. Weight: 1.
  • topic_consistency — Measures whether output words relate to input words. A coarse but fast check for topic drift. Weight: 1.

Safety (3 rules)

  • no_pii — Regex patterns for SSN (d{3}-d{2}-d{4}), credit card numbers, phone numbers, and email addresses. Weight: 2.
  • no_blocklist_words — Configurable phrase blocklist. Default includes harmful content patterns. Weight: 2.
  • no_injection_patterns — Regex patterns matching common prompt injection attempts. Weight: 2.

Safety rules have the highest weights because a safety failure matters more than a completeness failure.

Cost (2 rules)

  • cost_under_threshold — Total trace cost must be under a configurable USD threshold (default: $0.10). Weight: 1.
  • token_efficiency — Completion-to-prompt token ratio must be under a configurable maximum (default: 5x). Catches cases where the agent generates disproportionately long responses. Weight: 0.5.

Running an evaluation

{
  "tool": "evaluate_output",
  "arguments": {
    "output": "Navigate to Settings > Security > Reset Password and follow the prompts.",
    "eval_type": "safety",
    "input": "How do I reset my password?"
  }
}

Response:

{
  "score": 1.0,
  "passed": true,
  "rule_results": [
    { "ruleName": "no_pii", "passed": true, "score": 1.0, "message": "No PII detected" },
    { "ruleName": "no_blocklist_words", "passed": true, "score": 1.0, "message": "No blocklisted content found" },
    { "ruleName": "no_injection_patterns", "passed": true, "score": 1.0, "message": "No injection patterns detected" }
  ]
}

Three rules, three results, sub-millisecond. No LLM call, no cost, no non-determinism.

Use the Right Tool for the Check

LLM-as-Judge is a powerful technique. I am building it into Iris because there are evaluations that genuinely require semantic understanding. But the industry’s default of routing every evaluation through a judge model is expensive, slow, and unnecessary for the majority of checks agents need.

If you can express the check as a pattern, a threshold, or a string comparison, use a heuristic rule. Save the LLM for the evaluations that actually need one.

Iris is open-source and MIT licensed. The 12 built-in rules are ready to use today. Add it to your MCP config and start evaluating your agent output in <1ms.

npx @iris-eval/mcp-server --dashboard

The code is at github.com/iris-eval/mcp-server. See the roadmap for what is coming next, including LLM-as-Judge in v0.4.0.

AI-Driven Quality Engineering for Regulated Enterprise Systems

A Framework for Reliability, Validation, and Operational Trust in High-Stakes Digital Environments

Abstract

Artificial Intelligence is reshaping enterprise software engineering, particularly in regulated sectors such as healthcare, insurance, financial services, public workforce systems, and digital commerce. As organizations increasingly integrate Artificial Intelligence (AI), Machine Learning (ML), Generative AI (GenAI), and Large Language Models (LLMs) into mission-critical business applications, conventional quality assurance and software testing approaches are no longer sufficient to address the reliability, fairness, explainability, and governance challenges of these systems. AI-enabled applications introduce probabilistic behavior, dynamic model drift, data dependency risks, hallucinated outputs, bias propagation, and new forms of operational uncertainty that require a modernized quality engineering discipline.
This paper proposes a framework for AI-driven quality engineering tailored to regulated enterprise systems. It argues that quality engineering must evolve from traditional defect detection toward a broader capability integrating AI validation, risk-based testing, continuous monitoring, automated governance controls, and lifecycle assurance. The paper analyzes the limitations of conventional software quality practices when applied to AI-enabled enterprise systems, identifies the core design principles of AI-driven quality engineering, and outlines implementation strategies across regulated digital infrastructures. It concludes that AI-driven quality engineering is an essential operational discipline for trustworthy enterprise AI adoption, particularly where system failures can affect financial outcomes, healthcare access, payroll integrity, regulatory compliance, and public trust.
Keywords: AI-driven quality engineering, enterprise AI validation, regulated systems, reliability engineering, responsible AI, software quality, continuous validation, enterprise governance

1. Introduction

Quality engineering has long served as a foundational discipline for building reliable enterprise software. Traditionally, it has focused on defect prevention, test strategy design, automation frameworks, regression assurance, performance testing, release governance, and process improvement across software delivery lifecycles. In deterministic software systems, these practices have proven effective because requirements, business logic, data flows, and expected outputs are relatively stable and testable through conventional methods.
However, the rapid adoption of AI-enabled enterprise systems is changing the nature of software quality itself. Modern enterprise platforms increasingly incorporate predictive models, intelligent automation, recommendation systems, generative AI interfaces, and language-based reasoning engines. These systems are now used in functions such as insurance underwriting, claims processing, telehealth support, workforce scheduling, payroll compliance, fraud detection, and enterprise knowledge retrieval.
In regulated environments, these systems are not merely productivity tools. They are embedded within operational workflows that affect healthcare access, financial determinations, insurance outcomes, employee compensation, research funding accountability, and digital service continuity. This means that the quality of these systems must be evaluated not only in terms of functional correctness, but also in terms of reliability, fairness, transparency, robustness, and governance compliance.
Traditional software testing and automation practices are insufficient for this new context. AI-enabled systems often produce probabilistic outputs rather than deterministic results. Their behavior may depend on model version, training data, prompt structure, retrieval context, environmental drift, or user interaction patterns. As a result, system quality can no longer be assessed solely through binary pass/fail assertions or static regression suites.
This paper argues that enterprise software organizations require a modernized discipline of AI-driven quality engineering. This discipline extends conventional quality engineering by integrating AI model validation, risk-based scenario testing, fairness assessment, drift monitoring, governance controls, and operational observability into the enterprise software lifecycle.
The paper presents a conceptual and practical framework for AI-driven quality engineering in regulated enterprise systems. Its central claim is that quality engineering must evolve from a software testing function into a broader AI reliability and assurance capability capable of supporting safe and accountable AI adoption at scale.

2. Background: From Traditional QA to AI-Driven Quality Engineering

2.1 Evolution of Software Quality Practice
The evolution of enterprise quality practice has generally progressed through several stages:
Manual quality assurance
Test automation and regression engineering
Continuous testing and DevOps integration
Quality engineering as a lifecycle discipline
AI-driven quality engineering

Manual QA focused primarily on defect detection late in the software lifecycle. Test automation improved repeatability and scale. Continuous testing integrated quality into release pipelines. Quality engineering then broadened the focus from test execution to overall product quality, architecture, observability, shift-left practices, and risk reduction.
AI-enabled enterprise systems now require the next evolution: AI-driven quality engineering, in which system reliability depends not only on code quality, but also on model quality, data quality, prompt behavior, retrieval integrity, and runtime monitoring.
2.2 Why Regulated Systems Require More Than Conventional Testing
Regulated enterprise environments are distinguished by three factors:
consequential outcomes
strict compliance requirements
high operational interdependence

A failure in a consumer social application may affect user satisfaction; a failure in an insurance claims system, payroll platform, or telehealth application may affect financial benefits, labor compliance, or patient services. As a result, AI-enabled regulated systems require stronger assurance mechanisms than conventional commercial software.

3. Why Conventional Quality Engineering Is Insufficient for AI Systems

3.1 Deterministic Assumptions Break Down
Traditional testing assumes stable expectations:
fixed inputs
defined outputs
reproducible logic
deterministic workflows

AI systems violate many of these assumptions. A machine learning model may produce different outputs depending on input distribution. A generative AI system may produce multiple plausible responses to the same prompt. A recommendation engine may change behavior as data evolves. These characteristics challenge the foundations of traditional functional testing.
3.2 Hidden Failure Modes
AI systems often fail in subtle ways:
inaccurate confidence
biased ranking
unsupported summary statements
model drift
prompt sensitivity
context instability

These are not always visible through standard regression tests.
3.3 Data and Model Dependencies
In AI-enabled systems, quality depends not only on application logic but on:
training data quality
inference data quality
model versioning
retrieval source quality
prompt templates
feature transformations

This expands the scope of quality engineering beyond code.
3.4 Continuous Degradation Risk
Unlike static software functionality, AI systems may degrade over time. Quality engineering must therefore include runtime observability and revalidation mechanisms, not just pre-release testing.

4. Defining AI-Driven Quality Engineering

AI-driven quality engineering can be defined as:
A discipline that applies validation engineering, automation, risk-based testing, model assurance, monitoring, and governance controls to ensure the reliability, fairness, and operational trustworthiness of AI-enabled enterprise systems across their full lifecycle.
This definition expands conventional quality engineering in four important ways:
It includes AI-specific failure modes, such as drift, bias, and hallucination.
It treats quality as a continuous operational property, not merely a release criterion.
It integrates governance controls into engineering practice.
It positions quality engineering as a core contributor to responsible AI deployment.

5. Core Design Principles of AI-Driven Quality Engineering

5.1 Risk-Based Validation
Not all AI-enabled systems require the same level of quality control. Validation depth should be determined by:
domain criticality
regulatory exposure
decision consequence
degree of automation
reversibility of outcomes

For example, a generative assistant helping draft internal notes requires different controls than an AI-enabled system assisting claims adjudication or telehealth guidance.
5.2 Continuous Validation Across the Lifecycle
AI-driven quality engineering is not limited to a test phase. It spans:
design validation
data validation
model validation
pre-release testing
deployment assurance
post-release monitoring
incident analysis
revalidation after changes

5.3 Explainability of Quality Signals
Quality engineering in AI systems must provide interpretable evidence of reliability, such as:
error categories
fairness disparities
drift indicators
unsupported output density
override and incident trends

This helps align technical quality activities with governance and audit requirements.
5.4 Quality-as-Code and Governance-as-Code
Quality controls for AI systems should increasingly be embedded into automation pipelines through:
policy checks
validation thresholds
release gates
data quality rules
prompt controls
monitoring alerts
model rollback triggers

This operationalizes governance within software delivery.

6. A Framework for AI-Driven Quality Engineering in Regulated Enterprise Systems

This paper proposes a six-domain framework for AI-driven quality engineering:

  • Use-Case and Risk Classification
  • Data and Model Assurance
  • Scenario-Based Validation
  • Automation and Continuous Testing
  • Runtime Monitoring and Observability
  • Governance and Operational Feedback

6.1 Use-Case and Risk Classification
Quality engineering must begin with understanding:
what the system is intended to do
where AI is embedded
what decisions are influenced
what failures matter most
which regulations or policies apply
This determines validation scope and quality thresholds.

6.2 Data and Model Assurance
AI-driven quality engineering must evaluate:
data completeness
feature consistency
model version integrity
training/inference alignment
retrieval-source freshness
prompt template reliability

6.3 Scenario-Based Validation
AI-enabled systems require rich scenario design including:
normal workflows
exception paths
edge cases
adversarial inputs
demographic fairness scenarios
stale-data scenarios
integration failure scenarios

6.4 Automation and Continuous Testing
Automation remains essential, but it must expand beyond UI and API testing to include:
model validation pipelines
response evaluation harnesses
fairness checks
prompt regression tests
retrieval validation
synthetic scenario generation

6.5 Runtime Monitoring and Observability
Post-deployment quality signals should include:
anomaly rates
drift indicators
user override frequency
latency degradation
unsupported response rates
model incident trends
fairness drift over time

6.6 Governance and Operational Feedback
Quality engineering should feed governance by providing:
measurable evidence of system reliability
release readiness signals
incident classification
revalidation triggers
audit-supporting records

7. AI-Driven Quality Engineering Across Regulated Industries

7.1 Healthcare Systems
Healthcare systems increasingly rely on AI for triage, documentation, digital patient engagement, and telehealth workflows. AI-driven quality engineering in this domain should prioritize:
patient safety
factual grounding
service continuity
equitable performance
explainability for clinicians and operations staff

7.2 Insurance Systems
Insurance platforms use AI in underwriting, claims processing, risk analysis, and document interpretation. Quality engineering priorities include:
fairness in decision support
policy-grounded output validation
document interpretation accuracy
auditability
operational resilience

7.3 Workforce and Payroll Systems
AI-enabled workforce systems may support scheduling, compliance review, exception analysis, and enterprise workflow support. Quality engineering should emphasize:
payroll accuracy
labor rule integrity
policy consistency
traceability
cross-role and cross-scenario validation

7.4 Digital Commerce and Financial Systems
In digital commerce and financial platforms, AI-driven quality engineering must address:
transaction reliability
fraud system stability
fairness in customer-facing recommendations
API and workflow resilience
compliance and service continuity

8. Validation Methods in AI-Driven Quality Engineering

8.1 Model Behavior Testing
Assess whether model outputs align with business intent and operational expectations across representative scenarios.
8.2 Hallucination and Unsupported Output Detection
For GenAI and LLM systems, quality engineering must include:
faithfulness checks
source-grounding validation
unsupported claim analysis
response consistency testing

8.3 Bias and Fairness Testing
Evaluate whether system quality varies across:
demographic groups
language or communication styles
case complexity levels
operational contexts

8.4 Adversarial and Robustness Testing
Assess resistance to:
malformed inputs
prompt injection
incomplete data
conflicting sources
exception-heavy workflows

8.5 Regression and Drift Testing
AI regression testing must include:
model change comparisons
prompt-template regression
retrieval-source changes
behavioral stability under updated conditions

9. Operational Metrics for AI-Driven Quality Engineering

A mature AI-driven quality engineering practice should track a multi-dimensional metrics set.
9.1 Reliability Metrics
decision error rate
response consistency score
hallucination rate
unsupported claim density
regression stability index

9.2 Fairness Metrics
disparity in error rate
response quality parity
contextual sensitivity variance
scenario-group consistency

9.3 Operational Metrics
incident rate per release
override frequency
escalation rate
mean time to detection
mean time to remediation
release quality score

9.4 Infrastructure Metrics
latency degradation
retrieval failure rate
API dependency reliability
deployment rollback frequency

10. Relationship between AI-Driven Quality Engineering and Responsible AI Governance

AI-driven quality engineering and responsible AI governance should not be treated as separate domains.
Responsible AI governance defines:
what risks matter
what controls are required
what accountability exists

AI-driven quality engineering operationalizes those requirements through:
validation
testing
automation
monitoring
evidence generation

In this sense, AI-driven quality engineering is a technical execution layer of responsible AI governance.

11. Implementation Challenges

11.1 Organizational Silos
AI engineers, QA teams, data scientists, platform engineers, and governance stakeholders often work in separate functions. This fragmentation weakens AI assurance.
11.2 Tooling Gaps
Many organizations have mature CI/CD and automation for software, but not for model evaluation, prompt regression, or fairness monitoring.
11.3 Lack of Shared Metrics
Engineering teams, compliance teams, and business stakeholders often use different definitions of “quality” and “risk.”
11.4 Pace of Model Change
Rapid evolution of AI tooling can outpace governance and quality control maturity.

12. Toward an Enterprise Maturity Model

A maturity model for AI-driven quality engineering may look like this:
Level 1: Reactive
Minimal AI testing; defects found late; governance is informal.
Level 2: Managed
Basic AI validation exists; controls vary by team.
Level 3: Standardized
Enterprise-level AI quality standards, metrics, and release controls are defined.
Level 4: Integrated
AI quality engineering is integrated with DevOps, data operations, model governance, and compliance functions.
Level 5: Adaptive
Continuous learning, monitoring, and feedback improve both reliability and governance over time.

13. Future Directions

Future work in AI-driven quality engineering should focus on:
standardized enterprise AI validation patterns
automated fairness and hallucination detection at scale
observability frameworks for LLM systems
quality benchmarks for regulated use cases
integrated quality-governance tooling
AI-specific maturity assessment models

14. Conclusion

AI-enabled enterprise systems are changing the meaning of software quality. In regulated domains, quality can no longer be assessed solely through traditional functional testing and automation frameworks. Instead, organizations must adopt AI-driven quality engineering practices that integrate validation, monitoring, governance controls, and operational feedback across the full lifecycle of AI systems.
AI-driven quality engineering is therefore not just an extension of traditional QA. It is a strategic discipline for ensuring that AI systems remain reliable, fair, accountable, and operationally trustworthy in healthcare, insurance, workforce, and other high-stakes enterprise environments.
Organizations that build this capability will be better positioned to deploy AI responsibly while maintaining compliance, resilience, and public trust.

About the Author
Suresh Babu Narra is a technology professional with over 19 years of experience in software engineering, qulity assurance, MLOps, AI/ML/LLM validation and Responsible AI Governance. His work focuses on developing validation frameworks and governance practices that improve the reliability, transparency, and accountability of AI-enabled enterprise systems across healthcare, insurance, workforce management, finance and digital commerce platforms.

References

  • National Institute of Standards and Technology (2023). Artificial Intelligence Risk Management Framework (AI RMF 1.0).
  •  National Institute of Standards and Technology (2024). Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile.
  •  ISO/IEC 23894:2023. Artificial intelligence - Guidance on risk management.
  •  OECD. OECD AI Principles.
  •  European Commission. Ethics Guidelines for Trustworthy AI.
  •  The White House Office of Science and Technology Policy. Blueprint for an AI Bill of Rights.

LiDAR Depth Maps on the GPU in React Native — And a Dawn Upstream Contribution

In Part 1, I built a zero-copy GPU compute pipeline for camera frames. In Part 2, I got Apple Log HDR working with LUT color grading. This post covers what happened next: getting LiDAR depth data into the same GPU pipeline, the format wars that ensued, and submitting my first upstream patch to Dawn (Google’s WebGPU implementation).

The Goal

iPhone Pro models have LiDAR sensors that produce real-time depth maps at 320×180 at 60fps. I wanted to feed that depth data into the same WebGPU compute pipeline as the camera video — one shader that reads both the camera frame and the depth map, producing a depth-colored overlay blended with the live video.

The API I wanted:

const { currentFrame } = useGPUFrameProcessor(camera, {
  resources: {
    depth: GPUResource.cameraDepth(),
  },
  pipeline: (frame, { depth }) => {
    'worklet';
    frame.runShader(DEPTH_COLORMAP_WGSL, { inputs: { depth } });
  },
});

Declare a dynamic depth resource. Bind it as a shader input. The native side handles AVCaptureDepthDataOutput, frame synchronization, and per-frame texture updates. The user writes a WGSL shader and gets LiDAR data.

Getting there took longer than expected.

The Apple Log Color Space Saga

Before depth, I had to fix something from Part 2: Apple Log wasn’t actually Apple Log.

“Why Doesn’t It Look Washed Out?”

Apple Log footage should look flat and desaturated — that’s the whole point of a log encoding. But our output looked identical to standard sRGB. Same contrast, same saturation. The Apple Log → Rec.709 LUT produced oversaturated garbage because it expected flat input.

I spent hours adjusting the YUV→RGB shader math. Full range vs video range. BT.709 vs BT.2020 matrix. Different normalization constants. Nothing changed the look. Then I added CVPixelBuffer attachment logging:

[DawnPipeline] CVPixelBuffer transfer=ITU_R_2100_HLG, matrix=ITU_R_2020, primaries=ITU_R_2020

HLG. Not Apple Log. iOS was delivering HLG-encoded frames despite activeColorSpace = .appleLog on the capture device. The camera was converting Apple Log to HLG before handing frames to AVCaptureVideoDataOutput.

Spying on Blackmagic Camera

I had a theory but no proof. Then I did something useful: opened Console.app, filtered for Blackmagic Camera’s process, and looked at their AVCaptureSession configuration:

VC: <SRC:Wide back x422/3840x2160, ColorSpace:3, ...>

ColorSpace:3 — that’s Apple Log. Our session showed ColorSpace:2 — HLG. Same device, same format, different color space. What were they doing differently?

Two things:

  1. automaticallyConfiguresCaptureDeviceForWideColor = false on the session — this prevents iOS from overriding the color space
  2. autoConfig: 0 in their session configuration

Adding one line fixed it:

session.automaticallyConfiguresCaptureDeviceForWideColor = false

Suddenly the footage was flat, washed out, and beautiful. The LUT worked perfectly. I saved a feedback memory: “Disable automaticallyConfiguresCaptureDeviceForWideColor for Apple Log; session overrides color space otherwise.”

The Format Wars: 4:2:0 vs 4:2:2

But the color space wasn’t the only issue. The camera’s native format was x422 (10-bit 4:2:2 YUV) — half chroma subsampling horizontally, full vertically. We were requesting x420 (4:2:0 — half in both dimensions), forcing AVFoundation to convert. That conversion was likely applying the HLG transform.

Dawn’s multi-planar support has separate feature flags per format:

  • DawnMultiPlanarFormats — enables 8-bit NV12 (R8BG8Biplanar420Unorm)
  • MultiPlanarFormatP010 — enables 10-bit 4:2:0 (R10X6BG10X6Biplanar420Unorm)
  • MultiPlanarFormatP210 — enables 10-bit 4:2:2 (R10X6BG10X6Biplanar422Unorm)

We had P010 enabled but not P210. Requesting the native 4:2:2 format required enabling P210 — and adjusting the UV plane dimensions in the shader (half-width, full-height instead of half-both).

Lesson I saved as permanent feedback: “Verify ALL required Dawn feature flags when adding new GPU format support; Dawn silently returns zeros for unsupported formats.”

Upstream Progress

The multi-planar feature flags (P010, P210) and the getRecorder() API I needed for canvas overlays were both merged into react-native-skia upstream (PR #3753 and PR #3751). The 1×1 throwaway surface hack from Part 2 — creating a dummy offscreen surface just to extract the thread-local Recorder* — is now replaced with a clean ctx.getRecorder() call. Small wins that clean up the codebase.

Adding LiDAR Depth

With Apple Log solid, I moved to depth. The design was straightforward:

  1. When GPUResource.cameraDepth() appears in resources, add AVCaptureDepthDataOutput to the session
  2. Use a separate DepthDelegate that caches the latest depth frame
  3. The video FrameDelegate grabs the cached depth and passes both to processFrame
  4. Import the depth IOSurface as a R16Float texture via Dawn’s SharedTextureMemory
  5. Bind it as a dynamic per-frame input in the depth shader

Each step had its own set of problems.

Problem 1: The LiDAR Device

The LiDAR depth camera is a separate AVCaptureDevice.builtInLiDARDepthCamera. It captures both video and depth, but its video output is 8-bit NV12 (420v) instead of the wide-angle camera’s BGRA. The pipeline needed a new code path: NV12→RGB conversion with BT.709 matrix and video range expansion.

But the shader selection was tied to useDepth, not the actual pixel format. When the JS-side depth resource parsing had a timing issue, useDepth was false even though the LiDAR camera was delivering NV12. The BGRA passthrough shader ran on YUV data — bind group mismatch, silent black output.

The fix was adding a lidarYUV flag set in Swift based on the actual camera device, passed through the ObjC++ bridge to C++, independent of whether depth data is also being captured.

Problem 2: Dawn Doesn’t Know About Depth

This was the big one. Everything worked: frames arrived, the depth CVPixelBuffer had a valid IOSurface, SharedTextureMemory imported it, CreateTexture with R16Float succeeded, BeginAccess succeeded. But textureLoad in the shader returned all zeros.

Sound familiar? Same pattern as Apple Log without the P010 feature. Dawn silently accepts the import but can’t actually read the data.

I fetched Dawn’s source code from dawn.googlesource.com and found GetFormatEquivalentToIOSurfaceFormat() in SharedTextureMemoryMTL.mm. It’s a switch statement mapping CVPixelFormat codes to Dawn TextureFormats. 17 formats are mapped — every video format you’d expect: BGRA, NV12, P010, P210, etc.

kCVPixelFormatType_DepthFloat16 (hdep) is not in the list.

The depth IOSurface has format code 0x68646570. Dawn doesn’t recognize it. It imports the IOSurface handle successfully (it’s still a valid IOSurface), creates the texture object, and even succeeds at BeginAccess. But the underlying Metal texture can’t be properly configured because Dawn doesn’t know the pixel format — so textureLoad returns zeros.

The irony: Dawn already maps kCVPixelFormatType_OneComponent16HalfR16Float. The depth format DepthFloat16 has identical memory layout — single-channel, 16-bit half-float. Only the FourCC tag differs.

The CPU Upload Workaround

For now, I bypass SharedTextureMemory entirely for depth:

// CPU readback + WriteTexture (115KB at 320x180 — negligible)
CVPixelBufferLockBaseAddress(depthBuffer, kCVPixelBufferLock_ReadOnly);
const void* data = CVPixelBufferGetBaseAddress(depthBuffer);
size_t bpr = CVPixelBufferGetBytesPerRow(depthBuffer);

wgpu::TexelCopyTextureInfo dst{};
dst.texture = depthTexture;  // persistent, created once

wgpu::TexelCopyBufferLayout layout{};
layout.bytesPerRow = (uint32_t)bpr;
layout.rowsPerImage = (uint32_t)depthH;

device.GetQueue().WriteTexture(&dst, data, bpr * depthH, &layout, &extent);
CVPixelBufferUnlockBaseAddress(depthBuffer, kCVPixelBufferLock_ReadOnly);

320×180 × 2 bytes = 115KB per frame. At 60fps, that’s 6.9MB/sec of CPU→GPU transfer. On a device that does 25GB/sec bandwidth, this is invisible. The texture is persistent (created once, reused every frame), so there’s no allocation overhead either.

Zero-copy would be nicer. Which brings us to…

The Dawn Upstream Contribution

Four lines of code. That’s all it takes:

case kCVPixelFormatType_DepthFloat16:
    return wgpu::TextureFormat::R16Float;
case kCVPixelFormatType_DepthFloat32:
    return wgpu::TextureFormat::R32Float;
case kCVPixelFormatType_DisparityFloat16:
    return wgpu::TextureFormat::R16Float;
case kCVPixelFormatType_DisparityFloat32:
    return wgpu::TextureFormat::R32Float;

Added to the switch in GetFormatEquivalentToIOSurfaceFormat(), right before the default case. Same file, same pattern as every other format mapping.

Dawn uses Chromium’s Gerrit for code review instead of GitHub PRs. I filed a bug at issues.chromium.org under Blink>WebGPU, then submitted the CL.

The CL is at dawn-review.googlesource.com/c/dawn/+/297995. Four case statements. If accepted, every iOS app using Dawn can zero-copy import LiDAR depth data.

The Depth Colormap Shader

With depth data flowing (via CPU upload for now), the shader is satisfying:

@group(0) @binding(0) var inputTex: texture_2d<f32>;
@group(0) @binding(1) var outputTex: texture_storage_2d<rgba16float, write>;
@group(0) @binding(3) var depthTex: texture_2d<f32>;
@group(0) @binding(4) var depthSampler: sampler;

fn depthColormap(t: f32) -> vec3f {
  let r = clamp(2.0 * t - 0.5, 0.0, 1.0);
  let g = clamp(1.0 - 2.0 * abs(t - 0.5), 0.0, 1.0)
        + clamp(2.0 * t - 1.0, 0.0, 1.0);
  let b = clamp(1.0 - 2.0 * t, 0.0, 1.0);
  return vec3f(r, g, b);
}

@compute @workgroup_size(16, 16)
fn main(@builtin(global_invocation_id) id: vec3u) {
  let outDims = textureDimensions(outputTex);
  if (id.x >= outDims.x || id.y >= outDims.y) { return; }

  let color = textureLoad(inputTex, vec2i(id.xy), 0).rgb;

  // Rotate UV 90° CW — depth texture is landscape, output is portrait
  let rotU = (f32(id.y) + 0.5) / f32(outDims.y);
  let rotV = 1.0 - (f32(id.x) + 0.5) / f32(outDims.x);
  let depth = textureSampleLevel(depthTex, depthSampler, vec2f(rotU, rotV), 0.0).r;

  let t = clamp(depth / 5.0, 0.0, 1.0);  // 0-5m range
  let blended = mix(color, depthColormap(t), 0.6);

  textureStore(outputTex, vec2i(id.xy), vec4f(blended, 1.0));
}

The depth texture is 320×180 (landscape), but the output is 1080×1920 (portrait, after the rotation pass). The UV rotation maps portrait coordinates to landscape depth coordinates. textureSampleLevel with a linear sampler gives free bilinear interpolation, upsampling the 320×180 depth map smoothly to the 1080×1920 output.

The normalize distance (depth / 5.0) sets the colormap range — 5 meters maps to the full blue→green→yellow gradient. LiDAR range is ~5m on iPhone, so this captures the useful range. Closer objects are blue, farther objects yellow.

GPU-Side Portrait Rotation

One thing that helped across the board: moving the landscape→portrait rotation from Skia to the GPU.

Previously, the camera delivered landscape textures (1920×1080), the compute pipeline processed them in landscape, and Skia applied a 90° rotation transform when drawing to the portrait screen. This rotation was expensive — Skia had to do a rotated blit of a 4K F16 texture every frame, costing significant GPU time.

Now the first compute pass (passthrough or YUV→RGB) does the rotation:

// 90° CW: output(x, y) reads from input(y, inH - 1 - x)
let yDims = textureDimensions(yPlaneTex);
let srcCoord = vec2u(id.y, yDims.y - 1u - id.x);

The output texture is portrait-sized (1080×1920). Every downstream shader and Skia sees portrait coordinates. The Skia Canvas draws with an identity transform — no rotation, no scaling, just a straight blit. This matters at high resolutions where the rotated blit was the bottleneck.

It also fixed the onFrame canvas coordinate problem: previously, Skia draws happened in landscape coordinates and then got rotated, making positioning unintuitive. Now (0, 0) is top-left of what the user sees.

What I Learned

Dawn’s format mapping is strict and silent

When Dawn doesn’t recognize an IOSurface pixel format, it doesn’t error — it imports successfully but reads zeros. We hit this three times: once with Apple Log (missing P010), once with 4:2:2 (missing P210), and once with LiDAR depth (missing DepthFloat16 mapping entirely). The symptoms are always the same: “everything succeeds but the output is black.”

Check how other apps configure their sessions

Spying on Blackmagic Camera’s AVCaptureSession via Console.app was the breakthrough for Apple Log. Their ColorSpace:3 vs our ColorSpace:2 immediately showed the misconfiguration. System-level logging is an underused debugging tool.

CPU upload is fine for small data

The instinct to zero-copy everything is strong, but 115KB at 60fps is nothing. The CPU upload for depth data is simpler, more debuggable, and just as fast as the zero-copy path would be. Save the optimization for when it matters — the 8MB camera frame, not the 115KB depth map.

Contributing upstream is surprisingly accessible

Dawn’s Gerrit workflow is different from GitHub but not harder. CLA + clone + hook + push. For a four-line format mapping, the total effort was: file a bug (5 min), clone the repo (2 min), make the edit (1 min), push (1 min). The hardest part was finding GetFormatEquivalentToIOSurfaceFormat() in the source — once found, the fix was obvious.

The Full Pipeline

Here’s what’s running on an iPhone 16 Pro:

Camera (LiDAR, 1920×1080 NV12 420v)
  + Depth (320×180 DepthFloat16, CPU upload to R16Float)
  → Pass 0: NV12→RGB (BT.709, video range) + 90° CW rotation
  → RGBA16Float ping-pong (1080×1920 portrait)
  → Pass 1: Depth colormap (bilinear upsample, blue→green→yellow blend)
  → RGBA16Float output
  → Skia Graphite MakeImageFromTexture → SkImage → Canvas

Or with Apple Log:

Camera (Wide, 1920×1080 10-bit 4:2:2 YUV, Apple Log)
  → Pass 0: YUV→RGB (BT.2020, video range, no clamp) + 90° CW rotation
  → RGBA16Float ping-pong (1080×1920 portrait)
  → Pass 1: Apple Log → Rec.709 LUT (65³ 3D texture)
  → RGBA16Float output
  → Skia Graphite MakeImageFromTexture → SkImage → Canvas

Same pipeline architecture, different first-pass shader. The user writes WGSL, declares resources, and the native side handles format detection, IOSurface import, YUV conversion, and rotation.

What’s Next

  • Pipeline/overlay/onFrame split — separating what goes into recordings vs. what’s display-only
  • Recording — ProRes output with Apple Log preserved, overlays burned in
  • Dawn upstream — waiting on the CL review, hopefully zero-copy depth soon
  • More depth effects — focus peaking, background blur, depth-based segmentation

This is Part 3 of an ongoing series. Part 1 covers the initial spike. Part 2 covers Apple Log HDR and debugging. Code: react-native-webgpu-camera

Google Summer of Code 2026 Is Here: Contribute to Kotlin

The Kotlin Foundation is joining Google Summer of Code (GSoC) 2026! If you are a student or an eligible contributor looking to spend your summer working on a real-world open-source project, this is your chance to make a meaningful impact on the Kotlin ecosystem while also benefiting from the mentorship of experienced engineers.

Take part

Why participate?

GSoC is one of the best-known programs for introducing new contributors to open-source development. By joining, you’ll work on a real project, get hands-on guidance from a dedicated mentor, earn a stipend, and come away with production experience. Last year’s contributors shipped work that is already being used by other developers, from a compiler-integrated Kotlin LSP to KMP support for Firebase AI. They wrote about how the program changed their perspective on open-source development – you can read their stories here.

This year’s projects:

  • Kotlin Compiler Fuzzer (Kai) – Build a new, modular fuzzer for the Kotlin compiler from scratch. [Hard, 350 hrs]
  • Swift-to-Kotlin interop (PoC) – Create a proof of concept for importing Swift APIs into Kotlin/Native. [Hard, 350 hrs]
  • Tail call support in Kotlin/Wasm – Design and implement tail call support in the Kotlin/Wasm backend. [Medium, 90 hrs]
  • Kotlin Education landscape report – Research and document how Kotlin is taught worldwide, creating reusable datasets and strategic input. [Medium, 175 hrs]

Browse all project details and contributor guidelines here.

Important dates:

  • March 16 – Contributor application period opens
  • March 31 – Application deadline
  • April 30 – Accepted projects announced
  • May 25 – Coding begins!

How to apply:

  1. Explore the project ideas and pick one that interests you.
  2. Join the #gsoc channel on Kotlin Slack to connect with mentors and ask questions.
  3. Review the contributor guidelines and prepare a code sample.
  4. Submit your application on the GSoC website by March 31.

Questions? You can reach us on Slack or gsoc@kotlinfoundation.org.

Not a student?

If you are an open-source maintainer or contributor but not eligible for GSoC, the Kotlin Ecosystem Mentorship Program may be for you. The current round is already underway, but we are likely to run another one – follow our announcements so you don’t miss it.

Join us this summer, we look forward to your proposals!

From 100x Slower Than Rust to Beating It: The coregex Journey

A few days ago, @kostya27 posted on r/golang (124 upvotes):

“Why is Go’s regex so slow?” Go total time on LangArena: 116.6 seconds. Without two regex tests: 78.5 seconds. Without regex, Go is competitive with Rust/C++. With regex, it’s not even close.

He’s right. And this post is about what we did about it.

Six months ago, I wrote about building coregex — a regex engine for Go that’s 3-3000x faster than stdlib. The benchmarks looked great. Then reality hit.

@kostya, author of LangArena (2,900+ stars on GitHub), tried coregex on his benchmark suite. His verdict:

“I tried coregex, but no luck.”

His numbers told the story:

Benchmark Go stdlib coregex v0.12.8 Rust regex PCRE2 JIT
LogParser (13 patterns) 22.7s 22.0s 0.2s
Template::Regex 6.5s 7.0s 3.8s 1.0s

We were 100x slower than Rust on LogParser. On the same machine. Same input. Same patterns.

Our “3000x faster than stdlib” claim? True — on many patterns we tested. But we had blind spots we didn’t know about: case-insensitive patterns, dense-digit data, multi-wildcard suffixes. On a real-world benchmark with 13 diverse patterns covering all these cases, we were barely faster than stdlib.

That was March 10, 2026. Here’s what happened in the next week.

The LangArena Challenge

LangArena’s LogParser benchmark parses Apache log files with 13 regex patterns — the kind of patterns you’d find in any log analysis pipeline:

errors:        ` [5][0-9]{2} | [4][0-9]{2} `
bots:          `(?i)bot|crawler|scanner|spider|indexing|crawl`
suspicious:    `(?i)etc/passwd|wp-admin|../`
ips:           `d+.d+.d+.35`
api_calls:     `/api/[^ "]+`
methods:       `(?i)get|post|put`
emails:        `[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+.[a-zA-Z]{2,}`
...and 6 more

Nothing exotic. These are bread-and-butter patterns that every Go developer uses. And we were 100x slower than Rust on them.

The question wasn’t “can we optimize one pattern?” — it was “can we close a 100x gap across 13 different pattern types?”

Step 1: Learning from the Enemy

Before writing a single line of code, I needed to understand what Rust does differently. Not from reading docs — from running Rust with debug logging.

Rust’s regex crate has RUST_LOG=debug:

$ RUST_LOG=debug ./rust-benchmark input.txt
[regex] prefixes extracted: Seq["EVA", "EVa", "EvA", "Eva", "eVA", ...]
[regex] prefilter built: teddy
[regex] using reverse suffix strategy

Every strategy decision, every prefilter choice, every literal extraction — logged. I could see exactly what Rust did for each pattern.

We had nothing like this. So I built COREGEX_DEBUG:

$ COREGEX_DEBUG=1 ./my-app
[coregex] pattern="(?i:GET|POST|PUT)" strategy=UseTeddy nfa_states=43 literals=40 complete=true
[coregex] prefilter=FatTeddy (AVX2 fat) complete=true

Now I could compare strategy selection side-by-side. And the differences were immediately obvious.

Step 2: The Root Causes

Bug #1: Refusing to extract case-insensitive literals

Pattern: (?iU)b(eval|system|exec|execute|passthru|shell_exec|phpinfo)b

A real user (#137) reported this WAF pattern was 88,000x slower than stdlib.

Rust extracts 250 case-fold literal variants:

eval → EVAL, EVAl, EVaL, EVal, EvAL, ... eval  (16 variants)
system → SYSTEM, SYSTEm, SYSTem, ...            (32 variants)

Then trims to 60 three-byte prefixes → Teddy SIMD prefilter → scans 968 bytes in 263 nanoseconds. Done.

Our literal extractor? One line killed everything:

// literal/extractor.go:137
if re.Flags&syntax.FoldCase != 0 {
    return NewSeq()  // Return EMPTY. For ALL case-insensitive patterns.
}

This guard was added for a previous bug (#87) — naive extraction of single-case variants caused prefilter false negatives. The fix was correct for that bug, but the blanket rejection meant zero prefilter for any (?i) pattern. Without prefilter, the engine fell back to lazy DFA, which cache-thrashed on the 181-state NFA.

Fix: Expand (?i) literals into ALL case-fold variants (like Rust), then trim to 3-byte prefixes. One function, ~50 lines.

Result: 88,000x slower → 24x faster than stdlib.

Bug #2: isMatchDigitPrefilter was O(n²)

Pattern: d{3}-d{3}-d{4} (phone numbers)

On 6MB of log data: 7 minutes per Match() call. Stdlib: 262ms.

Root cause: isMatchDigitPrefilter used dfa.FindAt() (unanchored search) which scans from each digit position to end of input:

// Before (O(n²)):
endPos := e.dfa.FindAt(haystack, digitPos)  // Scans to EOF!

// After (O(pattern_len)):
endPos := e.dfa.SearchAtAnchored(haystack, digitPos)  // Checks only at position

One function call change. 7 minutes → 2.1ms. 200,000x faster.

The same pattern was already fixed in findIndicesDigitPrefilter months ago — but isMatchDigitPrefilter was never updated. Copy-paste divergence.

Bug #3: ReverseSuffix rejected multi-wildcard patterns

Pattern: d+.d+.d+.35 (IP address suffix)

This pattern has a clear suffix: .35. Rust finds it instantly with memmem, then reverse-scans for the start. Our isSafeForReverseSuffix rejected it because it had 3 wildcard subexpressions (d+):

if wildcardCount >= 2 {
    return false  // "multiple wildcards break reverse NFA"
}

The guard existed because our reverse NFA builder had a bug with mixed byte+epsilon states. That bug was fixed in v0.12.9. But the guard stayed.

Fix: Remove the guard. Also fix Find() leftmost semantics — bytes.LastIndexbytes.Index for non-.* patterns.

Result: 57ms → 0.63ms (603x faster, 1.6x faster than Rust!)

Bug #4: FatTeddy AVX2 missed matches

Pattern: (?i)get|post|put (40 case-fold expanded literals)

FatTeddy (33-64 pattern SIMD search) found only 11,456 matches. Correct answer: 34,368.

Root cause: One assembly instruction.

FatTeddy uses 256-bit AVX2 registers with two 128-bit lanes. Low lane handles buckets 0-7, high lane handles buckets 8-15. The code used ANDL to combine lane results — requiring a match in both lanes. But GET variants (8 patterns) were all in buckets 0-7 (low lane only), PUT variants in buckets 8-15 (high lane only). ANDL zeroed them out.

; Before (incorrect):
ANDL CX, AX          ; Requires BOTH lanes to match

; After (correct):
ORL  CX, AX          ; Either lane is sufficient

One instruction. 22,912 missing matches fixed.

Step 3: Building What Rust Has

Beyond bug fixes, we needed architectural improvements to match Rust’s approach:

Bidirectional DFA

Previously, UseDFA patterns did: forward DFA → match end, then PikeVM → exact boundaries. PikeVM is O(n×states) — a second full scan.

Now: forward DFA → end, reverse DFA → start, anchored DFA → exact end. Three O(n) passes instead of one O(n×states) pass.

Cascading Prefix Trim

When case-fold expansion produces too many literals (>64), we trim them using Rust’s approach:

128 six-byte literals → try keep 4 bytes → 18 unique → fits Teddy!

This is directly from Rust’s optimize_for_prefix_by_preference() with their ATTEMPTS table: [(4,64), (3,64), (2,64)].

Aho-Corasick DFA Backend

Our Aho-Corasick library got a complete DFA backend rewrite:

  • Flat transition table with premultiplied state IDs
  • Match flag in high bit (single AND instruction for detection)
  • SIMD skip-ahead prefilter via bytes.IndexByte

Result: 300 MB/s → 3,400 MB/s (Find), 260 MB/s → 5,900 MB/s (IsMatch). 11-22x throughput improvement.

The Results

Benchmark: 8 Real-World Patterns on 6.3 MB Input

100 iterations each, best of 5, same machine (i7-1255U):

Pattern Go stdlib coregex v0.12.13 Rust regex vs stdlib vs Rust
.*@example.com 420 ms 3.3 ms 7.2 ms 126x 2.2x faster
.*.(txt&#124;log&#124;md) 426 ms 1.0 ms 1.8 ms 425x 1.8x faster
email validation 447 ms 3.4 ms 3.8 ms 132x 1.1x faster
d+.d+.d+.35 381 ms 0.63 ms 0.98 ms 603x 1.6x faster
(?i)get&#124;post&#124;put 561 ms 16.6 ms 7.0 ms 34x 2.4x slower
(?i)bot&#124;crawler&#124;... 883 ms 38.4 ms 6.7 ms 23x 5.7x slower
password=[^&s"]+ 24 ms 8.9 ms 2.9 ms 3x 3.1x slower
session[_-]?id=... 8 ms 2.7 ms 1.2 ms 3x 2.3x slower

4 out of 8 patterns are faster than Rust. All 8 are faster than Go stdlib.

@kostya’s Update

Remember “no luck”? Here’s the progression on his M1 MacBook:

Version LogParser Gap to Rust
v0.12.8 (start) 22.0s 100x
v0.12.9 5.3s 26x
v0.12.10 2.67s 13x
v0.12.13 (current) 2.12s 10x

From 100x slower to 10x. Not parity yet — but a different conversation than “no luck.”

Why Not Just Use CGO?

Every other Go regex alternative uses CGO or Wasm:

  • go-re2: C++ RE2 via Wasm (wazero)
  • regexp2: Backtracking (.NET-style) — no O(n) guarantee
  • rubex: Oniguruma via CGO
  • go-pcre: PCRE via CGO

coregex is pure Go + Go assembly. No CGO, no Wasm, no external dependencies.

Why does this matter?

  • Cross-compilation: GOOS=linux GOARCH=arm64 go build just works
  • Static binaries: No shared libraries to ship
  • Go toolchain: go vet, go test -race, pprof all work
  • Debugging: Standard Go debugging, no FFI boundary
  • Security: No C memory safety issues in regex hot paths

The performance gap to pure-CGO solutions (PCRE2 JIT) exists — JIT compiles regex to native machine code, achieving 1.0s where we take 7.1s on Template::Regex. But that’s an architectural tier boundary — we’re competing within the automata-based class (like RE2 and Rust regex), not against JIT engines.

What We Learned

1. Debug logging is not optional

Building COREGEX_DEBUG was the single most impactful decision. Without it, every optimization was guesswork. With it, we could see exactly why a pattern was slow and verify our fix matched Rust’s approach.

If you’re building any kind of engine — regex, query planner, compiler — add strategy logging from day one.

2. One instruction can hide 23,000 bugs

The FatTeddy ANDL → ORL fix taught us that SIMD code correctness is binary. Not “mostly correct” or “works for some patterns.” If your lane combining logic is wrong, you silently drop matches. No error, no panic — just wrong results.

Always verify match counts against stdlib. On every pattern. On every change.

3. Benchmarks lie — until they don’t

Our “3000x faster” headline was true for .*error.* patterns. But @kostya’s LangArena showed the full picture: on diverse real-world patterns, we were barely faster than stdlib.

Real benchmarks use real patterns from real users. We now run regex-bench CI on every PR — 16 core patterns + 13 LangArena patterns, compared against both stdlib and Rust regex, on Linux AMD EPYC and macOS Apple Silicon.

4. Guard clauses outlive their bugs

Three of our four major bugs were caused by guards that stayed after the underlying bug was fixed. FoldCase rejection, wildcardCount >= 2, unanchored FindAt — all were correct when added. All became performance killers months later when the original bugs were resolved.

Track why a guard exists. Remove it when the reason is gone.

5. Go ASM is production-viable for SIMD

We wrote ~500 lines of AVX2/SSSE3 assembly for Teddy multi-pattern search. It works. FatTeddy throughput: 12 GB/s on single-call scans (2x faster than SlimTeddy SSSE3!).

The challenge isn’t writing the ASM — it’s the Go→ASM function call boundary. Each call costs ~60ns + mask reload. For high-match-count patterns, this adds up. Our batch API (64KB chunks) reduces round-trips, but the integrated prefilter+DFA loop that Rust uses remains the gold standard.

Current State: v0.12.13

97,000 lines of code. 17 strategies. 1,470 tests. 5 releases in one week.

go get github.com/coregx/coregex@v0.12.13

Drop-in replacement:

It’s a true drop-in replacement for Go’s regexp package — same API, same types (Regexp is aliased), same method signatures:

import "github.com/coregx/coregex"  // instead of "regexp"

re := coregex.MustCompile(`(?i)get|post|put`)
matches := re.FindAllString(data, -1)  // Same API, faster execution

In most cases, changing the import path is all you need.

Debug your patterns:

COREGEX_DEBUG=1 ./your-app
# [coregex] pattern="(?i:GET|P(?:OST|UT))" strategy=UseTeddy nfa_states=43 literals=40 complete=true
# [coregex] prefilter=FatTeddy (AVX2 fat) complete=true

What’s Still Slower Than Rust

Honesty matters. Here’s where we’re still behind:

Gap Root cause Status
(?i) patterns: 2-6x FatTeddy ORL creates more false positives than Rust’s interleave verification Researched, needs ASM rewrite
DFA verification: 3-7x Go→ASM round trip overhead, no integrated prefilter+DFA loop Architectural
Template::Regex: 1.8x Two-phase DFA+PikeVM vs Rust’s single-phase lazy DFA Planned
ARM: 5-15x vs Rust No SIMD prefilters on ARM (Teddy/memchr are x86 only) Waiting for Go NEON support

We’re not hiding these gaps. They’re tracked, researched, and planned. The goal is Rust parity on all pattern types — we’re not there yet on (?i) and DFA-heavy patterns.

Community Testing Matters — A Lot

A multi-engine regex library is inherently complex. 17 strategies, SIMD assembly, lazy DFA, reverse search, prefilter cascading — every combination of pattern shape × input data × strategy is a potential edge case. No amount of internal testing can cover what real users discover in minutes.

Every major fix in this article came from community feedback:

  • @kostya’s LangArena exposed the 100x gap we didn’t know about
  • tjbrains’ WAF pattern (#137) revealed the 88,000x regression in case-insensitive matching
  • GoAWK integration uncovered 15+ Unicode edge cases months earlier

The pattern is consistent: someone runs coregex on their specific workload, finds a pattern type we haven’t optimized yet, reports it — and we fix it in hours, not weeks. The FatTeddy lane bug? Fixed same day. The DigitPrefilter O(n²)? Fixed in one line. Case-insensitive literal extraction? Researched Rust’s approach, implemented, released — all within 24 hours.

There are likely more patterns that aren’t optimized yet. That’s the nature of a 17-strategy engine — some strategy paths get less testing than others. But the architecture is sound, the fix turnaround is fast, and every report makes the library better for everyone.

We proposed coregex for Go’s standard library. It wasn’t accepted — and honestly, that’s okay. As an independent library, we can iterate faster, ship SIMD assembly that the Go team wouldn’t merge, and make decisions optimized for performance rather than compatibility. The Go ecosystem is better with options.

Don’t hesitate to contribute. File issues with your patterns and inputs. Even a simple “this pattern is slower than stdlib” report helps — it tells us which strategy path needs work. The more diverse patterns we see, the fewer blind spots remain.

Pull requests are especially welcome. We know that a healthy open source project is built by its community, and we value every contributor. Don’t worry if your PR isn’t perfect — we’ll review the code, help you fix any issues, guide you through our conventions, and explain what’s needed to get it merged. Whether it’s a new test case, a documentation fix, a strategy optimization, or a bug report with a reproducer — every contribution counts and every contributor gets credited.

Try It

If regex is a bottleneck in your Go application:

  1. Profile first — make sure regex is actually the problem
  2. Benchmark your specific patterns — performance varies by pattern type
  3. Check match countscoregex.FindAll() must match regexp.FindAll() exactly
  4. Report issues — we fixed #137 (88,000x regression) within 24 hours
# Quick benchmark
go get github.com/coregx/coregex@v0.12.13
COREGEX_DEBUG=1 go test -bench=. -benchmem your-package

Links:

  • GitHub
  • Aho-Corasick library
  • Cross-language benchmarks
  • LangArena

The most humbling moment? Seeing ANDL CX, AX in our FatTeddy ASM and realizing one wrong instruction had been silently dropping 23,000 matches. The most satisfying? Seeing coregex 1.6x faster than Rust on the IP pattern that started this whole journey.

Built by @kolkov as part of CoreGX — production Go libraries.

UI Freezes and the Dangers of Non-Cancellable Read Actions in Background Threads

In JetBrains IDEs, UI freezes are often blamed on “heavy work on the EDT,” but our recent investigations show another common culprit in plugins: long, non-cancellable read actions running in background threads.

We receive a lot of freeze reports via our automated exceptions reporting system, and many reports actually show problems in plugins that contain that single erroneous pattern. Let’s try to highlight this problem arising from non-cancellable code and figure out how to fix it.

A real-world example

Let’s look at the following stack traces that come from a Package Checker plugin freeze. Note how a background thread “DefaultDispatcher-worker-27” ends up executing ReadAction.compute:

"AWT-EventQueue-0" prio=0 tid=0x0 nid=0x0 waiting on condition
     java.lang.Thread.State: TIMED_WAITING
 on com.intellij.openapi.progress.util.EternalEventStealer@1356e599
    at java.base@21.0.8/java.lang.Object.wait0(Native Method)
    at java.base@21.0.8/java.lang.Object.wait(Object.java:366)
    at com.intellij.openapi.progress.util.EternalEventStealer.dispatchAllEventsForTimeout(SuvorovProgress.kt:261)
    at com.intellij.openapi.progress.util.SuvorovProgress.processInvocationEventsWithoutDialog(SuvorovProgress.kt:125)
    at com.intellij.openapi.progress.util.SuvorovProgress.dispatchEventsUntilComputationCompletes(SuvorovProgress.kt:73)
    at com.intellij.openapi.application.impl.ApplicationImpl.lambda$postInit$14(ApplicationImpl.java:1434)
    at com.intellij.openapi.application.impl.ApplicationImpl$$Lambda/0x000001fafa58b530.invoke(Unknown Source)
    at com.intellij.platform.locking.impl.RunSuspend.await(NestedLocksThreadingSupport.kt:1517)
...
    at com.intellij.openapi.application.impl.ApplicationImpl.runWriteAction(ApplicationImpl.java:1106)
    at com.intellij.psi.impl.PsiManagerImpl.dropPsiCaches(PsiManagerImpl.java:108)
    at com.intellij.lang.typescript.compiler.TypeScriptServiceRestarter.restartServices$lambda$1(TypeScriptServiceRestarter.kt:24)
    at com.intellij.lang.typescript.compiler.TypeScriptServiceRestarter$$Lambda/0x000001fafcf4a780.run(Unknown Source)
    at com.intellij.openapi.application.TransactionGuardImpl.runWithWritingAllowed(TransactionGuardImpl.java:240)
    at com.intellij.openapi.application.TransactionGuardImpl.access$100(TransactionGuardImpl.java:26)
...
    at com.intellij.ide.IdeEventQueueKt.performActivity(IdeEventQueue.kt:974)
    at com.intellij.ide.IdeEventQueue.dispatchEvent$lambda$12(IdeEventQueue.kt:307)
    at com.intellij.ide.IdeEventQueue$$Lambda/0x000001fafa7f0468.run(Unknown Source)
    at com.intellij.ide.IdeEventQueue.dispatchEvent(IdeEventQueue.kt:347)

"DefaultDispatcher-worker-27" prio=0 tid=0x0 nid=0x0 runnable
     java.lang.Thread.State: RUNNABLE
 (in native)
    at java.base@21.0.8/java.io.WinNTFileSystem.getBooleanAttributes0(Native Method)
    at java.base@21.0.8/java.io.WinNTFileSystem.getBooleanAttributes(WinNTFileSystem.java:479)
    at java.base@21.0.8/java.io.FileSystem.hasBooleanAttributes(FileSystem.java:125)
    at java.base@21.0.8/java.io.File.isDirectory(File.java:878)
    at com.intellij.lang.javascript.library.JSCorePredefinedLibrariesProvider.getLibFilesByIO(JSCorePredefinedLibrariesProvider.java:265)
    at com.intellij.lang.typescript.library.TypeScriptCustomServiceLibrariesRootsProvider.getAdditionalProjectLibraries$lambda$0(TypeScriptCustomServiceLibrariesRootsProvider.kt:22)
...    
    at com.intellij.psi.impl.source.PsiFileImpl.isValid(PsiFileImpl.java:177)
    at com.intellij.packageChecker.javascript.NpmProjectDependenciesModel.declaredDependencies$lambda$15$lambda$14(NpmProjectDependenciesModel.kt:165)
    at com.intellij.packageChecker.javascript.NpmProjectDependenciesModel$$Lambda/0x000001fafd722aa0.compute(Unknown Source)
    at com.intellij.openapi.application.impl.AppImplKt$rethrowCheckedExceptions$2.invoke(appImpl.kt:106)
    at com.intellij.platform.locking.impl.NestedLocksThreadingSupport.runReadAction(NestedLocksThreadingSupport.kt:784)
    at com.intellij.openapi.application.impl.ApplicationImpl.runReadAction(ApplicationImpl.java:1043)
    at com.intellij.openapi.application.ReadAction.compute(ReadAction.java:66)
    at com.intellij.packageChecker.javascript.NpmProjectDependenciesModel.declaredDependencies(NpmProjectDependenciesModel.kt:164)
    at com.intellij.packageChecker.javascript.NpmProjectDependenciesModel.declaredDependencies(NpmProjectDependenciesModel.kt:185)
    at com.intellij.packageChecker.model.ProjectDependenciesModelSimplified.declaredDependencies$lambda$1(ProjectDependenciesModelSimplified.kt:29)
    at com.intellij.packageChecker.model.ProjectDependenciesModelSimplified$$Lambda/0x000001fafd8557e0.invoke(Unknown Source)
...

“AWT-EventQueue-0” asks for write lock – PsiManagerImpl.dropPsiCaches -> ApplicationImpl.runWriteAction

“DefaultDispatcher-worker-27” holds the read lock and does not react to write action attempts; there is no progress indicator on the screen, so users cannot cancel.

Let’s look at the code that causes this freeze (simplified):

fun declaredDependencies(project: ProjectSnapshot): List<Package> {
    return project.modules.asSequence()
      .flatMap { module ->
          ReadAction.compute {
            declaredDependencies(module)
          }
        }
      }
      .toList()
  }

Even though this code is not on the Event Dispatch Thread (EDT), the non-cancellable read action blocks write actions, freezing the UI until it completes. Non-cancellable here means that the read action block must be executed completely and cannot be interrupted by write actions or the user.

The core problem

Note that the following APIs are not cancellable by default:

  • ReadAction.compute { … }
  • Application.runReadAction { … }
  • runReadAction { … }

When such a read action runs for a long time in a background thread, it can block write actions (PSI changes, workspace model updates, editor changes). Since many write actions are initiated by the UI thread, the result is a UI freeze, even though the work itself is “in the background”.

Background threads holding read locks for too long prevent the platform from progressing.

Why is this dangerous?

  • Background doesn’t mean safe: Read locks affect the entire platform.
  • Long reads starve write actions.
  • Write actions are required for UI responsiveness.
  • The platform cannot cancel these reads.

How does this work under the hood? The IntelliJ Platform uses a reader-writer lock where multiple read actions can run concurrently but write actions require exclusive access. When a write action is requested, it must wait for all active read actions to complete before it can proceed. A non-cancellable read action holds its lock until finished, and the platform has no way to interrupt it. If that read action takes seconds, every pending write action and the UI thread are blocked for the entire duration.

In short, a long read action can freeze everything.

What you should do instead

Avoid ReadAction.compute (and similar APIs) for long work in background threads. 

Use it only when:

  • The read action is very short, or
  • It runs with a modal, cancellable progress.

Recommended alternatives:

  • Use cancellable read actions – for coroutine APIs use readAction/smartReadAction{}, ReadAction.nonBlocking{ … }.submit(), or for Java code without coroutines use ReadAction.nonBlocking{ … }.executeSynchronously()).
  • For blocking, non-coroutine code, split work into small, predictable chunks. Run them under ProgressManager.run(Task.Backgroundable){ … } with async progress and short read actions.
  • Periodically check for cancellation with ProgressManager.checkCanceled()
  • In advanced use cases, such as inlays or highlighting passes, use ReadAction.computeCancellable { }, which makes only a single attempt and does not restart when a write action interrupts it.

If your code touches the PSI, project model, or indexes and runs for a long time, it must be cancellable, or it will eventually freeze the UI. Following this rule is one of the most effective ways plugin authors can keep JetBrains IDEs fast and responsive.

Finally, code may not use network calls under read or write actions for the same reason: Such calls cannot be cancelled easily because they do not check cancellation via ProgressManager. Network latency is unpredictable, and these calls don’t participate in the cooperative cancellation mechanism.

Let’s see how to access the model from background processing in a cancellable manner. Please note that both readAction and ReadAction.nonBlocking are intended for idempotent computations that are safe to retry (they will cancel and restart if WriteAction is pending!). Idempotent means the computation produces the same result when run multiple times. This is required because cancellable read actions may be interrupted and restarted when write actions are pending.

  1. Using the coroutine Read Action API in suspend context
  suspend fun processOnBackground(virtualFile: VirtualFile, project: Project) {
    val methodNames = readAction {
      if (!virtualFile.isValid()) return@readAction null // validity checks first in read action!

      // inside a read action access the model
      val psiFile = PsiManager.getInstance(project).findFile(virtualFile)
      if (psiFile == null) return@readAction null

      // Compute something expensive — e.g., collect all method names
      return@readAction PsiTreeUtil.findChildrenOfType(psiFile, PsiMethod::class.java)
        .map { it.name }
    }

    // continue processing on background without locks
  }
  1. Using Java and the ReadAction.nonBlocking API
  @RequiresBackgroundThread
  public void processOnBackground(@NotNull VirtualFile virtualFile, @NotNull Project project) {
    var methodNames = ReadAction.nonBlocking(() -> {
        if (!virtualFile.isValid()) return null; // validity checks first!

        // inside a read action access the model
        PsiFile psiFile = PsiManager.getInstance(project).findFile(virtualFile);
        if (psiFile == null) return null;

        // Compute something expensive — e.g., collect all method names
        return ContainerUtil.map(PsiTreeUtil.findChildrenOfType(psiFile, PsiMethod.class), PsiMethod::getName);
      })
      .expireWith(project)
      .executeSynchronously(); // use submit() for callback-style code


    // continue processing on background without locks
  }

Note on API deprecation: As of the 2026.1 versions of our IDEs, we are deprecating runReadAction and ReadAction.compute in favor of the more explicit runReadActionBlocking and ReadAction.computeBlocking. Instead of just replacing usages, consider ReadAction.nonBlocking for background processing without suspend or readAction for suspend contexts.

How to analyze freezes

You can easily analyze thread dumps using the built-in Analyze Stacktrace or Thread Dump action with Search Everywhere (Shift-Shift shortcut). The only thing you need is a full dump text.

Here, for instance, the reason is detected clearly as:

> Long read action in com.intellij.packageChecker.javascript.NpmProjectDependenciesModel.declaredDependencies$lambda$15$lambda$14

We strongly recommend reworking such code paths in plugins to fix UI freezes. Given the number of reports and the customer impact, this is not a theoretical issue but a problem actively affecting users. Such improvements will visibly improve the perceived performance of JetBrains IDEs.

Join us live on March 19 at 03:00 PM UTC for an in-depth session with me and Patrick Scheibe and learn how to eliminate UI freezes in your JetBrains IDE plugins. We will figure out how to build cancellable, freeze-safe plugin code that keeps IDEs fast and responsive—plus, stay until the end to get your questions answered in a live Q&A with the experts.

CloudBees vs TeamCity: Enterprise CI/CD Beyond Jenkins

Many organizations adopt Jenkins because it’s flexible and widely supported. Over time, however, that flexibility often turns into operational overhead: maintaining plugins, debugging pipelines, and coordinating upgrades across teams.

CloudBees CI and JetBrains TeamCity represent two different ways of addressing this problem.

CloudBees CI builds on Jenkins, adding enterprise-grade governance, centralized management, and commercial support. It’s a natural step for organizations that want to keep their existing Jenkins investments while improving control and scalability.

TeamCity takes a different approach. Instead of extending Jenkins, it provides a CI/CD platform with most capabilities built in, from pipeline modeling to test reporting, reducing reliance on plugins and simplifying long-term maintenance.

This comparison focuses on how these platforms differ in practice for enterprise teams.

Platform foundations

CloudBees CI TeamCity
Architecture Jenkins-based (managed controllers) Purpose-built CI/CD platform
Configuration model Jenkinsfile (Groovy) + UI Kotlin DSL + UI
Plugin dependency High Low
Governance Centralized enterprise controls Built-in roles and project-level permissions

CloudBees CI extends Jenkins by introducing features such as centralized controller management, role-based access control (RBAC), and pipeline governance. It allows teams to continue using Jenkins pipelines while improving visibility and compliance.

TeamCity provides many of these capabilities out of the box. Build chains, artifact handling, test reporting, and pipeline configuration are native features rather than plugin-based extensions. This leads to more predictable behavior and fewer compatibility issues over time.

Setup and configuration

CloudBees CI is typically deployed as a set of managed Jenkins controllers. While this model enables isolation and scaling, it also inherits Jenkins complexity: plugin selection, version compatibility, and pipeline scripting remain ongoing concerns.

TeamCity uses a server-and-agent architecture. Initial setup is straightforward, and most teams can start running builds without assembling a plugin stack. Configuration can be managed via the UI or defined programmatically using Kotlin DSL, which allows pipelines to be versioned and reviewed like application code.

Visual editor in TeamCity

Key difference:

  • CloudBees preserves Jenkins flexibility, along with its configuration complexity
  • TeamCity emphasizes consistency and predictability through built-in features and typed configuration

Pipeline modelling and developer experience

In CloudBees CI, pipelines are defined using Jenkinsfile (Groovy). This provides flexibility but can become difficult to maintain as pipelines grow in size and complexity. Debugging pipeline logic and ensuring consistency across teams often requires additional tooling and governance.

TeamCity models pipelines through build chains, which explicitly define dependencies between build steps. This makes pipeline structure visible and easier to reason about. Using Kotlin DSL, teams can define pipelines in a statically typed language with IDE support, improving maintainability and reducing errors.

Example of a build chain in TeamCity

What this means in practice:

  • Jenkins-based pipelines prioritize flexibility but can become harder to manage at scale.
  • TeamCity pipelines are easier to standardize and review across teams.

Scalability and infrastructure

CloudBees CI is often deployed in large, distributed environments and is commonly used with Kubernetes for dynamic agent provisioning. Its multi-controller architecture allows organizations to isolate workloads and scale horizontally.

TeamCity scales through distributed build agents and agent pools. It supports dynamic agent provisioning in cloud environments and allows teams to control resource allocation through queues and priorities.

Both platforms can scale to enterprise workloads, but the operational model differs:

  • CloudBees requires managing multiple Jenkins controllers and their lifecycle
  • TeamCity centralizes orchestration while distributing execution across agents

Integration ecosystem

CloudBees benefits from the extensive Jenkins plugin ecosystem, which covers a wide range of tools and integrations. This flexibility is a major advantage, especially for organizations with highly customized workflows.

However, plugins also introduce variability. They are developed independently, may have inconsistent quality, and can break during upgrades.

TeamCity includes many commonly needed integrations and CI features natively, reducing the need for external plugins. Additional integrations are available, but the platform does not rely on them for core functionality.

Trade-off:

  • CloudBees: maximum flexibility, higher maintenance risk
  • TeamCity: fewer moving parts, more predictable behavior

Security and governance

CloudBees CI provides strong enterprise governance features, including RBAC, policy enforcement, and centralized visibility across controllers. These capabilities are designed for organizations with strict compliance requirements.

TeamCity also offers enterprise-grade security, including role-based permissions, project-level access control, audit logs, secure parameter handling, and integration with external authentication providers.

The key difference is not capability but implementation:

  • CloudBees layers governance on top of Jenkins
  • TeamCity includes governance as part of the core system

Maintenance and operational overhead

One of the most important differences between the two platforms is how much effort is required to keep the system running.

In CloudBees CI, teams still need to manage Jenkins plugins, pipeline scripts, and controller upgrades. While CloudBees adds tooling to simplify this, the underlying complexity remains.

TeamCity reduces this overhead by providing built-in functionality for most CI/CD needs. Fewer plugins mean fewer compatibility issues and less time spent troubleshooting pipeline failures caused by environment drift.

For many enterprises, this translates directly into engineering time saved.

Pricing considerations

CloudBees CI pricing is typically customized for enterprise deployments and depends on factors such as scale, infrastructure, and support requirements.

TeamCity offers both self-managed and cloud options, with pricing based on usage (such as the number of build agents). This can make costs more predictable, especially for growing teams.

When to choose each platform

Choose CloudBees CI if:

  • You have a significant investment in Jenkins pipelines and plugins.
  • You need to standardize and govern existing Jenkins environments.
  • You require deep customization and flexibility.

Choose TeamCity if:

  • You want to reduce CI/CD maintenance overhead.
  • You prefer built-in functionality over plugin-based systems.
  • You need a scalable platform with predictable configuration and behavior.

Final thoughts

CloudBees CI and TeamCity solve similar problems in different ways.

CloudBees extends Jenkins into an enterprise-ready platform, preserving its flexibility while adding governance and support.

TeamCity rethinks the approach by delivering a CI/CD system with core capabilities built in, reducing complexity and making pipelines easier to maintain at scale.

For organizations evaluating how to move beyond Jenkins, the choice often comes down to this:

  • Continue evolving Jenkins with CloudBees
  • Or adopt a platform designed to avoid Jenkins’ operational trade-offs altogether

Understanding where your team sits on that spectrum is the key to making the right decision.

IntelliJ IDEA 2025.3.4 is Out!

We’ve just released IntelliJ IDEA 2025.3.4. This update introduces full support for Java 26 along with several notable improvements.

You can update to this version from inside the IDE, using the Toolbox App, or using snaps if you are a Ubuntu user. You can also download it from our website.

Here are the most notable improvements:

  • Resolved an issue where running an HTTP request could trigger a different request in the same file. [IJPL-66727]
  • Local changes refresh in large Perforce projects now works as expected. [IJPL-236557]
  • The Dependencies tab now opens correctly when using the Analyze Cyclic Dependencies feature. [IJPL-206236]

For a comprehensive overview of the fixes, see the release notes. If you spot any issues, let us know via the issue tracker.

Happy developing!