Comparative Analysis of Development Cycle Speed in Java and Kotlin Based on IDE Telemetry Data

Introduction

Does the choice of programming language affect how fast developers deliver code? This question matters for engineering teams evaluating technology stacks, yet it is notoriously hard to answer. Self-reported surveys suffer from recall bias, lines-of-code comparisons conflate conciseness with productivity, and controlled experiments rarely scale beyond a handful of participants.

In 2024, Meta introduced Diff Authoring Time (DAT) – the wall-clock time from when a developer starts working on a code change to when they submit it for review – as a scalable, telemetry-based productivity metric. Inspired by that work, we adapted the concept for IntelliJ IDEA’s built-in usage telemetry (feature usage statistics) and constructed IDE-DAT: the time from first code edit to push, measured directly inside the IDE.

This post presents a large-scale observational study comparing development cycle speed in Java and Kotlin. We analyzed telemetry data from approximately 320,000 IntelliJ IDEA developers over 20 months (November 2023 – June 2025), covering roughly 28 million development cycles. 

After controlling for user, project, overall time trend, and task size, we find that development cycles in Kotlin-oriented projects are generally shorter than comparable cycles in Java-oriented projects – roughly 15–20% shorter for everyday small, medium, and large tasks. In practice, the main pattern is not a dramatic one-time speedup, but slower degradation over time: as projects mature, cycle times in unmigrated Java contexts tend to grow, while Kotlin-oriented contexts deteriorate less.

A note on transparency. JetBrains is the creator of Kotlin, and we are aware that any study comparing Kotlin favorably to Java may be perceived as biased. For this reason, we rely on a rigorous statistical framework – longitudinal difference-in-differences on log-transformed outcomes, with multiple control groups and validity checks. We present the methodology, the data, the limitations, and the open questions in full so the reader can assess the strength of the evidence independently.

In the sections that follow, we describe the metric (Section 1), present the key finding and its practical magnitude (Section 2), walk through the detailed results (Section 3), examine threats to validity and open questions (Section 4), and document the full methodology (Section 5).

1. Measuring development speed: The IDE-DAT metric

1.1 The “first edit → push” cycle

IDE-DAT (IDE diff authoring time) is an adaptation of Meta’s DAT for IntelliJ IDE telemetry.

We measure the duration of a single development cycle:

Push₁ → [first edit **, …, edits, …, commits] → Push₂ **

  • Cycle start = the moment of the first Java/Kotlin file edit after the previous push.
  • Cycle end = the moment of the next push.
  • IDE-DAT = wall-clock time between them.

This serves as a proxy for “time spent working on a single change” – from the moment a developer starts writing code to the moment they push the result.

1.2 How task size is determined

Within each cycle, we count the number of edit events – instances of file editing that the IDE reports, with a one-minute cooldown (after each report, the system remains silent for one minute, even if the developer continues typing). The number of edit events is a proxy for task size: roughly speaking, how many times the developer switched between reading and writing code during the cycle. We do acknowledge, though, that the number of edits may also depend on the seniority level of the developers studied.

Cycles are grouped into size buckets:

Bucket Number of edits Typical cycle duration (median, Java) What kind of tasks
S 1–5 ~10 min Small fix, single-file change
M 6–15 ~30 min Small feature, bug fix
L 16–40 ~1.5–2 h Feature spanning multiple files
XL 41+ ~10 h Large feature or refactoring

1.3 How cycle language is determined

For each cycle, edit events are tallied by file type. If Java edits outnumber Kotlin edits, the cycle is classified as a Java cycle; if Kotlin edits outnumber Java edits, it is classified as a Kotlin cycle. Cycles with equal counts are excluded.

1.4 What the metric does not measure

  • Time spent on code review, planning, discussions, or CI/CD.
  • Code quality (bugs and reverts).
  • The distinction between pushing to a feature branch vs. main (branch names are not reported).
  • Code volume in lines (an edit event ≠ number of lines).

2. Key finding: Kotlin cycles are shorter

2.1 Development cycles in Kotlin are shorter for comparable tasks

Using the primary longitudinal log-DiD estimator on user-project × task-size contexts, the “first edit → push” cycle is shorter after migration to Kotlin than in comparable unmigrated Java contexts:

Task size Typical cycle (Java)* Primary estimate 95% CI In absolute terms**
S: small fix ~10 min −15.7% [−24.4%, −6.0%] ~1–2 min faster
M: small feature ~30 min −20.3% [−31.3%, −7.6%] ~6 min faster
L: multi-file feature ~1.5 – 2 h −15.1% [−26.8%, −1.6%] ~15–20 min faster
XL: large feature ~10 h −11.0% [−23.5%, +3.5%] Directionally ~1 h faster, but imprecise

* Approximate median cycle duration in the Kotlin migrants’ Java phase for that task-size bucket. Exact bucket medians are shown in Section 3.3.

** Approximate translation of the primary percentage estimate into minutes or hours for a typical Java cycle in that bucket.

How to read the table: A “small feature” (bucket M) is a cycle in which the developer made 6–15 editing sessions before pushing. A typical such cycle in Java lasts ~30 minutes from first edit to push. In the primary estimator, the corresponding post-migration Kotlin context is 20.3% shorter – approximately 24 minutes instead of 30.

The effect is obtained using a longitudinal difference-in-differences on log(DAT): For each user-project and task-size bucket, we compare the pre→post change among Java→Kotlin migrants with the corresponding change in the unmigrated ava control group. This subtracts the overall time trend and isolates the effect associated with the transition to Kotlin.

Important: In the stricter estimator, the bulk of the effect is still explained by degradation in the control group, rather than by dramatic speedups among migrants. Full details can be found in Section 2.3.

2.2 A case for conservative estimation

We control for task size by the number of edit events. At the same time, Kotlin includes a number of language features (data classes, default arguments, properties, extension functions, smart casts, etc.) that make code more concise. Consequently, the same logical task (one ticket in a tracker) might require, say, 20 edits in Java but only 15 in Kotlin. These would fall into different buckets.

Therefore, in cases where this assumption holds true, a direct comparison of “the same ticket” would make the gap between the Kotlin migrants and the Java control group even larger.

2.3 How the effect manifests

The stricter DiD estimate is composed of two components (with task size controlled via buckets):

  1. Java→Kotlin migrants improve modestly for small and medium tasks: In the primary log-scale model, pre→post change is about −8% for buckets S and M, and roughly flat for L and XL.
  2. Unmigrated Java contexts degrade across all buckets: In the same model, pre→post change is about +9% to +17%.

In other words, projects that migrated to Kotlin exhibit materially less cycle-time growth than projects that remain on Java.

This is echoed by a complementary comparison on absolute DAT without task-size normalization: Projects that have consistently stayed on Kotlin (never migrated) degrade by +14.5% at p90 of absolute DAT, whereas unmigrated Java projects degrade by +23.1% (details in Section 3.4). Both groups degrade (projects grow more complex over time), but Kotlin projects do so at roughly half the rate at p90.

2.4 Practical magnitude

In practical terms, the central estimate corresponds to roughly 1–2 minutes saved on a small fix, ~6 minutes on a small feature, and ~15–20 minutes on a multi-file feature. For XL tasks, the point estimate is also negative, but the interval is too wide for a firm claim. Compared with the earlier descriptive median contrast, the stricter estimator no longer supports a monotonic “bigger task → bigger effect” story; the stable conclusion is narrower: for comparable tasks, Kotlin-oriented contexts show substantially less cycle-time growth than unmigrated Java controls.

3. Evidence in detail

3.1 Java→Kotlin migrants: DAT by phase

1,501 users, 1,664 user-projects, ~76K cycles. For each month of a migrant’s activity, a phase is determined by the share of Kotlin edits: Java phase (<10%), Transition (10–50%), and Kotlin phase (>50%).

Metric Java phase (N=29,554) Transition (N=11,657) Kotlin phase (N=35,406) Δ Java→Kotlin
p25 DAT 6.3 min 5.6 min 5.8 min −8.7%
Median DAT 34.7 min 32.2 min 32.1 min −7.5%
p75 DAT 4.51 h 4.13 h 4.02 h −10.9%
p90 DAT 39.2 h 36.7 h 34.2 h −12.7%
Avg DAT 14.5 h 14.0 h 13.6 h −6.4%
Edits/cycle 22.5 19.8 24.3 +8.0%

A monotonic decrease in DAT across all percentiles, with a smooth transition through the Transition phase. Notably, the number of edits per cycle increases – tasks in the Kotlin phase are larger, yet the cycle is still shorter.

3.2 Control group: Unmigrated Java

320,248 users, 665,154 user-projects, ~28M cycles. Users who remained on Java (kotlin_share <10% at the start and end, ≥4 active months). Their history is divided into three equal time-based thirds.

Metric Early Middle Late Δ Early→late
p25 DAT 6.6 min 6.2 min 6.0 min −8.4%
Median DAT 38.9 min 37.7 min 35.7 min −8.2%
p75 DAT 5.25 h 5.45 h 5.35 h +2.0%
p90 DAT 31.7 h 35.5 h 39.0 h +23.1%
Avg DAT 12.7 h 13.7 h 14.5 h +14.1%
Edits/cycle 21.2 19.6 17.9 −15.4%

The unmigrated Java group exhibits degradation: the median decreases slightly, but the tails (p75, p90) and the mean increase substantially. Projects grow more complex, long cycles become even longer, and the number of edits per cycle declines.

3.3 Primary longitudinal log-DiD with task-size control

We compare the DAT of same-size cycles among migrants (Java phase vs. Kotlin phase) and the control group (early vs. late). The bucket-level median tables below are provided for descriptive context only. The primary effect size is estimated afterwards on the basis of pre/post changes in log(DAT) at the user-project × task-size-bucket level.

Java→Kotlin migrants – median DAT by bucket and phase:

Bucket Java phase Kotlin phase Δ
S: 1–5 edits 10.4 min 9.6 min −7.5%
M: 6–15 edits 33.9 min 31.9 min −5.8%
L: 16–40 edits 1.82 h 1.70 h −6.3%
XL: 41+ edits 11.4 h 12.2 h +6.7%

unmigrated Java control group: median DAT by bucket:

Bucket Early Late Δ
S: 1–5 edits 10.5 min 11.0 min +4.7%
M: 6–15 edits 35.2 min 38.8 min +10.3%
L: 16–40 edits 1.93 h 2.27 h +17.7%
XL: 41+ edits 12.1 h 15.3 h +26.3%

Primary estimator: log-DAT DiD on user-project × task-size-bucket contexts
Panel size: 978 migrant contexts and 400,425 control contexts, each with ≥3 cycles in both pre and post periods. Standard errors are clustered by machine_id.

Task size Migrants pre→post unmigrated Java pre→post Primary log-DiD effect 95% CI
S: small fix (~10 min) −8.1% +9.0% −15.7% [−24.4%, −6.0%]
M: small feature (~30 min) −7.3% +16.4% −20.3% [−31.3%, −7.6%]
L: multi-file feature (~1.5–2 h) −0.3% +17.5% −15.1% [−26.8%, −1.6%]
XL: large feature (~10 h) −0.1% +12.2% −11.0% [−23.5%, +3.5%]

The stricter estimator is less extreme than the earlier descriptive median contrast and does not support a monotonic increase with task size. The stable conclusion is narrower: For comparable tasks, Kotlin-oriented contexts show materially less cycle-time growth than unmigrated Java controls, with statistically supported negative estimates in S, M, and L, and the strongest precision in S and M. Pooling S/M/L contexts yields a primary estimate of about −17.1% (95% CI [−23.7%, −9.9%]).

As a robustness check, equal-weighting users rather than user-project contexts yields similar point estimates for S (−18.8%) and M (−20.3%), a weaker but still negative estimate for L (−13.7%), and again an imprecise estimate for XL (−11.0%). Thus, the sign is stable, while exact magnitudes depend on weighting, especially for larger tasks.

3.4 Complementary evidence from stable groups: Unmigrated  Kotlin vs. unmigrated Java

Without involving migrants — comparing trends of stable groups:

Metric unmigrated Java Δ E→L unmigrated Kotlin Δ E→L
Median DAT −8.2% −7.7%
p75 DAT +2.0% +0.8%
p90 DAT +23.1% +14.5%
Avg DAT +14.1% +9.9%
Edits/cycle −15.4% −12.4%

Unmigrated Kotlin projects degrade roughly half as fast at p90 and ~4 pp slower on average. This provides complementary evidence from stable groups and shows a directionally similar pattern without using the migrant cohort.

3.5 Cross-sectional comparison: Within-month

The most controlled design: same user, same project, same month, and same task size. 1,801 users and 6,908 contexts.

Bucket Java (median DAT) Kotlin (median DAT) Δ
S: 1–5 edits 10.1 min 9.8 min −2.0%
M: 6–15 edits 31.6 min 31.6 min 0%
L: 16–40 edits 1.69 h 1.62 h −3.9%
XL: 41+ edits 9.76 h 10.91 h +11.8%

The cross-sectional effect is more modest (−2% to −4% for S/L, 0% for M, opposite in XL) than the primary longitudinal estimate. This suggests Kotlin’s contribution is not primarily an instantaneous within-month speedup, but rather a gradual reduction of cycle time .


4. Validity checks, limitations, and open questions

4.1 Addressing selection bias: Stepwise confounder control

At each stage of the analysis, we progressively eliminated confounders (factors that could distort the comparison − for example, if Kotlin developers are inherently more experienced or work on simpler projects):

Design Kotlin vs. Java difference
All users (naïve comparison) −6%
Within-user (same individuals) −3.5%
Within-user + within-project + within-month +12% (!)
…+ task-size control −2% to −4% (for S/M/L)
Longitudinal log-DiD + task-size control ≈ −15% to −20% (S/M/L point estimates); XL directional only

Each step of confounder control changes the picture. The naïve comparison overstates the effect (selection bias). Within-month, without task-size control yields the opposite result (+12% − Kotlin appears slower) because Kotlin cycles in mixed-language projects contain ~15% more edit events. Our hypothesis: In such projects, new functionality tends to be written in Kotlin (larger cycles), while legacy maintenance is done in Java (smaller cycles), and without normalization, this creates an artifact. Only after controlling for task size and moving to a longitudinal design does a stable negative gap emerge. In the stricter estimator, S, M, and L all remain negative, with the strongest precision in S and M, while XL is too imprecise for a firm claim.

4.2 Additional robustness checks

4.2.1 Stability across months

Cross-sectional comparison of Java vs. Kotlin in mixed-language projects (same user, same project) over six months:

Month Java median Kotlin median Δ Kotlin faster?
2025-01 31.8 min 29.6 min −6.9%
2025-02 35.1 min 32.5 min −7.5%
2025-03 27.3 min 28.3 min +3.7%
2025-04 27.0 min 29.0 min +7.4%
2025-05 30.0 min 28.5 min −5.2%
2025-06 31.7 min 31.1 min −2.1%

The direction is inconsistent (4 out of 6 months favor Kotlin). The instantaneous effect is small and noisy − a robust effect is only visible in the longitudinal design.

4.2.2 Breakdown by project size

A separate descriptive split by total project size shows a similar pattern. Because this check is based on aggregate p90/avg trends rather than the primary log-DiD estimator, it should be read as exploratory. The pattern is most pronounced in L-projects (500–2,000 edits over the entire period):

Project size Descriptive gap on p90 Descriptive gap on avg
XL (2,000+ edits) −13.9% −11.7%
L (500–2,000 edits) −39.7% −23.9%

4.2.3 Java version as a proxy for engineering culture

The “active team effect” hypothesis posits that the slower degradation of Kotlin projects is explained not by the language itself but by the characteristics of the team. If this were the case, then  teams with a stronger engineering culture within the unmigrated Java group should degrade more slowly, too.

To test this, we used project JDK version as a proxy for engineering culture. The MODULE_JDK_VERSION event from IDE telemetry contains the major Java version. Unmigrated Java user-projects were segmented into:

  • old_java: maximum JDK version ≤ 11 (~308K user-projects).
  • modern_java: maximum JDK version ≥ 17 (~348K user-projects).

DAT/edit degradation (median minutes per edit, early → late) by bucket:

Bucket old_java Δ modern_java Δ All unmigrated Java Δ Java→Kotlin migrants Δ
S: 1–5 edits +0.1% +4.0% +4.7% −7.5%
M: 6–15 edits +9.1% +10.5% +10.3% −5.8%
L: 16–40 edits +17.2% +16.6% +17.7% −6.3%
XL: 41+ edits +28.9% +25.1% +26.3% +6.7%

The relationship between Java version and degradation rate is mixed: for large tasks (XL), modern_java degrades 3.8 pp more slowly than old_java, but for small tasks (S) it degrades 3.9 pp faster. For buckets M and L, the difference between segments is minimal (≤1.4 pp). There is no systematic advantage for modern_java.

Descriptive DAT/edit contrast recalculated with modern_java as the control group:

Bucket Migrants Δ Control: all unmigrated Java Descriptive contrast (original) Control: modern_java Descriptive contrast (adjusted)
S −7.5% +4.7% −12.2% +4.0% −11.5%
M −5.8% +10.3% −16.1% +10.5% −16.3%
L −6.3% +17.7% −24.0% +16.6% −22.9%
XL +6.7% +26.3% −19.6% +25.1% −18.4%

When using modern_java as the control group, the descriptive contrast changes little (deviation ≤1.2 pp across buckets). This check was performed on the simpler DAT/edit view, not on the primary log-DiD estimator, so it should be read as auxiliary evidence only.

Interpretation: within this descriptive check, Java version as a proxy for engineering culture is weakly associated with DAT/edit degradation rate, and substituting the control group with modern_java has almost no effect on the descriptive contrast. This weakens the hypothesis that the difference between migrants and the control group is explained solely by team characteristics − at least to the extent that JDK version reflects those characteristics.

However, JDK version is only one possible proxy for engineering culture. Other factors (code review practices, CI/CD pipelines, refactoring habits) may differ between Kotlin migrants and the control group, even though they do not correlate with the Java version used.

4.3 Confidence level

Aspect Status Comment
User control Within-user comparison (same individual)
Project control Within-project comparison (same project)
Time-trend control Primary log-DiD with unmigrated Java control; stable-group comparison gives complementary evidence
Estimator form Main result based on log-DAT changes at the user-project × task-size level
Project-size robustness  Descriptive split by total project size shows a directionally similar pattern across the available project-size segments; this is supportive evidence, not part of the primary estimator
Task-size control Bucketing by number of edits per cycle
Secondary comparison group unmigrated Kotlin provides an additional comparison and shows a directionally similar pattern, although it is not part of the primary estimator
Sample size 1,501 migrants, ~76K migrant cycles, ~28M control cycles
Temporal stability ⚠️ Cross-sectional month-by-month comparison is unstable; effect is visible in the longitudinal design
Weighting sensitivity ⚠️ Magnitudes vary across context-weighted and cycle-weighted aggregations
Large-task precision ⚠️ XL interval includes zero; L is weaker in user-level robustness checks
Branch information ⚠️ Cannot distinguish a push to a feature branch from a push to main
Cycle definition ⚠️ A single push may encompass multiple tasks
Causality ⚠️ Observational study, not an experiment

4.4 Threats to validity

What could weaken the result?

  1. Unobserved team characteristics: Segmentation of unmigrated Java by JDK version showed that even this rough proxy for engineering culture only slightly narrows the descriptive contrast for individual buckets. Other unobserved factors that systematically differ between migrants and the control group may exist and could further reduce the gap.
  2. Unobserved confounders: Factors we cannot measure remain: concurrent refactoring, process changes, and dependency upgrades.
  3. Weighting sensitivity: Equal-weighting user-project contexts yields stronger magnitudes than cycle-weighted variants. The sign remains negative, but the exact effect size depends on weighting, especially for large tasks.
  4. Large-task precision: The XL bucket is directionally negative, but its 95% CI includes zero. The L bucket is negative in the primary model but weaker under user-level robustness checks.
  5. Push ≠ PR: A push is a proxy for delivery. A PR may be created later or through a web interface.
  6. Calendar time: DAT includes nights, weekends, and lunch breaks. This adds noise, but it affects both groups equally.
  7. Cycle definition: A cycle = the interval between pushes; a single push may encompass multiple logical tasks.

What could strengthen the result?

  1. Bucketing may work in Java’s favor. If Kotlin code is more concise, then the same logical task in Kotlin may require fewer edits than in Java. In that case, within each bucket, Kotlin tasks would be objectively larger. This is a hypothesis we cannot verify directly from the available data (we do not know which “logical task” underlies each cycle), but if it holds − the real effect is stronger than measured.

4.5 Open questions

  1. Pre-trends / event-study: Do migrant and control trends look parallel before the transition to Kotlin?
  2. Alternative controls in the primary estimator: Does the log-DiD result hold when using only modern_java or matched controls?
  3. Robustness check on thresholds: Does the effect hold under alternative migrant definitions (5/20% instead of 10%, 40/60% instead of 50%)?
  4. DAT of reverse migrants with task-size control: Do cycle durations worsen when moving away from Kotlin?
  5. Long-term dynamics: Does the effect continue to grow after 12+ months on Kotlin, or does it plateau?
  6. Other productivity metrics: Which parts of the development process — such as compilation errors, build times, and rebuild frequency — would best clarify the data we collected?
  7. Android Studio: Will the difference be the same for the segment of Android developers?
  8. Propensity score matching: Could matching each migrant with a “twin” from the control group with similar baseline characteristics (project size, initial speed, activity) yield a more precise DiD estimate?

5. Methodology

5.1 IDE telemetry events

Event group_id event_id Key field
File editing file.types.usage edit file_type = “JAVA” / “Kotlin”
Push actions action.finished action_id = “Vcs.Push” / “Git.Commit.And.Push.Executor”

On the nature of the edit event: The edit event is reported with a 1-minute cooldown − after being sent, the system does not record new edits for one minute, even if the developer continues typing. Between edit events, a developer also reads code, navigates the project, runs tests, and discusses issues with colleagues − all of which are part of the work cycle included in DAT but do not generate edit events. Therefore, the number of edits is a proxy for task size, not a measure of time spent. DAT measures the full wall-clock time of the cycle.

5.2 Filtering

  • DAT > 36 sec and < 14 days
  • Only product_code = ‘IU’ (IntelliJ IDEA)
  • Only recorder_code = ‘FUS’
  • Non-empty machine_id and project_id

5.3 Defining migration groups

For each (machine_id, project_id) per month, kotlin_share = kotlin_edits / (java_edits + kotlin_edits) is calculated. A minimum of ≥10 Java/Kotlin edit events per month and ≥4 active months is required.

Group Definition
A: Java→Kotlin migrants First month kotlin_share < 10%, last month > 50%
B: Unmigrated Java First and last month kotlin_share < 10%
C: Unmigrated Kotlin First and last month kotlin_share > 50%

Migrant phases are determined monthly: Java phase (<10%), Transition (10–50%), and Kotlin phase (>50%).

unmigrated group phases: History is divided into three equal time-based thirds (early / middle / late) − analogous to migrant phases.

5.4 Difference-in-differences (DiD)

The primary estimator is a longitudinal log-DiD on user-project × task-size-bucket contexts:

  1. A context is defined as (machine_id, project_id, size_bucket).
  2. Only contexts with ≥3 cycles in both pre and post periods are kept.
  3. For each context, we compute ΔlogDAT = mean(log(DAT))_post − mean(log(DAT))_pre.
  4. We then estimate the treated-control gap: Primary log-DiD effect = exp(mean(ΔlogDAT)_migrants − mean(ΔlogDAT)_control) − 1.

Standard errors are clustered by machine_id, since one user may contribute multiple projects or buckets.

For interpretability, we also show bucket-level median DAT tables and simple percentage changes. These descriptive summaries are not the primary estimator.

5.5 Task-size control

Raw DAT is not appropriate for comparison: Kotlin cycles in mixed-language projects contain ~15% more edits (our hypothesis is that new functionality is written in Kotlin while legacy maintenance is done in Java).

The primary control method is bucketing: We compare only cycles of the same size. We also use DAT/edits as an auxiliary descriptive normalization in selected robustness checks.

Two control methods are used:

  • DAT/edits: Cycle duration divided by the number of edits.
  • Bucketing: Comparing cycles of the same size: S (1–5), M (6–15), L (16–40), XL (41+ edits).

5.6 Data

  • Period: November 2023 – June 2025 (~20 months).
  • Product: IntelliJ IDEA (product_code = ‘IU’).
  • Volumes: ~28M control-group cycles, ~76K migrant cycles, ~2.5M unmigrated Kotlin cycles.
  • Primary log-DiD panel: 978 migrant contexts and 400,425 control contexts with ≥3 cycles in both pre and post periods.

Conclusion

This study presents large-scale observational evidence that development cycles in Kotlin-oriented projects are shorter than comparable cycles in Java-oriented projects. The primary longitudinal log-DiD estimate, controlling for user, project, time trend, and task size, places the effect at roughly 15–20% for everyday tasks (S, M, and L buckets). For XL tasks, the point estimate is also negative but statistically imprecise.

The dominant pattern is not a sudden acceleration upon switching to Kotlin, but rather a difference in trajectory: Unmigrated Java projects tend to experience substantial cycle-time growth over the observation period (+9% to +17% in the primary model), while migrant projects show modest improvement or remain flat. The result is independently echoed by the unmigrated Kotlin vs. unmigrated Java comparison, where Kotlin projects degrade at roughly half the rate.

We want to be explicit about what this study does not establish. This is an observational study, not a randomized experiment, and we cannot make definitive causal claims. Teams that choose to migrate to Kotlin may differ from those that stay on Java in ways we cannot fully observe − though our validity checks (JDK-version segmentation, multiple control groups, stepwise confounder elimination) suggest these differences alone do not explain the gap. We encourage readers to examine the limitations in Section 4.4 and the open questions in Section 4.5 when forming their own assessment of the evidence.

Several directions for future work could strengthen or refine these findings: event-study analysis of pre-trends, propensity score matching for more precise control-group construction, and extension to Android Studio where Kotlin is the default language. We plan to pursue these in subsequent analyses.

We Gave LLMs 150 Tools: Here’s What Broke.

There’s a hypothesis that most people building AI agents have encountered but few have measured: the more tools you give an LLM, the worse it gets at picking the right one.

It’s intuitive. Connect a few MCP servers to your agent, and suddenly it’s choosing from 60, 80, 100+ tools. GitHub tools, GitLab tools, Kubernetes, Slack, Jira, PagerDuty, Terraform, Grafana, all loaded into the context window, all the time. The model has to read every tool definition, understand the distinctions between them, and pick the right one. That’s a lot of signal to sift through.

But intuition isn’t data. So we built Boundary, an open-source framework for finding where LLM context breaks, and ran the numbers.

The setup

We assembled 150 tool definitions based on real schemas from production agent systems across 16 services: GitHub, GitLab, Jira, Confluence, Kubernetes, AWS, Datadog, Slack, PagerDuty, Okta, Snyk, Grafana, Terraform Cloud, Docker, Linear, and Notion. The tools are synthetic (no-op for benchmarking) but the schemas, parameter structures, and descriptions mirror what you’d find in a production MCP environment.

We tested six models across three providers:

  • Claude Sonnet 4.6 and Claude Haiku 4.5 (Anthropic)
  • GPT-4o and GPT-5.4 Mini (OpenAI)
  • Grok 4 and Grok 4.1 Fast Reasoning (xAI)

Each model received 60 prompts (both direct requests and ambiguous ones) at five toolset sizes: 25, 50, 75, 100, and 150 tools. At each size, the available tools were randomly selected but always included the correct one. The question: does the model pick the right tool?

The results

Every model that completed the test degraded. Two didn’t finish at all.

Model 25 tools 50 tools 75 tools 100 tools 150 tools
Grok 4.1 Fast 86.7% 83.3% 80.0% 83.3% 76.7%
GPT-5.4 Mini 85.0% 85.0% 80.0% 83.3% failed
GPT-4o 81.7% 78.3% 73.3% 76.7% failed
Claude Haiku 4.5 81.7% 80.0% 78.3% 80.0% 76.7%
Grok 4 80.0% 78.3% 80.0% 71.7% 80.0%
Claude Sonnet 4.6 78.3% 73.3% 73.3% 76.7% 75.0%

Accuracy vs toolset size across 6 LLMs

GPT-5.4 Mini was the most surprising result. At 85% accuracy through 50 tools, 92% on ambiguous prompts, sub-1-second latency, and $0.002 per call, it was arguably the best overall performer for small-to-medium toolsets. Then it hit the same 128-tool wall as GPT-4o and failed completely at 150.

Grok 4.1 Fast Reasoning was the only model that combined top-tier accuracy with the ability to handle 150 tools. It degraded steadily from 86.7% to 76.7%, but it never broke.

Both OpenAI models failed at 150 tools. OpenAI’s API has a hard limit of 128 tools per request. This isn’t a degradation curve. It’s a wall. If your agent connects enough MCP servers to exceed 128 tools, no OpenAI model works.

Claude Sonnet 4.6, the most expensive model in the test ($0.028/call), was the least accurate at 25 tools and never recovered. Claude Haiku outperformed it at every size while costing 3x less.

Cross-service confusion scales with tools

Cross-service confusion, where a model picks a tool from the wrong service entirely, was the most dangerous failure mode.

Model 25 tools 50 tools 75 tools 100 tools 150 tools
Claude Haiku 4.5 0 0 1 2 4
Grok 4.1 Fast 0 0 0 2 3
Claude Sonnet 4.6 0 1 2 3 2
Grok 4 2 0 2 4 1
GPT-4o 0 0 1 2 n/a
GPT-5.4 Mini 0 0 2 1 n/a

Grok 4 had cross-service errors even at 25 tools. Claude Haiku was clean until 75 tools but escalated to 4 errors at 150, the worst of any model at that size.

The most common cross-service confusions across all models:

  • Datadog vs Grafana: “Check the monitoring alerts” consistently routed to the wrong observability platform
  • Notion vs Confluence: “Search for documentation” split between the two
  • Linear vs Jira: “Add a comment to the tracking issue” picked the wrong project tracker
  • GitHub vs GitLab: “Show me the open issues” confused the two at higher tool counts

Direct vs. ambiguous prompts

A “direct” prompt names the service: “List all Terraform Cloud workspaces.” An “ambiguous” prompt doesn’t: “Add a comment saying ‘Resolved’ to the tracking issue.”

Model 25t (ambig) 50t (ambig) 75t (ambig) 100t (ambig) 150t (ambig)
GPT-5.4 Mini 92% 92% 67% 92% n/a
Grok 4.1 Fast 83% 83% 83% 75% 67%
Claude Sonnet 4.6 83% 75% 75% 83% 75%
GPT-4o 83% 83% 67% 58% n/a
Claude Haiku 4.5 75% 75% 83% 67% 67%
Grok 4 67% 75% 67% 50% 67%

GPT-5.4 Mini dominated ambiguous prompts at 92% through 100 tools. It handled disambiguation better than any other model by a wide margin. GPT-4o collapsed to 58% at the same size. Grok 4 hit 50%, a coin flip.

Claude Sonnet was the most stable, staying between 75% and 83% regardless of toolset size. Consistent, but never great.

Where models get confused

The errors tell a story. Some patterns appeared across all six models:

Terraform is hard. All models consistently confused terraform_create_run with terraform_list_workspaces, and terraform_lock_workspace with terraform_get_workspace. The tool names are semantically close, and the models default to “list” or “get” operations when the toolset is crowded.

Snyk is a trap. snyk_get_remediation, snyk_list_container_projects, and snyk_list_projects all got misrouted to snyk_list_organizations. When Snyk tools are buried among 100+ others, the models default to the most generic-sounding option.

Confluence updates fail. All models picked confluence_search when asked to update a page. The prompt said “Update the runbook page”, but with 75+ tools in context, the model reached for search instead of the update operation.

Monitoring platform confusion. Datadog and Grafana both have alerting, dashboards, and metrics tools. The prompt “Check the monitoring alerts for the API server” got routed to Grafana instead of Datadog by every model at some toolset size. Adding two similar services to the toolset creates permanent ambiguity.

The latency story

Accuracy isn’t the only cost.

Model 25 tools 50 tools 75 tools 100 tools 150 tools
GPT-5.4 Mini 739ms 754ms 849ms 976ms n/a
GPT-4o 1,170ms 4,035ms 6,213ms 7,657ms n/a
Claude Haiku 4.5 2,463ms 6,157ms 8,765ms 11,473ms 16,749ms
Claude Sonnet 4.6 4,728ms 10,308ms 14,579ms 19,120ms 27,935ms
Grok 4.1 Fast 6,448ms 7,042ms 6,930ms 7,349ms 7,533ms
Grok 4 7,706ms 7,945ms 8,133ms 8,418ms 9,552ms

Latency vs toolset size

GPT-5.4 Mini was the latency champion: sub-1-second at every toolset size it completed. The Anthropic models scaled linearly, with Sonnet reaching 28 seconds at 150 tools. The xAI models barely changed, staying in the 6-10 second range regardless of tool count.

What this means

The pattern is consistent across six models from three providers: more tools means worse accuracy, and the degradation starts between 25 and 50 tools.

The implications for anyone building agents with MCP:

  1. Don’t load everything. If your agent has access to 10+ services, that’s easily 80-150 tools. Loading them all upfront is a measurable tax on accuracy, starting at 25 tools.

  2. OpenAI has a hard wall at 128 tools. Both GPT-4o and GPT-5.4 Mini failed at 150. This isn’t a model quality issue. It’s a platform constraint. If your agent might exceed 128 tools, OpenAI models are not an option.

  3. Ambiguous prompts are the danger zone. Grok 4 hit 50% accuracy on ambiguous prompts at 100 tools. GPT-4o dropped to 58%. When users don’t name the service explicitly, the model has to disambiguate, and more tools makes that exponentially harder.

  4. Similar services compound the problem. Datadog and Grafana. Notion and Confluence. Linear and Jira. GitHub and GitLab. Every pair of similar services in the toolset creates a permanent source of confusion that scales with tool count.

  5. Latency compounds. Even if accuracy were flat, the latency cost matters. Claude Sonnet at 28 seconds per call is unusable for interactive workloads. GPT-5.4 Mini at sub-1-second is a different product entirely.

  6. Price does not predict performance. Claude Sonnet 4.6 costs 28x more per call than Grok 4.1 Fast and is less accurate. Claude Haiku outperforms Claude Sonnet at 3x lower cost. The most expensive model lost.

The cost equation

What you pay per call versus what you get in accuracy.

Model Total cost Calls Cost/call Best accuracy Worst accuracy
Grok 4.1 Fast $0.31 300 $0.0010 86.7% (25t) 76.7% (150t)
GPT-5.4 Mini $0.50 240* $0.0021 85.0% (25t) failed (150t)
GPT-4o $1.57 240* $0.0065 81.7% (25t) failed (150t)
Claude Haiku 4.5 $2.83 300 $0.0094 81.7% (25t) 76.7% (150t)
Grok 4 $3.85 300 $0.013 80.0% (25t) 71.7% (100t)
Claude Sonnet 4.6 $8.51 300 $0.028 78.3% (25t) 73.3% (50t)

*OpenAI models completed 240 of 300 calls. All calls at 150 tools failed due to the 128-tool API limit.

Cost vs accuracy tradeoff

The two cheapest models (Grok 4.1 Fast at $0.001/call and GPT-5.4 Mini at $0.002/call) were also the two most accurate. The most expensive model (Claude Sonnet at $0.028/call) was the least accurate. The correlation between price and tool-calling performance is not just weak. It’s inverted.

This is exactly the kind of tradeoff Boundary is designed to surface. Without benchmark data, you’d likely pick Claude Sonnet or GPT-4o. The data says they’re among the worst choices for tool-calling workloads. A team running fewer than 128 tools should seriously consider GPT-5.4 Mini for its combination of accuracy, speed, and cost. A team that might exceed 128 needs Grok 4.1 Fast or an Anthropic model.

Running these benchmarks costs almost nothing. This entire run across six models cost $17. That’s less than a single hour of engineer time debugging a misrouted tool call in production.

How this shaped our architecture

This data isn’t theoretical for us. It directly informed how we built progressive disclosure in SixDegree.

The core insight: if accuracy degrades between 25 and 50 tools, then the goal isn’t to find a smarter model. It’s to never present more than 25 tools in the first place. Not by hardcoding a curated list, but by letting the agent’s context determine which tools are relevant at each step.

In SixDegree, when an agent queries the ontology and discovers a GitHub repository, only the GitHub tools become available. When a Kubernetes deployment surfaces through a relationship, the Kubernetes tools appear. The agent never sees all 150 tools at once because it never needs to. The toolset at any given turn is scoped to the entities the agent has actually encountered.

The Boundary data validates this approach quantitatively. At 25 tools (roughly the size of two or three services’ worth of tools), accuracy is in the mid-to-high 80s. That’s the operating range progressive disclosure keeps you in, regardless of how many total services are connected. You can have 16 integrations and 150 tools installed, and the agent still only sees the 10-20 that matter for the current conversation.

The alternative, loading everything and hoping the model figures it out, costs you 5-10 percentage points of accuracy, up to 28x the latency, and for OpenAI models, a hard failure at 128 tools. Progressive disclosure isn’t a nice-to-have. It’s a requirement for agents that work at scale.

Limitations and what we’d like to improve

This benchmark is a starting point, not a definitive answer. There are real limitations to what it measures and how:

Single-turn only. Each prompt gets one shot at picking a tool. Real agents chain tool calls, use results from previous calls to inform the next one, and recover from mistakes. A model that picks the wrong tool on the first try might self-correct on a second turn. This benchmark doesn’t capture that.

Random tool subsets. At each toolset size, the available tools are randomly selected (with the correct one always included). In production, the tools in context aren’t random. They’re usually grouped by service or use case. Random selection may overstate or understate confusion depending on which tools end up adjacent.

No parameter validation. We check whether the model picked the right tool, but not whether it filled in the parameters correctly. A model that picks github_create_issue but hallucinates the owner field is still counted as correct. Parameter accuracy is a whole separate dimension.

Prompt quality varies. Some of the ambiguous prompts have arguably debatable expected answers. “Check the monitoring alerts” could reasonably map to either Datadog or Grafana depending on the organization. We picked one, but reasonable people would disagree.

Single trial. Each prompt runs once per toolset size. With 60 prompts per size, the results are directional but individual percentage points could shift with more trials.

We’d like to add multi-turn evaluation, parameter accuracy checking, configurable prompt difficulty levels, and more models. If you have ideas for how to make this benchmark better, if you disagree with our methodology, or if you’ve run Boundary against a model we haven’t tested yet, open an issue or submit a PR. This is an open source project and we want the community to help shape it.

What’s next

The full interactive results from this run are available on our site. The framework is open source. Run it yourself and see how your preferred models handle tool overload.

Boundary is an open-source framework for finding where LLM context breaks. See how SixDegree solves tool overload.

PhpStorm 2026.1 is Now Out

Welcome to PhpStorm 2026.1! This release brings new PhpStorm MCP tools, new third-party agents inside your IDE, support for Git worktrees, and lots of other productivity-enhancing features for PHP and Laravel developers.

Download PhpStorm 2026.1

PhpStorm MCP tools

In PhpStorm 2025.2, we added an integrated MCP server for third-party coding agents like Claude Code, Windsurf, or Codex to access and use your IDE’s tools. 

In 2026.1, we are extending the MCP server toolset with more PhpStorm features, including:

  • Inspections and quick-fixes that enable agents to leverage PhpStorm’s powerful static analysis engine.
  • IDE search capabilities, including PhpStorm’s structural search and semantic search for code patterns.
  • Access to IDE actions so that you can delegate setup and customization of your IDE to your coding agent.

Furthermore, the PhpStorm plugin for Claude Code provides Claude Code with context and instructions for using PhpStorm MCP server tools. To add the plugin’s skills and hooks to your project, go to PhpStorm’s Settings | Tools | PHP Claude Skills.

Note: PhpStorm’s MCP server is disabled by default. To enable the server and configure integration with your coding agent, go to Settings | Tools | MCP Server.

AI

Third-party agents in PhpStorm

PhpStorm is evolving as an open platform that allows you to bring the AI tools of your choice into your professional development workflows.

In addition to Junie, Claude Agent, and most recently Codex, PhpStorm now lets you work with more AI agents directly in the AI chat. You can choose from agents such as GitHub Copilot, Cursor, and many others supported through the Agent Client Protocol.

Next edit suggestions

Next edit suggestions are now available without consuming AI quota of your JetBrains AI Pro, Ultimate, and Enterprise subscriptions. These suggestions go beyond traditional code completion for PHP. Instead of updating only what’s at your cursor, they intelligently apply related changes across the entire file, helping you keep your code consistent and up to date with minimal effort.

This natural evolution of code completion delivers a seamless Tab Tab experience that keeps you in the flow.

Junie CLI is now in Beta

Junie CLI is JetBrains’ LLM-agnostic coding agent you can use directly from the terminal, inside any IDE, in CI/CD, and on GitHub or GitLab. Junie CLI comes with:

  • Bring Your Own Key (BYOK) pricing, allowing you to use your own keys from model providers without additional charges.
  • One-click migration from other agents such as Claude Code or Codex.
  • Flexible customization through guidelines, custom agents and agent skills, commands, MCP, and more.

Read the full announcement in our blog post.

Project indexing optimization

PhpStorm now automatically detects framework-specific directories that contain frequently changing generated, cached, or user-uploaded content and excludes such directories from project indexing. 

The IDE skips excluded folders during search, parsing, and other operations. Reducing indexing overhead helps optimize the CPU usage and performance of your IDE. 

If you want to re-enable indexing for any of the automatically excluded folders, you can do so in Settings | Directories by clicking Exclude and unselecting the checkboxes next to the directories you want to be indexed.

Generics support

The new release brings a number of improvements and bug fixes for PhpStorm’s type inference engine, including: 

  • Improved type inference for callable generic types. Now the IDE can infer both the input parameter type from a callable(T) annotation and the callable template return type.

  • Improved display for nested parameterized template types. PhpStorm 2026.1 displays parameter type (Ctrl + Shift+P) and quick documentation (F1) info with multiple layers of wrapping, such as Wrapper<Wrapper<Wrapper<stdClass>>>.

More quality-of-life improvements

Debugging non-PHP files

You can now set breakpoints in non-PHP files as soon as the file name pattern is associated with the PHP file type in the IDE settings. Together with native path mapping between templates and compiled PHP files introduced in Xdebug 3.5, this feature allows you to debug source template files of any format, including niche extensions like .ezt.

Improved Go to test navigation

In PhpStorm 2026.1, we’ve improved Go to Test navigation for PHPUnit and Pest tests with the following enhancements: 

  • Navigation between PHPUnit tests that use a #[UsesClass] or #[UsesMethod] attribute and the related class/method.
  • For Pest tests, you can now navigate from the Test Runner tab to the source test nested inside Pest describe blocks. 

Convert to pipe operator quick-fix

PhpStorm now detects code elements where the PHP 8.5 pipe operator syntax can be used and suggests a quick-fix to convert such code into easier-to-read pipe operator chains.

Laravel

  • Framework support: support for Laravel 13 and new versions of Livewire and Filament. Support for the new @hasStack and @includeIsolated Blade directives.
  • New package support: Laravel Wayfinder, PHP Native, staudenmeir/laravel-cte and staudenmeir/laravel-adjacency-list packages.
  • Eloquent enhancements: advanced #[Scope] methods support, optimized and more accurate Find Usages for scope, attribute and relation methods.
  • UI and navigation: Blade view usages UI, better controller inlays, new Route Search UI, and routes to the Endpoints tool window.
  • Productivity tweaks: a new Add Application Database action. Run Artisan commands in the Terminal tool window or via PHP interpreter.
  • Laravel Idea MCP server shipped with the PhpStorm MCP server.

For the full list of updates, see Laravel Idea’s changelog.

Frontend

PhpStorm’s TypeScript support now uses the service-powered type engine (built on the TypeScript language service) by default, delivering more accurate type inference and lower CPU usage in large projects. The TypeScript support is further improved with better auto-import handling for path aliases and project references, as well as the integration of inlay hints from the TypeScript Go-based language server. JavaScript parsing now also correctly handles string-literal import / export specifiers.

Framework and styling support have been refined across the board: 

  • The IDE now highlights React’s new use memo and use no memo directives. 
  • The Vue integration uses the updated 3.1.8 version of @vue/typescript-plugin
  • Astro settings accept JSON-based configuration for language server integration. 
  • Modern CSS color() functions and additional color spaces are supported in swatches and previews. 
  • Angular 21.x template syntax is supported.

Databases

The AI chat integration for Codex and Claude Agent now offers full, native support for your connected databases. With that, you can now query, analyze, and modify your database state using natural language right from the IDE.

The same functionality is available for external agents via an MCP server.

Data source settings can now be stored in your JetBrains Account via data source templates. Especially nifty for All Products Pack users or anyone who uses multiple instances of JetBrains IDEs, this upgrade allows you to access data source templates and settings in every JetBrains IDE with database functionality.

Productivity-enhancing features

Editor caret and selection updates

We’re continuing to modernize our IDEs, and in this update, we’ve refreshed something you interact with constantly – the editor. Smooth caret animation and updated selection behavior provide improved comfort, a cleaner look, and a more enjoyable coding experience.

Read more

Work on multiple branches at once with Git Worktrees

With the evolution of AI agents, running multiple tasks in parallel has become a major time-saver, and this is precisely where Git worktrees are extremely handy. To support cutting-edge workflows for AI-boosted software development, PhpStorm now provides first-class support for Git worktrees. Create a separate worktree for an urgent hotfix, hand off another one to an AI agent, and keep working in your main branch – all at the same time, without interruption.

Even if you don’t use agents, worktrees will save you time on branch switching, especially in big projects.

Native Wayland support

IntelliJ-based IDEs now run natively on Wayland by default. This transition provides Linux professionals with ultimate comfort through sharper HiDPI and better input handling, and it paves the way for future enhancements like Vulkan support.

While Wayland provides benefits and serves as a foundation for future improvements, we prioritize reliability: The IDE will automatically fall back to X11 in unsupported environments to keep your workflow uninterrupted. Learn more.

Terminal completion

Stop memorizing commands. Start discovering them. In-terminal completion helps you instantly explore available subcommands and parameters as you type. Whether you’re working with complex CLI tools like Git, Docker, or kubectl or using your own custom scripts, this feature intelligently suggests valid options in real time.

Code With Me sunset

As we continue to evolve our IDEs and focus on the areas that deliver the most value to developers, we’ve decided to sunset Code With Me, our collaborative coding and pair programming service. Demand for this type of functionality has declined in recent years, and we’re prioritizing more modern workflows tailored to professional software development.

As of version 2026.1, Code With Me will be unbundled from all JetBrains IDEs. Instead, it will be available on JetBrains Marketplace as a separate plugin. 2026.1 will be the last IDE version to officially support Code With Me, as we gradually sunset the service.

Read the full announcement and sunset timeline in our blog post. 

RubyMine 2026.1: AI Chat Upgrades, New Code Insight, Stable Remote Development, and More 

RubyMine 2026.1 is here! This release brings a range of improvements aimed at making Ruby and Rails development faster and more enjoyable.

You can get the new build from our website or via the free Toolbox App.

Let’s take a look at the highlights of this release.

AI

RubyMine continues to evolve as an open platform that lets you bring your preferred AI tools directly into your development workflow. With RubyMine 2026.1, working with multiple AI agents and integrating them into your IDE experience is now easier than ever.

Use more AI agents in RubyMine

In addition to Junie and Claude Agent, you can now choose more agents in the AI chat, including Codex. Additionally, Cursor and GitHub Copilot, along with dozens of external agents, are now supported via the Agent Client Protocol (ACP). With the new ACP Registry, you can discover available agents and install them in just one click.

Install From ACP Registry option in AI chat

Work with connected databases directly in the AI chat

The AI chat integration for Codex and Claude Agent now offers full, native support for your connected databases. With that, you can now query, analyze, and modify your database state using natural language right from the IDE.

The same functionality is available for external agents via MCP server.

Accessing rails project databases from AI chat using Claude Agent

Get next edit suggestions throughout your file

Next edit suggestions are now available without consuming AI quota of your JetBrains AI Pro, Ultimate and Enterprise subscriptions. These suggestions go beyond what is offered by traditional code completion for your programming language. Instead of updating only what’s at your cursor, they intelligently apply related changes across the entire file, helping you keep your code consistent and up to date with minimal effort.

This natural evolution of code completion delivers a seamless Tab Tab experience that keeps you in the flow. 

Enabling next edit suggestions for AI Assistant

Code insight

Try the new code insight engine (Beta)

RubyMine 2026.1 introduces a new, currently experimental, symbol-based language modeling engine.

This engine changes how RubyMine understands classes, modules, and constants (support for methods is planned for future releases), laying the groundwork for faster and more reliable code insight.

Our internal benchmarks show significant improvements.

Qualified first-element constant completion is about 40% faster, while the overall time for constant completion improved by roughly 50%. Type-matched completion for exceptions became dramatically faster – by about 95%. In addition, the performance of Find Usages improved by around 60% in large projects and by about 15% in typical cases.

Additional areas that benefit from the new engine include:

  • Rename refactoring
  • Quick Documentation, Quick Definition, and Ctrl+Hover hints
  • Structure view
  • Navigation (Go to Declaration and Go to Type Declaration)

Because the engine is still in Beta, it is disabled by default. You can enable it in Settings | Languages & Frameworks | Ruby | Code Insight.

Give it a try and share your feedback!

Enabling experimental code insight for Ruby

Remote development

Boost your productivity with Stable remote development

Remote development officially moves out of Beta and becomes Stable in RubyMine 2026.1.

You can now connect to your development environments via SSH, Dev Containers, or WSL 2, and the IDE backend will run on the remote machine while the user interface remains fast and responsive on your local device.

This setup gives you the full RubyMine experience wherever your code lives.

Remote Development window

Rails

Work seamlessly with variables passed via render

RubyMine now correctly recognizes local variables passed via render.

Variables provided through the locals: option are no longer marked as unresolved and appear in code completion.

This behavior works consistently across views, layouts, partials, and templates (ERB and HAML), providing cleaner code insight and fewer unnecessary warnings.

Recognizing variables passed via render

Detect deprecated Rails associations instantly

Keeping Rails projects modern and maintainable is now easier with improved deprecation detection.

When a Rails association is marked as deprecated (for example, has_many :posts, deprecated: true), RubyMine highlights all its usages throughout your project and shows a clear deprecation notice in the Quick Documentation popup.

This helps you identify outdated APIs early and update your code proactively.

Highlighting deprecated Rails associations

Use Rails virtual database columns

RubyMine 2026.1 adds recognition for virtual generated columns from PostgreSQL 18 (or later versions) in Rails projects.

These non-persisted columns behave just like regular attributes in the IDE. Code completion, type hints, and navigation to the column definition in schema.rb work seamlessly.

Recognizing virtual database columns in Rails

Ruby and RBS

Use endless methods with access modifiers

RubyMine now fully supports Ruby 4.0 endless methods with access modifiers. Code such as private def hello = puts "Hello" is now parsed correctly and no longer produces errors.

Supporting endless methods with access modifiers

Use more Ruby and RBS operators in completion

You can now type Ruby and RBS operators (=, !, +, *, and others) directly in the completion popup without closing it. This keeps you in the flow and helps you finish expressions faster.

Expanded range of operators in completion popup

Rename global variables safely

RubyMine now validates global variable names during renaming.

Invalid names such as $foo!@# are no longer allowed, preventing broken code and syntax errors. The IDE ensures renamed variables follow Ruby’s syntax rules, making refactoring safer and more reliable.

Alert notification about an invalid global variable name

Let RubyMine select the Ruby interpreter automatically

RubyMine 2026.1 can automatically detect the correct Ruby interpreter by analyzing configuration files such as .ruby-version or .tool-versions.

There are three scenarios:

  • Single match found: RubyMine sets the interpreter automatically so you can start coding immediately.
  • Multiple matches or no match found: RubyMine shows a notification and helps you choose the correct interpreter.
  • No configuration file found: RubyMine selects the latest installed MRI Ruby version as a safe default.

If you prefer manual configuration, you can disable this behavior in Settings | Languages & Frameworks | Ruby. Find more details in our docs.

Updated Ruby settings page with the option of automatic Ruby interpreter selection

User experience improvements

Debug failing tests faster with the diff viewer

RubyMine 2026.1 introduces a diff viewer for failed RSpec and minitest tests.

When a test fails, simply click Click to see difference in the test results to open a side-by-side comparison of expected and actual values. This makes it much easier to identify the issue and fix failing tests quickly.

Configure linting and formatting with ease

RubyMine now features a redesigned configuration for RuboCop and the standard gem, along with a new Linting and Formatting section in Settings | Tools | RuboCop.

You can choose from mutually exclusive options:

  • Default
  • Standard gem inspections
  • Standard on save
  • RuboCop server mode
  • RuboCop on save

The updated settings simplify configuration, prevent conflicts between tools, and integrate tightly with RubyMine formatting actions.

Redesigned RuboCop settings page with new Linting and Formatting section

Other

Plan ahead for the sunsetting of Code With Me

Starting with RubyMine 2026.1, Code With Me will be unbundled from JetBrains IDEs and distributed as a separate plugin on JetBrains Marketplace.

RubyMine 2026.1 will be the last IDE version to officially support Code With Me as the service is gradually sunset.

Read the full announcement and timeline in our blog post.

Stay in touch

Follow RubyMine on X to stay up to date on all the latest features.

We invite you to share your thoughts in the comments below. You can also suggest and vote for new features in our issue tracker.

Happy developing!

The RubyMine team

AI-Assisted Java Application Development with Agent Skills

Agent-assisted development is quickly becoming a common mode of software development. New techniques are emerging to help LLMs generate code that matches your preferences and standards.

One common approach is to create an AGENTS.md, CLAUDE.md, or GEMINI.md file with project details, build instructions, and coding guidelines. The AI agent loads this file into context on every request.

This has two drawbacks:

  • It consumes tokens on every request, increasing cost.
  • Loading too much context into an LLM degrades its effectiveness.

Agent Skills is a new initiative that solves both problems by managing context progressively and extending AI agent capabilities on demand.

What are Agent Skills?

Agent Skills is an open standard introduced by Anthropic, to extend AI agent capabilities with specialized knowledge and workflows.

Consider a use case where you want an AI to generate presentations using your company’s slide template and design guidelines. You can package those assets (the PPT template, font files, and design rules) into a skill. The agent then uses that skill to generate slides that match your standards automatically.

A skill is a folder containing a SKILL.md file. This file includes metadata (name and description at minimum) and instructions that tell an agent how to perform a specific task. Skills can also bundle scripts, templates, and reference materials.

skill-name/
├── SKILL.md          # Required: instructions + metadata
├── scripts/          # Optional: executable code
├── references/       # Optional: documentation
└── assets/           # Optional: templates, resources

The format of a SKILL.md file is:

---
name: name-of-the-skill
description: Skill description.
license: Apache-2.0
metadata:
  author: author/org
  version: "1.0"
compatibility: Requires git, docker, jq, and access to the internet
---

Skill Content

In a SKILL.md file, name and description are required fields, and you can add optional fields like licence, metadata, compatibility, etc. You can explore more about the Skill Specification here.

How do Agent Skills manage context?

At startup, agents load only the metadata (name and description) of installed skills. When you ask the agent to perform a task, it finds the relevant skill and loads only that SKILL.md into context.

This progressive loading keeps context minimal and pulls in additional information only when needed, unlike a monolithic CLAUDE.md that loads everything upfront.

What can be a skill?

Skills extend AI capabilities across a wide range: from coding guidelines for a specific library, to step-by-step workflows with reference documents and helper scripts.

For example, you can create a skill that:

  • Specifies which library APIs to use and which anti-patterns to avoid.
  • Bundles reference documentation in a references/ directory.
  • Includes helper scripts in a scripts/ directory.

Case Study: Implementing Spring Data JPA Pagination

Suppose you ask an AI agent to implement a Spring Boot REST API endpoint that returns a paginated list of Post entities along with their Comment collections.

Without guidance, the agent is likely to produce one of these common mistakes:

  • N+1 SELECT problem — lazy-loading comments trigger a separate query per post.
  • In-memory pagination — using JOIN FETCH with pagination loads all rows into memory, then paginates in the application layer.

You can check out the sample code from the GitHub repository https://github.com/sivaprasadreddy/agent-skills-demo 

Let us see how an AI Agent might generate code when asked to implement a REST API endpoint to return paginated posts along with comments.

Without any specific guidelines or skills, the AI Agent generated the following implementation:

@RestController
@RequestMapping("/api/posts")
class PostController {
   private final PostService postService;

   PostController(PostService postService) {
       this.postService = postService;
   }

   @GetMapping
   PagedResult<PostDto> getPosts(
           @RequestParam(name = "page", defaultValue = "1") int pageNo,
           @RequestParam(name = "size", defaultValue = "10") int pageSize) {
       return postService.getPosts(pageNo, pageSize);
   }

}


@Service
@Transactional(readOnly = true)
public class PostService {
   private final PostRepository postRepository;

   public PostService(PostRepository postRepository) {
       this.postRepository = postRepository;
   }

   public PagedResult<PostDto> getPosts(int pageNo, int pageSize) {
       Sort sort = Sort.by(Sort.Direction.ASC, "id");
       Pageable pageable = PageRequest.of(pageNo <= 0 ? 0 : pageNo - 1, pageSize, sort);
       Page<PostDto> postPage = postRepository.findAllWithComments(pageable).map(PostDto::from);
       return PagedResult.from(postPage);
   }

}

If you run the application and invoke the GET /api/posts endpoint, you will get the results, but in the logs you will find the below WARNING:

HHH000104: firstResult/maxResults specified with collection fetch; applying in memory

This essentially means, Hibernate will load all the entities into memory and then apply pagination. This will result in poor performance and even OutOfMemory exceptions if there are a large number of rows in the posts table.

A Spring Data JPA skill prevents both issues by giving the agent explicit guidelines and a working code example.

Spring Data JPA Agent Skill

Create a spring-data-jpa/SKILL.md file with the following content:

---
name: spring-data-jpa-skill
description: Implement the persistence layer using Spring Data JPA in Spring Boot applications.
---

Follow the below principles when using Spring Data JPA:

1. Disable the Open Session in View (OSIV) filter: 
spring.jpa.open-in-view=false
2. Disable in-memory pagination: 
spring.jpa.properties.hibernate.query.fail_on_pagination_over_collection_fetch=true

3. Avoid the N+1 SELECT problem: use JOIN FETCH to load associated child collections in a single query.
4. Avoid in-memory pagination: when loading a paginated list of parent entities with child collections:
	* First, load only the parent IDs using pagination
	* Then, load the full entities with their child collections using JOIN FETCH for those IDs
	* Assemble the final Page from the paginated IDs and the loaded entities


## Pagination with child collections example:

PostRepository.java

public interface PostRepository extends JpaRepository<Post, Long> {

   @Query("select p.id from Post p order by p.id")
   Page<Long> findPostIds(Pageable pageable);

   @Query("select distinct p from Post p left join fetch p.comments where p.id in :ids")
   List<Post> findAllByIdInWithComments(@Param("ids") Collection<Long> ids);
}


PostService.java

@Service
public class PostService {
   private final PostRepository postRepository;

   public PostService(PostRepository postRepository) {
       this.postRepository = postRepository;
   }

   @Transactional(readOnly = true)
   public Page<Post> findPosts(Pageable pageable) {
       Page<Long> idsPage = postRepository.findPostIds(pageable);
       if (idsPage.isEmpty()) {
           return Page.empty(pageable);
       }
       List<Post> posts = postRepository.findAllByIdInWithComments(idsPage.getContent());
       return new PageImpl<>(posts, pageable, idsPage.getTotalElements());
   }
}

How to use Agent Skills?

Agent Skills work with Claude Code, Codex, Gemini CLI, JetBrains Junie, and other agents. Install a skill at the project level or user level depending on your preference.

Agent Project-Level User-Level
Junie .junie/skills/ ~/.junie/skills/
Claude Code .claude/skills/ ~/.claude/skills/
Codex .agents/skills/ ~/.agents/skills/
Gemini CLI .gemini/skills/(or).agents/skills/ ~/.gemini/skills/(or)~/.agents/skills/

To use the Spring Data JPA skill with Claude Code:

  1. Copy the spring-data-jpa/ directory into {project-root}/.claude/skills/.
  2. Ask Claude Code to implement a paginated REST API endpoint.
  3. Claude Code discovers the skill automatically and follows the guidelines.

As you can see, Claude Code automatically discovered the Spring Data JPA skill and generated the following implementation following the guidelines given in the skill.

@Service
public class PostService {
   private final PostRepository postRepository;

   public PostService(PostRepository postRepository) {
       this.postRepository = postRepository;
   }

   @Transactional(readOnly = true)
   public Page<Post> findPosts(Pageable pageable) {
       Page<Long> idPage = postRepository.findPostIds(pageable);
       if (idPage.isEmpty()) {
           return Page.empty(pageable);
       }
       List<Post> posts = postRepository.findAllByIdInWithComments(idPage.getContent());
       return new PageImpl<>(posts, pageable, idPage.getTotalElements());
   }
}

With this implementation, only the Post IDs of the desired page will be loaded first, and then a list of posts along with their comments will be fetched in a separate query. This will fix the pagination in-memory issue.

Using Agent Skills with Junie

You can use the JetBrains Junie Agent to generate code which automatically loads the necessary skills from .junie/skills  and directory.

The Junie agent loaded spring-data-jpa skill based on the given task and applied the guidelines. You can also observe that Junie automatically runs the relevant tests to verify the generated code is working or not and iterate until the tests are passed.

In the sample repository https://github.com/sivaprasadreddy/agent-skills-demo, you can find the following branches to try out the spring-data-jpa Agent Skill:

  • main: Starting point to try implementing the mentioned usecase without any skills.
  • in-memory-pagination-issue: Usecase implementation generated by AI that results in in-memory pagination issue.
  • skills: With spring-data-jpa skill to try implementing the mentioned usecase.

Summary

If the AI agent is generating code with any anti-patterns or not following team coding standards and conventions, instead of fixing issues one-by-one with follow-up prompts, consider creating a skill to provide those as guidelines.

To explore more on Agent Skills, please refer to the following resources:

  • https://agentskills.io/
  • https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview
  • https://developers.openai.com/codex/skills/
  • https://geminicli.com/docs/cli/skills/
  • https://junie.jetbrains.com/docs/agent-skills.html 

GoLand 2026.1 Is Released

GoLand 2026.1 helps you keep your Go code modern and your workflow efficient. This release introduces guided syntax updates for Go 1.26, making it easier to adopt new language improvements across your entire codebase. It also expands AI capabilities with support for additional agents, and brings several productivity improvements, including Git worktrees support, and a smoother editing experience.

Let’s take a look at the key updates in this release.

Download GoLand

Keep your codebase modern with guided Go syntax updates

Keeping your code aligned with the evolution of Go helps ensure long-term maintainability and compatibility with the ecosystem. GoLand 2026.1 introduces a unified workflow that helps you discover and apply modern Go syntax across your codebase.

When your project switches to Go 1.26, GoLand scans your code and highlights constructs that can be updated. These alerts appear directly in the editor and explain what can be improved and why, making new language features visible as you work.

In this release, GoLand supports two Go 1.26 syntax updates. Our team plans to expand this functionality in upcoming releases by covering additional important changes introduced in recent Go versions.

Identify outdated syntax directly in the editor

GoLand now includes inspections that detect outdated patterns and suggest modern alternatives. In 2026.1, the IDE introduces two syntax updates based on Go 1.26:

  • Pointer creation improvements using new()
  • Type-safe error unwrapping with errors.AsType

Each inspection provides quick-fixes so you can apply improvements directly in the editor.

Update your entire codebase in one workflow

Once you apply a syntax update, you can expand it across your entire project.

GoLand provides several entry points, so you can start where it feels most natural:

  • Right after applying a quick-fix, click Analyze code for other syntax updates.
  • Open Search Everywhere by double-pressing Shift and run the Update Syntax action.
  • Open go.mod with the go 1.26 directive and click Analyze code for syntax updates.
  • Go to the Refactor menu and select Update Syntax.

GoLand collects all findings in the Problems tool window, where you can review and apply updates across the project.

Review large changes with diff previews

You can review grouped results, apply fixes to individual occurrences or entire groups, and inspect every change using a built-in diff preview before applying it.

Work more easily with cloud and infrastructure workflows

Modern development increasingly relies on containerized environments and infrastructure tools. GoLand 2026.1 introduces several improvements that help you work with these workflows directly in the IDE.

Manage Terraform Stacks more easily

GoLand now supports working with Terraform Stacks directly in the IDE.

You can explore the infrastructure structure, navigate between components, and create new deployments from the IDE interface. Code completion and improved navigation help you stay oriented in complex infrastructure configurations.

Work faster and more comfortably in everyday development

Several improvements in GoLand 2026.1 focus on reducing friction in common workflows and making the IDE more comfortable to use throughout the day.

Work on multiple branches simultaneously with Git worktrees

GoLand now provides first-class support for Git worktrees, allowing you to work with multiple branches at the same time.

You can create a separate worktree for a hotfix, assign another one to an AI agent, and continue working in your main branch without switching contexts.

Even without AI workflows, worktrees reduce branch switching overhead and help you move faster in large repositories.

Enjoy a smoother and more responsive editing experience

The editor continues to evolve with improvements designed to make everyday coding more convenient.

This release introduces smoother caret animations and updated selection behavior, resulting in a cleaner and more responsive editing experience. For more information, refer to our blog post: Editor Improvements: Smooth Caret Animation and New Selection Behavior.

Get better Linux support with native Wayland integration

GoLand now runs on Wayland by default, improving HiDPI rendering and input handling on Linux systems.

If Wayland is not supported in your environment, the IDE automatically falls back to X11 to ensure your workflow remains stable and uninterrupted. For more information, refer to our blog post: Wayland By Default in 2026.1 EAP.

Get more done with AI directly in the IDE

GoLand continues expanding its AI capabilities to give you more flexibility and control over how you use AI during development.

Choose the best AI agent for each task

In addition to Junie, Claude Agent, and most recently Codex, GoLand now lets you work with more AI agents directly in the AI chat. You can choose from agents such as GitHub Copilot, Cursor, and many others supported through the Agent Client Protocol (ACP).

With the new ACP Agent Registry, you can discover and install supported agents with a single click.

Code With Me sunset

As we continue to evolve our IDEs and focus on the areas that deliver the most value to developers, we’ve decided to sunset Code With Me, our collaborative coding and pair programming service. Demand for this type of functionality has declined in recent years, and we’re prioritizing more modern workflows tailored to professional software development.

As of version 2026.1, Code With Me will be unbundled from all JetBrains IDEs. Instead, it will be available on JetBrains Marketplace as a separate plugin. 2026.1 will be the last IDE version to officially support Code With Me, as we gradually sunset the service.

Read the full announcement and sunset timeline in our blog post.

That wraps up the highlights of GoLand 2026.1.

We hope these changes make your workflow smoother and more enjoyable.

We would love to hear your thoughts: Feel free to tag us on X, drop into the #goland-gophers Slack channel, or create a ticket in our YouTrack issue tracker.

Happy coding,

The GoLand team

I cut Claude API costs by 90% with prompt caching. Here’s what I learned before I had to shut it down.

867 Discord servers. 1,000+ active users. $10–11 every time someone played a one-hour D&D session.

I was the only engineer. There was no revenue. And that number wasn’t going down on its own.

I want to be upfront before we go any further: Scrollbook is no longer running.

I built it because I was always the Dungeon Master. My wife, my son, and I had a standing D&D night, and I wanted to actually play for once instead of running the whole session. So I built an AI dungeon master to take my seat. It worked well enough that I shared it. I did not expect anyone else to care.

They did. 867 servers and 1,000+ users later, I was looking at $10-11 every time someone played a one-hour session with no revenue, no paywall, and no plan for either. (Scrollbook is one of three production projects I break down in my case studies. The other two are live and generating revenue. The contrast is instructive.) I shut it down because the cost of operating it solo, without a monetization model that kept pace with usage, made it unsustainable. By the time I pulled the plug, prompt caching had dropped that same session to $0.50-1.50. The technical solution worked. The business math didn’t.

Both of those things are worth talking about.

This post covers the technical side in detail: what the problem was, what I changed, and the actual production code behind it. The business lesson is at the end. I’d argue it’s the more important one.

The Cost Problem

Every message to Claude sent the entire conversation context from scratch. In a D&D session, that context grows with every exchange between the player and the AI.

Before caching, each API call looked like this:

[system prompt: ~1,800 lines of D&D rules + Cipher's personality]
[campaign context: setting, NPCs, quests, locations, active encounter]
[character context: stats, equipment, spells, conditions, companions]
[party context: all active players and their characters]
[message history: every exchange in the session so far]
[current question: "can I grapple the goblin?"]

The system prompt and campaign context alone sat at 4,000–5,000 tokens, reprocessed at full price on every single message.

A one-hour D&D session averages 15–25 back-and-forth exchanges. Context grows on each call. At Sonnet pricing ($3.00/M input, $15.00/M output): $10–11 per session. Multiply that across hundreds of active servers running concurrent sessions and it stops being a line item. It becomes a ceiling. Every new user makes the situation structurally worse.

The Architecture

Scrollbook runs on six services:

Service Role
bot/ Discord bot — receives player commands
api/ REST API for the companion web app
shared/services/cipher_service.py Owns all Anthropic API calls
shared/services/ai_usage_tracker.py Token counting and budget enforcement
shared/services/ai_extraction_service.py PDF/content extraction via Bedrock
infrastructure/ AWS CDK — ECS Fargate, RDS, ALB

cipher_service.py is the single point of contact with the Anthropic API. Context is assembled per-request by ContextManager.build_context(), pulling campaign data, character stats, active party, quests, encounters, and NPCs from Postgres — all scoped to the Discord guild ID.

Here is the insight that unlocked the fix: the system prompt and campaign context were structurally identical on every request for a given server. The D&D rules, Cipher’s personality, the campaign world — none of it changes message-to-message. It was being sent and fully reprocessed every single time, on every message, for every server.

What Prompt Caching Actually Is

Anthropic caches the prefix of your prompt on their infrastructure for a TTL window. Subsequent requests that match that prefix byte-for-byte skip the reprocessing cost. Instead of paying full input token price, you pay roughly 10% of that on a cache hit.

A few things that matter:

Prefix, not arbitrary sections. The cache applies to the beginning of your prompt. Everything you want cached must come before everything that changes. This means prompt order is the entire game.

Cache hits vs. misses. A hit means the prefix was already in cache; you pay about 10% of the normal input token price. A miss means the prefix gets written to cache at roughly 1.25x the normal input token price — slightly more expensive than a regular call, but a one-time cost within each TTL window. After the first message in a session, you want hits almost exclusively.

The TTL is 5 minutes for the ephemeral cache type on Anthropic’s infrastructure. For active D&D sessions this is fine — messages come fast. For a server that runs one session a week, you pay write costs every time with zero read benefit. The math only works at session density.

This is a first-class API feature, not a workaround. You opt in by passing structured content blocks with a cache_control field instead of a plain string. Two lines of code. Anthropic’s infrastructure handles everything else.

One more thing worth saying clearly: this is not client-side caching. You are not storing API responses locally. You are telling Anthropic’s infrastructure which portion of your prompt is stable so it does not need to recompute it.

The Implementation

Centralizing Prompt Assembly

With six services in play, the first structural requirement was centralizing all prompt assembly into one place. The cacheable prefix must be byte-for-byte identical across every request. That cannot happen if prompts are assembled in multiple code paths and concatenated at call time. A trailing space, a newline difference, a Unicode normalization inconsistency — any of it produces a full cache miss.

All prompt assembly in Scrollbook runs through one function: cipher_service.py:_build_conversational_prompt().

Prompt Order

The ordering decision is the whole thing:

1. System prompt (D&D rules + Cipher personality)        CACHED
2. Campaign and character context (per-guild, stable)    included in cache
3. Conversation history [0 ... N-3]                      CACHED at breakpoint
4. Conversation history [N-2, N-1]                       NOT cached
5. Current question                                      NOT cached

Static content at the top. Dynamic content at the bottom. The most expensive tokens, cached. The tokens that change on every message, not cached.

The Code

Before caching, the system prompt was passed as a plain string:

# Every call: full system text + context, reprocessed at full price every time
response = self.anthropic_client.messages.create(
    model=self.model_id,
    system=full_system_text,  # plain string, no caching
    messages=messages,
)

After caching, it becomes a structured content block:

# cipher_service.py:2070-2079
if self.enable_caching:
    system_blocks = [
        {
            "type": "text",
            "text": full_system_text,
            "cache_control": {"type": "ephemeral"}  # two lines
        }
    ]
else:
    system_blocks = [{"type": "text", "text": full_system_text}]

The conversation history gets a second cache breakpoint at the third-to-last message, capturing the entire prior session:

# cipher_service.py:2084-2098
for i, msg in enumerate(conversation_history):
    content_blocks = [{"type": "text", "text": msg["content"]}]

    if self.enable_caching:
        is_last_two = i >= len(conversation_history) - 2
        # Cache breakpoint at third-to-last message
        if not is_last_two and i == len(conversation_history) - 3:
            content_blocks[0]["cache_control"] = {"type": "ephemeral"}

    messages.append({"role": msg["role"], "content": content_blocks})

# Current question is never cached
messages.append({"role": "user", "content": [{"type": "text", "text": question}]})

Two cache breakpoints: one on the system prompt, one on the conversation history. The Anthropic API limits the number of cache control markers per request, so placement matters. You want those markers positioned to maximize the ratio of cached-to-uncached tokens on every call — that ratio is what drives your actual savings.

The API call itself barely changes. The system parameter is now a content block array instead of a string:

# cipher_service.py:2221-2228
response = self.anthropic_client.messages.create(
    model=self.model_id,
    max_tokens=self.max_tokens,
    temperature=self.temperature,
    system=system_blocks,  # content block array instead of plain string
    messages=msgs,
    tools=tools_to_use,
)

The Multi-Tenant Problem

867 servers means 867 sets of campaign state — different characters, different HP totals, different active encounters, different party compositions. Keeping per-guild context out of a polluted shared prefix requires a specific architectural decision.

In Scrollbook, guild-specific data lives inside the cached block:

# cipher_service.py:2066-2068
context_section = context.to_prompt_section()
full_system_text = f"{system_prompt_text}nn{context_section}"
# This full_system_text then receives the cache_control block

This works because campaign context is stable within a session. Cipher updates game state via tool calls when something changes — it does not receive externally updated context as new input mid-session. For the duration of an active session, the system prompt plus campaign context is genuinely identical across every message for that guild. Each guild gets its own cached prefix. No cross-contamination.

If your situation is different — if state changes externally between messages — that dynamic content needs to live below the cache breakpoint, not inside it.

The Results

A one-hour session that cost $10–11 dropped to $0.50–1.50.

To verify you are actually hitting the cache, read the usage object on the response. Do not assume. Log it explicitly:

# cipher_service.py:2268-2288
if self.enable_caching and hasattr(response, "usage"):
    usage = response.usage
    input_tokens = getattr(usage, "input_tokens", 0)
    cache_read_tokens = getattr(usage, "cache_read_input_tokens", 0)
    cache_creation = getattr(usage, "cache_creation_input_tokens", 0)

    if cache_read_tokens > 0:
        savings_pct = (
            cache_read_tokens / (input_tokens + cache_read_tokens)
        ) * 100
        logger.info(
            f"Cache HIT: {cache_read_tokens} tokens read from cache "
            f"({savings_pct:.1f}% savings), {input_tokens} new tokens"
        )
    elif cache_creation > 0:
        logger.info(f"Cache MISS: {cache_creation} tokens written to cache")

Three fields to understand:

  • input_tokens — tokens billed at full price this call
  • cache_creation_input_tokens — tokens written to cache, billed at approximately 1.25x the base input token price (one-time cost per TTL window)
  • cache_read_input_tokens — tokens read from cache, billed at approximately 10% of normal (this is where the 90% savings comes from)

The feature flag that controlled it all:

# shared/config/settings.py:86-98
anthropic_enable_prompt_caching: bool = Field(
    default=True, description="Enable Anthropic prompt caching (90% cost savings)"
)

# Bedrock fallback has no equivalent — hardcoded off
bedrock_enable_prompt_caching: bool = Field(
    default=False, description="Enable prompt caching (not supported on AWS Bedrock)"
)

A note on Bedrock: At the time Scrollbook was built, Bedrock did not support prompt caching. That gap made it a non-starter as the primary provider and locked the architecture to the direct Anthropic API. Bedrock has since caught up — prompt caching went GA in April 2025, with 1-hour TTL support added in January 2026. If you are on Bedrock today, the same technique applies.

When optimization becomes load-bearing infrastructure, provider lock-in follows. That was true when I built this. It is less true now.

Gotchas That Will Kill Your Cache Hit Rate

Prompt order is everything. If you accidentally flip the ordering — campaign context before system prompt, for example — every call is a full miss. The cache matches from the beginning of the prompt in sequence. There is no partial matching.

Dynamic content in the cached prefix. This is the hardest mistake to catch. Timestamps, counters, random values, user-specific data — anything that changes per-message, if it bleeds into the section you are trying to cache, every call is a miss. In Scrollbook, character HP and active conditions are inside the cached block intentionally, because Cipher controls those updates via tool calls. If your state changes externally, that content belongs below the breakpoint.

The 5-minute TTL cliff. Servers with long gaps between messages cold-start on every session. Write costs get paid repeatedly with zero read benefit. The math works at session density. For sparse traffic, run the calculation before assuming caching helps.

Whitespace and encoding. The prefix match is byte-level. A trailing space, a newline inconsistency, a Unicode normalization difference — any of it is a miss. Prompt assembly must run through a single code path. If you are concatenating in multiple places, you will have inconsistency you cannot see.

Don’t assume, verify. The logging block above takes ten minutes to add. Add it. The usage object will tell you immediately whether your cache hit rate matches your expectations. Ship it before you ship the feature.

Why I Still Had to Shut It Down

The honest math: 90% off still leaves 10% of a cost that grows with usage.

At $0.50–1.50 per session across 867 servers with no subscription revenue, the situation improved dramatically and remained unsustainable. I had bought runway. I had not fixed the underlying problem.

There was no paywall. No subscription tier. No mechanism for Scrollbook to generate revenue as usage scaled. Every new server was a new cost center with nothing offsetting it. Prompt caching made the slope of that curve shallower. It did not change the direction.

Beyond the API costs: solo maintenance at that user count meant incident response, server reliability, and the full weight of being the only person accountable to 867 active communities. That is not something you can optimize your way out of.

What I would do differently: charge earlier. I know that is a strange thing to say about something I built so my family could play D&D together. But the moment it left that context and became someone else’s tool, it became a product. I just did not treat it like one. Even a small subscription changes the entire math and the entire psychology of the product.

I built the technical foundation first, optimized costs second, and never got to monetization. The right order is the reverse: figure out how this sustains itself, then build, then optimize. I applied that lesson to the next two products I shipped. ReptiDex launched with a three-tier subscription model on day one and hit 50 paid subscribers in 9 days. Geckistry collects payment at checkout. Both are still running.

What to Take From This

Prompt caching is a real, production-grade optimization. The cache_control field is two lines of code. A 90% reduction in inference cost is achievable if your prompt has a large, stable prefix and your traffic density is high enough for cache reads to consistently outpace cache writes.

If you are building on Claude at any meaningful scale, look at your prompt structure. If you are sending the same system prompt on every request and that prompt is long, you are paying for reprocessing you do not need.

But the bigger lesson is not technical. If you are building an AI product solo, get to monetization before you get to optimization. The optimization I built here was real and it worked. The product did not survive anyway — not because the code was wrong, but because I treated cost reduction as a substitute for a business model.

It is not.

I run Built By Dusty, a software studio that builds custom apps and sales platforms for animal breeders and small businesses. The AI cost optimization techniques from Scrollbook now power features in the breeding software I deliver to clients. If you’re building on Claude at scale, or you’re a founder with a product that has real infrastructure costs to manage, I’d like to hear from you.

All code references in this article are from the actual Scrollbook production codebase. The codebase is private, but every snippet shown here ran in production.

How the DNS is resolved ?

USER MAKING A REQUEST:
when a user searches for something using the domain name , the browser needs to know the IP of the domain to establish communication so it resolves the DNS.

How it fetches the IP through the DNS?

Before starting, let’s be clear about what DNS is. DNS is like a label that maps a domain name to an IP address.

Let’s say we are searching for “WIKIPEDIA”.

First, the machine checks within itself (browser/cache) asking, “Do you remember the IP of Wikipedia?” If it doesn’t find it there, the request is forwarded to the router/modem. If it still doesn’t know, it is sent to the resolver (Internet Service Provider).

If the resolver also doesn’t have it cached, it queries the Root Name Server. From there, it is directed to the appropriate Top-Level Domain (TLD) server like .com, .in, or .org. Then it reaches the authoritative name server, where it finds Wikipedia’s IP from the zone file.

Finally, the IP address is returned back to the user.

I Built a tool to give AI coding agents persistent memory and a way smaller token footprint

Been building with AI coding agents for a while now. Claude Code, Cursor, Antigravity, and two things kept annoying me enough that I finally just built something to fix them.

The two problems

Problem 1: Your agent reads a 1000-line file and burns 8000 tokens doing it.

That’s before it’s done anything useful. Large codebases eat context fast, and once the window fills up, you’re either compressing (lossy) or starting over. Neither is great.

Problem 2: Every new session, your agent starts from zero.

It doesn’t remember that the API rate limit is 100 req/min. It doesn’t remember the weird edge case in the auth module you spent two hours debugging last week. It doesn’t remember anything. You either re-explain everything, or watch it rediscover the same gotchas.

These aren’t niche complaints — if you’re using AI agents to work on real codebases, you’ve hit both of these.

What I built

agora-code — persistent memory and context reduction for AI coding agents. Works with Claude Code, Cursor, and Gemini CLI. Survives context resets, new conversations, and agent restarts.

It’s early. It works. I want people to try it.

How it handles token bloat

Instead of letting the agent read raw source files, agora-code intercepts every file read and serves an AST summary instead.

Real example: summarizer.py is 885 lines. Raw read = 8,436 tokens. Summarized = 542 tokens. That’s a 93.6% reduction — and the agent still gets all the signal: class names, function signatures, docstrings, line numbers.

It works across languages too:

File type Method What you get
Python stdlib AST Classes, functions, signatures, docstrings
JS, TS, Go, Rust, Java + 160 more tree-sitter Same — exact line numbers, parameter types
JSON / YAML Structure parser Top-level keys + shape
Markdown Heading extractor Headings + opening paragraph

Summaries are cached in SQLite, so re-reads on the same branch are instant.

How it handles memory loss

When a session ends, agora-code parses the transcript and extracts a structured checkpoint: what was the goal, what changed, what non-obvious things did you find, what’s next.

At the start of the next session, the relevant parts are injected automatically — last checkpoint, top learnings from recent commits on the branch, git state, symbol index for dirty files.

You can also manually store findings:

agora-code learn "POST /users rejects + in emails" --tags email,validation
agora-code learn "Rate limit is 100 req/min" --confidence confirmed

And recall them later (keyword search by default, semantic search if you wire up embeddings):

agora-code recall "email validation"
agora-code recall "rate limit"

Storage is three layers: an active session file (project-local, gitignored), a global SQLite DB scoped per project via git remote URL, and search (FTS5/BM25 always on, optional vector search).

What happens automatically (Claude Code)

Once hooks are installed, you don’t have to think about most of this:

When you… agora-code automatically…
Start a session Injects last checkpoint + relevant learnings
Submit a prompt Recalls relevant past findings, sets session goal
Read a file > 100 lines Summarizes via AST — serves summary instead
Edit a file Tracks the diff, re-indexes symbols
Run git commit Derives learnings from the commit
Context window compresses Checkpoints before, re-injects after
End a session Parses transcript → structured checkpoint in DB

Getting started

pip install git+https://github.com/thebnbrkr/agora-code.git

Then in your project:

cd your-project
agora-code install-hooks --claude-code

For Cursor and Gemini CLI, you copy a config directory into your project root — full instructions in the README.

At the start of every Claude Code session, run /agora-code to load the skill. That’s the bit that tells the agent when to summarize, when to inject context, when to save progress.

It’s early

APIs may change. Things might break. I’m actively working on it — semantic search is in progress, automated hook setup for Cursor and Gemini is on the roadmap.

If you try it and hit something weird, open an issue. If you want to add hook support for a different editor, the pattern is consistent across .claude/hooks/ and .cursor/hooks/ — PRs welcome.

GitHub: https://github.com/thebnbrkr/agora-code

Screenshot: (https://imgur.com/a/APaiNnl

Would love to hear if this solves the same pain points for others, or if you’re handling token bloat / memory loss differently. Drop a comment.

Filter Assignments

DB- TASK 2

Bonus Q/A

  1. Find all movies where the special features are not listed (i.e., special_features is NULL).

cmd:
SELECT title FROM film WHERE special_features IS NULL;

sample op:

title

Academy Dinosaur
Ace Goldfinger
Adaptation Holes
Affair Prejudice
African Egg

2) Find all movies where the rental duration is more than 7 days.

cmd:
SELECT title, rental_duration
FROM film
WHERE rental_duration > 7;

sample op:
title | rental_duration
———————+—————–
Alamo Videotape | 8
Brotherhood Blanket | 9
Chicago North | 10
Dragon Squad | 8

3) Find all movies that have a rental rate of $4.99 and a replacement cost of more than $20.

cmd:
SELECT title, rental_rate, replacement_cost FROM film WHERE rental_rate = 4.99 AND replacement_cost > 20;

sample op:
title | rental_rate | replacement_cost
——————–+————-+——————
Ace Goldfinger | 4.99 | 22.99
Airport Pollock | 4.99 | 24.99
Bright Encounters | 4.99 | 21.99

4) Find all movies that have a rental rate of $0.99 or a rating of ‘PG-13’.

cmd:
SELECT title, rental_rate, rating FROM film WHERE rental_rate = 0.99 OR rating = ‘PG-13’;

sample op:
title | rental_rate | rating
——————-+————-+——–
Academy Dinosaur | 0.99 | PG
Alien Center | 2.99 | PG-13
Angels Life | 0.99 | PG-13

5) Retrieve the first 5 rows of movies sorted alphabetically by title.

cmd:
SELECT title FROM film ORDER BY title ASC LIMIT 5;

sample op:

title

Academy Dinosaur
Ace Goldfinger
Adaptation Holes
Affair Prejudice
African Egg

6) Skip the first 10 rows and fetch the next 3 movies with the highest replacement cost.

cmd:
SELECT title, replacement_cost
FROM film
ORDER BY replacement_cost DESC
LIMIT 3 OFFSET 10;

sample op:
title | replacement_cost
——————-+——————
Anthem Luke | 24.99
Apollo Teen | 24.99
Arabia Dogma | 24.99

7) Find all movies where the rating is either ‘G’, ‘PG’, or ‘PG-13’.
cmd:
SELECT title, rating FROM film WHERE rating IN (‘G’, ‘PG’, ‘PG-13’);

sample op:
title | rating
——————-+——–
Academy Dinosaur | PG
Ace Goldfinger | G
Alien Center | PG-13

8) Find all movies with a rental rate between $2 and $4.

cmd:
SELECT title, rental_rate FROM film WHERE rental_rate BETWEEN 2 AND 4;

sample op:
title | rental_rate
——————-+————-
Adaptation Holes | 2.99
Alien Center | 2.99
Apollo Teen | 3.99

9) Find all movies with titles that start with ‘The’.

cmd:
SELECT title FROM film WHERE title LIKE ‘The%’;

sample op:

title

The Matrix
The Pianist
The Others
The Truman Show

10) Find the first 10 movies with a rental rate of $2.99 or $4.99, a rating of ‘R’, and a title containing the word “Love”.

cmd:
SELECT title, rental_rate, rating
FROM film
WHERE rental_rate IN (2.99, 4.99)
AND rating = ‘R’
AND title LIKE ‘%Love%’
LIMIT 10;

sample op:
title | rental_rate | rating
—————–+————-+——–
Crazy Love | 2.99 | R
Dangerous Love | 4.99 | R
Endless Love | 2.99 | R

11) Find all movies where the title contains the % symbol.

cmd:
SELECT title FROM film WHERE title LIKE ‘%%%’ ESCAPE ”;

sample op:

title

100% Love
50% Chance

12) Find all movies where the title contains an underscore (_).

cmd:
SELECT title FROM film WHERE title LIKE ‘%_%’ ESCAPE ”;

sample op:

title

Mission_Impossible
Fast_Furious

13) Find all movies where the title starts with “A” or “B” and ends with “s”.

cmd:
SELECT title FROM film WHERE (title LIKE ‘A%’ OR title LIKE ‘B%’) AND title LIKE ‘%s’;

sample op:

title

Angels Life
Backwards Towns
Brothers Dreams

14) Find all movies where the title contains “Man”, “Men”, or “Woman”.

cmd:
SELECT title FROM film WHERE title LIKE ‘%Man%’ OR title LIKE ‘%Men%’ OR title LIKE ‘%Woman%’;

sample op:

title

Spider Man
X Men United
Wonder Woman

15) Find all movies with titles that contain digits (e.g., “007”, “2”, “300”).

cmd:
SELECT title FROM film WHERE title ~ ‘[0-9]’;

sample op:

title

007 Bond
300 Spartans
2 Fast 2 Furious

16) Find all movies with titles containing a backslash ().

cmd:
SELECT title FROM film WHERE title LIKE ‘%%’;

sample op:

title

Escape Reality
Path Finder

17) Find all movies where the title does contain the words “Love” or “Hate”.

cmd:
SELECT title FROM film WHERE title LIKE ‘%Love%’ OR title LIKE ‘%Hate%’;

sample op:

title

Crazy Love
Endless Love
Hate Story
Love Actually

18) Find the first 5 movies with titles that end with “er”, “or”, or “ar”.

cmd:
SELECT title
FROM film
WHERE title LIKE ‘%er’
OR title LIKE ‘%or’
OR title LIKE ‘%ar’
LIMIT 5;

sample op:

title

Joker
Creator
Avatar
Doctor
Warrior