Uncategorized

We Gave LLMs 150 Tools: Here’s What Broke.

There’s a hypothesis that most people building AI agents have encountered but few have measured: the more tools you give an LLM, the worse it gets at picking the right one.

It’s intuitive. Connect a few MCP servers to your agent, and suddenly it’s choosing from 60, 80, 100+ tools. GitHub tools, GitLab tools, Kubernetes, Slack, Jira, PagerDuty, Terraform, Grafana, all loaded into the context window, all the time. The model has to read every tool definition, understand the distinctions between them, and pick the right one. That’s a lot of signal to sift through.

But intuition isn’t data. So we built Boundary, an open-source framework for finding where LLM context breaks, and ran the numbers.

The setup

We assembled 150 tool definitions based on real schemas from production agent systems across 16 services: GitHub, GitLab, Jira, Confluence, Kubernetes, AWS, Datadog, Slack, PagerDuty, Okta, Snyk, Grafana, Terraform Cloud, Docker, Linear, and Notion. The tools are synthetic (no-op for benchmarking) but the schemas, parameter structures, and descriptions mirror what you’d find in a production MCP environment.

We tested six models across three providers:

  • Claude Sonnet 4.6 and Claude Haiku 4.5 (Anthropic)
  • GPT-4o and GPT-5.4 Mini (OpenAI)
  • Grok 4 and Grok 4.1 Fast Reasoning (xAI)

Each model received 60 prompts (both direct requests and ambiguous ones) at five toolset sizes: 25, 50, 75, 100, and 150 tools. At each size, the available tools were randomly selected but always included the correct one. The question: does the model pick the right tool?

The results

Every model that completed the test degraded. Two didn’t finish at all.

Model 25 tools 50 tools 75 tools 100 tools 150 tools
Grok 4.1 Fast 86.7% 83.3% 80.0% 83.3% 76.7%
GPT-5.4 Mini 85.0% 85.0% 80.0% 83.3% failed
GPT-4o 81.7% 78.3% 73.3% 76.7% failed
Claude Haiku 4.5 81.7% 80.0% 78.3% 80.0% 76.7%
Grok 4 80.0% 78.3% 80.0% 71.7% 80.0%
Claude Sonnet 4.6 78.3% 73.3% 73.3% 76.7% 75.0%

Accuracy vs toolset size across 6 LLMs

GPT-5.4 Mini was the most surprising result. At 85% accuracy through 50 tools, 92% on ambiguous prompts, sub-1-second latency, and $0.002 per call, it was arguably the best overall performer for small-to-medium toolsets. Then it hit the same 128-tool wall as GPT-4o and failed completely at 150.

Grok 4.1 Fast Reasoning was the only model that combined top-tier accuracy with the ability to handle 150 tools. It degraded steadily from 86.7% to 76.7%, but it never broke.

Both OpenAI models failed at 150 tools. OpenAI’s API has a hard limit of 128 tools per request. This isn’t a degradation curve. It’s a wall. If your agent connects enough MCP servers to exceed 128 tools, no OpenAI model works.

Claude Sonnet 4.6, the most expensive model in the test ($0.028/call), was the least accurate at 25 tools and never recovered. Claude Haiku outperformed it at every size while costing 3x less.

Cross-service confusion scales with tools

Cross-service confusion, where a model picks a tool from the wrong service entirely, was the most dangerous failure mode.

Model 25 tools 50 tools 75 tools 100 tools 150 tools
Claude Haiku 4.5 0 0 1 2 4
Grok 4.1 Fast 0 0 0 2 3
Claude Sonnet 4.6 0 1 2 3 2
Grok 4 2 0 2 4 1
GPT-4o 0 0 1 2 n/a
GPT-5.4 Mini 0 0 2 1 n/a

Grok 4 had cross-service errors even at 25 tools. Claude Haiku was clean until 75 tools but escalated to 4 errors at 150, the worst of any model at that size.

The most common cross-service confusions across all models:

  • Datadog vs Grafana: “Check the monitoring alerts” consistently routed to the wrong observability platform
  • Notion vs Confluence: “Search for documentation” split between the two
  • Linear vs Jira: “Add a comment to the tracking issue” picked the wrong project tracker
  • GitHub vs GitLab: “Show me the open issues” confused the two at higher tool counts

Direct vs. ambiguous prompts

A “direct” prompt names the service: “List all Terraform Cloud workspaces.” An “ambiguous” prompt doesn’t: “Add a comment saying ‘Resolved’ to the tracking issue.”

Model 25t (ambig) 50t (ambig) 75t (ambig) 100t (ambig) 150t (ambig)
GPT-5.4 Mini 92% 92% 67% 92% n/a
Grok 4.1 Fast 83% 83% 83% 75% 67%
Claude Sonnet 4.6 83% 75% 75% 83% 75%
GPT-4o 83% 83% 67% 58% n/a
Claude Haiku 4.5 75% 75% 83% 67% 67%
Grok 4 67% 75% 67% 50% 67%

GPT-5.4 Mini dominated ambiguous prompts at 92% through 100 tools. It handled disambiguation better than any other model by a wide margin. GPT-4o collapsed to 58% at the same size. Grok 4 hit 50%, a coin flip.

Claude Sonnet was the most stable, staying between 75% and 83% regardless of toolset size. Consistent, but never great.

Where models get confused

The errors tell a story. Some patterns appeared across all six models:

Terraform is hard. All models consistently confused terraform_create_run with terraform_list_workspaces, and terraform_lock_workspace with terraform_get_workspace. The tool names are semantically close, and the models default to “list” or “get” operations when the toolset is crowded.

Snyk is a trap. snyk_get_remediation, snyk_list_container_projects, and snyk_list_projects all got misrouted to snyk_list_organizations. When Snyk tools are buried among 100+ others, the models default to the most generic-sounding option.

Confluence updates fail. All models picked confluence_search when asked to update a page. The prompt said “Update the runbook page”, but with 75+ tools in context, the model reached for search instead of the update operation.

Monitoring platform confusion. Datadog and Grafana both have alerting, dashboards, and metrics tools. The prompt “Check the monitoring alerts for the API server” got routed to Grafana instead of Datadog by every model at some toolset size. Adding two similar services to the toolset creates permanent ambiguity.

The latency story

Accuracy isn’t the only cost.

Model 25 tools 50 tools 75 tools 100 tools 150 tools
GPT-5.4 Mini 739ms 754ms 849ms 976ms n/a
GPT-4o 1,170ms 4,035ms 6,213ms 7,657ms n/a
Claude Haiku 4.5 2,463ms 6,157ms 8,765ms 11,473ms 16,749ms
Claude Sonnet 4.6 4,728ms 10,308ms 14,579ms 19,120ms 27,935ms
Grok 4.1 Fast 6,448ms 7,042ms 6,930ms 7,349ms 7,533ms
Grok 4 7,706ms 7,945ms 8,133ms 8,418ms 9,552ms

Latency vs toolset size

GPT-5.4 Mini was the latency champion: sub-1-second at every toolset size it completed. The Anthropic models scaled linearly, with Sonnet reaching 28 seconds at 150 tools. The xAI models barely changed, staying in the 6-10 second range regardless of tool count.

What this means

The pattern is consistent across six models from three providers: more tools means worse accuracy, and the degradation starts between 25 and 50 tools.

The implications for anyone building agents with MCP:

  1. Don’t load everything. If your agent has access to 10+ services, that’s easily 80-150 tools. Loading them all upfront is a measurable tax on accuracy, starting at 25 tools.

  2. OpenAI has a hard wall at 128 tools. Both GPT-4o and GPT-5.4 Mini failed at 150. This isn’t a model quality issue. It’s a platform constraint. If your agent might exceed 128 tools, OpenAI models are not an option.

  3. Ambiguous prompts are the danger zone. Grok 4 hit 50% accuracy on ambiguous prompts at 100 tools. GPT-4o dropped to 58%. When users don’t name the service explicitly, the model has to disambiguate, and more tools makes that exponentially harder.

  4. Similar services compound the problem. Datadog and Grafana. Notion and Confluence. Linear and Jira. GitHub and GitLab. Every pair of similar services in the toolset creates a permanent source of confusion that scales with tool count.

  5. Latency compounds. Even if accuracy were flat, the latency cost matters. Claude Sonnet at 28 seconds per call is unusable for interactive workloads. GPT-5.4 Mini at sub-1-second is a different product entirely.

  6. Price does not predict performance. Claude Sonnet 4.6 costs 28x more per call than Grok 4.1 Fast and is less accurate. Claude Haiku outperforms Claude Sonnet at 3x lower cost. The most expensive model lost.

The cost equation

What you pay per call versus what you get in accuracy.

Model Total cost Calls Cost/call Best accuracy Worst accuracy
Grok 4.1 Fast $0.31 300 $0.0010 86.7% (25t) 76.7% (150t)
GPT-5.4 Mini $0.50 240* $0.0021 85.0% (25t) failed (150t)
GPT-4o $1.57 240* $0.0065 81.7% (25t) failed (150t)
Claude Haiku 4.5 $2.83 300 $0.0094 81.7% (25t) 76.7% (150t)
Grok 4 $3.85 300 $0.013 80.0% (25t) 71.7% (100t)
Claude Sonnet 4.6 $8.51 300 $0.028 78.3% (25t) 73.3% (50t)

*OpenAI models completed 240 of 300 calls. All calls at 150 tools failed due to the 128-tool API limit.

Cost vs accuracy tradeoff

The two cheapest models (Grok 4.1 Fast at $0.001/call and GPT-5.4 Mini at $0.002/call) were also the two most accurate. The most expensive model (Claude Sonnet at $0.028/call) was the least accurate. The correlation between price and tool-calling performance is not just weak. It’s inverted.

This is exactly the kind of tradeoff Boundary is designed to surface. Without benchmark data, you’d likely pick Claude Sonnet or GPT-4o. The data says they’re among the worst choices for tool-calling workloads. A team running fewer than 128 tools should seriously consider GPT-5.4 Mini for its combination of accuracy, speed, and cost. A team that might exceed 128 needs Grok 4.1 Fast or an Anthropic model.

Running these benchmarks costs almost nothing. This entire run across six models cost $17. That’s less than a single hour of engineer time debugging a misrouted tool call in production.

How this shaped our architecture

This data isn’t theoretical for us. It directly informed how we built progressive disclosure in SixDegree.

The core insight: if accuracy degrades between 25 and 50 tools, then the goal isn’t to find a smarter model. It’s to never present more than 25 tools in the first place. Not by hardcoding a curated list, but by letting the agent’s context determine which tools are relevant at each step.

In SixDegree, when an agent queries the ontology and discovers a GitHub repository, only the GitHub tools become available. When a Kubernetes deployment surfaces through a relationship, the Kubernetes tools appear. The agent never sees all 150 tools at once because it never needs to. The toolset at any given turn is scoped to the entities the agent has actually encountered.

The Boundary data validates this approach quantitatively. At 25 tools (roughly the size of two or three services’ worth of tools), accuracy is in the mid-to-high 80s. That’s the operating range progressive disclosure keeps you in, regardless of how many total services are connected. You can have 16 integrations and 150 tools installed, and the agent still only sees the 10-20 that matter for the current conversation.

The alternative, loading everything and hoping the model figures it out, costs you 5-10 percentage points of accuracy, up to 28x the latency, and for OpenAI models, a hard failure at 128 tools. Progressive disclosure isn’t a nice-to-have. It’s a requirement for agents that work at scale.

Limitations and what we’d like to improve

This benchmark is a starting point, not a definitive answer. There are real limitations to what it measures and how:

Single-turn only. Each prompt gets one shot at picking a tool. Real agents chain tool calls, use results from previous calls to inform the next one, and recover from mistakes. A model that picks the wrong tool on the first try might self-correct on a second turn. This benchmark doesn’t capture that.

Random tool subsets. At each toolset size, the available tools are randomly selected (with the correct one always included). In production, the tools in context aren’t random. They’re usually grouped by service or use case. Random selection may overstate or understate confusion depending on which tools end up adjacent.

No parameter validation. We check whether the model picked the right tool, but not whether it filled in the parameters correctly. A model that picks github_create_issue but hallucinates the owner field is still counted as correct. Parameter accuracy is a whole separate dimension.

Prompt quality varies. Some of the ambiguous prompts have arguably debatable expected answers. “Check the monitoring alerts” could reasonably map to either Datadog or Grafana depending on the organization. We picked one, but reasonable people would disagree.

Single trial. Each prompt runs once per toolset size. With 60 prompts per size, the results are directional but individual percentage points could shift with more trials.

We’d like to add multi-turn evaluation, parameter accuracy checking, configurable prompt difficulty levels, and more models. If you have ideas for how to make this benchmark better, if you disagree with our methodology, or if you’ve run Boundary against a model we haven’t tested yet, open an issue or submit a PR. This is an open source project and we want the community to help shape it.

What’s next

The full interactive results from this run are available on our site. The framework is open source. Run it yourself and see how your preferred models handle tool overload.

Boundary is an open-source framework for finding where LLM context breaks. See how SixDegree solves tool overload.

PhpStorm 2026.1 is Now Out

Welcome to PhpStorm 2026.1! This release brings new PhpStorm MCP tools, new third-party agents inside your IDE, support for Git worktrees, and lots of other productivity-enhancing features for PHP and Laravel developers.

Download PhpStorm 2026.1

PhpStorm MCP tools

In PhpStorm 2025.2, we added an integrated MCP server for third-party coding agents like Claude Code, Windsurf, or Codex to access and use your IDE’s tools. 

In 2026.1, we are extending the MCP server toolset with more PhpStorm features, including:

  • Inspections and quick-fixes that enable agents to leverage PhpStorm’s powerful static analysis engine.
  • IDE search capabilities, including PhpStorm’s structural search and semantic search for code patterns.
  • Access to IDE actions so that you can delegate setup and customization of your IDE to your coding agent.

Furthermore, the PhpStorm plugin for Claude Code provides Claude Code with context and instructions for using PhpStorm MCP server tools. To add the plugin’s skills and hooks to your project, go to PhpStorm’s Settings | Tools | PHP Claude Skills.

Note: PhpStorm’s MCP server is disabled by default. To enable the server and configure integration with your coding agent, go to Settings | Tools | MCP Server.

AI

Third-party agents in PhpStorm

PhpStorm is evolving as an open platform that allows you to bring the AI tools of your choice into your professional development workflows.

In addition to Junie, Claude Agent, and most recently Codex, PhpStorm now lets you work with more AI agents directly in the AI chat. You can choose from agents such as GitHub Copilot, Cursor, and many others supported through the Agent Client Protocol.

Next edit suggestions

Next edit suggestions are now available without consuming AI quota of your JetBrains AI Pro, Ultimate, and Enterprise subscriptions. These suggestions go beyond traditional code completion for PHP. Instead of updating only what’s at your cursor, they intelligently apply related changes across the entire file, helping you keep your code consistent and up to date with minimal effort.

This natural evolution of code completion delivers a seamless Tab Tab experience that keeps you in the flow.

Junie CLI is now in Beta

Junie CLI is JetBrains’ LLM-agnostic coding agent you can use directly from the terminal, inside any IDE, in CI/CD, and on GitHub or GitLab. Junie CLI comes with:

  • Bring Your Own Key (BYOK) pricing, allowing you to use your own keys from model providers without additional charges.
  • One-click migration from other agents such as Claude Code or Codex.
  • Flexible customization through guidelines, custom agents and agent skills, commands, MCP, and more.

Read the full announcement in our blog post.

Project indexing optimization

PhpStorm now automatically detects framework-specific directories that contain frequently changing generated, cached, or user-uploaded content and excludes such directories from project indexing. 

The IDE skips excluded folders during search, parsing, and other operations. Reducing indexing overhead helps optimize the CPU usage and performance of your IDE. 

If you want to re-enable indexing for any of the automatically excluded folders, you can do so in Settings | Directories by clicking Exclude and unselecting the checkboxes next to the directories you want to be indexed.

Generics support

The new release brings a number of improvements and bug fixes for PhpStorm’s type inference engine, including: 

  • Improved type inference for callable generic types. Now the IDE can infer both the input parameter type from a callable(T) annotation and the callable template return type.

  • Improved display for nested parameterized template types. PhpStorm 2026.1 displays parameter type (Ctrl + Shift+P) and quick documentation (F1) info with multiple layers of wrapping, such as Wrapper<Wrapper<Wrapper<stdClass>>>.

More quality-of-life improvements

Debugging non-PHP files

You can now set breakpoints in non-PHP files as soon as the file name pattern is associated with the PHP file type in the IDE settings. Together with native path mapping between templates and compiled PHP files introduced in Xdebug 3.5, this feature allows you to debug source template files of any format, including niche extensions like .ezt.

Improved Go to test navigation

In PhpStorm 2026.1, we’ve improved Go to Test navigation for PHPUnit and Pest tests with the following enhancements: 

  • Navigation between PHPUnit tests that use a #[UsesClass] or #[UsesMethod] attribute and the related class/method.
  • For Pest tests, you can now navigate from the Test Runner tab to the source test nested inside Pest describe blocks. 

Convert to pipe operator quick-fix

PhpStorm now detects code elements where the PHP 8.5 pipe operator syntax can be used and suggests a quick-fix to convert such code into easier-to-read pipe operator chains.

Laravel

  • Framework support: support for Laravel 13 and new versions of Livewire and Filament. Support for the new @hasStack and @includeIsolated Blade directives.
  • New package support: Laravel Wayfinder, PHP Native, staudenmeir/laravel-cte and staudenmeir/laravel-adjacency-list packages.
  • Eloquent enhancements: advanced #[Scope] methods support, optimized and more accurate Find Usages for scope, attribute and relation methods.
  • UI and navigation: Blade view usages UI, better controller inlays, new Route Search UI, and routes to the Endpoints tool window.
  • Productivity tweaks: a new Add Application Database action. Run Artisan commands in the Terminal tool window or via PHP interpreter.
  • Laravel Idea MCP server shipped with the PhpStorm MCP server.

For the full list of updates, see Laravel Idea’s changelog.

Frontend

PhpStorm’s TypeScript support now uses the service-powered type engine (built on the TypeScript language service) by default, delivering more accurate type inference and lower CPU usage in large projects. The TypeScript support is further improved with better auto-import handling for path aliases and project references, as well as the integration of inlay hints from the TypeScript Go-based language server. JavaScript parsing now also correctly handles string-literal import / export specifiers.

Framework and styling support have been refined across the board: 

  • The IDE now highlights React’s new use memo and use no memo directives. 
  • The Vue integration uses the updated 3.1.8 version of @vue/typescript-plugin
  • Astro settings accept JSON-based configuration for language server integration. 
  • Modern CSS color() functions and additional color spaces are supported in swatches and previews. 
  • Angular 21.x template syntax is supported.

Databases

The AI chat integration for Codex and Claude Agent now offers full, native support for your connected databases. With that, you can now query, analyze, and modify your database state using natural language right from the IDE.

The same functionality is available for external agents via an MCP server.

Data source settings can now be stored in your JetBrains Account via data source templates. Especially nifty for All Products Pack users or anyone who uses multiple instances of JetBrains IDEs, this upgrade allows you to access data source templates and settings in every JetBrains IDE with database functionality.

Productivity-enhancing features

Editor caret and selection updates

We’re continuing to modernize our IDEs, and in this update, we’ve refreshed something you interact with constantly – the editor. Smooth caret animation and updated selection behavior provide improved comfort, a cleaner look, and a more enjoyable coding experience.

Read more

Work on multiple branches at once with Git Worktrees

With the evolution of AI agents, running multiple tasks in parallel has become a major time-saver, and this is precisely where Git worktrees are extremely handy. To support cutting-edge workflows for AI-boosted software development, PhpStorm now provides first-class support for Git worktrees. Create a separate worktree for an urgent hotfix, hand off another one to an AI agent, and keep working in your main branch – all at the same time, without interruption.

Even if you don’t use agents, worktrees will save you time on branch switching, especially in big projects.

Native Wayland support

IntelliJ-based IDEs now run natively on Wayland by default. This transition provides Linux professionals with ultimate comfort through sharper HiDPI and better input handling, and it paves the way for future enhancements like Vulkan support.

While Wayland provides benefits and serves as a foundation for future improvements, we prioritize reliability: The IDE will automatically fall back to X11 in unsupported environments to keep your workflow uninterrupted. Learn more.

Terminal completion

Stop memorizing commands. Start discovering them. In-terminal completion helps you instantly explore available subcommands and parameters as you type. Whether you’re working with complex CLI tools like Git, Docker, or kubectl or using your own custom scripts, this feature intelligently suggests valid options in real time.

Code With Me sunset

As we continue to evolve our IDEs and focus on the areas that deliver the most value to developers, we’ve decided to sunset Code With Me, our collaborative coding and pair programming service. Demand for this type of functionality has declined in recent years, and we’re prioritizing more modern workflows tailored to professional software development.

As of version 2026.1, Code With Me will be unbundled from all JetBrains IDEs. Instead, it will be available on JetBrains Marketplace as a separate plugin. 2026.1 will be the last IDE version to officially support Code With Me, as we gradually sunset the service.

Read the full announcement and sunset timeline in our blog post. 

RubyMine 2026.1: AI Chat Upgrades, New Code Insight, Stable Remote Development, and More 

RubyMine 2026.1 is here! This release brings a range of improvements aimed at making Ruby and Rails development faster and more enjoyable.

You can get the new build from our website or via the free Toolbox App.

Let’s take a look at the highlights of this release.

AI

RubyMine continues to evolve as an open platform that lets you bring your preferred AI tools directly into your development workflow. With RubyMine 2026.1, working with multiple AI agents and integrating them into your IDE experience is now easier than ever.

Use more AI agents in RubyMine

In addition to Junie and Claude Agent, you can now choose more agents in the AI chat, including Codex. Additionally, Cursor and GitHub Copilot, along with dozens of external agents, are now supported via the Agent Client Protocol (ACP). With the new ACP Registry, you can discover available agents and install them in just one click.

Install From ACP Registry option in AI chat

Work with connected databases directly in the AI chat

The AI chat integration for Codex and Claude Agent now offers full, native support for your connected databases. With that, you can now query, analyze, and modify your database state using natural language right from the IDE.

The same functionality is available for external agents via MCP server.

Accessing rails project databases from AI chat using Claude Agent

Get next edit suggestions throughout your file

Next edit suggestions are now available without consuming AI quota of your JetBrains AI Pro, Ultimate and Enterprise subscriptions. These suggestions go beyond what is offered by traditional code completion for your programming language. Instead of updating only what’s at your cursor, they intelligently apply related changes across the entire file, helping you keep your code consistent and up to date with minimal effort.

This natural evolution of code completion delivers a seamless Tab Tab experience that keeps you in the flow. 

Enabling next edit suggestions for AI Assistant

Code insight

Try the new code insight engine (Beta)

RubyMine 2026.1 introduces a new, currently experimental, symbol-based language modeling engine.

This engine changes how RubyMine understands classes, modules, and constants (support for methods is planned for future releases), laying the groundwork for faster and more reliable code insight.

Our internal benchmarks show significant improvements.

Qualified first-element constant completion is about 40% faster, while the overall time for constant completion improved by roughly 50%. Type-matched completion for exceptions became dramatically faster – by about 95%. In addition, the performance of Find Usages improved by around 60% in large projects and by about 15% in typical cases.

Additional areas that benefit from the new engine include:

  • Rename refactoring
  • Quick Documentation, Quick Definition, and Ctrl+Hover hints
  • Structure view
  • Navigation (Go to Declaration and Go to Type Declaration)

Because the engine is still in Beta, it is disabled by default. You can enable it in Settings | Languages & Frameworks | Ruby | Code Insight.

Give it a try and share your feedback!

Enabling experimental code insight for Ruby

Remote development

Boost your productivity with Stable remote development

Remote development officially moves out of Beta and becomes Stable in RubyMine 2026.1.

You can now connect to your development environments via SSH, Dev Containers, or WSL 2, and the IDE backend will run on the remote machine while the user interface remains fast and responsive on your local device.

This setup gives you the full RubyMine experience wherever your code lives.

Remote Development window

Rails

Work seamlessly with variables passed via render

RubyMine now correctly recognizes local variables passed via render.

Variables provided through the locals: option are no longer marked as unresolved and appear in code completion.

This behavior works consistently across views, layouts, partials, and templates (ERB and HAML), providing cleaner code insight and fewer unnecessary warnings.

Recognizing variables passed via render

Detect deprecated Rails associations instantly

Keeping Rails projects modern and maintainable is now easier with improved deprecation detection.

When a Rails association is marked as deprecated (for example, has_many :posts, deprecated: true), RubyMine highlights all its usages throughout your project and shows a clear deprecation notice in the Quick Documentation popup.

This helps you identify outdated APIs early and update your code proactively.

Highlighting deprecated Rails associations

Use Rails virtual database columns

RubyMine 2026.1 adds recognition for virtual generated columns from PostgreSQL 18 (or later versions) in Rails projects.

These non-persisted columns behave just like regular attributes in the IDE. Code completion, type hints, and navigation to the column definition in schema.rb work seamlessly.

Recognizing virtual database columns in Rails

Ruby and RBS

Use endless methods with access modifiers

RubyMine now fully supports Ruby 4.0 endless methods with access modifiers. Code such as private def hello = puts "Hello" is now parsed correctly and no longer produces errors.

Supporting endless methods with access modifiers

Use more Ruby and RBS operators in completion

You can now type Ruby and RBS operators (=, !, +, *, and others) directly in the completion popup without closing it. This keeps you in the flow and helps you finish expressions faster.

Expanded range of operators in completion popup

Rename global variables safely

RubyMine now validates global variable names during renaming.

Invalid names such as $foo!@# are no longer allowed, preventing broken code and syntax errors. The IDE ensures renamed variables follow Ruby’s syntax rules, making refactoring safer and more reliable.

Alert notification about an invalid global variable name

Let RubyMine select the Ruby interpreter automatically

RubyMine 2026.1 can automatically detect the correct Ruby interpreter by analyzing configuration files such as .ruby-version or .tool-versions.

There are three scenarios:

  • Single match found: RubyMine sets the interpreter automatically so you can start coding immediately.
  • Multiple matches or no match found: RubyMine shows a notification and helps you choose the correct interpreter.
  • No configuration file found: RubyMine selects the latest installed MRI Ruby version as a safe default.

If you prefer manual configuration, you can disable this behavior in Settings | Languages & Frameworks | Ruby. Find more details in our docs.

Updated Ruby settings page with the option of automatic Ruby interpreter selection

User experience improvements

Debug failing tests faster with the diff viewer

RubyMine 2026.1 introduces a diff viewer for failed RSpec and minitest tests.

When a test fails, simply click Click to see difference in the test results to open a side-by-side comparison of expected and actual values. This makes it much easier to identify the issue and fix failing tests quickly.

Configure linting and formatting with ease

RubyMine now features a redesigned configuration for RuboCop and the standard gem, along with a new Linting and Formatting section in Settings | Tools | RuboCop.

You can choose from mutually exclusive options:

  • Default
  • Standard gem inspections
  • Standard on save
  • RuboCop server mode
  • RuboCop on save

The updated settings simplify configuration, prevent conflicts between tools, and integrate tightly with RubyMine formatting actions.

Redesigned RuboCop settings page with new Linting and Formatting section

Other

Plan ahead for the sunsetting of Code With Me

Starting with RubyMine 2026.1, Code With Me will be unbundled from JetBrains IDEs and distributed as a separate plugin on JetBrains Marketplace.

RubyMine 2026.1 will be the last IDE version to officially support Code With Me as the service is gradually sunset.

Read the full announcement and timeline in our blog post.

Stay in touch

Follow RubyMine on X to stay up to date on all the latest features.

We invite you to share your thoughts in the comments below. You can also suggest and vote for new features in our issue tracker.

Happy developing!

The RubyMine team

AI-Assisted Java Application Development with Agent Skills

Agent-assisted development is quickly becoming a common mode of software development. New techniques are emerging to help LLMs generate code that matches your preferences and standards.

One common approach is to create an AGENTS.md, CLAUDE.md, or GEMINI.md file with project details, build instructions, and coding guidelines. The AI agent loads this file into context on every request.

This has two drawbacks:

  • It consumes tokens on every request, increasing cost.
  • Loading too much context into an LLM degrades its effectiveness.

Agent Skills is a new initiative that solves both problems by managing context progressively and extending AI agent capabilities on demand.

What are Agent Skills?

Agent Skills is an open standard introduced by Anthropic, to extend AI agent capabilities with specialized knowledge and workflows.

Consider a use case where you want an AI to generate presentations using your company’s slide template and design guidelines. You can package those assets (the PPT template, font files, and design rules) into a skill. The agent then uses that skill to generate slides that match your standards automatically.

A skill is a folder containing a SKILL.md file. This file includes metadata (name and description at minimum) and instructions that tell an agent how to perform a specific task. Skills can also bundle scripts, templates, and reference materials.

skill-name/
├── SKILL.md          # Required: instructions + metadata
├── scripts/          # Optional: executable code
├── references/       # Optional: documentation
└── assets/           # Optional: templates, resources

The format of a SKILL.md file is:

---
name: name-of-the-skill
description: Skill description.
license: Apache-2.0
metadata:
  author: author/org
  version: "1.0"
compatibility: Requires git, docker, jq, and access to the internet
---

Skill Content

In a SKILL.md file, name and description are required fields, and you can add optional fields like licence, metadata, compatibility, etc. You can explore more about the Skill Specification here.

How do Agent Skills manage context?

At startup, agents load only the metadata (name and description) of installed skills. When you ask the agent to perform a task, it finds the relevant skill and loads only that SKILL.md into context.

This progressive loading keeps context minimal and pulls in additional information only when needed, unlike a monolithic CLAUDE.md that loads everything upfront.

What can be a skill?

Skills extend AI capabilities across a wide range: from coding guidelines for a specific library, to step-by-step workflows with reference documents and helper scripts.

For example, you can create a skill that:

  • Specifies which library APIs to use and which anti-patterns to avoid.
  • Bundles reference documentation in a references/ directory.
  • Includes helper scripts in a scripts/ directory.

Case Study: Implementing Spring Data JPA Pagination

Suppose you ask an AI agent to implement a Spring Boot REST API endpoint that returns a paginated list of Post entities along with their Comment collections.

Without guidance, the agent is likely to produce one of these common mistakes:

  • N+1 SELECT problem — lazy-loading comments trigger a separate query per post.
  • In-memory pagination — using JOIN FETCH with pagination loads all rows into memory, then paginates in the application layer.

You can check out the sample code from the GitHub repository https://github.com/sivaprasadreddy/agent-skills-demo 

Let us see how an AI Agent might generate code when asked to implement a REST API endpoint to return paginated posts along with comments.

Without any specific guidelines or skills, the AI Agent generated the following implementation:

@RestController
@RequestMapping("/api/posts")
class PostController {
   private final PostService postService;

   PostController(PostService postService) {
       this.postService = postService;
   }

   @GetMapping
   PagedResult<PostDto> getPosts(
           @RequestParam(name = "page", defaultValue = "1") int pageNo,
           @RequestParam(name = "size", defaultValue = "10") int pageSize) {
       return postService.getPosts(pageNo, pageSize);
   }

}


@Service
@Transactional(readOnly = true)
public class PostService {
   private final PostRepository postRepository;

   public PostService(PostRepository postRepository) {
       this.postRepository = postRepository;
   }

   public PagedResult<PostDto> getPosts(int pageNo, int pageSize) {
       Sort sort = Sort.by(Sort.Direction.ASC, "id");
       Pageable pageable = PageRequest.of(pageNo <= 0 ? 0 : pageNo - 1, pageSize, sort);
       Page<PostDto> postPage = postRepository.findAllWithComments(pageable).map(PostDto::from);
       return PagedResult.from(postPage);
   }

}

If you run the application and invoke the GET /api/posts endpoint, you will get the results, but in the logs you will find the below WARNING:

HHH000104: firstResult/maxResults specified with collection fetch; applying in memory

This essentially means, Hibernate will load all the entities into memory and then apply pagination. This will result in poor performance and even OutOfMemory exceptions if there are a large number of rows in the posts table.

A Spring Data JPA skill prevents both issues by giving the agent explicit guidelines and a working code example.

Spring Data JPA Agent Skill

Create a spring-data-jpa/SKILL.md file with the following content:

---
name: spring-data-jpa-skill
description: Implement the persistence layer using Spring Data JPA in Spring Boot applications.
---

Follow the below principles when using Spring Data JPA:

1. Disable the Open Session in View (OSIV) filter: 
spring.jpa.open-in-view=false
2. Disable in-memory pagination: 
spring.jpa.properties.hibernate.query.fail_on_pagination_over_collection_fetch=true

3. Avoid the N+1 SELECT problem: use JOIN FETCH to load associated child collections in a single query.
4. Avoid in-memory pagination: when loading a paginated list of parent entities with child collections:
	* First, load only the parent IDs using pagination
	* Then, load the full entities with their child collections using JOIN FETCH for those IDs
	* Assemble the final Page from the paginated IDs and the loaded entities


## Pagination with child collections example:

PostRepository.java

public interface PostRepository extends JpaRepository<Post, Long> {

   @Query("select p.id from Post p order by p.id")
   Page<Long> findPostIds(Pageable pageable);

   @Query("select distinct p from Post p left join fetch p.comments where p.id in :ids")
   List<Post> findAllByIdInWithComments(@Param("ids") Collection<Long> ids);
}


PostService.java

@Service
public class PostService {
   private final PostRepository postRepository;

   public PostService(PostRepository postRepository) {
       this.postRepository = postRepository;
   }

   @Transactional(readOnly = true)
   public Page<Post> findPosts(Pageable pageable) {
       Page<Long> idsPage = postRepository.findPostIds(pageable);
       if (idsPage.isEmpty()) {
           return Page.empty(pageable);
       }
       List<Post> posts = postRepository.findAllByIdInWithComments(idsPage.getContent());
       return new PageImpl<>(posts, pageable, idsPage.getTotalElements());
   }
}

How to use Agent Skills?

Agent Skills work with Claude Code, Codex, Gemini CLI, JetBrains Junie, and other agents. Install a skill at the project level or user level depending on your preference.

Agent Project-Level User-Level
Junie .junie/skills/ ~/.junie/skills/
Claude Code .claude/skills/ ~/.claude/skills/
Codex .agents/skills/ ~/.agents/skills/
Gemini CLI .gemini/skills/(or).agents/skills/ ~/.gemini/skills/(or)~/.agents/skills/

To use the Spring Data JPA skill with Claude Code:

  1. Copy the spring-data-jpa/ directory into {project-root}/.claude/skills/.
  2. Ask Claude Code to implement a paginated REST API endpoint.
  3. Claude Code discovers the skill automatically and follows the guidelines.

As you can see, Claude Code automatically discovered the Spring Data JPA skill and generated the following implementation following the guidelines given in the skill.

@Service
public class PostService {
   private final PostRepository postRepository;

   public PostService(PostRepository postRepository) {
       this.postRepository = postRepository;
   }

   @Transactional(readOnly = true)
   public Page<Post> findPosts(Pageable pageable) {
       Page<Long> idPage = postRepository.findPostIds(pageable);
       if (idPage.isEmpty()) {
           return Page.empty(pageable);
       }
       List<Post> posts = postRepository.findAllByIdInWithComments(idPage.getContent());
       return new PageImpl<>(posts, pageable, idPage.getTotalElements());
   }
}

With this implementation, only the Post IDs of the desired page will be loaded first, and then a list of posts along with their comments will be fetched in a separate query. This will fix the pagination in-memory issue.

Using Agent Skills with Junie

You can use the JetBrains Junie Agent to generate code which automatically loads the necessary skills from .junie/skills  and directory.

The Junie agent loaded spring-data-jpa skill based on the given task and applied the guidelines. You can also observe that Junie automatically runs the relevant tests to verify the generated code is working or not and iterate until the tests are passed.

In the sample repository https://github.com/sivaprasadreddy/agent-skills-demo, you can find the following branches to try out the spring-data-jpa Agent Skill:

  • main: Starting point to try implementing the mentioned usecase without any skills.
  • in-memory-pagination-issue: Usecase implementation generated by AI that results in in-memory pagination issue.
  • skills: With spring-data-jpa skill to try implementing the mentioned usecase.

Summary

If the AI agent is generating code with any anti-patterns or not following team coding standards and conventions, instead of fixing issues one-by-one with follow-up prompts, consider creating a skill to provide those as guidelines.

To explore more on Agent Skills, please refer to the following resources:

  • https://agentskills.io/
  • https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview
  • https://developers.openai.com/codex/skills/
  • https://geminicli.com/docs/cli/skills/
  • https://junie.jetbrains.com/docs/agent-skills.html 

GoLand 2026.1 Is Released

GoLand 2026.1 helps you keep your Go code modern and your workflow efficient. This release introduces guided syntax updates for Go 1.26, making it easier to adopt new language improvements across your entire codebase. It also expands AI capabilities with support for additional agents, and brings several productivity improvements, including Git worktrees support, and a smoother editing experience.

Let’s take a look at the key updates in this release.

Download GoLand

Keep your codebase modern with guided Go syntax updates

Keeping your code aligned with the evolution of Go helps ensure long-term maintainability and compatibility with the ecosystem. GoLand 2026.1 introduces a unified workflow that helps you discover and apply modern Go syntax across your codebase.

When your project switches to Go 1.26, GoLand scans your code and highlights constructs that can be updated. These alerts appear directly in the editor and explain what can be improved and why, making new language features visible as you work.

In this release, GoLand supports two Go 1.26 syntax updates. Our team plans to expand this functionality in upcoming releases by covering additional important changes introduced in recent Go versions.

Identify outdated syntax directly in the editor

GoLand now includes inspections that detect outdated patterns and suggest modern alternatives. In 2026.1, the IDE introduces two syntax updates based on Go 1.26:

  • Pointer creation improvements using new()
  • Type-safe error unwrapping with errors.AsType

Each inspection provides quick-fixes so you can apply improvements directly in the editor.

Update your entire codebase in one workflow

Once you apply a syntax update, you can expand it across your entire project.

GoLand provides several entry points, so you can start where it feels most natural:

  • Right after applying a quick-fix, click Analyze code for other syntax updates.
  • Open Search Everywhere by double-pressing Shift and run the Update Syntax action.
  • Open go.mod with the go 1.26 directive and click Analyze code for syntax updates.
  • Go to the Refactor menu and select Update Syntax.

GoLand collects all findings in the Problems tool window, where you can review and apply updates across the project.

Review large changes with diff previews

You can review grouped results, apply fixes to individual occurrences or entire groups, and inspect every change using a built-in diff preview before applying it.

Work more easily with cloud and infrastructure workflows

Modern development increasingly relies on containerized environments and infrastructure tools. GoLand 2026.1 introduces several improvements that help you work with these workflows directly in the IDE.

Manage Terraform Stacks more easily

GoLand now supports working with Terraform Stacks directly in the IDE.

You can explore the infrastructure structure, navigate between components, and create new deployments from the IDE interface. Code completion and improved navigation help you stay oriented in complex infrastructure configurations.

Work faster and more comfortably in everyday development

Several improvements in GoLand 2026.1 focus on reducing friction in common workflows and making the IDE more comfortable to use throughout the day.

Work on multiple branches simultaneously with Git worktrees

GoLand now provides first-class support for Git worktrees, allowing you to work with multiple branches at the same time.

You can create a separate worktree for a hotfix, assign another one to an AI agent, and continue working in your main branch without switching contexts.

Even without AI workflows, worktrees reduce branch switching overhead and help you move faster in large repositories.

Enjoy a smoother and more responsive editing experience

The editor continues to evolve with improvements designed to make everyday coding more convenient.

This release introduces smoother caret animations and updated selection behavior, resulting in a cleaner and more responsive editing experience. For more information, refer to our blog post: Editor Improvements: Smooth Caret Animation and New Selection Behavior.

Get better Linux support with native Wayland integration

GoLand now runs on Wayland by default, improving HiDPI rendering and input handling on Linux systems.

If Wayland is not supported in your environment, the IDE automatically falls back to X11 to ensure your workflow remains stable and uninterrupted. For more information, refer to our blog post: Wayland By Default in 2026.1 EAP.

Get more done with AI directly in the IDE

GoLand continues expanding its AI capabilities to give you more flexibility and control over how you use AI during development.

Choose the best AI agent for each task

In addition to Junie, Claude Agent, and most recently Codex, GoLand now lets you work with more AI agents directly in the AI chat. You can choose from agents such as GitHub Copilot, Cursor, and many others supported through the Agent Client Protocol (ACP).

With the new ACP Agent Registry, you can discover and install supported agents with a single click.

Code With Me sunset

As we continue to evolve our IDEs and focus on the areas that deliver the most value to developers, we’ve decided to sunset Code With Me, our collaborative coding and pair programming service. Demand for this type of functionality has declined in recent years, and we’re prioritizing more modern workflows tailored to professional software development.

As of version 2026.1, Code With Me will be unbundled from all JetBrains IDEs. Instead, it will be available on JetBrains Marketplace as a separate plugin. 2026.1 will be the last IDE version to officially support Code With Me, as we gradually sunset the service.

Read the full announcement and sunset timeline in our blog post.

That wraps up the highlights of GoLand 2026.1.

We hope these changes make your workflow smoother and more enjoyable.

We would love to hear your thoughts: Feel free to tag us on X, drop into the #goland-gophers Slack channel, or create a ticket in our YouTrack issue tracker.

Happy coding,

The GoLand team

I cut Claude API costs by 90% with prompt caching. Here’s what I learned before I had to shut it down.

867 Discord servers. 1,000+ active users. $10–11 every time someone played a one-hour D&D session.

I was the only engineer. There was no revenue. And that number wasn’t going down on its own.

I want to be upfront before we go any further: Scrollbook is no longer running.

I built it because I was always the Dungeon Master. My wife, my son, and I had a standing D&D night, and I wanted to actually play for once instead of running the whole session. So I built an AI dungeon master to take my seat. It worked well enough that I shared it. I did not expect anyone else to care.

They did. 867 servers and 1,000+ users later, I was looking at $10-11 every time someone played a one-hour session with no revenue, no paywall, and no plan for either. (Scrollbook is one of three production projects I break down in my case studies. The other two are live and generating revenue. The contrast is instructive.) I shut it down because the cost of operating it solo, without a monetization model that kept pace with usage, made it unsustainable. By the time I pulled the plug, prompt caching had dropped that same session to $0.50-1.50. The technical solution worked. The business math didn’t.

Both of those things are worth talking about.

This post covers the technical side in detail: what the problem was, what I changed, and the actual production code behind it. The business lesson is at the end. I’d argue it’s the more important one.

The Cost Problem

Every message to Claude sent the entire conversation context from scratch. In a D&D session, that context grows with every exchange between the player and the AI.

Before caching, each API call looked like this:

[system prompt: ~1,800 lines of D&D rules + Cipher's personality]
[campaign context: setting, NPCs, quests, locations, active encounter]
[character context: stats, equipment, spells, conditions, companions]
[party context: all active players and their characters]
[message history: every exchange in the session so far]
[current question: "can I grapple the goblin?"]

The system prompt and campaign context alone sat at 4,000–5,000 tokens, reprocessed at full price on every single message.

A one-hour D&D session averages 15–25 back-and-forth exchanges. Context grows on each call. At Sonnet pricing ($3.00/M input, $15.00/M output): $10–11 per session. Multiply that across hundreds of active servers running concurrent sessions and it stops being a line item. It becomes a ceiling. Every new user makes the situation structurally worse.

The Architecture

Scrollbook runs on six services:

Service Role
bot/ Discord bot — receives player commands
api/ REST API for the companion web app
shared/services/cipher_service.py Owns all Anthropic API calls
shared/services/ai_usage_tracker.py Token counting and budget enforcement
shared/services/ai_extraction_service.py PDF/content extraction via Bedrock
infrastructure/ AWS CDK — ECS Fargate, RDS, ALB

cipher_service.py is the single point of contact with the Anthropic API. Context is assembled per-request by ContextManager.build_context(), pulling campaign data, character stats, active party, quests, encounters, and NPCs from Postgres — all scoped to the Discord guild ID.

Here is the insight that unlocked the fix: the system prompt and campaign context were structurally identical on every request for a given server. The D&D rules, Cipher’s personality, the campaign world — none of it changes message-to-message. It was being sent and fully reprocessed every single time, on every message, for every server.

What Prompt Caching Actually Is

Anthropic caches the prefix of your prompt on their infrastructure for a TTL window. Subsequent requests that match that prefix byte-for-byte skip the reprocessing cost. Instead of paying full input token price, you pay roughly 10% of that on a cache hit.

A few things that matter:

Prefix, not arbitrary sections. The cache applies to the beginning of your prompt. Everything you want cached must come before everything that changes. This means prompt order is the entire game.

Cache hits vs. misses. A hit means the prefix was already in cache; you pay about 10% of the normal input token price. A miss means the prefix gets written to cache at roughly 1.25x the normal input token price — slightly more expensive than a regular call, but a one-time cost within each TTL window. After the first message in a session, you want hits almost exclusively.

The TTL is 5 minutes for the ephemeral cache type on Anthropic’s infrastructure. For active D&D sessions this is fine — messages come fast. For a server that runs one session a week, you pay write costs every time with zero read benefit. The math only works at session density.

This is a first-class API feature, not a workaround. You opt in by passing structured content blocks with a cache_control field instead of a plain string. Two lines of code. Anthropic’s infrastructure handles everything else.

One more thing worth saying clearly: this is not client-side caching. You are not storing API responses locally. You are telling Anthropic’s infrastructure which portion of your prompt is stable so it does not need to recompute it.

The Implementation

Centralizing Prompt Assembly

With six services in play, the first structural requirement was centralizing all prompt assembly into one place. The cacheable prefix must be byte-for-byte identical across every request. That cannot happen if prompts are assembled in multiple code paths and concatenated at call time. A trailing space, a newline difference, a Unicode normalization inconsistency — any of it produces a full cache miss.

All prompt assembly in Scrollbook runs through one function: cipher_service.py:_build_conversational_prompt().

Prompt Order

The ordering decision is the whole thing:

1. System prompt (D&D rules + Cipher personality)        CACHED
2. Campaign and character context (per-guild, stable)    included in cache
3. Conversation history [0 ... N-3]                      CACHED at breakpoint
4. Conversation history [N-2, N-1]                       NOT cached
5. Current question                                      NOT cached

Static content at the top. Dynamic content at the bottom. The most expensive tokens, cached. The tokens that change on every message, not cached.

The Code

Before caching, the system prompt was passed as a plain string:

# Every call: full system text + context, reprocessed at full price every time
response = self.anthropic_client.messages.create(
    model=self.model_id,
    system=full_system_text,  # plain string, no caching
    messages=messages,
)

After caching, it becomes a structured content block:

# cipher_service.py:2070-2079
if self.enable_caching:
    system_blocks = [
        {
            "type": "text",
            "text": full_system_text,
            "cache_control": {"type": "ephemeral"}  # two lines
        }
    ]
else:
    system_blocks = [{"type": "text", "text": full_system_text}]

The conversation history gets a second cache breakpoint at the third-to-last message, capturing the entire prior session:

# cipher_service.py:2084-2098
for i, msg in enumerate(conversation_history):
    content_blocks = [{"type": "text", "text": msg["content"]}]

    if self.enable_caching:
        is_last_two = i >= len(conversation_history) - 2
        # Cache breakpoint at third-to-last message
        if not is_last_two and i == len(conversation_history) - 3:
            content_blocks[0]["cache_control"] = {"type": "ephemeral"}

    messages.append({"role": msg["role"], "content": content_blocks})

# Current question is never cached
messages.append({"role": "user", "content": [{"type": "text", "text": question}]})

Two cache breakpoints: one on the system prompt, one on the conversation history. The Anthropic API limits the number of cache control markers per request, so placement matters. You want those markers positioned to maximize the ratio of cached-to-uncached tokens on every call — that ratio is what drives your actual savings.

The API call itself barely changes. The system parameter is now a content block array instead of a string:

# cipher_service.py:2221-2228
response = self.anthropic_client.messages.create(
    model=self.model_id,
    max_tokens=self.max_tokens,
    temperature=self.temperature,
    system=system_blocks,  # content block array instead of plain string
    messages=msgs,
    tools=tools_to_use,
)

The Multi-Tenant Problem

867 servers means 867 sets of campaign state — different characters, different HP totals, different active encounters, different party compositions. Keeping per-guild context out of a polluted shared prefix requires a specific architectural decision.

In Scrollbook, guild-specific data lives inside the cached block:

# cipher_service.py:2066-2068
context_section = context.to_prompt_section()
full_system_text = f"{system_prompt_text}nn{context_section}"
# This full_system_text then receives the cache_control block

This works because campaign context is stable within a session. Cipher updates game state via tool calls when something changes — it does not receive externally updated context as new input mid-session. For the duration of an active session, the system prompt plus campaign context is genuinely identical across every message for that guild. Each guild gets its own cached prefix. No cross-contamination.

If your situation is different — if state changes externally between messages — that dynamic content needs to live below the cache breakpoint, not inside it.

The Results

A one-hour session that cost $10–11 dropped to $0.50–1.50.

To verify you are actually hitting the cache, read the usage object on the response. Do not assume. Log it explicitly:

# cipher_service.py:2268-2288
if self.enable_caching and hasattr(response, "usage"):
    usage = response.usage
    input_tokens = getattr(usage, "input_tokens", 0)
    cache_read_tokens = getattr(usage, "cache_read_input_tokens", 0)
    cache_creation = getattr(usage, "cache_creation_input_tokens", 0)

    if cache_read_tokens > 0:
        savings_pct = (
            cache_read_tokens / (input_tokens + cache_read_tokens)
        ) * 100
        logger.info(
            f"Cache HIT: {cache_read_tokens} tokens read from cache "
            f"({savings_pct:.1f}% savings), {input_tokens} new tokens"
        )
    elif cache_creation > 0:
        logger.info(f"Cache MISS: {cache_creation} tokens written to cache")

Three fields to understand:

  • input_tokens — tokens billed at full price this call
  • cache_creation_input_tokens — tokens written to cache, billed at approximately 1.25x the base input token price (one-time cost per TTL window)
  • cache_read_input_tokens — tokens read from cache, billed at approximately 10% of normal (this is where the 90% savings comes from)

The feature flag that controlled it all:

# shared/config/settings.py:86-98
anthropic_enable_prompt_caching: bool = Field(
    default=True, description="Enable Anthropic prompt caching (90% cost savings)"
)

# Bedrock fallback has no equivalent — hardcoded off
bedrock_enable_prompt_caching: bool = Field(
    default=False, description="Enable prompt caching (not supported on AWS Bedrock)"
)

A note on Bedrock: At the time Scrollbook was built, Bedrock did not support prompt caching. That gap made it a non-starter as the primary provider and locked the architecture to the direct Anthropic API. Bedrock has since caught up — prompt caching went GA in April 2025, with 1-hour TTL support added in January 2026. If you are on Bedrock today, the same technique applies.

When optimization becomes load-bearing infrastructure, provider lock-in follows. That was true when I built this. It is less true now.

Gotchas That Will Kill Your Cache Hit Rate

Prompt order is everything. If you accidentally flip the ordering — campaign context before system prompt, for example — every call is a full miss. The cache matches from the beginning of the prompt in sequence. There is no partial matching.

Dynamic content in the cached prefix. This is the hardest mistake to catch. Timestamps, counters, random values, user-specific data — anything that changes per-message, if it bleeds into the section you are trying to cache, every call is a miss. In Scrollbook, character HP and active conditions are inside the cached block intentionally, because Cipher controls those updates via tool calls. If your state changes externally, that content belongs below the breakpoint.

The 5-minute TTL cliff. Servers with long gaps between messages cold-start on every session. Write costs get paid repeatedly with zero read benefit. The math works at session density. For sparse traffic, run the calculation before assuming caching helps.

Whitespace and encoding. The prefix match is byte-level. A trailing space, a newline inconsistency, a Unicode normalization difference — any of it is a miss. Prompt assembly must run through a single code path. If you are concatenating in multiple places, you will have inconsistency you cannot see.

Don’t assume, verify. The logging block above takes ten minutes to add. Add it. The usage object will tell you immediately whether your cache hit rate matches your expectations. Ship it before you ship the feature.

Why I Still Had to Shut It Down

The honest math: 90% off still leaves 10% of a cost that grows with usage.

At $0.50–1.50 per session across 867 servers with no subscription revenue, the situation improved dramatically and remained unsustainable. I had bought runway. I had not fixed the underlying problem.

There was no paywall. No subscription tier. No mechanism for Scrollbook to generate revenue as usage scaled. Every new server was a new cost center with nothing offsetting it. Prompt caching made the slope of that curve shallower. It did not change the direction.

Beyond the API costs: solo maintenance at that user count meant incident response, server reliability, and the full weight of being the only person accountable to 867 active communities. That is not something you can optimize your way out of.

What I would do differently: charge earlier. I know that is a strange thing to say about something I built so my family could play D&D together. But the moment it left that context and became someone else’s tool, it became a product. I just did not treat it like one. Even a small subscription changes the entire math and the entire psychology of the product.

I built the technical foundation first, optimized costs second, and never got to monetization. The right order is the reverse: figure out how this sustains itself, then build, then optimize. I applied that lesson to the next two products I shipped. ReptiDex launched with a three-tier subscription model on day one and hit 50 paid subscribers in 9 days. Geckistry collects payment at checkout. Both are still running.

What to Take From This

Prompt caching is a real, production-grade optimization. The cache_control field is two lines of code. A 90% reduction in inference cost is achievable if your prompt has a large, stable prefix and your traffic density is high enough for cache reads to consistently outpace cache writes.

If you are building on Claude at any meaningful scale, look at your prompt structure. If you are sending the same system prompt on every request and that prompt is long, you are paying for reprocessing you do not need.

But the bigger lesson is not technical. If you are building an AI product solo, get to monetization before you get to optimization. The optimization I built here was real and it worked. The product did not survive anyway — not because the code was wrong, but because I treated cost reduction as a substitute for a business model.

It is not.

I run Built By Dusty, a software studio that builds custom apps and sales platforms for animal breeders and small businesses. The AI cost optimization techniques from Scrollbook now power features in the breeding software I deliver to clients. If you’re building on Claude at scale, or you’re a founder with a product that has real infrastructure costs to manage, I’d like to hear from you.

All code references in this article are from the actual Scrollbook production codebase. The codebase is private, but every snippet shown here ran in production.

How the DNS is resolved ?

USER MAKING A REQUEST:
when a user searches for something using the domain name , the browser needs to know the IP of the domain to establish communication so it resolves the DNS.

How it fetches the IP through the DNS?

Before starting, let’s be clear about what DNS is. DNS is like a label that maps a domain name to an IP address.

Let’s say we are searching for “WIKIPEDIA”.

First, the machine checks within itself (browser/cache) asking, “Do you remember the IP of Wikipedia?” If it doesn’t find it there, the request is forwarded to the router/modem. If it still doesn’t know, it is sent to the resolver (Internet Service Provider).

If the resolver also doesn’t have it cached, it queries the Root Name Server. From there, it is directed to the appropriate Top-Level Domain (TLD) server like .com, .in, or .org. Then it reaches the authoritative name server, where it finds Wikipedia’s IP from the zone file.

Finally, the IP address is returned back to the user.

I Built a tool to give AI coding agents persistent memory and a way smaller token footprint

Been building with AI coding agents for a while now. Claude Code, Cursor, Antigravity, and two things kept annoying me enough that I finally just built something to fix them.

The two problems

Problem 1: Your agent reads a 1000-line file and burns 8000 tokens doing it.

That’s before it’s done anything useful. Large codebases eat context fast, and once the window fills up, you’re either compressing (lossy) or starting over. Neither is great.

Problem 2: Every new session, your agent starts from zero.

It doesn’t remember that the API rate limit is 100 req/min. It doesn’t remember the weird edge case in the auth module you spent two hours debugging last week. It doesn’t remember anything. You either re-explain everything, or watch it rediscover the same gotchas.

These aren’t niche complaints — if you’re using AI agents to work on real codebases, you’ve hit both of these.

What I built

agora-code — persistent memory and context reduction for AI coding agents. Works with Claude Code, Cursor, and Gemini CLI. Survives context resets, new conversations, and agent restarts.

It’s early. It works. I want people to try it.

How it handles token bloat

Instead of letting the agent read raw source files, agora-code intercepts every file read and serves an AST summary instead.

Real example: summarizer.py is 885 lines. Raw read = 8,436 tokens. Summarized = 542 tokens. That’s a 93.6% reduction — and the agent still gets all the signal: class names, function signatures, docstrings, line numbers.

It works across languages too:

File type Method What you get
Python stdlib AST Classes, functions, signatures, docstrings
JS, TS, Go, Rust, Java + 160 more tree-sitter Same — exact line numbers, parameter types
JSON / YAML Structure parser Top-level keys + shape
Markdown Heading extractor Headings + opening paragraph

Summaries are cached in SQLite, so re-reads on the same branch are instant.

How it handles memory loss

When a session ends, agora-code parses the transcript and extracts a structured checkpoint: what was the goal, what changed, what non-obvious things did you find, what’s next.

At the start of the next session, the relevant parts are injected automatically — last checkpoint, top learnings from recent commits on the branch, git state, symbol index for dirty files.

You can also manually store findings:

agora-code learn "POST /users rejects + in emails" --tags email,validation
agora-code learn "Rate limit is 100 req/min" --confidence confirmed

And recall them later (keyword search by default, semantic search if you wire up embeddings):

agora-code recall "email validation"
agora-code recall "rate limit"

Storage is three layers: an active session file (project-local, gitignored), a global SQLite DB scoped per project via git remote URL, and search (FTS5/BM25 always on, optional vector search).

What happens automatically (Claude Code)

Once hooks are installed, you don’t have to think about most of this:

When you… agora-code automatically…
Start a session Injects last checkpoint + relevant learnings
Submit a prompt Recalls relevant past findings, sets session goal
Read a file > 100 lines Summarizes via AST — serves summary instead
Edit a file Tracks the diff, re-indexes symbols
Run git commit Derives learnings from the commit
Context window compresses Checkpoints before, re-injects after
End a session Parses transcript → structured checkpoint in DB

Getting started

pip install git+https://github.com/thebnbrkr/agora-code.git

Then in your project:

cd your-project
agora-code install-hooks --claude-code

For Cursor and Gemini CLI, you copy a config directory into your project root — full instructions in the README.

At the start of every Claude Code session, run /agora-code to load the skill. That’s the bit that tells the agent when to summarize, when to inject context, when to save progress.

It’s early

APIs may change. Things might break. I’m actively working on it — semantic search is in progress, automated hook setup for Cursor and Gemini is on the roadmap.

If you try it and hit something weird, open an issue. If you want to add hook support for a different editor, the pattern is consistent across .claude/hooks/ and .cursor/hooks/ — PRs welcome.

GitHub: https://github.com/thebnbrkr/agora-code

Screenshot: (https://imgur.com/a/APaiNnl

Would love to hear if this solves the same pain points for others, or if you’re handling token bloat / memory loss differently. Drop a comment.

Filter Assignments

DB- TASK 2

Bonus Q/A

  1. Find all movies where the special features are not listed (i.e., special_features is NULL).

cmd:
SELECT title FROM film WHERE special_features IS NULL;

sample op:

title

Academy Dinosaur
Ace Goldfinger
Adaptation Holes
Affair Prejudice
African Egg

2) Find all movies where the rental duration is more than 7 days.

cmd:
SELECT title, rental_duration
FROM film
WHERE rental_duration > 7;

sample op:
title | rental_duration
———————+—————–
Alamo Videotape | 8
Brotherhood Blanket | 9
Chicago North | 10
Dragon Squad | 8

3) Find all movies that have a rental rate of $4.99 and a replacement cost of more than $20.

cmd:
SELECT title, rental_rate, replacement_cost FROM film WHERE rental_rate = 4.99 AND replacement_cost > 20;

sample op:
title | rental_rate | replacement_cost
——————–+————-+——————
Ace Goldfinger | 4.99 | 22.99
Airport Pollock | 4.99 | 24.99
Bright Encounters | 4.99 | 21.99

4) Find all movies that have a rental rate of $0.99 or a rating of ‘PG-13’.

cmd:
SELECT title, rental_rate, rating FROM film WHERE rental_rate = 0.99 OR rating = ‘PG-13’;

sample op:
title | rental_rate | rating
——————-+————-+——–
Academy Dinosaur | 0.99 | PG
Alien Center | 2.99 | PG-13
Angels Life | 0.99 | PG-13

5) Retrieve the first 5 rows of movies sorted alphabetically by title.

cmd:
SELECT title FROM film ORDER BY title ASC LIMIT 5;

sample op:

title

Academy Dinosaur
Ace Goldfinger
Adaptation Holes
Affair Prejudice
African Egg

6) Skip the first 10 rows and fetch the next 3 movies with the highest replacement cost.

cmd:
SELECT title, replacement_cost
FROM film
ORDER BY replacement_cost DESC
LIMIT 3 OFFSET 10;

sample op:
title | replacement_cost
——————-+——————
Anthem Luke | 24.99
Apollo Teen | 24.99
Arabia Dogma | 24.99

7) Find all movies where the rating is either ‘G’, ‘PG’, or ‘PG-13’.
cmd:
SELECT title, rating FROM film WHERE rating IN (‘G’, ‘PG’, ‘PG-13’);

sample op:
title | rating
——————-+——–
Academy Dinosaur | PG
Ace Goldfinger | G
Alien Center | PG-13

8) Find all movies with a rental rate between $2 and $4.

cmd:
SELECT title, rental_rate FROM film WHERE rental_rate BETWEEN 2 AND 4;

sample op:
title | rental_rate
——————-+————-
Adaptation Holes | 2.99
Alien Center | 2.99
Apollo Teen | 3.99

9) Find all movies with titles that start with ‘The’.

cmd:
SELECT title FROM film WHERE title LIKE ‘The%’;

sample op:

title

The Matrix
The Pianist
The Others
The Truman Show

10) Find the first 10 movies with a rental rate of $2.99 or $4.99, a rating of ‘R’, and a title containing the word “Love”.

cmd:
SELECT title, rental_rate, rating
FROM film
WHERE rental_rate IN (2.99, 4.99)
AND rating = ‘R’
AND title LIKE ‘%Love%’
LIMIT 10;

sample op:
title | rental_rate | rating
—————–+————-+——–
Crazy Love | 2.99 | R
Dangerous Love | 4.99 | R
Endless Love | 2.99 | R

11) Find all movies where the title contains the % symbol.

cmd:
SELECT title FROM film WHERE title LIKE ‘%%%’ ESCAPE ”;

sample op:

title

100% Love
50% Chance

12) Find all movies where the title contains an underscore (_).

cmd:
SELECT title FROM film WHERE title LIKE ‘%_%’ ESCAPE ”;

sample op:

title

Mission_Impossible
Fast_Furious

13) Find all movies where the title starts with “A” or “B” and ends with “s”.

cmd:
SELECT title FROM film WHERE (title LIKE ‘A%’ OR title LIKE ‘B%’) AND title LIKE ‘%s’;

sample op:

title

Angels Life
Backwards Towns
Brothers Dreams

14) Find all movies where the title contains “Man”, “Men”, or “Woman”.

cmd:
SELECT title FROM film WHERE title LIKE ‘%Man%’ OR title LIKE ‘%Men%’ OR title LIKE ‘%Woman%’;

sample op:

title

Spider Man
X Men United
Wonder Woman

15) Find all movies with titles that contain digits (e.g., “007”, “2”, “300”).

cmd:
SELECT title FROM film WHERE title ~ ‘[0-9]’;

sample op:

title

007 Bond
300 Spartans
2 Fast 2 Furious

16) Find all movies with titles containing a backslash ().

cmd:
SELECT title FROM film WHERE title LIKE ‘%%’;

sample op:

title

Escape Reality
Path Finder

17) Find all movies where the title does contain the words “Love” or “Hate”.

cmd:
SELECT title FROM film WHERE title LIKE ‘%Love%’ OR title LIKE ‘%Hate%’;

sample op:

title

Crazy Love
Endless Love
Hate Story
Love Actually

18) Find the first 5 movies with titles that end with “er”, “or”, or “ar”.

cmd:
SELECT title
FROM film
WHERE title LIKE ‘%er’
OR title LIKE ‘%or’
OR title LIKE ‘%ar’
LIMIT 5;

sample op:

title

Joker
Creator
Avatar
Doctor
Warrior

Code Autopsy #1: How ~90 Lines Turned System Monitoring Into A Conversation

Code Autopsy #1: How 30 Lines Turned System Monitoring Into A Conversation

Part of the PC_Workman build-in-public series. Code Autopsy drops every Wednesday.

The Problem: Numbers Without Answers

You open Task Manager.

“CPU: 87%”

Cool.

But WHY 87%?

Is that normal? Should you worry? What process caused it? When did it start?

Task Manager doesn’t answer. HWMonitor doesn’t answer. MSI Afterburner doesn’t answer.

They show you WHAT is happening. Never WHY.

That’s the gap PC_Workman fills.

PC Workman 1.6.8 - hck_GPT in action. Service Setup - quick access to disable useless services, or services what you don't will use (Bluetooth, Print, fax). Today Report - Info about correctly collecting data by sessions. Daily usage averages. And Alerts from suspected spikes/moments by temperatures or voltage.

The Solution: EventDetector

After 800 hours building PC_Workman (most of it on a laptop that peaks at 94°C), I realized: users don’t need more data. They need context.

So I built EventDetector.

30 lines of Python that turn monitoring into a conversation.

Here’s how it works.

Step 1: Track YOUR Baseline (Not Generic Averages)

Most tools compare against hardcoded thresholds:

  • “50% CPU is normal”
  • “60% RAM is high”
  • “80°C is warm”

Problem: Your normal isn’t my normal.

A gaming PC idling at 30% CPU? Normal.

A lightweight laptop idling at 30% CPU? Something’s wrong.

EventDetector tracks YOUR baseline from the last 10 minutes:

def _get_baseline(self, now):
    """Get recent baseline averages from minute_stats.
    Cached for 60 seconds to avoid excessive queries.
    """
    cutoff = now - SPIKE_BASELINE_WINDOW  # 10 minutes

    rows = conn.execute("""
        SELECT AVG(cpu_avg) as cpu_avg, 
               AVG(ram_avg) as ram_avg,
               AVG(gpu_avg) as gpu_avg,
               AVG(cpu_temp) as cpu_temp, 
               AVG(gpu_temp) as gpu_temp
        FROM minute_stats
        WHERE timestamp >= ?
    """, (cutoff,)).fetchone()

    return baseline_cache

Key insight: The baseline is YOU. Not everyone. Just you.

PC Workman 1.6.8 - Events detector for hck_GPT insights. Based on long-term monitoring: CPU, GPU, RAM. EventDetector code with highlights on baseline, delta, rate limiting, severity

Step 2: Calculate Delta (Current vs YOUR Normal)

Once we have YOUR baseline, detecting spikes is simple math:

def _check_metric(self, now, metric_name, current_val, 
                  baseline_val, threshold, description):
    """Check if a metric exceeds its threshold above baseline"""

    delta = current_val - baseline_val

    if delta < threshold:
        return  # No spike - you're within YOUR normal range

Example:

  • Your CPU baseline (last 10 min): 42%
  • Current CPU: 87%
  • Delta: +45%
  • Threshold: 20%

Result: Spike detected. But we’re not done yet.

Step 3: Rate Limiting (No Alert Spam)

Early versions of EventDetector had a problem: alert spam.

Chrome spikes CPU every 30 seconds? You’d get 120 alerts per hour.

Useless.

Solution: Rate limiting.

# Rate limiting: {metric_name: last_event_timestamp}
self._last_event_time = {}

def _check_metric(self, ...):
    # ... delta calculation ...

    # Rate limiting
    last_time = self._last_event_time.get(metric_name, 0)
    if now - last_time < SPIKE_COOLDOWN:  # 5 minutes
        return  # Too soon since last alert

    # Log the event
    self._last_event_time[metric_name] = now

Result: Max 1 alert per metric per 5 minutes. No spam.

Step 4: Severity Levels (Critical vs Warning vs Info)

Not all spikes are equal.

CPU spiking 21% above baseline? Worth noting.

CPU spiking 60% above baseline? Drop everything.

EventDetector categorizes:

# Determine severity
if delta >= threshold * 2:
    severity = 'critical'  # 🔴
elif delta >= threshold * 1.5:
    severity = 'warning'   # ⚠️
else:
    severity = 'info'      # ℹ️

Example thresholds:

  • CPU threshold: 20%
  • Delta 40%+: Critical
  • Delta 30%+: Warning
  • Delta 20-29%: Info

Result: Alerts match urgency.

The Final Output: Context, Not Just Numbers

Here’s what you see in PC_Workman when a spike happens:

Before (Task Manager):

CPU: 87%

After (PC_Workman):

⚠️ CPU spike: 87% (baseline: 42%, delta: +45%)
Chrome.exe - started 3 hours ago

Same data. Different story.

One gives you anxiety. The other gives you action.

PC Workman 1.6.8 - My PC - Center of Actions.
STATS & ALERTS - Long term monitoring your components usage, process usage. And mainly time-travel TEMP and Voltages alerts about spikes, or suspected moments. Optimization & Services - For optimize and improve your PC performance. First Setup & Drivers - All for setup your new device/new os. Stability Tests - For check about correctly working of PC Workman and Database check. Your Account-Details - Soon :)

Implementation Notes

Handles 5 Metrics With Same Logic

The beauty of this design: reusable.

Same _check_metric function handles:

  • CPU usage
  • RAM usage
  • GPU usage
  • CPU temperature
  • GPU temperature
def check_and_log_spike(self, cpu_avg, ram_avg, gpu_avg,
                        cpu_temp=None, gpu_temp=None):
    baseline = self._get_baseline(now)

    # Check each metric with same logic
    self._check_metric(now, 'cpu', cpu_avg, 
                      baseline['cpu_avg'], 
                      SPIKE_THRESHOLD_CPU, 'CPU usage')

    self._check_metric(now, 'ram', ram_avg, 
                      baseline['ram_avg'],
                      SPIKE_THRESHOLD_RAM, 'RAM usage')

    # ... and so on

Clean. Maintainable. Scalable.

Performance: Cached Baselines

Baseline queries hit SQLite. Could be slow.

Solution: 60-second cache.

if now - self._baseline_cache_time < 60 and self._baseline_cache:
    return self._baseline_cache  # Use cached data

Result: Query once per minute, not once per second.

Storage: SQLite Events Table

All events logged to database:

INSERT INTO events
(timestamp, event_type, severity, metric, value, 
 baseline, process_name, description)
VALUES (?, ?, ?, ?, ?, ?, ?, ?)

Benefits:

  • Historical tracking (what spiked last week?)
  • Pattern detection (Chrome spikes every Tuesday?)
  • Exportable data

What I Learned Building This

1. Users Don’t Need More Data

Early versions of PC_Workman showed 20+ metrics.

Users ignored them all.

Lesson: Context, no quantity.

2. Rate Limiting Is User Experience

First version: no rate limiting.

Result: 500 alerts per hour. Unusable.

Lesson: Silence is a feature.

3. Personalization.

“50% CPU is high” works for nobody.

YOUR 50% vs MY 50% = different stories.

Lesson: Baselines must be personal.

PC Workman 1.6.8 - hck_GPT Insights

The Numbers

EventDetector stats:

  • ~30 lines core logic
  • Handles 5 metrics
  • Max 1 alert per metric per 5 min
  • Baseline cached 60 sec
  • 3 severity levels

PC_Workman stats:

  • 800+ hours development
  • Built on 94°C laptop
  • v1.6.8 current (v2.0 -> Microsoft Store, Q3 2026)
  • 60+ downloads
  • 17 stars
  • Open source, MIT licensed

Try It Yourself

PC_Workman is open source.

EventDetector is in hck_stats_engine/events.py.

Download, run, break it, improve it.

GitHub: github.com/HuckleR2003/PC_Workman_HCK
File what I show you: PC_Workman_HCK/hck_stats_engine/events.py

Building in public. Code Autopsy every Wednesday.

Follow the journey:

  • Twitter: @hck_lab
  • LinkedIn: Marcin Firmuga
  • Everything: linktr.ee/marcin_firmuga

Next Week: Wednesday Code Autopsy #2

Topic: ProcessAggregator – how PC_Workman tracks which apps eat your CPU without destroying performance.
See you Wednesday.

Questions? Comments? Roasts? I’m building in public. Feedback welcome.

About the Author

I’m Marcin Firmuga. Solo developer and founder of HCK_Labs.

I created PC Workman , an open-source, AI-powered
PC resource monitor
built entirely from scratch on dying hardware during warehouse
shifts in the Netherlands.

This is the first time I’ve given one of my projects a real, dedicated home.

Before this: game translations, PC technician internships, warehouse operations in multiple countries, and countless failed projects I never finished.

But this one? This one stuck.
800+ hours of code. 4 complete UI rebuilds. 16,000 lines deleted.
3 AM all-nighters. Energy drinks and toast.

And finally, an app I wouldn’t close in 5 seconds.
That’s the difference between building and shipping.

PC_Workman is the result.