Hashtag Jakarta EE #320

Hashtag Jakarta EE #320

Welcome to issue number three hundred and twenty of Hashtag Jakarta EE!

Ooops, I amn an little late publishing Hashtag Jakarta EE number 320. I am currently on my way home from Johannesburg and a successful JakartaOne by Jozi-JUG. I will write more about the event in a separate post next week. Speaking of next week, Immediately after returning from South Africa, I will head to San Jose, California for DeveloperWeek.

Since I have been busy traveling, I don’t have much update regarding what’s going on with Jakarta EE 12, but according to the minutes of the call this week, the estimated release date looks like it will be pushed out to Q4 this year. Delivering it in 2026 is the important aspect, not necessarily the month. This gives the platform team and the indivicual specifications a little more time to get their work don.

Registration for Open Community eXperience in Brussels, April 21-23 is open. I have a 20% discount code, so DM me if you are interested in attending this conference.

Ivar Grimstad


Pandas 3.0’s PyArrow String Revolution: A Deep Dive into Memory and Performance

Introduction

Pandas 3.0 made a game-changing decision: PyArrow-backed strings are now the default. Instead of storing strings as Python objects (the old object dtype), pandas now uses Apache Arrow’s columnar format with the new string[pyarrow] dtype.

But here’s the question that matters: How much does this new string dtype actually improve performance and memory usage in real-world scenarios?

To find out, I ran comprehensive benchmarks across diverse datasets and common string operations. The results? 51.8% memory savings on average, with operations running 2-27x faster.

This isn’t a theoretical improvement, it’s a fundamental shift in how pandas handles string data.

The Results: Summary Dashboard

Let me start with the headline numbers, then we’ll dive into how I got them.

Result Summary

The Four Key Metrics

1. 51.8% Memory Savings

Across all test datasets, the new PyArrow string dtype used half the memory of the old object dtype. This isn’t a marginal improvement, it’s transformative for memory-constrained environments.

2. 6.17x Average Operation Speedup

String operations aren’t just more memory-efficient: they’re dramatically faster. On average, operations like str.lower(), str.contains(), and str.len() run 6x faster with PyArrow strings.

Some operations are even more impressive:

  • str.len(): 27x faster
  • str.startswith(): 16x faster
  • str.endswith(): 15x faster

3. 889 MB Total Memory Saved

Across our test datasets (totaling 645 MB on disk), we saved nearly 1 GB of RAM in memory. For a real data pipeline processing dozens of datasets, this compounds quickly.

4. Memory Overhead: The Game Changer

The bottom chart reveals something crucial about how pandas handles strings:

Old string dtype (object):

  • CSV files on disk: 645 MB
  • Loaded into pandas: 1,714 MB
  • Memory overhead: 165.7% (more than doubles!)

New string dtype (PyArrow):

  • CSV files on disk: 645 MB
  • Loaded into pandas: 825 MB
  • Memory overhead: 27.9% (minimal overhead)

What does this mean?

When pandas reads a CSV file, it doesn’t just store the raw bytes: it creates in-memory data structures for fast operations. The old object dtype was incredibly inefficient, essentially duplicating string data multiple times. The new PyArrow string dtype keeps overhead minimal with a smarter memory layout.

This is the difference between pandas 2’s Python-object approach and pandas 3’s columnar Arrow approach.

The Methodology: Why 5 Different Datasets?

Now that you’ve seen the results, let me explain how I tested this. Real-world data comes in many shapes and sizes. A single benchmark on one type of data wouldn’t tell the whole story.

That’s why I created 5 distinct datasets, each representing common patterns you’ll encounter in production:

1. Low Cardinality Dataset (1M rows)

What it is: Repeated categorical values like product categories, status codes, regions, and priorities.

Why it matters: This is typical of business data – think order statuses, customer segments, or department codes. The same values repeat millions of times.

Example columns:

  • category: “Electronics”, “Clothing”, “Food” (10 unique values)
  • status: “pending”, “completed”, “failed” (4 unique values)

2. High Cardinality Dataset (1M rows)

What it is: Mostly unique strings like user IDs, email addresses, and session tokens.

Why it matters: When every row is different (like customer emails or transaction IDs), pandas can’t use simple optimizations. This tests worst-case scenarios.

Example columns:

  • user_id: “USER_00000001”, “USER_00000002″… (1M unique)
  • email: “user123@example45.com” (1M unique)

3. Mixed String Lengths Dataset (1M rows)

What it is: A combination of short codes (2-5 chars), medium names (20-50 chars), and long descriptions (100-300 chars).

Why it matters: Real data isn’t uniform. You might have product codes next to customer addresses next to order notes. This tests how pandas handles variable-length strings.

4. Dataset With Nulls (1M rows)

What it is: Data with missing values (10-33% nulls in different columns).

Why it matters: Messy data is reality. How does pandas 3.0 handle missing string data compared to pandas 2?

5. Large Dataset (10M rows)

What it is: A scaled-up version to test performance at scale.

Why it matters: Memory savings that look good at 1M rows might behave differently at 10M rows. This validates the findings scale linearly.

Memory Savings by Dataset Type

Memory Consumption Comparison

The memory savings from PyArrow strings vary significantly by dataset characteristics:

Best Case: Low Cardinality Data (-71.6%)

When data has repeated values (like categories), PyArrow strings shine:

  • Object dtype: 219 MB
  • PyArrow string dtype: 62 MB
  • Savings: 71.6%

Worst Case: Mixed String Lengths (-30.6%)

Variable-length strings see smaller (but still significant) savings:

  • Object dtype: 383 MB
  • PyArrow string dtype: 266 MB
  • Savings: 30.6%

The Pattern

Notice how savings correlate with data characteristics:

  1. Repeated values (low cardinality) → Best savings (64-72%)
  2. Unique values (high cardinality) → Good savings (53-55%)
  3. Variable length (mixed sizes) → Moderate savings (31%)

Takeaway: PyArrow strings help everywhere, but they’re especially powerful for categorical-like data.

Performance: Operation-Specific Speedups

Operation Speedup Heatmap

This heatmap shows how much faster PyArrow strings are compared to object dtype for common string operations (values > 1.0 mean PyArrow is faster).

The Fastest Operations

  1. str.len(): 10-27x faster

  2. str.startswith() and str.endswith(): 11-18x faster

  3. str.contains(): 3-5x faster

  4. str.split(): 1-8x faster

The Pattern

Read operations (like len(), startswith()) → Massive speedups (10-27x)

  • These operations just examine existing data without modification

Transform operations (like replace(), split()) → Good speedups (2-5x)

  • These operations create new data, which limits the performance gains

The Trade-off: CSV Loading Time

Load Time Comparison

There’s no such thing as a free lunch. While PyArrow strings save memory and run operations faster, loading CSV files is 9%-61% slower.

Why the Slowdown?

When pandas reads a CSV with PyArrow strings enabled:

  1. It parses the text (same as before)
  2. It converts strings to PyArrow’s columnar format (extra step)
  3. This conversion involves building dictionary encodings and optimized memory structures

Pandas is doing more work upfront to enable better performance downstream.

Real-world impact: On our 10M row dataset, the difference is 1.63s vs 2.02s, an extra 0.4 seconds for 10 million rows. For many data pipelines, this upfront cost might be negligible compared to the 2-27x speedup in subsequent operations.

Pros and Cons: Should You Adopt PyArrow Strings?

Benefits of PyArrow String Dtype

  1. Massive Memory Savings (30-72%)

  2. Dramatically Faster String Operations (2-27x)

  3. Minimal Memory Overhead (28% vs 166%)

  4. Modern Data Ecosystem Integration

Trade-offs to Consider

  1. Slower CSV Loading (9-61% slower)

    • Initial data ingestion takes longer
    • May impact workflows that repeatedly load small files
    • The trade-off: slower start, much faster operations
  2. Behavioral Changes

    • String dtype behaves differently from object dtype in edge cases
    • Need to update code that explicitly checks for object dtype
    • Testing required for migration

The Recommendation

For most data workflows, PyArrow strings are a clear win. The memory and performance benefits far outweigh the trade-offs.

Consider staying with object dtype if:

  • You rarely work with string columns
  • Your datasets easily fit in memory
  • Load time is critical and you rarely perform string operations
  • You have legacy code that’s deeply coupled to object dtype behavior

Definitely adopt PyArrow strings if:

  • You process large datasets with text data
  • String operations are a significant part of your workflow
  • Memory is a constraint in your environment
  • You’re building production data pipelines
  • You work with modern data tools (Parquet, Arrow, DuckDB, etc.)

Conclusion

Our comprehensive analysis across 5 diverse datasets and 15+ string operations conclusively shows that PyArrow-backed strings deliver transformative improvements:

  • 51.8% average memory savings across all dataset types
  • 6.17x average operation speedup for string operations
  • Minimal memory overhead (28% vs 166% with Python objects)

PyArrow strings aren’t just an incremental improvement, they’re a fundamental reimagining of how pandas handles text data. By adopting Apache Arrow’s proven columnar format, pandas has joined the modern data ecosystem while delivering massive performance and memory improvements.

For most data practitioners working with text data, the question isn’t “Should I use PyArrow strings?” but rather “How quickly can I migrate?”

Questions or feedback? Feel free to open an issue or contribute to this analysis! The code we used in this analysis has been uploaded to this repo.

Reporails: Copilot adapter, built with copilot, for copilot.

This is a submission for the GitHub Copilot CLI Challenge

What I Built

Reporails is a validator for AI agent instruction files: CLAUDE.md, AGENTS.md, copilot-instructions.md. It scores your files, tells you what’s missing, and helps you fix it.

The project already supported Claude Code and Codex. For this challenge, I added GitHub Copilot CLI as a first-class supported agent – using Copilot CLI itself to build the adapter.

The architecture was already multi-agent by design. A .shared/ directory holds agent-agnostic workflows and knowledge. Each agent gets its own adapter that wires into the shared content. Claude does it through .claude/skills/, Copilot through .github/copilot-instructions.md.

Adding Copilot took 113 lines. Not because the work was trivial – but because the architecture was ready.

Repos:

  • CLI: reporails/cli (v0.3.0)
  • Rules: reporails/rules (v0.4.0)
  • Recommended: reporails/recommended (v0.2.0)

Demo

After adding Copilot support, each agent gets its own rule set with no cross-contamination:

Agent Rules Breakdown
Copilot 29 30 CORE – 1 excluded + 0 COPILOT-specific
Claude 39 30 CORE – 1 excluded + 10 CLAUDE-specific
Codex 37 30 CORE + 7 CODEX-specific

Run it yourself:

npx @reporails/cli check --agent copilot

My Experience with GitHub Copilot CLI

It understood the architecture immediately

I explained the .shared/ folder — that it was created specifically so both Claude and Copilot (and other agents) can reference the same workflows and knowledge without duplication. Copilot got it on the first exchange:

Copilot understanding .shared/ architecture

Copilot understanding .shared/ architecture

The key insight it surfaced: “The .shared/ content is already agent-agnostic. Both agents reference the same workflows. No duplication is needed – just different entry points.”

That’s exactly right. Claude reaches shared workflows through /generate-rule.claude/skills/.shared/workflows/rule-creation.md. Copilot reads instructions → .shared/workflows/rule-creation.md. Same destination, different front doors.

What it built

Copilot created the full adapter in three phases:

  1. Foundation.github/copilot-instructions.md, agents/copilot/config.yml, updated backbone.yml, verified test harness supports --agent copilot
  2. Workflow Wiring – entry points in copilot-instructions.md, context-specific conditional instructions, wired to .shared/workflows/ and .shared/knowledge/
  3. Documentation – updated README and CONTRIBUTING with agent-agnostic workflow guidance

Copilot Contribution Parity Complete

Copilot Contribution Parity Complete

The bug it found (well, helped find)

While testing the Copilot adapter, I discovered that the test harness had a cross-contamination bug. When running --agent copilot, it was testing CODEX rules too — because _scan_root() scanned ALL agents/*/rules/ directories indiscriminately.

The fix was three lines of Python:

# If agent is specified, only scan that agent's rules directory
if agent and agent_dir.name != agent:
    continue

Test Harness Agent Isolation FixTest Harness Agent Isolation Fix

The model selector surprise

When I opened the Copilot CLI model selector, the default model was Claude Sonnet 4.5. The irony of building a Copilot adapter using Copilot CLI running Claude was not lost on me.

What worked, honestly

Copilot CLI understood multi-agent architecture without hand-holding. It generated correct config files matching existing adapter patterns. The co-author signature was properly included in all commits. It didn’t try to duplicate content that was already shared – it just wired the entry points.

The whole experience reinforced something I’ve been thinking about: the tool matters less than the architecture underneath. If your project is structured well, any competent agent can extend it. That’s the whole point of reporails – making sure your instruction files are good enough that the agent can actually help you.

What also happened during this challenge

While building the Copilot adapter, I also rebuilt the entire rules framework from scratch. Went from 47 rules (v0.3.1) to 35 rules (v0.4.0) – fewer rules, dramatically higher quality. Every rule is now distinct, detectable, and backed by evidence. But that’s a story for another post.

Try it: npx @reporails/cli check

GitHub | Previous posts

I built WikiPilot with GitHub Copilot CLI

This is a submission for the GitHub Copilot CLI Challenge.

## What I Built

I built WikiPilot, a local-first, AI-powered CLI that generates a structured wiki for real codebases with
source-grounded evidence.

Instead of manually writing docs that drift over time, WikiPilot analyzes repositories, extracts symbols, plans
pages, generates documentation, validates quality, and outputs a static viewer-ready wiki.

### Key capabilities

  • Evidence-first docs: generated sections include source references and confidence scoring.
  • Incremental updates: processes changed files by default, with full rebuild support.
  • Multi-language analysis: TypeScript/JavaScript and C# support.
  • Machine-readable outputs: manifests, codemap, quality reports, and wiki plan artifacts.
  • Viewer experience: static docs viewer with navigation, TOC, and Mermaid support.

### Why this matters
WikiPilot makes documentation more auditable, repeatable, and CI-friendly, so teams can keep
architecture knowledge close to the code without heavy manual curation.

## Demo

  • Repository: https://github.com/HariharanS/wikipilot
  • Screenshots:

### Suggested walkthrough (60–90 seconds)

  1. Show .wikipilot.yml and explain the target repo setup.
  2. Run generation (generate) and show incremental + quality outputs.
  3. Open generated markdown and point to evidence/source grounding.
  4. Launch viewer (serve --build) and show navigation + rendered docs.
  5. Close with one practical “before/after” outcome (time saved, clearer onboarding, etc.).

## My Experience with GitHub Copilot CLI

GitHub Copilot CLI acted like a development copilot across architecture iteration, implementation, and debugging
loops while building WikiPilot.

I used it to speed up:

  • CLI command design and refactors
  • prompt/schema iteration for generation quality
  • debugging pipeline edge cases
  • improving developer UX and docs

### Example Copilot CLI workflows I used


bash
   copilot "help me design a CLI flow for generate/serve/evaluate-models commands"
   copilot "review this module and suggest a safer refactor with minimal changes"
   copilot "debug why this output quality check is failing and propose a fix"
   copilot "draft docs for this command based on code behavior"
  Impact
  Copilot CLI reduced context switching, accelerated iteration on tricky parts (generation + validation), and helped

** Ran out of copilot credits **
- missing features to deploy to cloud
- use improved prompts and regenerate docs to improve quality of docs produced
  keep momentum from idea to working end-to-end tool.
  What I learned
   - Evidence-grounded AI output is much more trustworthy than free-form generation.
   - Incremental pipelines are critical for real-world repo scale.
   - Good DX (clear commands, predictable outputs, quality reports) matters as much as model quality.
  What’s next
   - Better cross-repo relationship visualization
   - More language analyzers
   - Richer interactive viewer exploration and traceability