Hashtag Jakarta EE #321

Hashtag Jakarta EE #321

Welcome to issue number three hundred and twenty-one of Hashtag Jakarta EE!

As this post comes out, I have just arrived home from DeveloperWeek 2026 in San Jose, California. I will now spend a couple of days at home before going to Montreal for ConFoo 2026. I look forward to presenting at this conference for the fifth time.

When drifting just a little outside the sphere of Java-focused conference, it is very apparent that Java is perceived as being a legacy language. Most of these developers (or do they identify as vibe-prompters these days?) are not aware of the progress made by Java to make it the number one platform for AI workloads. The performance of the JVM alone should be convincing enough, but these days when quality is measured by quantity (in lines of code), it is easy to forget the fundamentals of software architecture.

Bruno Borges has put together a bunch of patterns that showcases how modern Java differs from the old style, including how Enterprise Java has evolved from the old J2EE to modern Jakarta EE.

In the minutes from last week’s Jakarta EE Platform call, the content for Jakarta EE 12 Milestone 3 is outlined. All specifications are expected to update their parent pom.xml to the newly released EE4J Parent 2.0.0 which contains the configuration needed to be able to stage artifacts before releasing to Maven Central the same way we used to be able to do with OSSRH (which was retired last year).

By the way. If you were ever in doubt, this blog is, and will always be, 100% written by me. There is no AI involved, which you probably can tell by the spelling errors and (mostly) readable language. No generated slop here, only potentially sloppy human mistakes.

Ivar Grimstad


The Comprehensive Guide to OTel Collector Contrib

As application systems grow more complex, it becomes ever more important to understand how services interact across distributed systems. Observability sheds light on the behavior of instrumented applications and the infrastructure they run on. This enables engineering teams to gain better track system health and prevent critical failures.

OpenTelemetry (OTel) has standardized how we generate and transmit telemetry, and the OpenTelemetry Collector is the engine that processes and export this data. However, when deploying the Collector, you will encounter two distinct variants: the Core and Contrib distributions.

Choosing the right distribution is a key step in setting up your observability pipeline. This article explains what the OpenTelemetry Collector Contrib distribution is, why it exists, and how to navigate its ecosystem of plug-and-play components.

What is OpenTelemetry Collector Contrib?

The Collector is a high-performance component that handles data ingestion from multiple sources, performs in-flight processing, and exports data to observability platforms. OpenTelemetry Collector Contrib is a companion repository to the main OpenTelemetry Collector that houses a vast library of community-contributed components. To put things into perspective:

  • Collector Core: the standard, “official” distribution that is designed to be lightweight and stable. It contains only the essential components maintained and distributed by the OTel maintainers.
  • Collector Contrib: the extended, “batteries-included” distribution; it ships with dozens of integrations for third-party vendors, specific technologies and advanced processing needs. Added components are entirely community-managed and have different levels of stability. Eg., a new component might initially release in the alpha stage.

Versioning and Release Cadence

Since Contrib is a superset of Core, OpenTelemetry releases new versions in sync with the same version scheme. Version 0.142.0’s GitHub release page contains release binaries and links to the change logs for both distributions, in one place.

OTel Collector Contrib vs Core

The following matrix shows the major differences between the two distributions:

Feature Collector Contrib (otelcol-contrib) Collector Core (otelcol)
Scope Massive library of community components Essential components only
Binary Size Large (typically 120MB+) Small (typically ~50MB)
Maintenance Community-driven Managed by core OTel maintainers
Stability Mixed- contains components with varying stability High- strict compatibility guarantees
Use Case Integrating with varied stacks (AWS, K8s, Redis, etc.) Pure OTLP environments, high-security requirements

The OTel Collector Contrib distribution is the practical choice for most teams, as it likely supports the specific frameworks and cloud providers they need. The Core distribution is recommended only when you need a slim binary or have strict security constraints.

OTel Collector Contrib Architecture

The OTel Collector features a plug-and-play design where users configure specific components and define pipelines that control how data flows between those components. Let’s understand each component one-by-one.

Data moves through an OTel Collector Contrib pipeline; extensions enhance the Collector’s behavior and allow monitoring the Collector's health itself. Components not added to a pipeline are not initialized and are ignored.Data moves through an OTel Collector Contrib pipeline; extensions enhance the Collector’s behavior

Receivers

Receivers are the entry point for telemetry data into the Collector. They listen on ports and receive telemetry data or actively scrape data from external sources. Receivers convert incoming telemetry data into the Collector’s native OTLP format before passing it down the pipeline. They are classified as push-based (eg., listening for traces via gRPC/HTTP ports) or pull-based (eg., scraping Prometheus metrics).

Some commonly used receivers are:

  • filelog: Follows log files on disk and ingests new entries as log records.
  • hostmetrics: Periodically collects the host machine’s metrics (CPU, memory, disk usage stats, etc.).
  • k8s_cluster Collects cluster-level events and metadata from the Kubernetes API server.

Users requiring more specialized receivers can build their own by following the comprehensive documentation provided by OpenTelemetry.

Processors

Processors prepare telemetry data for final storage and analysis by OTel platforms. They perform operations like transformation, sampling and aggregation on ingested data. Processors in a pipeline must be defined carefully, as they run in sequence- we will see this in action later.

Contrib includes some powerful processors:

  • transform: Uses the OpenTelemetry Transform Language (OTTL) to modify data, such as rewriting metric names or sanitizing PII.
  • resourcedetection: Detects the cloud environment (AWS, GCP, Azure) and enriches telemetry data with relevant metadata like region or instance ID.
  • tail_sampling: Buffers complete traces in memory to apply sampling rules like “keep 100% of errors and only 1% of successful requests”.

Exporters

Exporters transmit data out of the Collector to observability backends like SigNoz or other Collector instances. They can push the data to an endpoint or expose an endpoint for scraping.

Some useful exporters included in Contrib are:

  • file: Writes data to a file on the disk in the JSON/OTLP format and supports essential features like file rotation, data compression, etc.
  • prometheusremotewrite: Pushes metrics to endpoints that support Prometheus Remote Write.
  • kakfa: Pushes telemetry data to a Kafka topic, typically used in large-scale setups.

Extensions

Extensions provide additional capabilities to the Collector itself, rather than interacting with telemetry data directly.

Some commonly used extensions are:

  • health_check: An HTTP endpoint (default :13133) that reports Collector health. It is often used for liveness/readiness probes in Kubernetes.
  • pprof: Enables performance profiling of the Collector using Go’s pprof tooling (default :1777). This can be helpful to diagnose issues like high memory usage.
  • zpages: Provides debug pages (default :55679) containing real-time information about the Collector’s internal state. For example, /debug/tracez offers a tabular view of trace spans currently inside the Collector.

The TraceZ page categories spans into latency buckets and distinctly lists error samplesThe TraceZ page categories spans into latency buckets and distinctly lists error samples

Pipelines

Pipelines are configurations that define the flow of data through the Collector- from data reception, to processing, and export to a compatible backend. Components are lazy-loaded– the Collector does not initialize components unless they are explicitly added to a pipeline. This prevents unused components from consuming resources.

The following configuration showcases a complete setup that monitors Collector health, scrapes metrics, adds host tags, and exports to an OpenTelemetry-compatible backend:

receivers:
  hostmetrics:
    collection_interval: 10s
    scrapers:
      cpu:
      memory:
      filesystem:

  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  resourcedetection:
    detectors: [system]

  resource:
    attributes:
      - key: service.name
        value: "otelcol-contrib-demo"
        action: upsert

      - key: service.namespace
        value: "infra"
        action: upsert

      - key: deployment.environment
        value: "demo"
        action: upsert

  # explicit object definitions prevent k8s validation errors
  batch: {}

exporters:
  otlp:
    # replace us with your host region
    endpoint: "https://ingest.us.signoz.cloud:443"
    tls:
      insecure: false
    headers:
      # add your ingestion key here
      signoz-ingestion-key: "<SIGNOZ-INGESTION-KEY>"
  debug:
    verbosity: normal

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679

service:
  pipelines:
    metrics:
      receivers: [hostmetrics, otlp]
      processors: [resourcedetection, resource, batch]
      exporters: [otlp, debug]

    traces:
      receivers: [otlp]
      processors: [resourcedetection, resource, batch]
      exporters: [otlp, debug]

  # extensions are configured separately from pipelines
  extensions: [health_check, pprof, zpages]

How to Setup OTel Collector Contrib

While Contrib is available as a binary, it’s recommended to deploy it via Docker or Kubernetes, as these tools provide logical isolation and streamline the software lifecycle process.

For this guide, we will use SigNoz as the observability backend to store and visualize our telemetry data efficiently.

Prerequisites

We need to perform a few initial steps before deploying the Collector.

Clone the Example Repository

We have a GitHub repository containing all the configuration files and scripts used in this guide. Clone it by running:

git clone git@github.com:SigNoz/examples.git
cd examples/opentelemetry-collector/otelcol-contrib-demo

Setting up SigNoz Cloud

As discussed, we’ll be sending telemetry to SigNoz, an OpenTelemetry-native APM.

  • Sign up for a free SigNoz Cloud account (includes 30 days of unlimited access).
  • Navigate to Settings -> Account Settings -> Ingestion from the sidebar.
  • Set the deployment Region and Ingestion Key values in the otel-config.yaml file, at lines 40 and 45 respectively.

Once you’ve signed up, access Settings -> Account Settings from the sidebar. From there, select the Ingestion option on the left, and create an ingestion key.

Find your Region and Ingestion Key from the Ingestion tabFind your Region and Ingestion Key from the Ingestion tab

Deploy Contrib with Docker

The Docker image for OTel Collector Contrib comes pre-packaged with all the community components discussed earlier.

The otel-config.yaml file in the repo contains the pipeline configuration we defined in the previous section. To deploy Contrib using Docker, execute:

docker run 
-v ./otel-config.yaml:/etc/otelcol-contrib/config.yaml 
-p 4317:4317 -p 4318:4318 -p 13133:13133 -p 1777:1777 -p 55679:55679  
--rm --name otelcol-contrib 
--rm --name otelcol-contrib otel/opentelemetry-collector-contrib:0.142.0

This command mounts your local config file to the container, exposes the necessary ports, and starts the Contrib Collector.

To verify the container is running correctly, query the health check endpoint in a new terminal window. Expect a Server available response:

curl localhost:13133

Deploy Contrib with Kubernetes (with Helm)

The official OpenTelemetry Helm chart is the standard way to deploy the Collector on Kubernetes.

We will run the Collector as a Deployment for convenience, though values.yaml can be modified to use it as a DaemonSet or a StatefulSet, if needed.

Run the following to add the parent Helm repo and install the chart:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm install -f values.yaml otelcol-contrib open-telemetry/opentelemetry-collector

Once installed, use port-forward to access the Collector services over localhost (run this in a separate terminal):

kubectl port-forward svc/otelcol-contrib-opentelemetry-collector 
  4317:4317 
  4318:4318 
  13133:13133 
  1777:1777 
  55679:55679

Generating Telemetry & Visualizing It in SigNoz

The GitHub repo contains a load generator script to generate sample telemetry data. It generates spans for various CRUD API calls and pushes them to your Collector at port 4318.

This allows you to see exactly how the observability backend represents data processed by your Collector pipeline.

First make the script executable, then run it:

chmod +x contrib_load_generator.sh
./contrib_load_generator.sh

The script will run for 30 seconds, generating a mix of success and error spans to simulate real-world scenarios. You can run it multiple times to generate a healthy amount of data.

This video shows you how to interact with the generated data in SigNoz:

Generating and Visualizing OpenTelemetry Data with SigNoz – YouTube

Photo image of Dhruv Ahuja

Dhruv Ahuja

No subscribers

Generating and Visualizing OpenTelemetry Data with SigNoz

Dhruv Ahuja

Search

Watch later

Share

Copy link

Info

Shopping

Tap to unmute

If playback doesn’t begin shortly, try restarting your device.

More videos

More videos

You’re signed out

Videos you watch may be added to the TV’s watch history and influence TV recommendations. To avoid this, cancel and sign in to YouTube on your computer.

CancelConfirm

Share

Include playlist

An error occurred while retrieving sharing information. Please try again later.

Watch on

0:00

0:00 / 2:44

•Live

Build Your Own Collector

OpenTelemetry allows us to build custom Collector distributions to suit our needs. Let’s understand how we can do so, and then look at the steps for the build process.

What is OpenTelemetry Collector Builder?

The OpenTelemetry Collector Builder (OCB) is a CLI tool that lets you create a custom Collector distribution that includes only the components you explicitly need. Users create a manifest file listing specific Contrib components. OCB then compiles a custom binary that including just components.

Building custom distributions has multiple benefits:

  • Smaller size: OCB keeps the binary size to an absolute minimum, ensuring you only ship the necessary code.
  • Adheres to security requirements: By limiting the number of included third-party components, teams can reduce attack vectors and meet strict compliance requirements.

Initial Checks

OCB requires Go for compiling your desired components into a binary.

  • Download the Go binary here.
  • For easier setup, you can use apt on Linux or brew on Mac.

The rest of the prerequisites (SigNoz account, etc.) are the same as defined earlier.

Step 1: Setting Up OCB

The recommended way to use OCB is to download the pre-compiled binary for your system. First, identify your machine’s architecture using uname -m.

The uname command helps find architecture details for your machine.The uname command helps find architecture details for your machine.

cd into the builder directory in our example repo and download the binary based on your machine’s OS and architecture:

Linux (AMD 64)Linux (ARM 64)Linux (ppc64le)macOS (AMD 64)macOS (ARM 64)Windows (AMD 64)

curl --proto '=https' --tlsv1.2 -fL -o ocb 
  https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fbuilder%2Fv0.142.0/ocb_0.142.0_linux_amd64

chmod +x ocb

To verify the installation, run:

./ocb help
curl --proto '=https' --tlsv1.2 -fL -o ocb 
  https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fbuilder%2Fv0.142.0/ocb_0.142.0_linux_arm64

chmod +x ocb

To verify the installation, run:

./ocb help
curl --proto '=https' --tlsv1.2 -fL -o ocb 
  https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fbuilder%2Fv0.142.0/ocb_0.142.0_linux_ppc64le

chmod +x ocb

To verify the installation, run:

./ocb help
curl --proto '=https' --tlsv1.2 -fL -o ocb 
  https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fbuilder%2Fv0.142.0/ocb_0.142.0_darwin_amd64

chmod +x ocb

To verify the installation, run:

./ocb help
curl --proto '=https' --tlsv1.2 -fL -o ocb 
  https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fbuilder%2Fv0.142.0/ocb_0.142.0_darwin_arm64

chmod +x ocb

To verify the installation, run:

./ocb help
Invoke-WebRequest -Uri `
  "https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/cmd%2Fbuilder%2Fv0.142.0/ocb_0.142.0_windows_amd64.exe" `
  -OutFile "ocb.exe"

Unblock-File -Path "ocb.exe"

To verify the installation, run:

ocb help

Step 2: Compiling the Contrib Binary

As discussed, OCB requires a configuration file to define which components to include. We have prepared a builder-config.yaml that includes the Go modules for all the components we used in the Docker/Kubernetes examples.

To compile the custom binary, run:

./ocb --config builder-config.yaml

OCB will download the defined Go components, compile them and save the output binary in the _build directory.

Now, let’s run our custom binary with our existing Collector configuration file:

./_build/custom-contrib-collector --config ../otel-config.yaml

You should start seeing the Collector’s log entries in your terminal. You can use the load generator script to send data and visualize it with SigNoz. Feel free to experiment with the configurations and see how things work under the hood.

Congratulations, you have successfully created your own optimized Collector distribution! Now, let’s look at the common use cases for Contrib.

Common Use Cases

You will almost certainly need Contrib (or a custom build using OCB) if your requirements include:

  • Kubernetes Observability: While Core handles OTLP data, it doesn’t natively understand Kubernetes objects. Contrib provides the k8sattributes processor and k8s_cluster receiver, which are essential for tagging your telemetry with Pod names, Namespaces, and Deployment IDs.
  • Data Sampling: As distributed systems scale, sampling becomes critical to control costs. The probabilistic sampling and tail sampling processors are only available in Contrib.
  • Telemetry Transformation: Often, telemetry generated by an application doesn’t match naming conventions, has verbose metadata or contains redundant data. The transform processor allows you to modify incoming telemetry to address such issues.
  • Vendor Flexibility: If you are migrating away from a vendor, but still have agents running (eg., legacy/proprietary agents or Prometheus scrapers), Contrib provides receivers to accept those formats.

FAQs

What is OpenTelemetry Collector Contrib?

OpenTelemetry Collector Contrib is the “batteries-included” distribution of OpenTelemetry Collector. While the Core project provides the basic framework for processing telemetry data, the Contrib repository houses the vast majority of community-written integrations. It allows you to collect data from virtually any data source and send it to any backend without writing custom code.

Can I mix Core and Contrib components?

Yes. If you use the Contrib binary, it already includes all Core components. If you use OCB, you can mix and match them as you please. You can use components present in a distribution by configuring them under receivers, processors, or exporters and adding them to your pipelines.

Is Contrib less stable than Core?

The Core components inside the Contrib binary are just as stable as they are in the Core binary. However, the additional components in Contrib vary in stability. Always check the README of the specific receiver or exporter you plan to use to see if it is marked Alpha, Beta, or Stable.

Does Contrib impact performance?

Having extra code in the binary increases file size slightly, but it does not degrade runtime performance. Unused components sit dormant and do not consume CPU or memory unless you explicitly enable them in your configuration pipelines.

Ingesting Data from the Collector

We have now understood what the OpenTelemetry Collector Contrib is, its advantages over the Core distribution, and its setup process. Once you set up the Collector for your observability needs, you need a reliable observability backend to interact with your telemetry data.

SigNoz is an open-source observability platform built natively on OpenTelemetry. Because SigNoz uses the OTel native format, it integrates seamlessly with the Contrib collector’s OTLP exporters.

You can choose between various deployment options in SigNoz. The easiest way to get started with SigNoz is SigNoz cloud. We offer a 30-day free trial account with access to all features.

Those who have data privacy concerns and can’t send their data outside their infrastructure can sign up for either enterprise self-hosted or BYOC offering.

Those who have the expertise to manage SigNoz themselves or just want to start with a free self-hosted option can use our community edition.

Ten years late to the dbt party (DuckDB edition)

> Apparently, you can teach an old dog new tricks.

Last year I wrote a blog post about building a data processing pipeline using DuckDB to ingest weather sensor data from the UK’s Environment Agency. The pipeline was based around a set of SQL scripts, and whilst it used important data engineering practices like data modelling, it sidestepped the elephant in the room for code-based pipelines: dbt.

dbt is a tool created in 2016 that really exploded in popularity on the data engineering scene around 2020. This also coincided with my own journey away from hands-on data engineering and into Kafka and developer advocacy. As a result, dbt has always been one of those things I kept hearing about but never tried.

In 2022 I made a couple of attempts to learn dbt, but it never really ‘clicked’.

I’m rather delighted to say that as of today, dbt has definitely ‘clicked’. How do I know? Because not only can I explain what I’ve built, but I’ve even had the 💡 lightbulb-above-the-head moment seeing it in action and how elegant the code used to build pipelines with dbt can be.

In this blog post I’m going to show off what I built with dbt, contrasting it to my previous hand-built method.

Tip:
You can find the full dbt project on GitHub here.

If you’re new to dbt hopefully it’ll be interesting and useful. If you’re an old hand at dbt then you can let me know any glaring mistakes I’ve made 🙂

First, a little sneak peek:

Do you like DAGs?

Now, let’s look at how I did it.

The Data

Note:
I’m just going to copy and paste this from my previous article 🙂

At the heart of the data are readings, providing information about measures such as rainfall and river levels. These are reported from a variety of stations around the UK.

The data is available on a public REST API (try it out here to see the current river level at one of the stations in Sheffield).

Note:
I’ve used this same set of environment sensor data many times before, because it provides just the right balance of real-world imperfections, interesting stories to discover, data modelling potential, and enough volume to be useful but not too much to overwhelm.

  • Exploring it with DuckDB and Rill

  • Trying out the new DuckDB UI

  • Loading it into Kafka

  • Working with it in Flink SQL

  • Hand-coding a processing pipeline with DuckDB

  • Analysing it in Iceberg

  • Building a streaming ETL pipeline with Flink SQL

Ingest

What better place to start from than the beginning?

Whilst DuckDB has built-in ingest capabilities (which is COOL) it’s not necessarily the best idea to tightly couple ingest with transformation.

Previously I did it one-shot like this:

CREATE OR REPLACE TABLE readings_stg AS
  WITH src AS (
    SELECT * 
      FROM read_json('https://environment.data.gov.uk/flood-monitoring/data/readings?latest')) 
    SELECT u.* FROM (
        SELECT UNNEST(items) AS u FROM src); 
  1. Extract

  2. Transform

dbt encourages a bit more rigour with the concept of sources. By defining a source we can decouple the transformation of the data (2) from its initial extraction (1). We can also tell dbt to use a different instance of the source (for example, a static dataset if we’re on an aeroplane with no wifi to keep pulling the API), as well as configure freshness alerts for the data.

The staging/sources.yml defines the data source:

[]
  - name: env_agency
    schema: main
    description: Raw data from the [Environment Agency flood monitoring API](https://environment.data.gov.uk/flood-monitoring/doc/reference)
    tables:
      - name: raw_stations
[]

Note the description – this is a Markdown-capable field that gets fed into the documentation we’ll generate later on. It’s pretty cool.

So env_agency is the logical name of the source, and raw_stations the particular table. We reference these thus when loading the data into staging:

SELECT
    u.dateTime, u.measure, u.value
FROM (
    SELECT UNNEST(items) AS u
    FROM {{ source('env_agency', 'raw_readings') }} 
)
  1. referencing the source

So if we’re not pulling from the API here, where are we doing it?

This is where we remember exactly what dbt is—and isn’t—for. Whilst DuckDB can pull data from an API directly, it doesn’t map directly to capabilities in dbt for a good reason—dbt is for transforming data.

That said, dbt is nothing if not flexible, and its ability to run Jinja-based macros gives it superpowers for bending to most wills. Here’s how we’ll pull in the readings API data:

{% macro load_raw_readings() %}
{% set endpoint = var('api_base_url') ~ '/data/readings?latest' %} 

{% do log("raw_readings ~ reading from " ~ endpoint, info=true) %}

{% set sql %}
    CREATE OR REPLACE TABLE raw_readings AS
    SELECT *,
            list_max(list_transform(items, x -> x.dateTime)) 
            AS _latest_reading_at 
    FROM read_json('{{ endpoint }}') 
{% endset %}
{% do run_query(sql) %}

{% do log("raw_readings ~  loaded", info=true) %}

{% endmacro %}
  1. Variables are defined in dbt_project.yml

  2. Disassemble the REST payload to get the most recent timestamp of the data, store it as its own column for freshness tests later

  3. As it happens, we are using DuckDB’s read_json to fetch the API data (contrary, much?)

Even though we are using DuckDB for the extract phase of our pipeline, we’re learning how to separate concerns. In a ‘real’ pipeline we’d use a separate tool to load the data into DuckDB (I discuss this a bit further later on). We’d do it that way to give us more flexibility over things like retries, timeouts, and so on.

The other two tables are ingested in a similar way, except they use CURRENT_TIMESTAMP for _latest_reading_at since the measures and stations APIs don’t return any timestamp information. If you step away from APIs and think about data from upstream transactional systems being fed into dbt, there’ll always be (or should always be) a field that shows when the data last changed. Regardless of where it comes from, the purpose of the _latest_reading_at field is to give dbt a way to understand when the source data was last updated.

In the staging/sources.yml the metadata for the source can include a freshness configuration:

[]
  - name: env_agency
    tables:
      - name: raw_stations
        loaded_at_field: _latest_reading_at
        freshness:
          warn_after: { count: 24, period: hour }
          error_after: { count: 48, period: hour }
[]

This is the kind of thing where the light started to dawn on me that dbt is popular with data engineers for a good reason; all of the stuff that bites you in the ass on day 2, they’ve thought of and elegantly incorporated into the tool. Yes I could write yet another SQL query and bung it in my pipeline somewhere that checks for this kind of thing, but in reality if the data is stale do we even want to continue the pipeline?

With dbt we can configure different levels of freshness check—“hold up, this thing’s getting stale, just letting you know” (warning), and “woah, this data source is so old it stinks worse than a student’s dorm room, I ain’t touching either of those things” (error).

Thinking clearly

When I wrote my previous blog post I did my best to structure the processing logically, but still ended up mixing pre-processing/cleansing with logical transformations.

dbt’s approach to source / staging / marts helped a lot in terms of nailing this down and reasoning through what processing should go where.

For example, the readings data is touched three times, each with its own transformations:

  1. Ingest: get the data in

macros/ingestion/load_raw_readings.sql

CREATE OR REPLACE TABLE raw_readings AS
SELECT *, 
        list_max(list_transform(items, x -> x.dateTime)) 
        AS _latest_reading_at 
FROM read_json('{{ endpoint }}')
1.  raw data, untransformed

2.  add a field for the latest timestamp
  1. Staging: clean the data up

models/staging/stg_readings.sql

SELECT
    u.dateTime,
    {{ strip_api_url('u.measure', 'measures') }} AS measure, 
    CAST( 
        CASE WHEN json_type(u.value) = 'ARRAY' THEN u.value->>0 
             ELSE CAST(u.value AS VARCHAR) 
        END AS DOUBLE 
    ) AS value 
FROM (
    SELECT UNNEST(items) AS u 
    FROM {{ source('env_agency', 'raw_readings') }}
)
1.  Drop the URL prefix from the measure name to make it more usable

2.  Handle situations where the API sends multiple values for a single reading (just take the first instance)

3.  Explode the nested array

    Except for exploding the data, the operations are where we start applying our opinions to the data (how `measure` is handled) and addressing data issues (`value` sometimes being a JSON array with multiple values)
  1. Marts: build specific tables as needed, handle incremental loads, backfill from archive, etc

models/marts/fct_readings.sql

{{
    config(
        materialized='incremental',
        unique_key=['dateTime', 'measure']
    )
}}

SELECT * FROM {{ ref('stg_readings') }}
UNION ALL
SELECT * FROM {{ ref('stg_readings_archive') }}

{% if is_incremental() %}
WHERE dateTime > (SELECT MAX(dateTime) FROM {{ this }})
{% endif %}

Each of these stages can be run in isolation, and each one is easily debugged. Sure, we could combine some of these (as I did in my original post), but it makes troubleshooting that much harder.

Incremental loading

This really is where dbt comes into its own as a tool for grown-up data engineers with better things to do than babysit brittle data pipelines.

Unlike my _joining_the_data[hand-crafted version] for loading the fact table—which required manual steps including pre-creating the table, adding constraints, and so on—dbt comes equipped with a syntax for declaring the intent (just like SQL itself), and at runtime dbt makes it so.

First we set the configuration, defining it as a table to load incrementally, and specify the unique key:

{{
    config(
        materialized='incremental',
        unique_key=['dateTime', 'measure']
    )
}}

then the source of the data:

SELECT * FROM {{ ref('stg_readings') }} 
UNION ALL
SELECT * FROM {{ ref('stg_readings_archive') }} 
  1. {{ }} is Jinja notation for variable substitution, with ref being a function that resolves the table name to where it got built by dbt previously

  2. The archive/backfill table. I keep skipping over this don’t I? I’ll get to it in just a moment, I promise

and finally a clause that defines how the incremental load will work:

{% if is_incremental() %}
WHERE dateTime > (SELECT MAX(dateTime) FROM {{ this }})
{% endif %}

This is more Jinja, and after a while you’ll start to see curly braces (with different permutations of other characters) in your sleep. What this block does is use a conditional, expressed with if/endif (and wrapped in Jinja code markers {% %}), to determine if it’s an incremental load. If it is then the SQL WHERE clause gets added. This is a straightforward predicate, the only difference from vanilla SQL being the {{ this }} reference, which compiles into the reference for the table being built, i.e. fct_readings. With this predicate, dbt knows where to look for the current high-water mark.

Backfill

I told you we’d get here eventually 🙂 Because we’ve built the pipeline logically with delineated responsibilities between stages, it’s easy to compartmentalise the process of ingesting the historical data from its daily CSV files and handling any quirks with its data from that of the rest of the pipeline.

The backfill is written as a macro. First we pull in each CSV file using DuckDB’s list comprehension to rather neatly iterate over each date in the range:

macros/ingestion/backfill_readings.sql

[]
INSERT INTO raw_readings_archive
SELECT * FROM read_csv(
    list_transform(
        generate_series(DATE '{{ start_date }}', DATE '{{ end_date }}', INTERVAL 1 DAY),
        d -> 'https://environment.data.gov.uk/flood-monitoring/archive/readings-' || strftime(d, '%Y-%m-%d') || '.csv'
    ), 
[]
  1. I guess this should be using the api_base_url variable that I mentioned above, oops!

The macro is invoked manually like this:

dbt run-operation backfill_readings 
    --args '{"start_date": "2026-02-10", "end_date": "2026-02-11"}'

Then we take the raw data (remember, no changes at ingest time) and cleanse it for staging. This is the same processing we do for the API (except value is sometimes pipe-delimited pairs instead of JSON arrays). Different staging tables are important here, otherwise we’d end up trying to solve the different types of value data in one SQL mess.

models/staging/stg_readings_archive.sql

SELECT
    dateTime,
    {{ strip_api_url('measure', 'measures') }} AS measure,
    CAST(
        CASE
            WHEN value LIKE '%|%' THEN split_part(value, '|', 1)
            ELSE value
        END AS DOUBLE
    ) AS value
FROM {{ source('env_agency', 'raw_readings_archive') }}

This means that when we get to building the fct_readings table in the mart, all we need to do is UNION the staging tables because they’ve got the same schema with the same data cleansing logic applied to them:

SELECT * FROM {{ ref('stg_readings') }}
UNION ALL
SELECT * FROM {{ ref('stg_readings_archive') }}

Handling Slowly Changing Dimensions (SCD) the easy (but proper) way

In my original version I use SCD type 1 and throw away dimension history. Not for any sound business reason but just because it’s the easiest thing to do; drop and recreate the dimension table from the latest version of the source dimension data.

It’s kinda a sucky way to do it though because you lose the ability to analyse how dimension data might have changed over time, as well as answer questions based on the state of a dimension at a given point in time. For example, “What was the total cumulative rainfall in Sheffield in December” could give you a different answer depending on whether you include measuring stations that **were* open in December* or all those that **are* open in Sheffield today when I run the query*.

dbt makes SCD an absolute doddle through the idea of snapshots. Also, in (yet another) good example of how good a fit dbt is for this kind of work, it supports dimension source data done ‘right’ and ‘wrong’. What do I mean by that, and how much heavy lifting are those ‘quotation’ ‘marks’ doing?

In an ideal world—where the source data is designed with the data engineer in mind—any time an attribute of a dimension changes, the data would indicate that with some kind of “last_updated” timestamp. dbt calls this the timestamp strategy and is the recommended approach. It’s clean, and it’s efficient. This is what I mean by ‘right’.

The other option is when the data upstream has been YOLO’d and as data engineers we’re left scrabbling around for crumbs from the table (TABLE, geddit?!). Whether by oversight, or perhaps some arguably-misguided attempt to streamline the data by excluding any ‘extraneous’ fields such as “last_updated”, the dimension data we’re working with just has the attributes and the attributes alone. In this case dbt provides the check strategy, which looks at some (or all) field values in the latest version of the dimension, compares it to what it’s seen before, and creates a new entry if any have changed.

Regardless of the strategy, the flow for building dimension tables looks the same:

(external data) raw -> staging -> snapshot -> dimension
  • Raw is literally whatever the API serves us up (plus, optionally, a timestamp to help us check freshness)

  • Staging is where we clean up and shape the data (unnest)

  • Snapshot looks at staging and existing rows in snapshot for the particular dimension instance, and creates a new entry if it’s changed (based on our strategy configuration)

  • Dimension is built from the snapshot table, taking the latest version of each instance of the dimension by checking using WHERE dbt_valid_to IS NULL. dbt_valid_to is added by dbt when it builds the snapshot table.

Here’s the snapshot configuration for station data:

{% snapshot snap_stations %}

{{
    config(
        target_schema='main',
        unique_key='notation', 
        strategy='check', 
        check_cols='all', 
    )
}}

SELECT * FROM {{ ref('stg_stations') }}

{% endsnapshot %}
  1. This is the unique key, which for stations is notation

  2. Since there’s no “last updated” timestamp in the source data, we have to use the check strategy

  3. Check all columns to see if any attributes of the dimension have changed. This is arguably not quite the right configuration—see the note below regarding the measures field.

This builds a snapshot table that looks like this

DESCRIBE snap_stations;
┌──────────────────┐
│   column_name    |
│     varchar      |
├──────────────────┤
│ @id              │ ①
│ RLOIid           │ ①
│ catchmentName    │ ①
│ dateOpened       │ ①
│ easting          │ ①
│ label            │ ①
│ lat              │ ①
│ long             │ ①
│ measures         │ ①
│ northing         │ ①
[…]
│ dbt_scd_id       │ ②
│ dbt_updated_at   │ ②
│ dbt_valid_from   │ ②
│ dbt_valid_to     │ ②
└──────────────────┘
  1. Columns from the source table

  2. Columns added by dbt snapshot process

So for example, here’s a station that got renamed:

The devil is in the detail data

Sometimes data is just…mucky.

Here’s why we always use keys instead of labels—the latter can be imprecise and frequently changing:

SELECT notation, label, dbt_valid_from, dbt_valid_to
  FROM snap_stations
 WHERE notation = 'E6619'
 ORDER BY dbt_valid_to;
┌──────────┬──────────────────┬────────────────────────────┬────────────────────────────┐
│ notation │      label       │       dbt_valid_from       │        dbt_valid_to        │
│ varchar  │       json       │         timestamp          │         timestamp          │
├──────────┼──────────────────┼────────────────────────────┼────────────────────────────┤
│ E6619    │ "Crowhurst GS"   │ 2026-02-12 14:12:10.501256 │ 2026-02-13 20:45:44.391342 │
│ E6619    │ "CROWHURST WEIR" │ 2026-02-13 20:45:44.391342 │ 2026-02-13 21:15:48.618805 │
│ E6619    │ "Crowhurst GS"   │ 2026-02-13 21:15:48.618805 │ 2026-02-14 00:46:35.044774 │
│ E6619    │ "CROWHURST WEIR" │ 2026-02-14 00:46:35.044774 │ 2026-02-14 01:01:34.296621 │
│ E6619    │ "Crowhurst GS"   │ 2026-02-14 01:01:34.296621 │ 2026-02-14 03:15:46.92373  │
[etc etc]

Eyeballing it, we can see this is nominally the same place (Crowhurst). If we were using label as our join we’d lose the continuity of our data over time. As it is, the label surfaced in a report will keep flip-flopping 🙂

Another example of upstream data being imperfect is this:

SELECT notation, label, measures[1].parameterName, dbt_valid_from, dbt_valid_to
  FROM snap_stations
 WHERE notation = '0'
 ORDER BY dbt_valid_to;
┌──────────┬───────────────────────────┬─────────────────────────────┬────────────────────────────┬────────────────────────────┐
│ notation │           label           │ (measures[1]).parameterName │       dbt_valid_from       │        dbt_valid_to        │
│ varchar  │           json            │           varchar           │         timestamp          │         timestamp          │
├──────────┼───────────────────────────┼─────────────────────────────┼────────────────────────────┼────────────────────────────┤
│ 0        │ "HELEBRIDGE"              │ Water Level                 │ 2026-02-12 14:12:10.501256 │ 2026-02-13 17:59:01.543565 │
│ 0        │ "MEVAGISSEY FIRE STATION" │ Flow                        │ 2026-02-13 17:59:01.543565 │ 2026-02-13 18:46:55.201417 │
│ 0        │ "HELEBRIDGE"              │ Water Level                 │ 2026-02-13 18:46:55.201417 │ 2026-02-14 06:31:08.75168  │
│ 0        │ "MEVAGISSEY FIRE STATION" │ Flow                        │ 2026-02-14 06:31:08.75168  │ 2026-02-14 07:31:14.07855  │
│ 0        │ "HELEBRIDGE"              │ Water Level                 │ 2026-02-14 07:31:14.07855  │ 2026-02-14 16:16:23.465051 │
│ 0        │ "MEVAGISSEY FIRE STATION" │ Flow                        │ 2026-02-14 16:16:23.465051 │ 2026-02-14 16:31:45.420155 │
│ 0        │ "HELEBRIDGE"              │ Water Level                 │ 2026-02-14 16:31:45.420155 │ 2026-02-15 06:31:07.812398 │

Our unique key is notation, and there are apparently two measurements using it! The same measures also have more correct-looking notation values, so one suspects this is an API glitch somewhere:

SELECT DISTINCT notation, label, measures[1].parameterName
  FROM snap_stations
 WHERE lcase(label) LIKE '%helebridge%'
    OR lcase(label) LIKE '%mevagissey%'
 ORDER BY 2, 3;
┌──────────┬───────────────────────────────────────┬─────────────────────────────┐
│ notation │                 label                 │ (measures[1]).parameterName │
│ varchar  │                 json                  │           varchar           │
├──────────┼───────────────────────────────────────┼─────────────────────────────┤
│ 0        │ "HELEBRIDGE"                          │ Flow                        │
│ 49168    │ "HELEBRIDGE"                          │ Flow                        │
│ 0        │ "HELEBRIDGE"                          │ Water Level                 │
│ 49111    │ "Helebridge"                          │ Water Level                 │
│ 18A10d   │ "MEVAGISSEY FIRE STATION TO BE WITSD" │ Water Level                 │
│ 0        │ "MEVAGISSEY FIRE STATION"             │ Flow                        │
│ 48191    │ "Mevagissey"                          │ Water Level                 │
└──────────┴───────────────────────────────────────┴─────────────────────────────┘

Whilst there might be upstream data issues, sometimes there are self-inflicted mistakes. Here’s one that I realised when I started digging into the data:

SELECT s.notation, s.label,
       array_length(s.measures) AS measure_count,
       string_agg(DISTINCT m.parameterName, ', ' ORDER BY m.parameterName) AS parameter_names,
       s.dbt_valid_from, s.dbt_valid_to
  FROM snap_stations AS s
  CROSS JOIN UNNEST(s.measures) AS u(m)
 WHERE s.notation = '3275'
 GROUP BY s.notation, s.label, s.measures, s.dbt_valid_from, s.dbt_valid_to
 ORDER BY s.dbt_valid_to;
┌──────────┬────────────────────┬───────────────┬───────────────────────┬────────────────────────────┬────────────────────────────┐
│ notation │       label        │ measure_count │    parameter_names    │       dbt_valid_from       │        dbt_valid_to        │
│ varchar  │        json        │     int64     │        varchar        │         timestamp          │         timestamp          │
├──────────┼────────────────────┼───────────────┼───────────────────────┼────────────────────────────┼────────────────────────────┤
│ 3275     │ "Rainfall station" │             1 │ Rainfall              │ 2026-02-12 14:12:10.501256 │ 2026-02-13 18:36:29.831889 │
│ 3275     │ "Rainfall station" │             2 │ Rainfall, Temperature │ 2026-02-13 18:36:29.831889 │ 2026-02-13 18:46:55.201417 │
│ 3275     │ "Rainfall station" │             1 │ Rainfall              │ 2026-02-13 18:46:55.201417 │ 2026-02-13 19:31:15.74447  │
│ 3275     │ "Rainfall station" │             2 │ Rainfall, Temperature │ 2026-02-13 19:31:15.74447  │ 2026-02-13 19:46:13.68915  │
│ 3275     │ "Rainfall station" │             1 │ Rainfall              │ 2026-02-13 19:46:13.68915  │ 2026-02-13 20:31:18.730487 │
│ 3275     │ "Rainfall station" │             2 │ Rainfall, Temperature │ 2026-02-13 20:31:18.730487 │ 2026-02-13 20:45:44.391342 │
[…]

Because we build the snapshot in dbt using a strategy of check and check_cols is all, any column changing triggers a new snapshot. What’s happening here is as follows. The station data includes measures, described in the API documentation as

> The set of measurement types available from the station

However, sometimes the API is showing one measure, and sometimes two. Is that enough of a change that we want to track and incur this flip-flopping?

Arguably, the API’s return doesn’t match the documentation (what measures a station has available is not going to change multiple times per day?). But, we are the data engineers and our job is to provide a firebreak between whatever the source data provides, and something clean and consistent for the downstream consumers.

So, perhaps we should update our snapshot configuration to specify the actual columns we want to track. Which is indeed what dbt explicitly recommends that you do:

> It is better to explicitly enumerate the columns that you want to check.

The tool that fits like a glove

The above section is a beautiful illustration of just how much sense the dbt approach makes. I’d already spent several hours analysing the source data before trying to build a pipeline. Even then, I missed some of the nuances described above.

With my clumsy self-built approach previously I would have lost a lot of the detail that makes it possible to dive into and troubleshoot the data like I just did. Crucially, dbt is strongly opinionated but ergonomically designed to help you implement a pipeline built around those opinions. By splitting out sources from staging from dimension snapshots from marts it makes it very easy to not only build the right thing, but diagnose it when it goes wrong. Sometimes it goes wrong from PEBKAC when building it, but in my experience a lot of the issues with pipelines come from upstream data issues (usually that are met with a puzzled “but it shouldn’t be sending that” reaction, or “oh yeah, it does that didn’t we mention it?”).

Date dimension

Whilst the data about measuring stations and measurements comes from the API, it’s always useful to have a dimension table that provides date information. Typically you want to be able to do things like analysis by date periods (year, month, etc) which may or may not be based on the standard calendar. Or you want to look at days of the week, or any other date-based things you can think of.

Even if your end users are themselves writing SQL, and you’ve not got a different calendar (e.g. financial year, etc), a date dimension table is useful. It saves time for the user in remembering syntax, and avoids any ambiguities on things like day of the week number (is Monday the first, or second day of the week?). More importantly though, it ensures that analytical end users building through some kind of tool (such as Superset, etc) are going to be generating the exact same queries as everyone else, and thus getting the same answers.

There were a couple of options that I looked at. The first is DuckDB-specific and uses a FROM RANGE() clause to generate all the rows:

models/marts/dim_date.sql

SELECT CAST(range AS DATE) AS date_day,
        monthname(range) AS date_monthname,
        CAST(CASE WHEN dayofweek(range) IN (0,6) THEN 1 ELSE 0 END AS BOOLEAN) AS date_is_weekend,
        []
FROM range(DATE '2020-01-01',
            DATE '2031-01-01',
            INTERVAL '1 day')

The second was a good opportunity to explore dbt packages. The dbt_utils includes a bunch of useful utilities including one for generating dates. The advantage of this is that it’s database-agnostic; I could port my pipeline to run on Postgres or BigQuery or anything else without needing to worry about whether the DuckDB range function that I used above is available in them.

Packages are added to packages.yml:

packages{.yml}

packages:
  - package: dbt-labs/dbt_utils
    version: ">=1.0.0"

The date dimension table then looks similar to the first, except the FROM clause is different:

models/marts/dim_date_v2.sql


SELECT CAST(date_day AS DATE) AS date_day,
    monthname(date_day) AS date_monthname,
    CAST(CASE WHEN dayofweek(date_day) IN (0,6) THEN 1 ELSE 0 END AS BOOLEAN) AS date_is_weekend,
    []
FROM (
        {{ dbt_utils.date_spine(
            datepart="day",
            start_date="cast('2020-01-01' as date)",
            end_date="cast('2031-01-01' as date)"
        ) }}
    ) AS date_spine

The resulting tables are identical; just different ways to build them.

SELECT * FROM dim_date LIMIT 1;
┌────────────┬───────────┬────────────┬────────────────┬─────────────────┬────────────────┬─────────────────┬──────────────┬────────────────┬─────────────────┬──────────────┐
│  date_day  │ date_year │ date_month │ date_monthname │ date_dayofmonth │ date_dayofweek │ date_is_weekend │ date_dayname │ date_dayofyear │ date_weekofyear │ date_quarter │
│    date    │   int64   │   int64    │    varchar     │      int64      │     int64      │     boolean     │   varchar    │     int64      │      int64      │    int64     │
├────────────┼───────────┼────────────┼────────────────┼─────────────────┼────────────────┼─────────────────┼──────────────┼────────────────┼─────────────────┼──────────────┤
│ 2020-01-01 │   2020    │     1      │ January        │        1        │       3        │ false           │ Wednesday    │       1        │        1        │      1       │
└────────────┴───────────┴────────────┴────────────────┴─────────────────┴────────────────┴─────────────────┴──────────────┴────────────────┴─────────────────┴──────────────┘
SELECT * FROM dim_date_v2 LIMIT 1;
┌────────────┬───────────┬────────────┬────────────────┬─────────────────┬────────────────┬─────────────────┬──────────────┬────────────────┬─────────────────┬──────────────┐
│  date_day  │ date_year │ date_month │ date_monthname │ date_dayofmonth │ date_dayofweek │ date_is_weekend │ date_dayname │ date_dayofyear │ date_weekofyear │ date_quarter │
│    date    │   int64   │   int64    │    varchar     │      int64      │     int64      │     boolean     │   varchar    │     int64      │      int64      │    int64     │
├────────────┼───────────┼────────────┼────────────────┼─────────────────┼────────────────┼─────────────────┼──────────────┼────────────────┼─────────────────┼──────────────┤
│ 2020-01-01 │   2020    │     1      │ January        │        1        │       3        │ false           │ Wednesday    │       1        │        1        │      1       │
└────────────┴───────────┴────────────┴────────────────┴─────────────────┴────────────────┴─────────────────┴──────────────┴────────────────┴─────────────────┴──────────────┘

Duplication is ok, lean in

One of the aspects of the dbt way of doing things that I instinctively recoiled from at first was the amount of data duplication. The source data is duplicated into staging; staging is duplicated into the marts. There are two aspects to bear in mind here:

  1. Each layer serves a specific purpose. Being able to isolate, debug, and re-run as needed elements of the pipeline is important. Avoiding one big transformation from source-to-mart makes sure that transformation logic sits in the right place
  1. There’s not necessarily as much duplication as you’d think. For example, the source layer is rebuilt at every run so only holds the current slice of data.

In addition to this…storage is cheap. It’s a small price to pay for building a flexible yet resilient data pipeline. Over-optimising is not going to be your friend here. We’re building analytics, not trying to scrape every bit of storage out of a 76KB computer being sent to the moon.

We’re going to do this thing properly: Tests and Checks and Contracts and more

This is where we really get into the guts of how dbt lies at the heart of making data engineering a more rigorous discipline in the way its software engineering older brother discovered a decade beforehand. Any fool can throw together some SQL to CREATE TABLE AS SELECT a one-big-table (OBT) or even a star-schema. In fact, I did just that! But like we saw above with SCD and snapshots, there’s a lot more to a successful and resilient pipeline. Making sure that the tables we’re building are actually correct, and proving so in a repeatable and automated manner, is crucial.

Of course, “correct” is up to you, the data engineer, to define. dbt gives us a litany of tools with which to encode and enforce it.

There are some features that are about the validity of the pipeline that we’ve built (does this transformation correctly result in the expected output), and others that validate the data that’s passing through it.

The configuration for all of these is done in the YAML that accompanies the SQL in the dbt project. The YAML can be in a single schema.yml, or broken up into individual YAML files. I quickly found the latter to be preferable for both source control footprint as well as simply locating the code that I wanted to work with.

Checking the data

Constraints provide a way to encode our beliefs as to the shape and behaviour of the data into the pipeline, and to cause it to flag any violation of these. For example:

  • Are keys unique? (hopefully)

  • Are keys NULL? (hopefully not)

Here’s what it looks like on dim_stations:

models:
  - name: dim_stations
    config:
      contract:
        enforced: true
    columns:
      - name: notation
        data_type: varchar
        constraints:
          - type: not_null
          - type: primary_key

You’ll notice the contract stanza in there. Constraints are part of the broader contracts functionality in dbt. Contracts also include further encoding of the data model by requiring the specification of a name and data type for every column in a model. SELECT * might be fast and fun, but it’s also dirty af in the long run for building a pipeline that is stable and self-documenting (of which see below).

Data tests are similar to constraints, but whilst constraints are usually defined and enforced on the target database (although this varies on the actual database), tests are run by dbt as queries against the loaded data, separately from the actual build process (instead by the dbt test command). Tests can also be more flexible and include custom SQL to test whatever conditions you want to. Here’s a nice example of where a test is a better choice than a constraint:

models:
  - name: dim_measures
    columns:
      - name: notation
        tests:
          - not_null ①
          - unique ①
      - name: station
        tests:
          - not_null ②
          - relationships:
              arguments: 
                to: ref('dim_stations') ③
                field: notation ③
              config:
                severity: warn ④
                error_after: 
                  percent: 5 ④
  1. Check that the notation key is not NULL, and is unique

  2. Check that the station foreign key is not NULL

  3. Check that the station FK has a match…

  4. …but only throw an error if this is the case with more than five percent of rows

We looked at freshness of source data above. This lets us signal to the operator if data has gone stale (the period beyond which data is determined as stale being up to us). Another angle to this is that we might have fresh data from the source (i.e. the API is still providing data) but the data being provided has gone stale (e.g. it’s just feeding us readings data from a few days ago). For this we can actually build a table (station_freshness):

SELECT notation, freshness_status, last_reading_at, time_since_last_reading, "label"
  FROM station_freshness;
┌──────────┬──────────────────┬──────────────────────────┬─────────────────────────┬──────────────────────────────────────────────┐
│ notation │ freshness_status │     last_reading_at      │ time_since_last_reading │                    label                     │
│ varchar  │     varchar      │ timestamp with time zone │        interval         │                   varchar                    │
├──────────┼──────────────────┼──────────────────────────┼─────────────────────────┼──────────────────────────────────────────────┤
│ 49118    │ stale (<24hr)    │ 2026-02-18 06:00:00+00   │ 05:17:05.23269          │ "Polperro"                                   │
│ 2758TH   │ stale (<24hr)    │ 2026-02-18 08:00:00+00   │ 03:17:05.23269          │ "Jubilee River at Pococks Lane"              │
│ 712415   │ fresh (<1hr)     │ 2026-02-18 10:45:00+00   │ 00:32:05.23269          │ "Thompson Park"                              │
│ 740102   │ fresh (<1hr)     │ 2026-02-18 10:45:00+00   │ 00:32:05.23269          │ "Duddon Hall"                                │
│ E12493   │ fresh (<1hr)     │ 2026-02-18 10:45:00+00   │ 00:32:05.23269          │ "St Bedes"                                   │
│ E8266    │ fresh (<1hr)     │ 2026-02-18 10:30:00+00   │ 00:47:05.23269          │ "Ardingly"                                   │
│ E14550   │ fresh (<1hr)     │ 2026-02-18 10:30:00+00   │ 00:47:05.23269          │ "Hartford"                                   │
│ E84109   │ stale (<24hr)    │ 2026-02-18 10:00:00+00   │ 01:17:05.23269          │ "Lympstone Longbrook Lane"                   │
│ F1703    │ dead (>24hr)     │ 2025-04-23 10:15:00+01   │ 301 days 01:02:05.23269 │ "Fleet Weir"                                 │
│ 067027   │ dead (>24hr)     │ 2025-03-11 13:00:00+00   │ 343 days 22:17:05.23269 │ "Iron Bridge"                                │
│ 46108    │ dead (>24hr)     │ 2025-05-28 10:00:00+01   │ 266 days 01:17:05.23269 │ "Rainfall station"                           │
[…]

and then define a test on that table:

models:
  - name: station_freshness
    tests:
      - max_pct_failing: 
          config:
            severity: warn
          arguments:
            column: freshness_status ②
            failing_value: "dead (>24hr)" 
            threshold_pct: 10 ②
  1. This is a custom macro

  2. Arguments to pass to the macro

So dbt builds the model, and then runs the test. It may strike you as excessive to have both a model (station_freshness) and macro (max_pct_failing). However, it makes a lot of sense because we’re building a model which can then be referred to when investigating test failures. If we shoved all this SQL into the test macro we’d not materialise the information. We’d also not be able to re-use the macro for other tables with similar test requirements.

When the test runs as part of the build, if there are too many stations that haven’t sent new data in over a day we’ll see a warning in the run logs. We can also run the test in isolation and capture the row returned from the macro (which triggers the warning we see in the log):

❯ dbt test --select station_freshness --store-failures
[…]
14:10:53  Warning in test max_pct_failing_station_freshness_freshness_status__dead_24hr___5 (models/marts/station_freshness.yml)
14:10:53  Got 1 result, configured to warn if != 0
14:10:53
14:10:53    compiled code at target/compiled/env_agency/models/marts/station_freshness.yml/max_pct_failing_station_freshn_113478f1da33b78c269ac56f22cbec9d.sql
14:10:53
14:10:53    See test failures:
  -----------------------------------------------------------------------------------------------------------------------
  select * from "env-agency-dev"."main_dbt_test__audit"."max_pct_failing_station_freshn_113478f1da33b78c269ac56f22cbec9d"
  -----------------------------------------------------------------------------------------------------------------------
14:10:53
14:10:53  Done. PASS=1 WARN=1 ERROR=0 SKIP=0 NO-OP=0 TOTAL=2
SELECT * FROM "env-agency-dev"."main_dbt_test__audit"."max_pct_failing_station_freshn_113478f1da33b78c269ac56f22cbec9d";
┌───────┬─────────┬─────────────┬───────────────┬────────────────────────────────────────┐
│ total │ failing │ failing_pct │ threshold_pct │             failure_reason             │
│ int64 │  int64  │   double    │     int32     │                varchar                 │
├───────┼─────────┼─────────────┼───────────────┼────────────────────────────────────────┤
│ 5458  │   546   │    10.0     │       5       │ Failing pct 10.0% exceeds threshold 5% │
└───────┴─────────┴─────────────┴───────────────┴────────────────────────────────────────┘

Checking the pipeline

Even data engineers make mistakes sometimes. Unit tests are a great way to encode what each part of a pipeline is supposed to do. This is then very useful for identifying logical errors that you make in the pipeline’s SQL, or changes made to it in the future.

Here’s a unit test defined to make sure that the readings fact table correctly unions data from the API with that from backfill:

unit_tests:
  - name: test_fct_readings_union ①
    model: fct_readings ②
    overrides:
      macros:
        is_incremental: false
    given:
      - input: ref('stg_readings') ④
        rows: 
          - { dateTime: "2025-01-01 00:00:00", measure: "api-reading", value: 3.5, } ④
      - input: ref('stg_readings_archive') ⑤
        rows: 
          - { dateTime: "2025-01-01 01:00:00", measure: "archive-reading", value: 7.2, } ⑤
    expect: 
      rows: 
        - { dateTime: "2025-01-01 00:00:00", measure: "api-reading", value: 3.5, } ⑥
        - { dateTime: "2025-01-01 01:00:00", measure: "archive-reading", value: 7.2, } ⑥
  1. Name of the test

  2. The model with which it’s associated

  3. Since the model has incremental loading logic, we need to indicate that this unit test is simulating a full (non-incremental) load

  4. Mock source row of data from the API (stg_readings)

  5. Mock source row of data from the backfill (stg_readings_archive)

  6. Expected rows of data

If you want them to RTFM, you gotta write the FM

This is getting boring now, isn’t it. No, not this article. But my constant praise for dbt. If you were to describe an ideal data pipeline you’d hit the obvious points—clean data, sensible granularity, efficient table design. Quickly to follow would be things like testing, composability, suitability for source control, and so on. Eventually you’d get to documentation. And dbt nails all of this.

You see, the pipeline that we’re building is self-documenting. All the YAML I’ve been citing so far has been trimmed to illustrate the point being made alone. In reality though, the YAML for the models looks like this:

models:
  - name: dim_stations
    description: >
      Dimension table of monitoring stations across England. Each station has one or
      more measures. Full rebuild each run.
      🔗 [API docs](https://environment.data.gov.uk/flood-monitoring/doc/reference#stations)
    columns:
      - name: dateOpened
        description: >
          API sometimes returns multiple dates as a JSON array; we take
          the first value.
      - name: latitude
        description: Renamed from 'lat' in source API.
        []

Every model, and every column, can have metadata associated with it in the description field. The description field supports Markdown too, so you can embed links and formatting in it, over multiple lines if you want.

dbt also understands the lineage of all of the models (because when you create them, you use the ref function thus defining dependencies).

All of this means that you build your project and drop in bits of description as you do so, then run:

dbt docs generate && dbt docs serve

This generates the docs and then runs a web server locally, giving this kind of interface to inspect the table metadata:

and its lineage:

Since the docs are built as a set of static HTML pages they can be deployed on a server for access by your end users. No more “so where does this data come from then?” or “how is this column derived?” calls. Well, maybe some. But fewer.

Tip:
As a bonus, the same metadata is available in Dagster:

So speaking of Dagster, let’s conclude this article by looking at how we run this dbt pipeline that we’ve built.

Orchestration

dbt does one thing—and one thing only—very well. It builds kick-ass transformation pipelines.

We discussed briefly above the slight overstepping by using dbt and DuckDB to pull the API data into the source tables. In reality that should probably be another application doing the extraction, such as dlt, Airbyte, etc.

When it comes to putting our pipeline live and having it run automagically, we also need to look outside of dbt for this.

We could use cron, like absolute savages. It’d run on a schedule, but with absolutely nothing else to help an operator or data engineer monitor and troubleshoot.

I used Dagster, which integrates with dbt nicely (see the point above about how it automagically pulls in documentation). It understands the models and dependencies, and orchestrates everything nicely. It tracks executions and shows you runtimes.

Dagster is configured using Python code, which I had Claude write for me. If I weren’t using dbt to load the sources it’d have been even more straightforward, but to get visibility of them in the lineage graph it needed a little bit extra. It also needed configuring to not run them in parallel, since DuckDB is a single-user database.

I’m sure there’s a ton of functionality in Dagster that I’ve yet to explore, but it’s definitely ticking a lot of the boxes that I’d be looking for in such a tool: ease of use, clarity of interface, functionality, etc.

Better late than never, right?

All y’all out there sighing and rolling your eyes…yes yes. I know I’m not telling you anything new. You’ve all known for years that dbt is the way to build the transformations for data pipelines these days.

But hey, I’m catching up alright, and I’m loving the journey. This thing is good, and it gives me the warm fuzzy feeling that only a good piece of technology designed really well for a particular task can do.

Created a GitHub Reusable Workflows Repository for Personal Use

I created a GitHub reusable workflows repository for my personal use.

GitHub logo

masutaka
/
actions

GitHub Actions reusable workflows

What Are GitHub Reusable Workflows?

GitHub Actions’ reusable workflows is a mechanism that allows workflow files to be called from other repositories.

For example, you can call a workflow from another repository like this:

jobs:
  example:
    uses: masutaka/actions/.github/workflows/some-workflow.yml@main

Since you can consolidate common processes in one place, it saves you the trouble of managing the same workflow across multiple repositories.

There are some limitations and caveats to be aware of:

  1. Reusable workflows in private repositories cannot be called from public repositories
  2. Reusable workflows in private repositories can only be called from other repositories within the same org/user (different orgs are not allowed, and the Access policy setting of the called repository must be configured)
  3. The env context at the calling workflow level is not propagated to the called workflow
  4. Environment secrets cannot be passed (only regular secrets can be passed via secrets: inherit)
  5. Reusable workflows always run at the job level (they cannot be used as steps). This means a separate runner starts each time they are called, and the filesystem cannot be shared between jobs. For private repositories, this also increases Actions minutes consumption. If you want to reuse at the step level, you need to use a composite action

Since masutaka/actions is a public repository, limitations 1 and 2 do not apply.

Included Workflows

Currently, I have included the following reusable workflows:

  • add_assignee_to_pr.yml (docs) – Sets the PR creator as the assignee when a PR is created
  • codeql.yml (docs) – Detects languages from changed files and runs CodeQL analysis
  • codeql_core.yml (docs) – Runs CodeQL analysis for specified languages
  • create_gh_issue.yml (docs) – Creates a GitHub Issue from a template
  • dependency_review.yml (docs) – Reviews PR dependencies
  • pushover.yml (docs) – Sends Pushover1 notifications for workflow failures

I referenced mdn/workflows for the documentation structure, documenting each workflow’s usage in Markdown files under docs/ and linking to them from README.md.

Why I Created This

I had been using route06/actions, a repository where I had been a maintainer at my previous job, even after leaving the company.

However, there were a few things I wanted to customize for personal use, and creating a new pushover.yml workflow prompted me to copy the necessary workflow files and create my own repository.

Up until then, the same pushover.yml file was duplicated across my personal repositories, but I took this opportunity to consolidate everything into masutaka/actions.

Handling Licenses

Both repositories are under the MIT License. For workflow files copied over, I added attribution to the original repository at the top of each file like this:

Example: codeql.yml:

# Derived from https://github.com/route06/actions/blob/main/.github/workflows/codeql.yml
# Copyright (c) 2024 ROUTE06, Inc.
# Licensed under the MIT License.

I also included both copyrights in the LICENSE file. I believe this satisfies the requirements of the MIT License.

Copyright (c) Takashi Masuda
Copyright (c) 2024 ROUTE06, Inc.

Conclusion

Until now, similar workflow files were scattered across my personal repositories, and I had to update multiple repositories every time a change was needed. By consolidating them into masutaka/actions, changes can now be made in one place.

As my personal repositories continue to grow, I plan to keep consolidating shareable workflows here going forward.

References

  • Reuse workflows – GitHub Docs
  • Creating a composite action – GitHub Docs
  • mdn/workflows
  • route06/actions
  1. A push notification service for iOS/Android ↩

GitHub热门项目: visual-explainer

visual-explainer

Agent skill + prompt templates that generate rich HTML pages for visual diff reviews, architecture overviews, plan audits, data tables, and project recaps

项目信息

  • 仓库: nicobailon/visual-explainer
  • stars: 2.4K
  • forks: 152
  • 语言: HTML
  • 地址: https://github.com/nicobailon/visual-explainer

简介

Agent skill + prompt templates that generate rich HTML pages for visual diff reviews, architecture overviews, plan audits, data tables, and project recaps

快速开始

git clone https://github.com/nicobailon/visual-explainer
cd visual-explainer

标签

github trending opensource html

欢迎关注!

Git Worktrees for AI Coding: Run Multiple Agents in Parallel

Last Tuesday I had Claude Code fixing a pagination bug in my API layer. While it worked, I sat there. Waiting. Watching it think. For eleven minutes.

Meanwhile, three other tasks sat in my backlog: a Blazor component needed refactoring, a new endpoint needed tests, and the SCSS build pipeline had a caching issue. All independent. All blocked behind my single terminal.

I thought: I have 5 monitors and a machine that could run a small country. Why am I running one agent at a time?

Then I discovered that Claude Code shipped built-in worktree support, and everything changed. I went from sequential AI coding to running five agents in parallel, each on its own branch, none stepping on each other’s files. My throughput didn’t just double. It went up roughly 5x.

Here’s exactly how I set it up, the .NET-specific gotchas I hit, and why I think worktrees are the single biggest productivity unlock for AI-assisted development right now.

Table of Contents

  • What Are Git Worktrees (And Why Should You Care Now)
  • The Problem: One Repo, One Agent, One Branch
  • Setting Up Your First Worktree
  • Running Multiple AI Agents in Parallel
  • The .NET Worktree Survival Guide
  • My 5-Agent Workflow
  • Common Worktree Pain Points (And How to Fix Them)
  • When Worktrees Don’t Make Sense
  • Frequently Asked Questions
  • Stop Waiting, Start Parallelizing

What Are Git Worktrees

A git worktree is a second (or third, or fifth) working directory linked to the same repository. Each worktree checks out a different branch, but they all share the same .git history, refs, and objects.

Think of it this way: instead of cloning your repo five times (and wasting disk space on five copies of your git history), you create five lightweight checkouts that share one .git folder.

# Your main repo
C:codeMyApp                    # on branch: master

# Your worktrees (separate folders, same repo)
C:codeMyApp-worktreesfix-pagination    # on branch: fix/pagination
C:codeMyApp-worktreesadd-tests         # on branch: feature/api-tests
C:codeMyApp-worktreesrefactor-blazor   # on branch: refactor/blazor-grid

Git introduced worktrees in version 2.5 (July 2015). They’ve been around for over a decade. Most developers have never used them because, until AI coding agents, there was rarely a reason to work on five branches simultaneously.

Now there is.

The Problem: One Repo, One Agent, One Branch

Here’s the typical AI coding workflow in 2026:

  1. Open terminal. Start Claude Code (or Cursor, or Copilot).
  2. Describe a task. Watch the agent work.
  3. Wait 5-15 minutes while it reads files, writes code, runs tests.
  4. Review the changes. Commit.
  5. Start the next task.

Steps 1-4 are sequential. You’re blocked. Your machine is doing maybe 10% of what it could.

“But I can just open another terminal and start a second agent.”

No, you can’t. Not safely. Two agents editing the same working directory is a recipe for corrupted state. Agent A writes to OrderService.cs while Agent B is reading it. Agent A runs dotnet build while Agent B is mid-refactor. Merge conflicts happen in real-time, inside your working directory, with no version control to save you.

Worktrees fix this. Each agent gets its own directory, its own branch, its own isolated workspace. They can all build, test, and modify files simultaneously without interference.

Setting Up Your First Worktree

The syntax is simple:

# Create a worktree with a new branch
git worktree add ../MyApp-worktrees/fix-pagination -b fix/pagination

# Create a worktree from an existing branch
git worktree add ../MyApp-worktrees/fix-pagination fix/pagination

# List all worktrees
git worktree list

# Remove a worktree when you're done
git worktree remove ../MyApp-worktrees/fix-pagination

I keep my worktrees in a sibling directory to avoid cluttering the main repo:

C:code
├── MyApp                        # Main working directory
└── MyApp-worktrees              # All worktrees live here
    ├── fix-pagination
    ├── add-tests
    └── refactor-blazor

One critical rule: you cannot check out the same branch in two worktrees. Git enforces this by default. If your main directory is on master, no worktree can also be on master. You can override this with git worktree add -f, but don’t. It prevents two workspaces from stomping on each other’s state. The restriction is a feature, not a bug.

Running Multiple AI Agents in Parallel

Here’s where it gets interesting. Once you have worktrees set up, you can launch an AI agent in each one.

With Claude Code

Claude Code has built-in worktree support with a --worktree (-w) CLI flag that starts a session in an isolated worktree automatically. You can also create worktrees manually and point Claude Code at them:

# Terminal 1: Main repo - fixing the pagination bug
cd C:codeMyApp
claude "Fix the pagination bug in OrdersController where offset is off by one"

# Terminal 2: Worktree - adding API tests
cd C:codeMyApp-worktreesadd-tests
claude "Add integration tests for all endpoints in OrdersController"

# Terminal 3: Worktree - refactoring Blazor component
cd C:codeMyApp-worktreesrefactor-blazor
claude "Refactor the OrderGrid component to use virtualization"

# Terminal 4: Worktree - fixing SCSS
cd C:codeMyApp-worktreesfix-scss
claude "Fix the SCSS compilation caching issue in the build pipeline"

# Terminal 5: Worktree - documentation
cd C:codeMyApp-worktreesupdate-docs
claude "Update the API documentation for the Orders endpoint"

Five terminals. Five agents. Five branches. Zero conflicts.

Claude Code also supports spawning subagents in worktrees internally using isolation: "worktree" in agent definitions, where each subagent works in isolation and the changes get merged back. Boris Cherny, Creator and Head of Claude Code at Anthropic, called worktrees his number one productivity tip — he runs 3-5 worktrees simultaneously and described it as particularly useful for “1-shotting large batch changes like codebase-wide code migrations.”

With Other AI Tools

The same pattern works with any AI coding tool:

# Cursor - open each worktree as a separate workspace
code C:codeMyApp-worktreesfix-pagination

# GitHub Copilot CLI - run in each worktree directory
cd C:codeMyApp-worktreesadd-tests && gh copilot suggest "..."

The worktree is just a directory. Any tool that operates on a directory works.

The .NET Worktree Survival Guide

This is where generic worktree guides fall short. .NET projects have specific pain points that will bite you if you’re not prepared.

Pain Point 1: NuGet Package Restore

Each worktree needs its own bin/ and obj/ directories. The good news: dotnet restore handles this automatically. The bad news: your first build in each worktree takes longer because it’s restoring packages from scratch.

# After creating a worktree, always restore first
cd C:codeMyApp-worktreesfix-pagination
dotnet restore

The NuGet global packages cache (%userprofile%.nugetpackages on Windows, ~/.nuget/packages on Mac/Linux) is shared across all worktrees. So the packages aren’t downloaded again — they’re just linked. Fast enough.

Pain Point 2: Port Conflicts in launchSettings.json

This one will get you. If all your worktrees use the same launchSettings.json, they’ll all try to bind to the same port. Two Kestrel instances on port 5001 means one of them crashes.

Fix it with environment variables or override the port at launch:

# In worktree terminal, override the port
dotnet run --urls "https://localhost:5011"

# Or set it via environment variable
ASPNETCORE_URLS=https://localhost:5011 dotnet run

One gotcha: if you have Kestrel endpoints configured explicitly in appsettings.json, those override ASPNETCORE_URLS. The --urls flag is safer because it takes highest precedence.

I usually don’t bother with any of this — most of the time the AI agent doesn’t need to run the app, just build and test it.

Pain Point 3: User Secrets and appsettings.Development.json

User secrets are stored by UserSecretsId (set in your .csproj) under %APPDATA%MicrosoftUserSecrets<UserSecretsId>secrets.json on Windows (~/.microsoft/usersecrets/ on Mac/Linux). They live outside the repo entirely. So they’re shared automatically across worktrees. This is usually what you want.

appsettings.Development.json is tracked in git (or should be gitignored), so it exists in every worktree. No issues here.

Pain Point 4: Database Migrations Running in Parallel

If two agents both try to run dotnet ef database update against the same database at the same time, you’ll get lock contention or worse.

My rule: only one worktree touches the database at a time. If a task involves migrations, it gets its own dedicated slot and the other agents work on code-only changes.

Or better: use a separate database per worktree for integration tests. Your docker-compose.yml can spin up isolated Postgres instances:

# docker-compose.worktree-tests.yml
services:
  db-pagination:
    image: postgres:17
    ports: ["5433:5432"]
    environment:
      POSTGRES_DB: myapp_pagination

  db-tests:
    image: postgres:17
    ports: ["5434:5432"]
    environment:
      POSTGRES_DB: myapp_tests

Pain Point 5: Shared Global Tools and SDK

The .NET SDK is machine-wide. global.json in your repo pins the version. Since all worktrees share the same repo, they all use the same SDK version. No issues here — this just works.

My 5-Agent Workflow

Here’s my actual daily workflow. I’ve been running this for a few weeks and it’s settled into a rhythm.

Morning planning (10 minutes):

  1. Check the backlog. Pick 4-5 independent tasks.
  2. “Independent” means: different files, different concerns, no shared migration paths.
  3. Create worktrees and branches:
# Quick script I keep handy
#!/bin/bash
REPO="C:codeMyApp"
TREES="C:codeMyApp-worktrees"

for branch in "$@"; do
    git worktree add "$TREES/$branch" -b "$branch" 2>/dev/null || 
    git worktree add "$TREES/$branch" "$branch"
    echo "Created worktree: $TREES/$branch"
done
# Usage
./create-worktrees.sh fix/pagination feature/api-tests refactor/blazor fix/scss update/docs

Parallel execution (1-2 hours):

  1. Open 5 terminals (I use Windows Terminal with tabs).
  2. Launch Claude Code in each worktree with a clear, scoped prompt.
  3. Monitor. Most tasks complete in 5-15 minutes.
  4. Review each agent’s work as it finishes.

Merge back (15 minutes):

  1. Review diffs. Run tests in each worktree.
  2. Merge completed branches back to master:
git checkout master
git merge fix/pagination
git merge feature/api-tests
# ... and so on
  1. Clean up worktrees:
git worktree remove ../MyApp-worktrees/fix-pagination
git worktree remove ../MyApp-worktrees/add-tests
# Or nuke them all
git worktree list | grep -v "bare" | awk '{print $1}' | xargs -I{} git worktree remove {}

Results: What used to take a full day of sequential agent sessions now takes about 2 hours including review time.

Task Selection Matters

Not every task is a good worktree candidate. The ideal task for parallel AI execution:

Good for worktrees Bad for worktrees
Bug fix in isolated file Database schema migration
Adding tests for existing code Renaming a shared model class
New endpoint (separate controller) Refactoring shared base classes
UI component work Changing DI registration order
Documentation updates Anything that touches Program.cs

The rule of thumb: if two tasks would cause a merge conflict, don’t run them in parallel.

Common Worktree Pain Points

The criticisms are real. Let me address them honestly.

“I have to npm install in every worktree.”

True for Node projects. For .NET, dotnet restore is fast because the global package cache is shared. If you’re in a monorepo with both Node and .NET, install node_modules per worktree — it takes 30 seconds with a warm cache.

“Pre-commit hooks don’t install automatically.”

If you use Husky or similar, run the install command after creating the worktree. For .NET projects using dotnet format as a pre-commit hook, it works automatically since the tool is restored via dotnet tool restore.

“I have to copy env files.”

Write a setup script. Seriously. If you’re creating worktrees regularly, spending 20 minutes on a setup-worktree.sh script will save you hours:

#!/bin/bash
WORKTREE_DIR=$1
cp .env "$WORKTREE_DIR/.env"
cd "$WORKTREE_DIR"
dotnet restore
dotnet tool restore
echo "Worktree ready: $WORKTREE_DIR"

“Ports conflict.”

Pass --urls to override the port. For ASP.NET Core integration tests, port conflicts aren’t even an issue — WebApplicationFactory<T> uses an in-memory test server with no actual port binding. Multiple test suites can run simultaneously without stepping on each other.

These are all solvable problems. The throughput gain is worth the 30-minute setup cost.

When Worktrees Don’t Make Sense

I’m not going to pretend worktrees are always the answer. Skip them when:

  • Your task list has sequential dependencies (task B needs task A’s output)
  • You’re working on a single large feature that touches every layer
  • Your repo is small enough that the agent finishes in under 3 minutes anyway
  • You’re on a machine with less than 16GB RAM (each agent + build process eats memory)
  • The codebase has heavy shared state — a single God.cs file that everything imports

For a focused 30-minute bug fix, just use your main directory. Worktrees shine when you have 3+ hours of independent tasks and the machine to run them.

Frequently Asked Questions

What is a git worktree?

A git worktree is an additional working directory linked to an existing repository. It lets you check out a different branch in a separate folder while sharing the same git history and objects. Created with git worktree add <path> <branch>, worktrees have been available since Git 2.5 (July 2015).

Can I use git worktrees with Visual Studio?

Yes. Visual Studio 2022 and later can open a worktree folder as a project. Solution files, project references, and NuGet packages all work normally. The only caveat is that Solution Explorer shows the worktree path, not the main repo path. JetBrains Rider also handles worktrees well.

How many git worktrees can I run at once?

Git imposes no hard limit. The practical limit is your machine’s RAM and CPU. Each worktree with an AI agent running dotnet build consumes roughly 2-4GB of RAM. On a 32GB machine, 5-6 concurrent worktrees with active builds is comfortable. On 64GB, you can push to 10+.

Do git worktrees share the NuGet cache?

Yes. The NuGet global packages folder (~/.nuget/packages) is machine-wide, not per-repository. When you run dotnet restore in a worktree, packages are resolved from the global cache. Only packages not already cached will be downloaded. This makes the first restore in a new worktree fast — usually under 10 seconds for a typical .NET solution.

Are git worktrees better than multiple git clones?

For AI-assisted parallel development, yes. Worktrees share git history, refs, and the object database. Five worktrees use a fraction of the disk space of five full clones. Commits made in any worktree are immediately visible to all others (same .git directory). The only advantage of separate clones is full isolation — useful if you need different git configs or hooks per copy.

How do I resolve merge conflicts from parallel worktree branches?

Merge each branch back to your main branch one at a time. If branches touched different files (which they should if you planned well), merges are clean. For conflicts, resolve them using your normal merge workflow. The key is task selection: if you chose truly independent tasks, merge conflicts are rare. I’ve been running 5 parallel branches daily for weeks and hit fewer than 3 conflicts total.

Stop Waiting, Start Parallelizing

The era of watching a single AI agent grind through your tasks one by one is over. Git worktrees give you isolated workspaces in seconds. AI coding tools give you agents that can fill each one.

The math is simple. If one agent takes 10 minutes per task and you have 5 tasks, that’s 50 minutes sequential. With 5 worktrees, it’s 10 minutes plus review time.

Set up a few worktrees. Pick independent tasks. Launch your agents. Go make coffee.

When you come back, five branches will be waiting for review.

Now if you’ll excuse me, I have 4 agents running and one of them just finished refactoring my Blazor grid component. Time to review.

About the Author

I’m Mashrul Haque, a Systems Architect with over 15 years of experience building enterprise applications with .NET, Blazor, ASP.NET Core, and SQL Server. I specialize in Azure cloud architecture, AI integration, and performance optimization.

When production catches fire at 2 AM, I’m the one they call.

  • LinkedIn: Connect with me
  • GitHub: mashrulhaque
  • Twitter/X: @mashrulthunder

Follow me here on dev.to for more .NET and AI coding content