structkit vs cookiecutter vs copier: Which Project Scaffolding Tool Is Right for You?

structkit vs cookiecutter vs copier: Which Project Scaffolding Tool Is Right for You?

If you’ve ever needed to scaffold a new project — a microservice, a Terraform module, a Python package — you’ve likely reached for cookiecutter or copier. They’ve served the community well for years. But in 2025, the needs of platform engineering teams have evolved: remote content sources, AI assistants, and organization-wide consistency at scale are now table stakes.

In this post we compare three tools: cookiecutter, copier, and structkit — to help you pick the right one for your workflow.

TL;DR Comparison

Feature cookiecutter copier structkit
Template storage Git repo required Git repo required YAML file (no repo needed)
Remote file inclusion ✅ GitHub, S3, GCS, HTTP
AI / MCP integration
Update existing projects
Pre/post hooks
Dry run mode
File conflict strategies ✅ (skip, backup, overwrite)
IDE schema validation
Language Python Python Python
GitHub stars ~10k ~7k ~14 (early)

cookiecutter

Best for: Simple, one-time project generation from a git template repo.

cookiecutter is the original. Define a cookiecutter.json and a directory of Jinja2-templated files, push it to GitHub, and anyone can run cookiecutter gh:your-org/your-template.

What it does well:

  • Huge ecosystem of community templates
  • Dead-simple mental model
  • Widely understood across teams

Where it falls short:

  • Templates must live in their own git repo
  • Remote content (“include this CI file from our shared templates repo”) means copy-pasting into the template itself
  • No update path — once generated, you’re on your own
  • No dry run, no conflict resolution

Use cookiecutter if: You need quick, one-time scaffolding and there’s already a community template for your use case.

copier

Best for: Projects that need to stay in sync with their template over time.

copier is the evolution of cookiecutter. It adds a killer feature: template updates. If the upstream template changes, copier update merges the diff into your existing project. It also adds dry run mode and file conflict strategies.

What it does well:

  • Template update / migration path (huge win for long-lived projects)
  • Dry run mode
  • Multiple conflict strategies (skip, overwrite, patch)
  • Jinja2 templating compatible with cookiecutter knowledge

Where it falls short:

  • Templates still require a git repo
  • Remote content still means copy-pasting
  • No AI integration
  • YAML config but still needs the template file tree to be managed

Use copier if: You manage projects that need to track upstream template changes over time — e.g. organizational standards that evolve quarterly.

structkit

Best for: Platform and DevOps teams managing project standards at scale, especially with remote content sources and AI-native workflows.

structkit takes a fundamentally different approach: your entire project structure is defined in a single YAML file — no template repo required. File content can come from anywhere: inline, local, GitHub, S3, GCS, or any HTTP URL.

files:
  - README.md:
      content: |
        # {{@ project_name @}}
        {{@ description @}}
  - .github/workflows/ci.yml:
      file: github://your-org/templates/main/ci.yml
  - terraform/main.tf:
      file: s3://your-bucket/terraform/base-module.tf

variables:
  - project_name:
      description: "Name of your project"

When your org updates the canonical CI template, every new project generated from your structkit YAML gets the update automatically. No template repo to maintain.

What makes structkit different:

1. Remote-first content

Reference your org’s canonical CI file from GitHub directly. No copy-pasting, no drift.

2. YAML-first design

The entire structure lives in one file. Commit it to your platform repo. Version it. Review it in a PR. No separate template repository overhead.

3. MCP / AI integration

structkit mcp --server

Your AI assistant (Claude, Cursor, Copilot) can generate project scaffolds from natural language, using your templates as source of truth. This is the scaffolding tool built for the AI era.

4. IDE schema validation

Get autocomplete and validation on your structkit YAML in VS Code, JetBrains, or any JSON Schema-aware editor.

Where structkit is early:

  • Smaller community and ecosystem (14 stars vs 10k+)
  • Fewer community templates available out of the box
  • Docs are still growing

Use structkit if: You’re a platform or DevEx team enforcing org-wide standards with remote content sources, or you want to integrate AI assistants into your project creation workflow.

Picking the Right Tool

Your situation Recommendation
Need a quick one-time scaffold from existing community templates cookiecutter
Projects that need to stay in sync with evolving org templates copier
Org-wide standards with remote content sources or AI integration structkit
Want to try something new with MCP/AI-native workflows structkit

Getting Started with structkit

pip install structkit
structkit generate my-template ./new-project
  • GitHub: httpdss/structkit
  • Docs: structkit documentation
  • MCP setup: structkit mcp --server

Have questions or want to share how you’re using structkit? Join the GitHub Discussions.

structkit is open source (MIT). Contributions and feedback welcome.

UCP Tech Council Expands: What the Meeting Minutes Tell Us About Where the Protocol Is Heading

On Friday just gone, five of the largest technology companies in the world quietly joined the governing body of the Universal Commerce Protocol. No press release. No blog post. Just a commit to MAINTAINERS.md in the spec repository.

Amazon. Meta. Microsoft. Salesforce. Stripe. All now have seats on the UCP Tech Council — the body that reviews, debates, and approves every change to the protocol that AI shopping agents use to buy things.

We know this because we read the meeting minutes. Every week, the TC meets to debate spec changes, vote on PRs, and argue about how agent commerce should work. Most people in the industry don’t read these minutes. We do — and what they reveal about where UCP is heading is more interesting than any announcement.

This is what the minutes tell us.

The expansion: who joined and why it matters

The Tech Council grew from roughly 12 seats to 16 members across 8 companies:

Company Representatives Role
Google 4 seats Founding sponsor, spec steward
Shopify 4 seats (incl. 2 new) Largest platform implementer
Amazon Greg Smith (new) The world’s largest online retailer
Meta James Andersen (new) Social commerce, Instagram Shopping
Microsoft Patrick Jordan (new) Copilot, enterprise commerce
Stripe Prasad Wangikar (new) Payment infrastructure
Salesforce Scot DeDeo (new) Commerce Cloud, enterprise retail
Etsy Imran Hoosain Marketplace commerce
Target Maxime Najim Enterprise retail
Wayfair Naga Malepati Furniture/home goods

This isn’t ceremonial. The TC has binding authority over spec changes — every PR that ships in a UCP release has been reviewed and voted on by this group. When Amazon and Stripe join that table, it changes what gets prioritised, what gets debated, and ultimately what the protocol becomes.

The meeting minutes from March 13 first mentioned the election process: seats rotating every six months, with growing partner interest. By March 27, six nominations had been received. The final review was scheduled for April 10. The MAINTAINERS.md update landed April 24.

The new members are already contributing. James Andersen (Meta) submitted PR #367 on April 17 — a documentation PR clarifying network token usage and PCI scope in card credentials. Patrick Jordan (Microsoft) contributed documentation accuracy fixes the same day. These aren’t advisory seats. They’re engineering seats.

What the meeting minutes actually say

We reviewed the six TC meetings from March 6 through April 17. Here’s what’s being debated, decided, and built — translated for a merchant audience.

Identity linking is the top priority — and it’s hard

The single most discussed topic across all six meetings is identity linking — how an agent knows who the customer is across sessions, stores, and platforms.

The April 17 minutes show an active debate about OAuth 2.0 scope design: nested scopes vs flat scopes vs config maps. The TC favoured flat. PR #354 implements OAuth 2.0 as the foundation for identity linking with capability-driven scopes.

Why this matters for merchants: Identity linking is the missing piece that would let an agent complete a purchase without a checkout-page handoff. Right now, agents can browse and cart — but paying requires redirecting the customer to a human checkout flow. Identity linking + payment handlers would close that loop. Until then, agents rely on the transport layer to reach the store and the manifest endpoint for discovery. Our April state-of-commerce report showed only 3 stores out of 4,024 currently declare identity linking capability. The spec work happening now is what will eventually bring that number up.

Loyalty is being trimmed to ship faster

The TC has been debating loyalty schemas since March. PR #340 implements a loyalty extension for the checkout capability. The April 10 minutes note that the extension is being “trimmed to baseline use cases” — a pragmatic decision to ship something that works for simple loyalty programs now, rather than waiting for a comprehensive solution that handles every edge case.

Why this matters: If your store has a loyalty or rewards program, the spec is building the infrastructure for agents to verify loyalty status and redeem points as part of the checkout flow. This is early — don’t build against it yet — but understand that it’s coming and it’s being shaped by people at Google, Shopify, Etsy, and Target who run real loyalty programs.

Local commerce is on the roadmap

The April 3 minutes list Q2 priorities. Among them: local commerce. PR #375 proposes store-based local inventory and fulfilment options — the infrastructure an agent would need to answer “is this product available at a store near me?”

This is Target and Wayfair territory. Both have TC seats. Both have store networks. The fact that local commerce is a Q2 priority with retail representation on the council suggests it’s not theoretical.

Returns are “incredibly complicated”

The April 17 minutes include the most honest assessment we’ve seen in any spec discussion: returns are acknowledged as an “incredibly complicated domain.” This is refreshing. Most protocol specs pretend returns are simple. UCP’s TC is saying out loud that they’re not, and that getting them right will take time.

PR #257 from the February cycle introduced a returns extension. It’s still in review. The complexity is in modelling return windows, refund methods, partial returns, and eligibility rules — all of which vary by merchant, product, and jurisdiction.

Why this matters: Don’t expect agent-managed returns in 2026. But understand that the protocol is building toward it, and the merchants who implement return policies as structured data (not just PDF links) will be ahead when it ships.

The spec itself just shipped its biggest release ever

v2026-04-08 landed with 60+ merged PRs — the largest release since the protocol launched. Key additions:

  • Cart capability — basket building for agents, a prerequisite for multi-item flows
  • Catalog search + lookup — formalised product discovery as a spec capability
  • Request/response signing — cryptographic integrity for agent-store communication
  • Error handling overhaul — first-class errors, business logic error types
  • Eligibility claims — for loyalty, membership, and verification-gated pricing
  • Discount extension to cart — discounts now apply pre-checkout, not just at checkout
  • Risk signals — authorization and abuse metadata for fraud prevention

Our crawler showed Shopify migrating its entire fleet to v2026-04-08 in four days. 99.4% of verified stores are now on the latest spec.

What this means for you

If you’re a merchant

The governance expansion doesn’t change what you need to do today. Your UCP requirements are the same: valid manifest, declared capabilities, clean variant data. Check your store, fix any common errors, compare against competitors, and set up alerts so you know if anything breaks.

What it does change is the timeline and the confidence. When Amazon, Microsoft, and Salesforce have engineering seats on the governing body, the protocol is not going away. If you’ve been waiting for a signal that UCP is “real enough” to invest in — five of the ten largest technology companies joining the TC in a single commit is that signal.

If you’re a platform

If you run Shopify, you’re covered — platform-level UCP support is mature. If you run BigCommerce, WooCommerce, Magento, or a custom stack, watch the identity linking and loyalty PRs. These are the capabilities that will differentiate agent-ready platforms from agent-compatible ones in H2 2026.

Salesforce Commerce Cloud now has a seat at the table. If you’re on SFCC, this is the clearest signal yet that platform-level UCP support is coming. Our April report noted that we’ve already seen SFCC engineering work in progress.

If you’re building agents

The Build an Agent quickstart still works — the protocol surface you’re building against is stable. But start tracking the identity linking PRs. When that capability ships, the agent flow goes from “browse + cart + redirect to checkout” to “browse + cart + pay” — end-to-end autonomous purchasing. That’s the step change.

Check the store leaderboard to find the highest-performing targets, understand how product discovery works, and test your agent against real stores in UCP Playground and use UCP Registry for production discovery. Both will surface the new capabilities as they ship.

The reading list

For anyone who wants to follow the protocol’s evolution themselves:

  • Meeting minutes: github.com/Universal-Commerce-Protocol/meeting-minutes
  • Spec repo: github.com/Universal-Commerce-Protocol/ucp
  • v2026-04-08 release notes: github.com/Universal-Commerce-Protocol/ucp/releases/tag/v2026-04-08
  • MAINTAINERS.md: github.com/Universal-Commerce-Protocol/ucp/blob/main/MAINTAINERS.md
  • Active PRs: github.com/Universal-Commerce-Protocol/ucp/pulls

We’ll continue monitoring the spec, the TC minutes, and the 4,500+ merchants building on the protocol. If any of the Q2 priorities (identity, loyalty, local commerce) ship in spec form, we’ll cover them in the May state-of-commerce report.

Check your store’s UCP status at UCPChecker.com. Browse verified stores at UCPRegistry.com. Test agent performance at UCPPlayground.com. Read the full protocol stack: MCP vs UCP vs AP2.

How Static Code Analysis Helps Reduce Software Bugs, and Money Spent!

reduce software bugs and the cost of software bugs.

Dealing with bugs is a natural part of software development. But it can also be among the most costly, especially when they don’t get discovered until later in the development lifecycle. The daunting part is that many bugs aren’t immediately obvious.

Issues like memory leaks, default credentials, and hardcoded tokens are easily missed in manual code reviews, going relatively unnoticed until they cause problems late in development or post-launch. This can spike the cost of development through security vulnerabilities, downtime, reputational damage, and unwanted spending on rework. 

That’s where static code analysis comes into play. It changes the way you approach code defects by shifting detection earlier in the process, prioritizing fixes without running code, and minimizing the impact and cost of bugs by making changes quicker and easier.

Let’s take a closer look at the costs associated with software bugs and explore how static code analysis helps save you money on related fixes.

Table of Contents

The true cost of software bugs

Every undetected bug comes with a price. But the scale might come as a surprise. According to a report by the Consortium for Information & Software Quality (CISQ), poor software quality cost the US economy $2.41 trillion in 2022. That includes everything from application downtime and debugging to productivity drains. 

In addition, IBM’s Cost of a Data Breach Report 2025 reports that the estimated average global cost of a data breach is $4.44 million, with many breaches traced back to software bugs.

Get A Cost-Saving Demo

How the cost grows with time

The cost of software bugs isn’t static. It escalates through the development, testing, and production stages. Essentially, the later in the process a bug is identified, the more it costs.

A 2010 study from the IBM System Sciences Institute shows the value of early bug identification. It said a bug identified at the traditional testing stage can cost 15 times more to fix than one found during design. And this only climbs higher when bugs sneak through to maintenance, costing up to 100 times more.

The average cost of a bug quickly adds up, including:

  • Downtime and production outages, leading to revenue loss and expensive hotfixes to get back online.
  • Rework, which is more time-consuming and complex if bugs are discovered later in the process, when developers don’t have a fresh view of the code. If a bug is discovered during coding, the developer immediately knows why they wrote it that way and the decisions they took, so they can easily unpick it. When it’s discovered later, whoever is tasked with removing the bug needs to take time to understand what the original developer did and why before they can fix the problem.
  • Context switching, when developers constantly need to shift tasks, leading to increased errors and more time and money spent on fixes. 
  • Security breaches and non-compliance, which can cost almost three times as much as compliance, thanks to fines and reputational damage.
  • Technical debt through wasted developer time that could be spent reducing existing maintenance needs.

How static code analysis reduces bug costs

Static code analysis transforms how you handle code quality. Applying a shift-left mindset brings crucial tasks like testing and code quality to the front of the timeline, reducing the cost of required fixes in several ways.

Early detection, easier fixes

Earlier detection means changes are actioned at the cheapest point, before bugs reach QA or production. Faster feedback loops also let your developers rework code while the context is fresh. No unpicking or trying to understand someone else’s thinking later in the process means bugs can be fixed instantly, rather than taking days.

Lower production risks

Undetected bugs can wreak havoc in a live environment. Static code analysis flags security issues and vulnerabilities before merge. So you can avoid incidents, rollbacks, potential financial penalties, reputational damage, and the added cost of emergency developer resources.

Freeing up your team

Static code analysis frees your team up and unblocks backlogs. Automating repeatable checks reduces the burden of manual code review and debugging, allowing your team to focus on more complex issues, coding, and improving development efficiency and costs.

Repaying your technical debt

Enforcing code quality standards keeps codebases maintainable and reasonably inexpensive to run. That lets you focus on reducing your technical debt, rather than watching it further stack up and adding mental load.

Static code analysis in CI/CD pipelines

Static code analysis maximizes its cost-saving powers when deployed as part of the CI/CD (continuous integration and continuous delivery/development) pipeline. As your driver of code quality, it doesn’t just help you spot bugs, it stops them from ever reaching your codebase.

When you run static code analysis on every commit, you automatically check for issues as part of your standard development lifecycle. By checking code quality and security vulnerabilities, including those highlighted by security coding standards like OWASP, you ensure errors are spotted and fixed before code is merged into the main branch.

Failed quality gates in your CI/CD pipeline stop progress when the code doesn’t meet the required standard. Your developers must fix these and rerun the full or partial CI pipeline before progressing, meaning identifiable bugs simply never reach your codebase.

This significantly reduces costs in the long term by removing the ability to bypass failed quality gates, which can lead to increased technical debt, security problems, and production bugs.

How Qodana helps your teams control bug costs

Qodana reinvents your code quality processes. It lets your development team identify issues and fix them earlier through real-time feedback, consistent code standards, and suggested fixes. 

That means you deliver secure, high-quality code without incurring the costs of refactoring and downtime. And all within the environments your team is already working in, for smooth integration.

The Qodana code quality tool is built on and uses the linters of JetBrains IDEs, trusted by more than 18 million developers since 2001, and integrates with tools developers already use. It drives code quality and reduces debugging costs through:

  • Ready-to-use pipelines for popular CI systems such as Azure Pipelines, CircleCI, Jenkins, GitHub Actions, GitLab CI/CD, and more.
  • Consistent rules and code quality checks across local development, IDE, and CI.
  • Quality gates that stop risky code before merge, minimizing potential future technical debt.
  • Clear and actionable reports that suggest automatic fixes and reduce repair time and associated costs.

Scenario: How to find and reduce bug costs in testing

How does static code analysis reduce costs in practice? Let’s say your team is working on a project to improve the design of your product and add new features. However, a code or configuration change unintentionally causes a security weakness by hardcoding secret values that have embedded API keys. 

The result is that these values could be exposed to anyone with access to the codebase, including version control history. If the bug goes undetected, your systems will be vulnerable to unauthorized access and data breaches, resulting in downtime, reputational damage, and more time, resources, and money spent on fixes.

This can also be caused by code changes that weaken security by:

  • Logging sensitive data, such as credentials, leading to security and compliance risks. 
  • Using insecure cryptographic APIs or misusing secure ones.
  • Missing authentication checks that may result in information exposure.
with and without static code analysis - how to reduce software bugs and save money

Without static code analysis

Without static code analysis, this vulnerability could easily make it through manual code review. Because it’s not obvious, it might pass functional testing and not be uncovered until close to or after release, during a security audit, or penetration test. 

Worse still, if the issue makes it into your shipped product, it can even be found and exploited by an attacker, costing far more than just the resource needed for debugging.

Fixing the problem in a live environment can require coordinated patches and quick fixes, which may cost much more than if the error was detected and rectified during development. It can also lead to:

  • Additional resource costs for personnel and security reviews to fix the issue.
  • Large fines by regulatory bodies, such as the SEC (Securities and Exchange Commission) and FTC (Federal Trade Commission).
  • Delayed certifications, which may lead to financial, operational, and legal penalties.
  • Reputational damage and the associated costs of lost business.

With static code analysis

Static code analysis significantly reduces the cost of dealing with such a vulnerability. At commit or pull request, the vulnerability is flagged and progress stopped. The developer working on that code is automatically alerted and required to fix the issue before merging code. This stops the bug in its tracks and prevents it from becoming part of your wider codebase, meaning there’s a quicker fix and no fallout.

The process ensures security is a proactive priority, rather than a reactive expense. It helps mitigate vulnerabilities causing security problems during testing and execution. It also reduces the risk and high costs of compliance breaches.

Reduce software bugs and costs with Qodana

Static code analysis is an important part of lowering the cost of software bugs and fixes. Identify problems before they escalate into expensive issues. 

Try Qodana for free with a 60-day trial for your project.

Speak To Qodana

A Quick-ish Rundown of LLM Basics

Over the past few days, I’ve realized that there are a lot of folks out there using LLMs that haven’t had an opportunity to dig, even a little, into the basics of how LLMs really work. And I guess that makes sense; for the most part, the average person doesn’t have a lot of reason to know this. But if you’re going to be a power user, there are things that would really help you to understand.

Below are the most basic basics. Not covering everything, just some stuff that I think if you get then the rest will start to make sense for you as well. Hopefully it helps someone out there.

Tokens

When you write something to an LLM, it doesn’t break that thing down by character, it breaks them down by groups of characters called “Tokens”. Every LLM has its own tokenizer, so not all choose the same tokens.

Here’s a real world example of what tokenization might look like using Qwen3.6 27b’s tokenizer: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/tokenizer.json. If you open that file, you’ll see the full list of tokens that Qwen3.6 27b utilizes.

As for how tokens work… here’s an example:

“This is a token”
– That’s 15 characters

‘This’ ‘Ġis’ ‘Ġa’ ‘Ġtoken’
– That’s 4 tokens. You’ll notice ‘Ġ’ is in each; that’s what
GPT-2/GPT-3/GPT-4 use as a space in tokenization

These line up to numbers, which the LLM then uses to do matrix math to determine the right output. If we go back to the link I gave you above, then you can see the following:

This   == 1919
ĠIs    == 369
Ġa     == 264
Ġtoken == 3817

So Qwen3.6 27b would see your sentence as (1919, 369, 264, 3817). It then does matrix math and other cool pattern-y stuff to determine the best tokens to respond to you with.

So remember this when you hear that an LLM has a context window of 1,000,000 tokens: it’s talking about those things. Sometimes whole words are tokens, sometimes not. Don’t just assume every word is a token; they try to create tokens off the most commonly used words. This, is, a are all very common in the English language. Token is very common when talking about LLMs.

Context Windows

The way I usually describe context windows is to imagine the full Song of Ice and Fire book series printed out on one really long parchment, and you have a piece of cardboard with a window cut in it that you can read text through. All you know is whatever’s currently in that window. If someone asks you about something outside the window? Tough luck, you don’t know it.

Now, the obvious thought is “well just make the window bigger”. The problem is that if you cut the window too big, you have a harder time finding any specific thing in there, and you start mixing details up. You’ve learned how to read a certain amount within that window, and pushing past that doesn’t go great. If the full book was the length of a parking lot, and someone asked you for details that could exist anywhere in that whole parking lot worth of text… well, good luck.

That’s pretty much how it works with LLMs. You’ll see models advertise huge context windows like 1,000,000 tokens, but the real-world practical use of that is a lot smaller than the marketing implies. The bigger you stuff that window, the worse the model gets at pinpointing specific information inside it. There’s a whole pile of benchmarks (needle in a haystack tests, NoLiMa, RULER, etc) showing accuracy drop as the context fills up. So a 200k token context window is not an invitation to dump 200k tokens in there and expect great results. You’ll generally get a much better answer giving the model 8k of really relevant tokens than 200k of “everything I have on the topic”.

To get a better visualization, check this benchmark out: https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87

Scroll down to the results section and you’ll see a table- the numbers in there represent how well the model pulls the right info out based on the context size it was fed. You can see that some models, like GPT-5.2 or Opus 4.6, did great all the way up to 120k (except 5.2 pro for some reason…). But look at something like minimax 2.5, for example: by the time you hit 60k tokens, you have less than a 50% chance to get all the right info you asked for.

This is a struggle a lot of us running local models deal with, and it usually means you want to account for that with a lot of great wrapper software or middleware.

Model Sizes (ie- parameters)

When we talk about models, we size them based on the number of parameters they have. 1M is a 1 Million parameter model. That’s itty bitty. 1b is 1 billion parameters- also itty bitty. Many modern models release in really huge sizes like 397b to 1T (1 Trillion parameters).

The easiest way to imagine parameters is as data points that can correspond to several pieces of data at once. So 1 datapoint doesn’t necessarily equate to something like “When did the first Ford car release?” It could also correspond to several other pieces of info at once.

Models are generally created in BF16 format to start with. Size wise- BF16 equates to about 2GB per 1b; so a 20b model would be 40GB. If you “quantize” the model (easiest way is to think of it is ‘compressing’ the model) to 8bpw, or ~q8_0, that becomes 1GB per 1b. If you go further to 4bpw, or ~q4_0, you get down to 0.5GB per 1b. That’s how we fit big models on smaller hardware.

As you can imagine, the more you quantize, the more mistakes the model will likely make.

Open Weight Models

These are models that you can download and run yourself. There are a few ways to do it, and here are some examples:

  • Raw transformers – this is the original format of the models
  • GGUF – This is a model that has been converted to run in llama.cpp
  • MLX – This is converted to run in Apple’s MLX

Many applications, like Ollama or LM Studio, wrap some of these and then have their own repositories to pull models from. For best speed and the fastest updates for model support, you generally want to avoid that. You can find all models here: https://huggingface.co.

Training

LLMs learn by being “trained”. It’s a complex process that, at the absolute highest level, involves the LLM seeing billions upon billions of tokens of information and learning patterns from it. “When I see someone say this, it usually involves someone responding with that” kind of thing. This is why people constantly harp about good data in training being the most important thing- if you have really clean examples of speech, knowledge, etc, it is easier for the LLM to find the right patterns.

Eventually, more powerful LLMs start to infer new patterns that they haven’t seen before. Remember the old math problems like if A == B and B == C, then A == C? Imagine that on a MASSIVE scale, where it creates connections between information many many many many layers deep to get from A to Z.

  • Training a commercially viable model takes ungodly amounts of money and data, and you need really smart people to do it. Companies spend millions to billions of dollars making some of the most powerful models.
  • Training data is hard to come by. If you’ve heard about how some companies scraped the internet for data? That’s why. They are looking for examples of speech, knowledge, etc. When an LLM wants to train on your data, it is less that the company wants to include your personal PII in the model (they generally don’t; they don’t want that bad publicity if someone makes the model spit it out) and more that they want nice clean interactions to give to the LLM to look at and learn more patterns.
  • This is also why AI companies are mad at each other for “distilling” their products. Distilling is the act of interacting with an LLM over and over again to get examples of the LLM’s speaking or thinking process, then creating training data to teach another LLM to act or reason that same way. An example of this from recently was that DeepSeek, Moonshot AI, and MiniMax got accused of doing this by Anthropic. The accusation was that they were using thousands of fraudulent accounts to interact with Claude millions of times, then using those interactions to teach their own models to think and speak similarly.
  • It’s possible to train little fun models pretty cheaply. One guy recently trained a small model from scratch on 1800s text, with nothing at all modern in it. This little model has no concept of anything past the industrial age.

Finetuning / Post-Training

When you hear a non-tech company say they are “training a model”, they most likely mean finetuning or post-training an open weight model.

Imagine an LLM as a big calculator for matrix math. Numbers go in, one number comes out. So that over and over and you get a response. The neat thing about matrix math is something called rank factorization- the idea that you can represent a matrix m*n with rank r by using smaller matrices m*r and r*n. Some super smart folks figured out that this allowed us to have LoRAs, which you can think of like add-on components to LLMs that modify the weight distribution.

In other words- rather than retraining the entire model to try to add more information, you train an itty bitty version of that model with the info you want, and then you can load the original model + LoRA at the same time to get a post-trained model.

Truthfully- I am pretty staunchly in the camp that you can’t reliably train new knowledge into a model this way. That’s a very common but not a universal view within the deeper LLM tinkering community; some companies have made post-training their bread and butter. I do believe that you CAN train styles, tones, etc really well into it (for example: training a model to handle documentation a certain way, or think a certain way), but ultimately I’ve yet to see a good example of a post-trained model outside of basic Instruct models from the same manufacturer that has actually been worth the effort. Maybe there are some out there, but I’m not familiar with them.

Anyhow, long story short- you CAN post-train a small model for $100 or less, but I wouldn’t even recommend it unless you really understand what you want to get out of it and why. There’s very little a post-trained model can do that you can’t do with a good workflow, prompt and data to RAG against.

How LLMs Respond

When you boil it down, LLMs work in a really simple loop. You give it a chunk of tokens. It processes them and spits out one new token. Then it takes all your original tokens plus that one new token it just spit out, and processes the whole thing again, and spits out the next token. Then it takes all your tokens plus the two new tokens, processes again, spits out the next. On and on, one token at a time, until it decides it is done and sends a stop token. You now have your response.

To simplify it- LLMs don’t think about the response all at once- they think 1 token at a time. Over and over and over until they are done. That’s it.

This is also why “reasoning” works. If you ask a model to just answer a hard math problem cold, it can fumble it, because by the time it gets to the answer it’s already locked into early tokens it picked. But if you tell it to think out loud first- write out the problem, work through it step by step- then while it’s writing all that, it’s still just predicting one token at a time, except now each new token gets to “see” all the work it just laid out. If it makes a mistake at step 2, it can sometimes catch it at step 4 and shift the line of thinking before it commits to a final answer.

If you ever watch an LLM think, and it constantly goes “But wait…”, that’s because it was trained to in order to stop it from locking in. It says its response, then it challenges the response, and in doing so that gives it a chance to realize the response was wrong.

That’s basically what chain of thought and reasoning models are. The model writing out its work so it has more to reference when generating each next token. It’s not magic, it’s just giving the model more useful context to predict from. The flip side is that more reasoning means more tokens, which means more time and more cost. And some models, like Qwen3.5/3.6 and Gemma 4, overthink badly. With those, you want to use a workflow app to manually apply CoT, if you can. Since I use Wilmer everywhere, I have workflows specifically to use Qwen/Gemma with thinking disabled, and then have a manual CoT step. That helps with overthinking massively.

RAG – Retrieval Augmented Generation

This is a $5 term for a $0.05 concept. When we talk about RAG, it boils down to a very simple concept: give the LLM the answer before it responds. Everything else, when talking about RAG, is talking about a design pattern.

  • Simplest example: The simplest form of RAG would be copying the text of an article or tutorial, putting it in your prompt, and asking the LLM to answer a question about that. The LLM will use the article to answer you.
  • Next level of simplicity: You might ask an LLM a question, the LLM uses a tool (web search, local wiki search, whatever) to pull the article, concatenates it into your prompt, and answers your question.
  • What a lot of folks think of when they think of RAG: You have a program that takes thousands, or even millions, of documents and turns them into “embeddings”- ie breaks the document into logical chunks and stores them somewhere easy to retrieve off of, such as a Vector database. Then, when you ask a question, it does some fancy stuff in the background to find the right chunks and answer your question with them. Since putting 1,000,000 files into your context all at once is impossible, this is how you go about the oft-advertised “chat with your documents” situation.

But all together, RAG comes down to a very simple concept: give the LLM the answer before it responds. That’s it. LLMs are very, very strong at this, and it’s a great way to avoid hallucinations.

For the most part, RAG solutions are not an LLM problem, they’re a software problem. If you’re struggling with RAG, you probably need to revisit HOW you’re feeding the data to your LLM and whether you’re giving it too much unnecessary stuff along with the right stuff.

Hallucinations

A hallucination is when the LLM responds with something that’s flat wrong. The reason it happens comes back to that loop in the How LLMs Respond section: an LLM doesn’t actually know anything. It’s a pattern matcher predicting the most likely next token based on what came before, based on the training that it did to determine “when I see X, I usually see a response of Y”. If the most likely next token happens to be the wrong one, well, that’s what you get. This can especially happen with information that there isn’t a lot of great data out there for, so the LLM had to infer the relationships. Asking a detailed question about Excel means it has millions of example questions, articles, documents, etc from the internet to have learned from; asking a question about FIS’ Relius Administration has far far fewer examples, so it likely inferred a lot of things based on other patterns, and it will hallucinate like mad.

LLMs, as a technology, don’t have a built-in “I’m not sure about this” lever they can pull. It just generates whatever the patterns say to generate, and confidence isn’t really part of the equation. The answer it gave you is ‘right’ from the perspective that it generated the most likely pattern. Whether that pattern is of any use to you has nothing to do with the LLM lol.

The most common reasons you see hallucinations:

  • The training data was wrong, so the pattern the model learned is wrong.
  • The training data didn’t cover the topic well, so the model is filling in gaps with whatever sounds plausible.
  • You asked something outside what the model was really trained for, and it tries to answer anyway because that’s what it was trained to do- give an answer.
  • Your context window is huge or messy, and the model is losing track of what’s actually relevant in there.
  • The model is over-quantized and just making more mistakes generally (going back to that earlier section).

Reasoning models hallucinate a bit less on certain types of problems because they get a chance to second-guess themselves while writing things out, but they absolutely still hallucinate. The single best mitigation is to put the answer in the context for it, which is RAG.

Using That Info

Knowing all this should hopefully help you start to narrow down why some of the “pro tips” of using LLMs exist. When you want a factual answer, you don’t just ask the LLM. Right or wrong, you’re getting a confident response. Instead, make sure you are injecting the right answer in before it responds- this often means tool use such as web search or, even better, “Deep Research” features you find on commercial LLMs.

This also hopefully will help you imagine why jamming ALL your codebase into the LLM, or constantly asking “What model has a bigger context window?” is the wrong question. It’s lazy to just look for bigger context windows; and that laziness will bite you. Instead, focus on how you can break the data apart so that the LLM can work in the confines of what it handles best. That means writing or downloading some supporting software.

Anyhow, good luck folks. Hope this helps the like 4 people that might read this far.

I Failed My Azure AI-102 Exam the First Time -Here’s What I Learned

There’s something nobody tells you about the Microsoft Azure AI Engineer (AI-102) certification: the practice exam and the real exam feel like two completely different tests.

I know this firsthand – because I failed the first time.

But here’s the twist: before I even attempted the exam, I had already built a real-world Retrieval Augmented Generation (RAG) system using Azure AI services for a live demonstration to associates from multiple teams in Cognizant UK and a group of colleagues from the Department of Education. I had hands-on experience with the very technology the exam covers. And I still failed.

This article is for every developer who has studied Microsoft Learn, watched the YouTube videos, sailed through the practice exams – and then walked out of the real test wondering what just happened.

Why I Decided to Take AI-102
This was entirely my own decision. Nobody asked me to do it.

I was already working with Azure AI services in my day-to-day work as a Senior Software Engineer at Cognizant, delivering enterprise and UK government applications. I wanted to formalise my knowledge, deepen my understanding of the broader Azure AI ecosystem, and demonstrate that my expertise went beyond just the services I was using on specific projects.

The AI-102 felt like the right certification – broad enough to cover the full Azure AI landscape, but technical enough to mean something.

How I Prepared
My study approach was straightforward:

  • Microsoft Learn : the official learning paths for AI-102

  • YouTube : practical walkthroughs and service deep-dives

  • Practice exams : Microsoft’s official sample questions and third-party practice tests

I studied consistently over several weeks, working through each Azure AI service systematically. Azure Cognitive Services, Azure OpenAI, Azure AI Search, Document Intelligence, Speech, Vision – I covered them all.

And when I took the practice exams, I was passing comfortably. I felt ready.

I wasn’t.

The Gap Nobody Warns You About
The Microsoft demo and practice exams are scenario-light. They test whether you know what a service does, what its key features are, and roughly when to use it.

The real AI-102 exam is fundamentally different. It is scenario-heavy.

You are not asked “what does Azure Document Intelligence do?” You are asked something closer to: “A financial services company needs to extract structured data from thousands of handwritten forms, integrate it with their existing Azure infrastructure, and ensure compliance with GDPR. Which combination of Azure AI services and configurations would you recommend, and why?”

The real exam puts you inside a business problem and asks you to think like an architect, not a student. It tests your judgement, not just your memory.

The practice exams did not prepare me for that shift in thinking. They were too easy – close ended, straightforward, and forgiving. I passed them confidently and mistook that confidence for readiness.

What Made It Harder: I Built Before I Studied
Here is something unusual about my journey: I actually built a RAG (Retrieval Augmented Generation) system using Azure AI before I sat the exam.

I developed and demonstrated an internal AI tool that allowed users to upload documents and query them intelligently using Azure AI Search for indexing and retrieval, combined with Azure OpenAI for generation. I presented this to associates from multiple teams in Cognizant UK and a group of colleagues from the DfE as a practical demonstration of what Azure AI could do in an enterprise context.

This was not my first time sharing Azure AI knowledge internally either. Around three years earlier, I had delivered an introduction to Azure AI services to Cognizant UKI associates -covering the practical landscape of what was available and how it could be applied in real projects. The RAG demo felt like a natural evolution of that earlier session -moving from “here is what Azure AI can do” to “here is a working system built with it.”

You might think that hands-on experience would make the exam easier. In some ways it did – I understood the architecture deeply, I knew the practical challenges, and I could reason about real-world scenarios confidently.

But the exam also exposed the gaps in my theoretical knowledge. There were services and configurations I had never needed in my specific project that appeared heavily in the exam. The breadth of AI-102 is wide – and real-world projects naturally focus on a subset of that breadth.

Building first taught me the practical. The exam demanded the theoretical. The gap between them was where I stumbled.

The Second Attempt
After failing, I approached my preparation differently.

Instead of going through Microsoft Learn linearly, I focused specifically on scenario-based thinking. For every service I studied, I asked myself: “In what business situation would I choose this over the alternatives? What are the constraints, trade-offs, and compliance considerations?”

I stopped treating the services as a list to memorise and started treating them as a toolkit to reason about.

I passed on my second attempt.

What the AI-102 Actually Tests
If you are preparing for this exam right now, here is what I wish someone had told me:

  1. Scenario thinking beats memorisation

The exam will put you in business situations. Practice thinking about why you would choose a service, not just what the service does.

  1. The practice exam is too easy – don’t be fooled

Passing the Microsoft sample questions comfortably does not mean you are ready. Seek out harder, scenario-based practice materials.

  1. Breadth matters as much as depth

Even if you work with Azure AI every day, the exam covers services you may rarely touch. Study the full ecosystem, not just your daily toolkit.

  1. Real experience helps but does not replace theory

Having built RAG systems and Azure AI integrations in production gave me invaluable context – but I still needed to understand the full theoretical landscape the exam demands.

  1. Failure is data, not defeat

My first failure told me exactly where my preparation was weak. I treated it as a diagnostic, not a verdict.

Where I Am Now
I am currently renewing my AI-102 certification, which reflects how seriously I take staying current in this field. The Azure AI ecosystem moves quickly – new services, updated capabilities, evolving best practices. Keeping the certification current is not just a box to tick. It is a commitment to remaining genuinely expert in the technology I use every day.

If you are preparing for AI-102, I hope this article saves you from the same mistake I made – assuming that passing practice exams means you are ready for the real thing.

Study the scenarios. Think like an architect. And if you fail the first time, use it.

Aromal Chulliyil Muraleedharan is a Senior Software Engineer at Cognizant UK with 8+ years of experience building enterprise and UK government applications using .NET, Azure, and AI services. He holds the Microsoft Azure AI Engineer (AI-102) and Azure Developer (AZ-204) certifications.

Read on Dev.to| Connect on LinkedIn | Follow on Medium

Tags: #AzureAI #AI102 #MicrosoftAzure #CloudComputing #MachineLearning #RAG #dotnet #SoftwareEngineering

CVMetric — Free ATS Resume Builder – Built for Modern Hiring Systems

We just shipped CVMetric, a resume builder web application designed to help job seekers create ATS-optimized resumes that actually pass applicant tracking systems and reach recruiters.

Problem We Solved

Most resumes fail not because of skills, but because they are:

  • Not structured for ATS parsing
  • Missing keyword alignment with job descriptions
  • Poorly formatted for recruiter readability

👉 As a result, a large percentage of applications never reach a human reviewer.

What CVMetric Offers

✔ ATS Resume Builder with structured form-based editing
✔ Real-time resume preview with print-ready output
✔ Resume scoring system (ATS compatibility + content quality)
✔ Job description matching with skill gap detection
✔ Professional resume templates (minimal, modern, sidebar)
✔ Export to PDF, DOCX, and JSON formats
✔ Resume dashboard for managing multiple versions
✔ PDF & JSON import system to rebuild resumes into structured data

Technical Implementation Highlights

Built with a focus on scalability and structured data design:

  • Next.js (App Router) for full-stack architecture
  • React + Zustand for state management
  • MongoDB + Mongoose for resume persistence layer
  • Modular resume schema for flexible template rendering
  • Rule-based ATS scoring engine (keyword + structure analysis)
  • Print-first design system for A4/Letter export accuracy
  • Template engine supporting multiple layout strategies

Core Engineering Focus

We prioritized:

  • Structured resume data modeling (not just UI forms)
  • Separation of content vs presentation (template system)
  • Deterministic ATS scoring logic
  • Export consistency across PDF/DOCX/print views
  • Performance-first editor architecture

What’s Next

We’re actively improving:

  • smarter job matching system
  • advanced ATS scoring rules
  • more resume templates
  • performance optimizations for large resumes

👉 Live project: cvmetric.com

Would love feedback from developers, engineers, and product builders on architecture, scalability, or UX improvements.

Claude Code Billing Alert, Workflow Enhancements & Open-Source OCR Benchmarks

Claude Code Billing Alert, Workflow Enhancements & Open-Source OCR Benchmarks

Today’s Highlights

Today’s highlights include a critical billing bug affecting Claude Code users, a comprehensive cheat sheet for optimizing Claude Code workflows, and the release of DharmaOCR, an open-source 3B SLM with strong cost-performance benchmarks.

Claude Code Billing Bug: ‘HERMES.md’ in Git Commits Triggers API Rates (r/ClaudeAI)

Source: https://reddit.com/r/ClaudeAI/comments/1svdm1w/psa_the_string_hermesmd_in_your_git_commit/

A critical bug has been discovered in Claude Code’s billing system that can silently incur unexpected costs for developers. Users are reporting that the presence of the string “HERMES.md” (case-sensitive) in their Git commit history can cause Claude Code to bypass the Max plan’s bundled usage and instead bill at standard API rates. One developer reported an unexpected $200 charge due to this issue.

Anthropic’s support has acknowledged the bug, indicating it’s an internal routing error related to an experimental feature that was inadvertently enabled for some users. This issue highlights the importance for developers to scrutinize cloud service billing and API usage patterns, especially when engaging with developer tools still under active development or integration. Developers are advised to check their Git commit histories and monitor their Claude Code billing closely to avoid similar unexpected charges.

Comment: This is a serious heads-up for anyone using Claude Code and Git. Unexpected billing bugs like this can derail project budgets fast. Always double-check your commits and monitor your spend.

Claude Code Cheat Sheet for Daily Use and Enhanced Workflows (r/ClaudeAI)

Source: https://reddit.com/r/ClaudeAI/comments/1sv852q/claude_code_cheat_sheet_after_6_months_of_daily/

Following positive community feedback on a previous post, a Claude Code power-user has compiled a comprehensive “cheat sheet” based on six months of daily use. This resource aims to help developers optimize their Claude Code workflows by outlining effective commands, configuration tips, and interaction patterns. The sheet covers strategies for better prompt engineering within the Claude Code environment, managing context efficiently, and leveraging the tool for specific coding tasks such as refactoring, debugging, and generating boilerplates.

It emphasizes practical advice for developers looking to deepen their integration of Claude Code into their daily development cycle, moving beyond basic prompts to more structured and repeatable interactions that yield superior results and productivity gains. The community contribution underscores the growing importance of shared knowledge in maximizing the utility of AI-powered developer tools, providing a valuable resource for both new and experienced users.

Comment: This cheat sheet is gold for Claude Code users. It distills months of practical experience into actionable tips, especially on structuring prompts for complex coding tasks.

DharmaOCR: Open-Source 3B SLM with Cost-Performance Benchmarks (r/MachineLearning)

Source: https://reddit.com/r/MachineLearning/comments/1sun6wt/dharmaocr_opensource_specialized_slm_3b/

DharmaOCR, a new open-source Specialized Small Language Model (SLM) with 3 billion parameters, has been released on Hugging Face, complete with public models and datasets. This release is accompanied by a research paper detailing extensive experimentation and a robust cost-performance benchmark comparing DharmaOCR against larger LLMs and other open-source models specifically for Optical Character Recognition (OCR) tasks. The benchmark demonstrates DharmaOCR’s efficiency and accuracy, positioning it as a highly competitive solution for specialized text extraction, particularly where cost and latency are critical considerations.

Developers and researchers can freely access and experiment with DharmaOCR, providing a valuable resource for integrating efficient OCR capabilities into applications without the overhead of larger, more general-purpose models. The project emphasizes the potential of specialized SLMs to outperform or match larger models in specific domains, offering a practical alternative for resource-constrained environments or applications requiring fine-tuned performance. This is an excellent example of a practical, open-source tool that can be immediately tested and integrated.

Comment: An excellent example of how specialized SLMs can deliver competitive performance with better cost-efficiency for specific tasks like OCR. This is definitely worth exploring for targeted applications.

I built an AI-powered PDF generation API — here’s how

PDF generation from code is still painful in 2026. You either wrestle with complex libraries that need 200+ lines for a simple invoice, or pay for bloated enterprise services.

So I built PDFGen AI — a simple REST API where you send HTML and get a PDF URL back. Or better — describe what you want in plain English and AI generates the template for you.

The Problem

Every developer who’s tried to generate PDFs programmatically knows the pain:

  • wkhtmltopdf — outdated, rendering issues, painful to install on servers
  • Puppeteer/Playwright — powerful but heavy, needs headless Chrome
  • jsPDF — client-side only, limited styling
  • PDFKit — low-level, you’re drawing rectangles manually
  • Paid services — $50-200/month for what should be a simple API call

All I wanted was: send HTML, get a PDF. That’s it.

The Solution: One API Call

curl -X POST https://pdfgen-api.vercel.app/api/generate 
  -H "Authorization: Bearer pk_your_key" 
  -H "Content-Type: application/json" 
  -d '{"html": "<h1>Invoice #001</h1><p>Amount: $500</p>"}'

Response:

{
  "success": true,
  "url": "https://storage.supabase.co/pdfs/invoice-abc123.pdf"
}

That’s the entire integration. No SDKs. No config files. No dependencies.

The AI Magic

Instead of writing HTML yourself, you can use AI to do the heavy lifting.

Generate a Template from a Description

curl -X POST https://pdfgen-api.vercel.app/api/ai/template 
  -H "Authorization: Bearer pk_your_key" 
  -H "Content-Type: application/json" 
  -d '{"prompt": "Professional invoice with logo, line items, tax, payment terms"}'

AI generates a complete, styled HTML template you can reuse.

Fill a Template with Data Automatically

curl -X POST https://pdfgen-api.vercel.app/api/ai/fill 
  -H "Authorization: Bearer pk_your_key" 
  -H "Content-Type: application/json" 
  -d '{
    "template": "<your-html-template>",
    "data": {
      "company": "Acme Corp",
      "items": [
        {"name": "Web Development", "amount": 50000},
        {"name": "Hosting", "amount": 12000}
      ],
      "tax_rate": 0.18
    }
  }'

AI maps your JSON data to the template fields — no manual field mapping needed.

JavaScript Example

const response = await fetch(
  "https://pdfgen-api.vercel.app/api/generate",
  {
    method: "POST",
    headers: {
      Authorization: "Bearer pk_your_key",
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      html: '<h1>Invoice #001</h1><p>Total: $1,500</p>',
    }),
  }
);

const { url } = await response.json();
console.log("PDF ready:", url);

Python Example

import requests

response = requests.post(
    "https://pdfgen-api.vercel.app/api/generate",
    headers={
        "Authorization": "Bearer pk_your_key",
        "Content-Type": "application/json",
    },
    json={
        "html": "<h1>Receipt</h1><p>Amount: $250</p><p>Status: Paid</p>"
    },
)

print("PDF ready:", response.json()["url"])

The Stack

Here’s what powers PDFGen AI:

Layer Technology Why
Hosting Vercel (serverless) Zero config, auto-scaling
Framework Next.js (App Router) API routes + frontend in one
Auth + DB Supabase PostgreSQL, auth, file storage
PDF Rendering Puppeteer + @sparticuz/chromium HTML to PDF in serverless
AI AWS Bedrock (Nova Micro) Fast, cheap template generation
Billing Lemon Squeezy Merchant of Record

The Chromium Challenge

The hardest part was getting Puppeteer to work on Vercel serverless. The standard Chromium binary is too large. Here’s what worked:

  1. Use @sparticuz/chromium — stripped-down build for serverless
  2. Add outputFileTracingIncludes in next.config.ts to bundle the binary
  3. Launch with headless: "shell" mode for faster startup
  4. Disable GPU with setGraphicsMode = false

Cold-start PDF generation: under 5 seconds.

Why Supabase?

  • Auth — email/password with magic links, zero config
  • Database — PostgreSQL for API keys, usage tracking
  • Storage — PDFs stored with signed URLs
  • Free tier — generous enough for an MVP

Why AWS Bedrock?

I originally used the Anthropic API directly, but switched to Bedrock:

  • Pay-per-use — no monthly minimums
  • Amazon Nova Micro — fast, cheap, perfect for templates
  • Bearer token auth — simple, no complex AWS SDK needed

5 Built-in Templates

PDFGen AI comes with ready-to-use templates:

  1. Invoice — line items, tax, totals
  2. Receipt — clean payment receipt
  3. Report — business report with sections
  4. Certificate — achievement/completion
  5. Letter — formal business letter

Lessons Learned Building Solo

Start with the API, not the UI. I tested with curl for weeks before building the frontend. Forced me to get the DX right.

Free tiers are your friend. Total infra cost: ~$0/month. Vercel free, Supabase free, Bedrock pay-per-call.

Billing in India is tricky. Stripe doesn’t support Indian merchants for international payments. Lemon Squeezy acts as Merchant of Record — handles global payments, pays you out.

SEO from day one. Sitemap, robots.txt, JSON-LD, OG images — all added before launch. 10x easier than retrofitting.

Ship fast. Idea to production in 4 weeks. Not perfect, but working.

Pricing

Plan Price PDFs/month AI calls/month
Free $0 50 10
Starter $19/mo 2,000 100
Pro $49/mo 15,000 500
Business $99/mo 100,000 2,000
Enterprise $299/mo Unlimited 5,000

No credit card required for free tier.

What’s Next

  • More built-in templates based on user requests
  • Webhook notifications when PDF is ready
  • Batch generation — hundreds of PDFs in one call
  • Custom font uploads
  • PDF merging

Try It

The API is live and free to start:

Website: pdfgen-api.vercel.app
Docs: pdfgen-api.vercel.app/docs

Sign up, grab your API key, and generate your first PDF in under a minute.

I’d love feedback — especially on the API design and developer experience. What would you build with it?

Built solo with Next.js, Supabase, AWS Bedrock, and too much coffee.

Layers Made It Universal. Harnesses Made It Run

A continuation of Flip the Axis: A Layer-Based Approach to Multi-Service Migrations.

TL;DR

You can’t script your way across a fleet of snowflake repositories. Neither can you just ask an AI agent to “migrate this service” and hope. What worked for the eight-quarter migration was a harness — a prompt pipeline in which each layer was a sequence of ordered steps: some calling scripts for deterministic changes, some using AI to discover and adapt, and some validating the results. The harness chained their outputs, ran each layer across 21 repos at once, and landed merge requests when it was done.

Here’s what that looked like in practice — including the parts we got wrong.

From methodology to machinery

The previous article ended on a line worth repeating: the methodology enables the tooling, not the other way around. This post is where that line becomes machinery.

The project: migrating 21 services from ECS to EKS — the final wave of an eight-quarter effort. Four engineers, targeting roughly ten repos per engineer per day on a given layer. The services were snowflakes: each with its own code style, CI configuration, logging framework, naming conventions, and infrastructure setup. The layers defined what to do — add an OIDC provider, swap the logging appender, rewrite the CI pipeline, set up a piece of infra — and the action was identical across services. But how to execute each layer varied across repositories. Same change, different wiring, 21 times over — toil that doesn’t yield to a single script. The layer approach made that pace imaginable. The harness made it real — though the most important piece wasn’t what we built first.

How a layer runs

Every layer ran through the same pipeline — a workflow of ordered prompts that mixed step types.

Take logging. One service uses logback.xml and is Java, another uses a custom logging setup and is NodeJS, and another has a custom appender in a shared library. You can’t script that discovery — but you can script adding the dependency once the agent finds the right config. So the workflow does both: an AI step to discover and adapt, a script step for the deterministic edit, and a validation step to check the result.

Some steps called Go tools for deterministic changes — known target, computable edit, no reasoning required. Others used AI to discover and adapt: Terraform is a good example — we could not realistically script the changes, but we could give the LLM the modules to use for adding new configuration. Others validated what the earlier ones produced. The pipeline chained their outputs, for example:

  • What a script produced in step 3 informed what the agent analyzed in step 4.
  • What the agent implemented in step 5 was what the validation prompt checked in step 7.

Step chaining

Why “ask Claude to migrate this service” fails

You cannot ask an AI agent to “migrate this service to EKS.” The task is too broad. The context is too large. The agent will hallucinate a plausible-looking solution, skip steps it decided weren’t important, or produce something that looks right and isn’t.

[!NOTE]
The failure mode isn’t that the AI is dumb. It’s that the task has no structure, so the agent invents its own — and its structure drifts between runs.

You get 21 repositories migrated in 21 different ways, with 21 different sets of mistakes to audit. That’s worse than having done nothing.

The fix isn’t a better prompt. The fix is everything around the prompt.

Harness engineering, named a little late

A few months ago, the term harness engineering started showing up — a zero-volume search term until early 2026. The idea: design the constraints and scaffolding around an LLM that make it reliable. Not whether to use AI, but to answer: what do you do after it’s part of your toolchain?

So far, the conversation is mostly about coding agents. We got here through infrastructure migration — and structured infra work is where the pattern fits most naturally. The changes follow patterns. The validation criteria are concrete. And the same change repeats across dozens of repos — which is exactly what a harness is built for.

The answer turns out to be unsexy: a pipeline that runs prompts in a fixed order, enforces what the agent can touch at each step, and validates its own work before declaring done.

I didn’t have that term in October 2025 when we built our own version of one. It’s easier to name a thing after you’ve already built it wrong twice. We just kept running into the same failure — the agent doing too much, too fast, with not enough guardrails — and kept adding constraints until it stopped failing. Mutation boundaries. Ordered steps. Explicit reasoning gates. Validation passes.

What was new — for us — was treating the harness as the primary engineering artifact. Not the prompt. Not the model. The pipeline around them.

The shift worth taking from it is this — stop optimizing the prompt, start engineering the pipeline that runs it.

What our harness looked like

A workflow is a sequence of numbered markdown files — each one a step prompt. The agent runs them in order inside an isolated Claude Code session. Context compounds: what step 2 discovered informs what step 5 decides.

The shape is always the same: context → discovery → analysis → planning → implementation → validation → ship → report.

Not every step needs the agent to reason. Some prompts call a Go tool or script for a deterministic change — same edit, known target, no drift. Others need AI to discover, analyze, and adapt. The workflow mixes both and outputs the chain forward.

Six things make this structure trustworthy:

  • Mutation boundaries. Every step is tagged READ ONLY, MAKES CHANGES, or PLANNING ONLY. The agent knows what it’s allowed to do at each point. No surprise writes during discovery.

  • Context anchoring. Step 0 explains why, not just what. “Here is why we move this piece of infrastructure and need it to be X and Z in a new destination.” “Here is why templates can’t be applied blindly.” The agent gets the intent before it touches any code.

  • Domain knowledge in the prompts. Hard-won lessons encoded directly: “BOM manages versions — you still must declare STS explicitly.” “Check actual pipeline behavior, not just config files.” These are the footnotes a senior engineer would leave for a junior one. The agent gets them every time.

The first three give the agent the right starting position. The rest keep it from wandering.

  • Early exits. If the dependency already exists, skip to the report. If a values file for a Helm chart is already configured, skip to the report. Not every repo needs every step.

  • Explicit reasoning gates. Complex steps require reasoning, so the agent has to reason through the problem before acting — no freestyle implementation — and this must be declared explicitly.

  • The implement/validate split. For critical layers, two separate workflows ran in sequence: one to implement, one to validate. The validation workflow reviewed the implementation against defined criteria rather than rubber-stamping its own work. This did more for reliability than any other constraint we added.

None of these is about making the agent smarter. They’re about making it predictable. Every constraint trades a degree of freedom for a degree of reliability. That’s the whole game.

The instrument

Workflows describe what the agent should do per repo. We still needed an orchestrator to run them — and to run them across a batch of repos at once.

So we built a parallel execution wrapper on top of the Claude Agent SDK. Think of it as a prompt pipeline runner. Point it at a workflow and a list of repositories, and for each repo it clones into its own git worktree, spins up an isolated Claude Code session, runs the workflow’s step files in order, enforces the mutation boundaries from the previous section, and lands a per-repo report when it’s done. One engineer monitors the batch and reviews the results.

Parallel execution diagram showing a workflow and repo list fanning out to isolated sessions per repository, converging to engineer review

Each session adapts to its repository while following the same workflow. The agent in repo A figures out one logging setup; the agent in repo B figures out a completely different one. Both produce the same kind of report. The engineer opens ten reports instead of writing ten PRs.

This is the automated flip-the-axis model. One layer, one workflow, N repos, one human in the loop. All twelve migration layers ran through this pipeline.

Prompt pipeline + parallel sessions + per-repo reports + human review at the end.

Prompts as production code

The non-obvious part: all of this lived in a shared repository. Workflows, runbooks, Go tools, scripts, per-service metadata — all version-controlled in one place. The project headquarters.

This wasn’t convenient. It was a knowledge-sharing mechanism.

When an engineer discovered a prompt needed more specificity — say, the IAM workflow missed an edge case with cross-account trust policies — they updated the prompt and committed it. On the next pull, every teammate got the improvement. Same for validation scripts, runbook instructions, and service metadata.

The lesson: treat AI prompts and workflows like production code. Review changes. Iterate as a team. The prompts you write in week 1 will not be the prompts you need in week 6 — and every improvement should propagate to everyone automatically.

If your team is adopting AI for infra work and the prompts live in private gists, you’ve already lost the compounding advantage.

Honest assessment

Cross-referencing was the hardest problem we didn’t fully solve. The Helm values layer required correlating data from Terraform configs, ECS task definitions, and application codebases into a single file. AI agents working within a single repository can’t hold that full picture. Human judgment and manual cross-checking stayed essential here, and I don’t think that’s going to change for this class of problem anytime soon.

The reliability gradient was predictable. Within each workflow, the deterministic steps — the ones calling scripts — never drifted. The AI steps were reliable when isolated to a single file following a repeatable pattern. Reliability dropped when a step required cross-repository context or judgment calls about tradeoffs. Those stayed with humans.

We validated — but not enough. The implement/validate split worked wherever we applied it — it was the highest-leverage pattern in the whole setup. But we didn’t apply it to every layer, and we should have. The same workflows we used for implementation could have run in verification mode at near-zero additional cost. Our QA and PREPROD environments caught what universal validation would have. They shouldn’t have had to.

If I were starting this project again tomorrow, the first thing I’d build is a validation workflow for every implementation workflow, from day one. Not as an afterthought. As a matching pair.

What this means

Layers made the work universal. Harnesses made the AI trustworthy enough to run it — even across 21 repos, each wired differently. Together they closed out an eight-quarter migration — four engineers, not forty.

The next time someone frames it as “just write a script” versus “just use AI” — it’s the wrong question. Build the harness that runs both.

Further reading:

  • Harness Design for Long-Running Application Development — Anthropic’s engineering team on the same concept applied to agentic coding.
  • Harness Engineering — OpenAI’s take on the same discipline.
  • It’s a Skill Issue: Harness Engineering for Coding Agents — HumanLayer on harness configuration for coding agents.

Flip the Axis: A Layer-Based Approach to Multi-Service Migrations

TL;DR

When you’re migrating many services through the same steps, parallelize by step, not by service. Sweep one type of change across all services, then the next – it compounds learning, catches inconsistencies early, and makes automation viable. But recognize which services don’t fit the pattern: architecturally unique services should still be migrated serially.

The Problem

You’ve probably seen the shape of this problem before. You’re planning next quarter’s migration – could be Kubernetes, a new database engine, a cloud provider switch, a major framework version bump. You count the services. You count the engineers. The math doesn’t work.

Here’s what it looked like for us: 2025 Q4 planning, 21 services still running on ECS (Amazon’s container orchestration service) that needed to move to EKS (their managed Kubernetes platform). A headcount cut left us with 4 engineers. Each service migration had been taking 3-4 weeks of effort. The project had already been running since January 2024 – nearly two years – and the serial execution model from the previous quarter had required 8 engineers for 19 services. We had half the people, more services, and a timeline that was starting to feel permanent.

Nobody was telling us it had to be done next quarter. Our management said, “It’s okay if we can’t, don’t worry.”

But you know what happens to migrations that stretch for many months. They lose momentum. Engineers rotate off. Institutional knowledge erodes. The remaining services – always the hardest ones – sit in a permanent “next quarter” backlog.

You don’t have a staffing problem. You have an execution model problem.

Why serial breaks down

The default migration approach is serial: one engineer owns a service end-to-end and walks it through every step – networking, permissions, environment adjustments, certificates, CI/CD, DNS, cleanup. This works fine for a couple of services. This breaks down at scale.

The engineer context-switches across completely different types of work – networking, then application configuration, then debugging a permissions issue – and never builds deep fluency in any of them. Services are unique snowflakes – each with its own code style, dependency patterns, and configuration quirks. Serial migration means absorbing that uniqueness for every service, at every step.

Even worse – learning stays siloed. An engineer who figured out a networking edge case in week 2 can’t help the engineer who hits the same issue in week 6 – by then, they’ve moved on. Everyone is deep in a different service, at a different stage. The team can’t effectively pair, review, or unblock each other.

The Insight

When I looked at what we’d actually done in Q3 – service by service, step by step – the pattern was obvious in hindsight: we were doing the same work over and over again. Networking, permissions, application setup – identical across services.

It only looked unique because we were thinking one service at a time.

Serial Migration Strategy

What if we flipped the axis? Instead of completing all steps for one service before moving to the next, complete one step across all services before moving to the next step.

That’s the core of the layer-based approach. A layer is one type of infrastructure or configuration change, applied to every service in the migration scope. You sweep through all services at one layer, validate, then move to the next layer.

Layer-based Migration Strategy

Why this works

  • Repetition builds expertise. By the third service in a layer, you’ve seen the pattern. By the tenth, you’re fast.
  • Cross-service checks catch errors early. When you’re applying the same change to 20 services in a row, inconsistencies become obvious.
  • Learning compounds across the team. Everyone works the same layer simultaneously – discoveries spread instantly instead of weeks later.
  • Automation becomes viable. Identical changes across services are exactly what tooling excels at – predictable patterns with minor per-service variations.

Defining Your Layers

The number of layers depends on your migration. Ours had 14. Yours might have 8 or 20.

Here are the categories we found useful, grouped by concern:

  • Discovery: mapping downstream dependencies – services, databases, endpoints, protocols
  • Connectivity: networking between environments, firewall configurations
  • Identity: permissions, service accounts, trust policies, OIDC configuration
  • App level security: certificates, TLS termination, WAF rules
  • Application: runtime configuration, environment variables, secrets, logging adjustments
  • Delivery: CI/CD pipelines, ingress and routing, traffic management for gradual rollout

Your categories will differ. The names don’t matter – the decomposition does.

How to decompose your own migration

Start from a single service migration you’ve already done. List every change you made, in order. Group changes by type, not by when they happened. Each group is a candidate layer.

Then validate: can this layer be applied independently of the next one? Can you validate it before moving on? If yes, it’s a good layer boundary. If two changes are tightly coupled and can’t be validated separately, merge them into one layer.

One rule we learned the hard way: one layer per pull request. Early on, some PRs combined changes from multiple layers – networking and permissions in the same commit. Validation got complex, rollbacks got messy. Keep them separate.

Execution Model

A layer sweep works like this: the team takes on a layer, splits the service list among themselves, and each engineer applies that layer to their assigned services. Everyone works the same type of change simultaneously.

One engineer can realistically sweep a single layer across 6-8 services in a day. That number surprises people – until they know the tooling. We paired the layered methodology with AI-assisted automation that handled the repetitive configuration work across services. But the important thing is: the layer-based structure is what makes that automation possible.

When every service needs the same type of change with minor variations, you can build prompts, scripts, and validation checks that apply across the board. Serial, per-service work is too varied to automate effectively. The AI tooling story – what worked, what failed, and where human judgment was irreplaceable – is the subject of the next post in this series.

During execution, the team meets briefly to sync on edge cases – because the work is homogeneous, an edge case in one service is immediately relevant to every other service going through the same layer.

Progress tracking

A simple table – one column per layer, one row per service – serves as the source of truth. The team updates it in real time. Status per cell: not started, in progress, done, blocked, not applicable. This sounds basic, but it’s surprisingly effective. You can see at a glance where the project stands, which layers are complete, and where blockers are clustering.

Services that don’t fit

Not every service fits the pattern. In our case, 4 out of 21 services were architecturally complex enough that the layered approach didn’t help – they required deep, per-service analysis that negated the speed advantage.

We recognized this early and migrated them serially, with dedicated engineers working in parallel with the layer sweeps. Trying to force these into the pattern would have slowed everything down.

The lesson: the layer-based approach is a force multiplier for homogeneous work. When a service is genuinely unique, serial migration is the right tool. Budget for both.

Coordination That Matches the Work

The coordination model that works during one phase can hurt you in the next.

During layers: synchronize

When the whole team works the same layer, synchronous coordination is natural and cheap. Team syncs are short because everyone has context on the same type of work. An edge case discovered by one engineer is immediately useful to the others. Knowledge transfer happens without any deliberate mechanism – the work itself is identical.

During traffic switching: structure async handoffs

When the project moves from layer execution to per-service traffic switching, the work diverges. Each service has its own timeline, its own blockers, its own owning team with a different schedule. Synchronous coordination becomes expensive – the team is now working on different problems.

This is where a handoff log pays for itself. A shared document – not “made progress on Service X” but the actual PR link, the specific blocker, the decision to skip WAF configuration for this service, and why. What made it work: specificity over summary, explicit ownership, and early surfacing of blockers.

We heavily used this approach during the last phase of migration – the traffic switch – when two team members went to our SF hub to be on site with service owners, and two stayed in Berlin. But this isn’t a timezone trick – it works for co-located teams just as well. Fewer meetings, more focused execution, and a written record that prevents “I thought you were handling that” conversations.

The lesson: match the coordination model to the shape of the work. When work is homogeneous, synchronize. When it diverges, structure async handoffs and get out of each other’s way.

The Traffic Switch Cadence

Layer execution is predictable. Traffic switching is where the surprises live.

We used a graduated weekly cadence: Monday preflight (verify hostnames, certificates, ingress, autoscaling, dashboards – deploy one instance), Tuesday scale up and shift 1% of traffic, Wednesday observe and fix, Thursday shift to 50%, Friday observe and fix, following Monday shift to 100%.

The observation days weren’t idle – they were when most debug work happens. Issues that don’t surface at 1% show up at 50%. Fixes discovered for one service often apply to others in the same batch.

Batch your traffic switches. Running multiple services through this cadence simultaneously amortizes the coordination overhead – the preflight checklist, once built, applies to every service.

When This Works (and When It Doesn’t)

The layered approach is not universal. It works well under specific conditions.

Use layers when:

  • The migration is decomposable into independent, repeatable steps
  • The same type of change applies across many targets with minor per-service variations
  • Changes can be batch-validated – all services at one layer before moving on
  • The team is small relative to the workload and needs a force multiplier

Use serial when:

  • Services are architecturally unique and complex, and require deep, per-service analysis
  • The number of targets is small enough that coordination overhead outweighs the parallelization benefit

This is not an either/or decision. In our migration, the layered approach covered 17 of 21 services. The remaining 4 were migrated serially. Recognizing which services don’t fit the pattern early is just as important as the pattern itself.

What We’d Do Differently

Start the handoff log from day one. We introduced it when the team split across workstreams during traffic switching. In retrospect, the discipline of specificity and explicit ownership helps even when everyone is in the same room working the same layer.

Run validation sweeps after each layer, not at the end. We deferred some validation to later phases, when we did traffic switch on preproduction environments, which made fixing errors more expensive and created pressure during the most time-sensitive window.

Define service owner readiness criteria upfront. Some services reached the traffic switch phase with owners who weren’t fully briefed, dashboards that weren’t adjusted, etc. Clear criteria before the switch phase would have eliminated friction during the highest-pressure window.

Plan for the energy arc. An intensive, multi-month migration grinds people down. Build rotation points into the plan. Bring fresh perspective at deliberate moments – especially before the production switch phase.

Track decisions explicitly, separate from action items. Some decisions logged in the handoff document were missed because they were buried among task updates. A dedicated “decisions” section prevents teams from diverging without realizing it.

Key Takeaways

  1. Flip the axis. When many services go through the same steps, parallelize by step, not by service. The efficiency gain comes from repetition, shared learning, and automation – not from working harder.
  2. Define your layers by decomposing a single service migration. Group changes by type, validate that layers can be applied independently, and enforce one layer per merge request.
  3. Match coordination to the shape of the work. Synchronize when work is homogeneous. Structure async handoffs when it diverges.
  4. Recognize what doesn’t fit the pattern. Some services are genuinely unique. Budget for serial migration alongside the layer sweeps.
  5. The traffic switch is its own phase. Layer execution is predictable. Traffic switching is where surprises live. Treat it with a graduated cadence and observation days.
  6. The methodology enables the tooling, not the other way around. We paired layer-based execution with AI-assisted automation – and that’s what made one engineer sweeping 6-8 services in a day realistic. But the automation only worked because the layers created predictable, repeatable patterns. That story is next.

If this feels like a problem you’ve hit – or you’re about to – I’d like to hear your approach. Same constraint, different solution? A migration where layers didn’t work? Drop a comment.