Uncategorized

Building AI agents with Vercel AI SDK

The Vercel AI SDK treats agents as tool-calling loops: the model generates text or invokes tools, the SDK runs those tools, and the loop continues until the model answers or a stop condition fires.

This post builds a support triage agent that looks up customers and invoices, searches an internal knowledge base, and either opens a ticket or escalates to a human. It builds on the LLM integration with Vercel AI SDK post and focuses on multiple tools, stopWhen, and stepCountIs.

For external tools exposed over MCP instead of SDK-native tool() handlers, see the MCP server with Node.js post.

Prerequisites

  • OpenAI account
  • Generated API key
  • Enabled billing
  • Node.js version 26
  • ai, @ai-sdk/openai, and zod installed (npm i ai @ai-sdk/openai zod)
  • Client setup from the Vercel AI SDK integration post

Mental model – steps and the tool loop

A step is one model generation. In that step the model either:

  • returns text (the loop ends), or
  • returns tool calls (the SDK executes them and starts another step with the results)

Typical flow for the support triage agent: user question → model calls lookup tools (getCustomer, getInvoice, searchKnowledgeBase) → model creates a ticket or escalates → final answer. stopWhen can end the loop before or after the write tools run.

stepCountIs(5) means “stop after 5 steps” (five model generations), not five individual tool calls. A single step can include multiple parallel tool calls.

When you pass tools without stopWhen, the SDK defaults to stepCountIs(20) as a safety cap.

Support triage scenario

Example prompt:

Customer cus_1042 says they were charged twice for invoice inv_8891. What should we do?

A realistic chain:

  1. getCustomer – plan tier, open ticket count
  2. getInvoice – amount, status, payment IDs
  3. searchKnowledgeBase – duplicate-charge and refund policy
  4. createSupportTicket or escalateToHuman – write action or sentinel stop

The demo uses in-memory fixtures (customers, invoices, knowledge-base articles) so scripts run without a database.

Defining multiple tools

Register tools with tool() and Zod inputSchema. Clear description values help the model pick the right tool.

import { tool } from 'ai';
import { z } from 'zod';

const getCustomer = tool({
  description: 'Look up a customer account by ID',
  inputSchema: z.object({
    customerId: z.string().describe('Customer ID, e.g. cus_1042'),
  }),
  execute: async ({ customerId }) => {
    const customer = customers.find((item) => item.id === customerId);
    if (!customer) {
      return { found: false, customerId, error: 'Customer not found' };
    }
    return { found: true, customer };
  },
});

const getInvoice = tool({
  description: 'Look up an invoice by ID, including payment IDs and status',
  inputSchema: z.object({
    invoiceId: z.string().describe('Invoice ID, e.g. inv_8891'),
  }),
  execute: async ({ invoiceId }) => {
    const invoice = invoices.find((item) => item.id === invoiceId);
    if (!invoice) {
      return { found: false, invoiceId, error: 'Invoice not found' };
    }
    return { found: true, invoice };
  },
});

const searchKnowledgeBase = tool({
  description: 'Search internal support articles by keyword',
  inputSchema: z.object({
    query: z.string().describe('Search terms, e.g. duplicate charge refund'),
  }),
  execute: async ({ query }) => {
    // keyword match against mocked articles
    return { query, articles: matches };
  },
});

Add write tools for outcomes:

const createSupportTicket = tool({
  description: 'Create a support ticket after gathering customer and policy context',
  inputSchema: z.object({
    customerId: z.string(),
    subject: z.string().min(3),
    priority: z.enum(['low', 'medium', 'high']),
    summary: z.string().min(10),
  }),
  execute: async (input) => {
    const ticket = createTicket(input);
    return { created: true, ticket };
  },
});

const escalateToHuman = tool({
  description: 'Escalate when policy requires manual review',
  inputSchema: z.object({
    customerId: z.string(),
    reason: z.string().min(10),
    urgency: z.enum(['normal', 'high']),
  }),
  execute: async (input) => ({
    escalated: true,
    queue: input.urgency === 'high' ? 'billing-urgent' : 'billing-standard',
    ...input,
  }),
});

Return structured objects from execute. The SDK serializes them as tool results for the next step. Return explicit errors (for example { found: false, error: '...' }) so the model can recover instead of throwing.

Multi-step triage with generateText

Pass all tools and a system prompt with triage rules:

import { generateText, stepCountIs } from 'ai';

const { text, steps } = await generateText({
  model: openai('gpt-5.5'),
  system: `You are a billing support triage agent.
- Look up customer and invoice before recommending refunds.
- Search the knowledge base for policy guidance.
- Create a ticket when you can resolve within policy.
- Call escalateToHuman when manual review is required.`,
  tools: {
    getCustomer,
    getInvoice,
    searchKnowledgeBase,
    createSupportTicket,
    escalateToHuman,
  },
  stopWhen: stepCountIs(8),
  prompt:
    'Customer cus_1042 says they were charged twice for invoice inv_8891. What should we do?',
});

console.log('steps:', steps.length);
console.log(text);

Use a model that supports tool calling (same requirement as web search in the Vercel AI SDK post).

stopWhen – when the loop stops

stopWhen defines stopping conditions for the tool loop. Conditions are evaluated only when the last step contains tool results.

  • A single condition stops when that condition returns true
  • An array stops when any condition returns true (OR logic)
  • Without stopWhen, the SDK applies stepCountIs(20)

The loop also ends naturally when the model returns text without further tool calls.

stepCountIs – cap the number of steps

stepCountIs(n) stops once steps.length reaches n. Use it on every production agent to prevent runaway loops and unbounded API cost.

Use case Suggested cap
Single tool, then answer 2 (tool step + text step)
Chat with occasional tool use 3-5
Task agents (triage, research) 8-15
Long autonomous workflows 15-20 (with monitoring)

Tight vs relaxed cap on the same prompt:

import { generateText, stepCountIs } from 'ai';

// Stops after 3 steps even if the model still wants more context
const capped = await generateText({
  model: openai('gpt-5.5'),
  tools: supportTools,
  stopWhen: stepCountIs(3),
  prompt: '...',
});

// Allows a fuller investigation chain
const relaxed = await generateText({
  model: openai('gpt-5.5'),
  tools: supportTools,
  stopWhen: stepCountIs(8),
  prompt: '...',
});

Combining hasToolCall with stepCountIs

hasToolCall('toolName') stops when the model invokes a specific tool in the latest step. Pair it with stepCountIs for a hard cap plus a sentinel tool:

import { generateText, stepCountIs, hasToolCall } from 'ai';

const { text, steps } = await generateText({
  model: openai('gpt-5.5'),
  system: TRIAGE_INSTRUCTIONS,
  tools: supportTools,
  stopWhen: [stepCountIs(10), hasToolCall('escalateToHuman')],
  prompt:
    'Customer cus_2201 on the starter plan reports a duplicate $190 charge on invoice inv_9104.',
});

escalateToHuman works well as a sentinel: the loop stops as soon as the model decides the case needs a human, without waiting for a final text-only step.

Inspecting steps and usage

The steps array on the result contains per-step tool calls, tool results, finish reason, and usage. Use it for debugging and cost tracking:

const { text, steps, totalUsage } = await generateText({
  model: openai('gpt-5.5'),
  tools: supportTools,
  stopWhen: stepCountIs(8),
  prompt: '...',
});

for (const [index, step] of steps.entries()) {
  console.log(`step ${index + 1}`);
  console.log('  toolCalls:', step.toolCalls?.map((c) => c.toolName));
  console.log('  usage:', step.usage);
}

console.log('totalUsage:', totalUsage);

With streamText, pass onStepFinish to log each step as it completes.

ToolLoopAgent – reusable agent definition

ToolLoopAgent wraps the same loop for reuse across scripts and API routes. It accepts the same settings as generateText (tools, stopWhen, instructions).

import { ToolLoopAgent, stepCountIs } from 'ai';

const supportTriageAgent = new ToolLoopAgent({
  model: openai('gpt-5.5'),
  instructions: TRIAGE_INSTRUCTIONS,
  tools: supportTools,
  stopWhen: stepCountIs(8),
});

const result = await supportTriageAgent.generate({
  prompt:
    'Customer cus_1042 says they were charged twice for invoice inv_8891. What should we do?',
  onStepFinish: async ({ stepNumber, usage, toolCalls }) => {
    console.log(`step ${stepNumber + 1}`, {
      tokens: usage.totalTokens,
      tools: toolCalls?.map((call) => call.toolName),
    });
  },
});

console.log(result.text);

Use .stream() for streaming. For Next.js chat UIs, see createAgentUIStreamResponse in the AI SDK agents docs.

Streaming with tools

streamText supports the same tools and stopWhen settings:

import { streamText, stepCountIs } from 'ai';

const result = streamText({
  model: openai('gpt-5.5'),
  system: TRIAGE_INSTRUCTIONS,
  tools: supportTools,
  stopWhen: stepCountIs(8),
  prompt: 'Customer cus_1042 says they were charged twice for invoice inv_8891.',
  onStepFinish: async ({ stepNumber, toolCalls }) => {
    console.error(`step ${stepNumber + 1}:`, toolCalls?.map((c) => c.toolName));
  },
});

for await (const part of result.textStream) {
  process.stdout.write(part);
}

Text streams incrementally. Tool calls run between text segments as the loop progresses.

Production notes

  • Always set stopWhen – do not rely on the default stepCountIs(20) in production without monitoring
  • Cost – each step is another model call; log steps or onStepFinish usage
  • Tool errors – return structured errors from execute instead of throwing when the model should retry or escalate
  • Instructions – keep policy rules in system / instructions, not only in the user prompt
  • Same patterns elsewhere – PR review (listPRsgetCheckssubmitReview) or job-fit scoring use the same loop mechanics with different tools

Demo

Runnable scripts for each section live in the vercel-ai-sdk-agents-demo folder. Get access via code demos.

The Onboarding Math: What Each New Hire Actually Costs When Your AI Stack Is Fragmented

Here is a number most finance teams have never calculated: the total cost of getting one new employee to full productivity in a fragmented digital environment.

Not salary. Not benefits. Not the recruiting fee. The cost of the time between their start date and the date they can operate independently at full output — including the time of every colleague who helps them get there.

For companies with consolidated, well-designed digital environments, this number runs 6 to 10 weeks of fully-loaded compensation. For companies with fragmented stacks of 8 to 12 tools that don’t integrate cleanly, it runs 14 to 20 weeks.

That gap — 4 to 10 weeks of productive capacity per hire — is one of the most expensive invisible costs in enterprise operations. And it scales directly with headcount growth.

The Calculation Most Companies Skip

The standard onboarding cost model looks like this: recruiter fee plus first-month salary plus benefits plus laptop plus software licenses. A mid-market company bringing on a $120,000 salary employee calculates something in the range of $15,000 to $20,000 in onboarding costs.

The actual cost model should look like this.

Assume a fully-loaded cost of $85 per hour for the new employee. Assume a productivity ramp that reaches 50% effectiveness at week 4 and 80% effectiveness at week 10, reaching full productivity at week 14 for a company with a moderately complex tool stack.

The opportunity cost of the ramp period — the output not produced compared to a fully productive employee — runs approximately $18,000 to $24,000 per hire at this salary level, before accounting for any manager or colleague time invested.

Now add the colleague time. A new employee in a 10-tool environment asks more questions, requires more hand-holding on where things live, and pulls more time from more experienced colleagues than a new employee in a 4-tool environment. Conservative estimate: 8 hours of senior colleague time in week one, declining to 2 hours per week by month two. Total: 40 to 50 hours of experienced-employee time per new hire.

At $85 to $120 per hour for the colleagues involved, that’s $3,400 to $6,000 in real cost that shows up nowhere in the onboarding budget.

For a company hiring 30 people per year at this salary level, the aggregate fragmentation tax on onboarding runs $650,000 to $900,000 annually. Not as a line item. As a diffuse, invisible drag on productivity that never surfaces in any report.

What Creates the Fragmentation Tax

The root cause is straightforward. Every tool in the stack is a thing a new employee must learn. Not just the interface — the conventions, the permissions model, where different types of content live, which channel is authoritative for which decisions, how the tool integrates with the other tools they’re also learning simultaneously.

A 4-tool environment has roughly 6 integration relationships to understand (each tool to each other tool). A 10-tool environment has 45. The cognitive load of navigating those relationships, before the new employee can focus on their actual job, is substantial.

The fragmentation tax compounds for roles that require cross-functional visibility. A project manager who needs to pull status from 6 different systems, synthesize it, and report upward is performing coordination labor that serves no other output. That labor is invisible because it’s embedded in their job description — but it’s directly caused by the tool architecture and would be reduced or eliminated by consolidation.

The Onboarding Efficiency Benchmark

A useful benchmark: what is your time-to-independent-operation for a new hire in a standard individual contributor role?

Not time to first task completion. Time to independent operation — the point at which the employee can navigate all required systems, locate information without asking, complete their core workflows without assistance, and contribute to cross-functional projects without needing to be oriented by colleagues.

For companies that have measured this with any precision, the results correlate strongly with tool stack complexity. Sub-8-week time-to-independence is achievable with consolidated environments. 16-plus weeks is typical for highly fragmented environments.

If you don’t know your current number, it’s worth measuring. Ask managers from three departments: how long until a new hire in this role can operate fully independently? The variance in the answers will tell you something. The averages will tell you something else.

Where This Intersects with AI Tools

The emergence of AI tools in enterprise environments adds a new dimension to this calculation.

In a consolidated AI environment — where the AI is integrated into the workspace teams already use — new employees learn the AI as part of learning the environment. The AI has access to the same context the employee is learning to navigate.

In a fragmented AI environment — where the AI is a separate tool that must be connected to other tools, prompted correctly for each use case, and maintained as another system in an already complex stack — new employees face an additional learning curve layered on top of an already demanding onboarding.

Worse, AI tools in fragmented environments often underperform for new employees specifically, because the AI lacks the organizational context that makes it useful. An AI that can’t access the CRM, the project management system, and the communication history simultaneously can’t help a new employee understand the state of a customer relationship or a project the way an integrated AI can.

The onboarding math changes when the AI has full context. The time-to-independent-operation shortens not because the tools are simpler but because the AI can answer the questions that would otherwise require a colleague’s time.

The Annual Cost of Not Fixing This

Take your current annual hiring volume. Multiply by the average fully-loaded salary of new hires. Apply a ramp efficiency factor based on your estimated time-to-independence. Add colleague time costs.

That number is the annual cost of your current fragmentation — not the total cost of fragmentation, but the portion attributable to onboarding alone.

For most mid-market companies hiring 20 to 50 people per year, this calculation produces a number large enough to justify a serious consolidation investment. The tools exist. The math usually makes the case clearly once someone does it.

The question is whether anyone has been asked to do it.

I built a self-hosted log search tool for my team

The backstory

Some time ago I adopted Quickwit at my company. For anyone who hasn’t used it: Quickwit is a search engine that runs full-text search directly on object storage (S3 or anything S3-compatible). It decouples compute from storage, so you don’t pay to keep big indexes warm to search older data. That model fits logs well.

It worked well for us, but there was a gap. Quickwit is excellent at the search engine part and leaves the rest to you by design: no end-user experience around it, and little access control. We were missing what a team needs day to day, like a usable UI, authentication, and gated ingest.

I started building that layer myself. It turned into Rootprint.

What it is

Besides the basics you’d expect for working with logs (controlling the view, seeing context around a log line, a filters panel, a histogram, severity-aware views), it adds:

  • User management with Google and GitHub authentication
  • Authenticated endpoints to ingest data
  • Cluster stats so you can see what’s going on
  • A user activity overview
  • Control over sources, and more

It runs on your own infrastructure, and it’s Apache-2.0 licensed.

The stack

I wanted it light and fast, so:

  • Backend: Hono running on Bun
  • Frontend: Svelte 5 + SvelteKit, with Tailwind and DaisyUI

How it plugs in

Rootprint connects to any Quickwit instance through an environment variable. One caveat: it needs Quickwit 0.9+.

To get started, grab the Docker Compose file and run it:

curl -o docker-compose.yml https://docs.rootprint.io/files/docker-compose.full.yaml
docker compose up -d

The docs cover the rest of the installation options: docs.rootprint.io/install/docker-compose.

Where it’s going

I want to build a platform for logs and traces that holds up against anything else out there. Right now I’m focused on the log search experience; traces are on the roadmap but not built yet.

It’s still early and pre-1.0, so expect breaking changes between releases.

I’d love your feedback

If this sounds useful, I’d appreciate you trying it out and telling me what’s broken, confusing, or missing. Feedback and contributors are welcome.

  • Source code: github.com/rootprint/rootprint
  • Docs: docs.rootprint.io

Thanks for reading.

Rider 2026.2 EAP 5: Code Quality Checks for Your AI Agents, and More.

Rider 2026.2 EAP 5 is now available, bringing a faster startup flow with the new non-modal Welcome screen and quality-check hooks for AI agents.

If you’re catching up on the 2026.2 EAP cycle, be sure to check out the blog posts we’ve already published about other updates unveiled so far, including WPF Hot Reload, the finding-tests skill for AI-assisted test generation, and the earlier EAP builds.

Download Rider 2026.2 EAP 5

Quality-check hooks for Claude Code and Codex

Rider 2026.2 EAP 5 introduces bundled quality-check hooks for external AI agents, starting with Claude Code and Codex. In agent workflows, a hook is an automated step that runs at a specific point in the agent’s process. Here, Rider uses a PostToolUse hook: after an agent edits a file, Rider automatically runs IDE-level validation before the agent continues.

This means agent-generated code is no longer just accepted as-is. These checks can detect code issues identified by Rider’s built-in analysis and inspections, as well as formatting inconsistencies.

Agent-generated code with and without Rider’s quality-check feedback.
Watch Rider hooks catch potential errors and redirect the agent.

Errors can block the agent from treating the task as complete, while warnings and other findings are returned as feedback the agent can use to fix its own output. The result is a tighter AI-assisted development loop where the IDE, not the agent, sets the quality bar.

Easier access to Explain with AI

The Explain with AI action is now easier to discover when you need it most: while dealing with build errors and runtime exceptions. Instead of copying diagnostics into chat or manually describing what went wrong, you can trigger an AI explanation directly from the place where Rider surfaces the problem.

For .NET developers, this is especially useful because build output often combines Roslyn diagnostics, analyzer warnings, MSBuild issues, NuGet restore problems, and multi-targeting failures. Explain with AI helps turn noisy or context-dependent errors into a clearer explanation with likely causes and next steps, so you can move from failure to fix faster.

Share your thoughts

That’s it for Rider 2026.2 EAP 5. Download the latest EAP build, try the new features for AI-assisted development, and let us know how they work in your projects.

Download Rider 2026.2 EAP 5

SoloEngine: How to Let AI Run Every Industry

As someone with three years of experience in large language model algorithms, agent development, and knowledge base construction, I’ve recently had a thought: Vibe Coding has emerged in the programming industry simply because programmers know how to write code. Other industries don’t have Cursor or Claude Code, not because they lack the need for Agentic AI, but because they don’t use LangChain or CrewAI. I wanted to build a tool that lowers the barrier to Agentic AI development to the same simplicity as workflow tools like Dify. Thus, SoloEngine was born.

SoloEngine, as the first low‑code Agentic AI development platform, fully encapsulates mechanisms such as ReAct, Tool, MCP, Skill, and SubAgent into backend services. When using it, you simply drag an agent onto the canvas, connect collaboration relationships, configure the required tools, and click Run. The backend then automatically compiles everything into your very own Claude Code — planning, execution, and delivery are all autonomously completed by the agent.

Comparison: SoloEngine vs Other Solutions

Feature Dify, n8n, Zapier LangChain, CrewAI, LangGraph SoloEngine
Agentic AI ✗ Scripted workflows only ✓ ReAct / Multi‑Agent ✓ ReAct / Multi‑Agent
No coding required ✗ Python mandatory
Visual orchestration Partial support ✓ Full canvas experience
Domain experts can build independently ✓ (but workflows are not truly Agentic)
Multi‑agent collaboration

Core Design

For compilation efficiency, all agent nodes adopt a unified ReAct architecture. The platform parses superior‑subordinate relationships through topology, enabling connections and SubAgent calls. The visual design on the canvas is directly compiled into an executable agent team.

At runtime, each agent employs progressive disclosure, loading only the MCPs and Skills it needs on demand — token consumption can be reduced by over 85%.

On the model side, SoloEngine covers commonly used AI models such as OpenAI, Anthropic, Ollama, DeepSeek, Qwen, and Zhipu — a unified interface for seamless switching.

Release Updates

After more than a dozen development iterations, the v0.2 file change tracking and rollback mechanism has been released and is relatively stable. An official release build will be available soon. v0.3‘s one‑click deployment feature for Agentic AI is in its final stages, allowing compiled agent teams to be packaged as standalone products for self‑deployment or distribution and sales. Meanwhile, long‑term memory and autonomous evolution are also on the roadmap.

Quick Start

git clone https://github.com/Sh4r1ock/SoloEngine.git
cd SoloEngine

# Backend (Python 3.11+)
cd backend
pip install -r requirements.txt
python main.py

# Frontend (Node.js 18+) — run in another terminal
cd frontend 
npm install
npm run dev

Open http://localhost:8991 to build your first agent team.

Get Involved

The project is currently in a phase of rapid iteration. More participants are welcome to help AI drive every industry. We hope that in the future, AI will evolve from Vibe Coding into Vibe Everything.

Project repository: https://github.com/Sh4r1ock/SoloEngine

TLS Fingerprinting: How JA3 and JA4 Identify You Before You Send a Byte

Encryption hides the contents of your HTTPS connection — but the negotiation that sets up that encryption happens in the clear. The very first message your client sends, before a single byte of application data, has a distinctive shape. JA3 and JA4 turn that shape into a fingerprint that can identify your software, and sometimes route, throttle, or block you on the spot.

Every HTTPS connection starts with a TLS handshake, and the handshake starts with a message called the ClientHello. It is sent unencrypted, because the two sides have not yet agreed on a key. Inside it, your client announces everything it is willing to do: which TLS versions it supports, which cipher suites it prefers and in what order, which extensions it understands, which elliptic curves and signature algorithms it offers.

None of that is secret. None of it has to be. But taken together, the exact set and ordering of those parameters is remarkably specific to a particular piece of software at a particular version. Chrome 124 produces a different ClientHello from Firefox, which produces a different one from Python’s requests library, which differs from Go’s standard library, which differs from a curl built against a specific OpenSSL version. TLS fingerprinting is the practice of hashing that ClientHello into a short, stable identifier and looking it up.

What Goes Into the Fingerprint

The original technique, JA3, was published by three engineers at Salesforce in 2017 — John Althouse, Jeff Atkinson, and Josh Atkins, whose initials gave it the name. JA3 builds a string from five fields of the ClientHello, in order:

  • The TLS version offered
  • The list of cipher suites
  • The list of extensions
  • The list of supported elliptic curves (named groups)
  • The list of elliptic-curve point formats

Each field is rendered as its numeric values joined by hyphens, the fields are joined by commas, and the whole string is hashed with MD5 to produce a 32-character fingerprint. A companion technique, JA3S, does the same for the server’s ServerHello, so you can fingerprint both ends of a conversation. Pairing a client JA3 with a server JA3S is a common way to identify specific malware command-and-control channels, because the malware and its server both produce consistent, unusual hashes.

Why ordering matters: Two clients can support the exact same cipher suites and still fingerprint differently, because they offer them in a different preference order. That ordering is baked into the TLS library and rarely changes between builds — which is exactly what makes it a stable signal.

Why JA3 Started to Break

JA3 worked well for years, but two developments eroded it. The first was GREASE (RFC 8701), a mechanism Google introduced to keep the TLS ecosystem flexible. GREASE makes clients insert random reserved values into their cipher and extension lists, so that middleboxes don’t hard-code assumptions about what they see. The side effect is that a naive JA3 implementation produces a different hash on every connection unless it explicitly strips the GREASE values out.

The second was TLS 1.3 and the rise of extension shuffling. Chrome began randomizing the order of some ClientHello extensions on each connection specifically to discourage fingerprinting and ossification. Against a technique that depends on extension ordering, that is fatal: the same browser now yields many different JA3 hashes.

JA4: The Redesign

In 2023, John Althouse — one of the original JA3 authors, now at FoxIO — released JA4, the centerpiece of a broader suite called JA4+ that fingerprints not just TLS but HTTP, TCP, SSH, and more. JA4 was designed to survive the things that broke JA3.

The biggest structural change is that JA4 is partly human-readable. Instead of one opaque MD5, a JA4 fingerprint is divided into sections you can read at a glance:

  • A prefix describing the transport and TLS version, whether SNI is present, the count of cipher suites, the count of extensions, and the first ALPN value — for example, whether the client is speaking HTTP/2 or HTTP/1.1
  • A truncated hash of the cipher suites, sorted numerically so that order-shuffling no longer changes the result
  • A truncated hash of the extensions and signature algorithms, also handled so that cosmetic reordering doesn’t matter

GREASE values are stripped by definition. Because the cipher and extension lists are sorted before hashing, Chrome’s randomization no longer produces a moving target. The result is a fingerprint that is both more stable than JA3 and more informative, because a human analyst can read meaningful structure out of the prefix without consulting a lookup table.

Property JA3 (2017) JA4 (2023)
Output Single MD5 hash Structured, partly human-readable
Handles GREASE Only if implementation strips it Yes, by design
Survives extension shuffling No — order-dependent Yes — lists are sorted
Scope TLS ClientHello / ServerHello TLS, HTTP, TCP, SSH and more (JA4+)

Who Uses This, and For What

TLS fingerprinting is genuinely dual-use. On the defensive side, it is one of the more useful tools a network operator has. A fingerprint that claims to be Chrome in its User-Agent header but whose ClientHello matches Python’s requests is almost certainly a bot lying about itself. Security teams use JA3/JA4 to spot malware beaconing, to cluster automated traffic, and to flag scrapers that don’t match any real browser. Because the fingerprint is computed from bytes the client cannot easily fake without rebuilding its TLS stack, it is harder to spoof than a header.

That same strength is what makes it a censorship and tracking tool. A national firewall or a corporate middlebox can fingerprint every outbound connection and treat traffic differently based on what software produced it — throttling or blocking a circumvention tool whose handshake doesn’t look like a mainstream browser, even though it cannot read the encrypted payload. Anti-bot vendors and CDNs fingerprint connections to decide who gets served and who gets a challenge. The fingerprint becomes a passive selector applied before you have proven anything about who you are.

The encryption is doing its job perfectly. The leak is in the envelope, not the letter — and the envelope is, by necessity, written in the clear.

Can You Defend Against It?

Not cleanly, and that is the uncomfortable part. Because the fingerprint is derived from how your TLS library behaves, the only thorough defense is to make your traffic produce a common, unremarkable fingerprint — to look like everyone else. Circumvention tools increasingly do exactly this through uTLS, a Go library that lets a client mimic the precise ClientHello of a mainstream browser, GREASE and ordering included, so its JA3/JA4 blends into the crowd.

For an ordinary user, the practical reality is simpler: using a current, mainstream browser is itself a form of crowd-blending, because millions of others produce a near-identical handshake. The danger zone is unusual software — a custom client, an old library, a niche tool — that produces a rare fingerprint precisely because few others share it. This is the same logic that governs browser fingerprinting at the application layer: distinctiveness is the vulnerability, and the anonymity set is the defense.

The Broader Lesson

TLS fingerprinting is a clean illustration of a pattern that runs through nearly all privacy engineering: encrypting the contents of a channel does not hide the channel’s metadata, and the metadata is often enough. The handshake has to be in the clear so two strangers can agree on a key. The shape of that handshake leaks the identity of the software making it. No amount of payload encryption closes that gap, because the gap exists before encryption begins.

The honest takeaway is not that TLS is broken — it isn’t — but that “the connection is encrypted” answers a narrower question than most people think. Knowing what your tools reveal in the clear, and choosing tools whose visible behavior is common rather than distinctive, is the part of the threat model that fingerprinting forces you to take seriously.

Originally published at havenmessenger.com

RAG with Postgres pgvector in 2026: the full TypeScript pipeline.

RAG with Postgres pgvector in 2026: the full TypeScript pipeline.

I spent a week evaluating dedicated vector databases before deciding to just use the Postgres instance I already had. The pgvector extension handles similarity search well enough for most production workloads, and it collapses three infrastructure components into one. This walkthrough covers everything from schema to answer: chunk your docs, embed them, store in pgvector, retrieve by cosine similarity, and wire the results into an LLM call.

TL;DR

Step Tool Why
Enable vector store pgvector 0.8.x, HNSW index Runs in your existing Postgres, no extra infra
Embed text-embedding-3-small (1,536 dims) $0.02 per million tokens, fast
Query <=> cosine distance, top-k Works with both OpenAI and Voyage models
Augment Claude or GPT-4o with retrieved docs Context window stuffed, hallucination rate drops

1. Why pgvector instead of a dedicated vector database

Pinecone and Weaviate are good products. If you need multi-tenant isolation, sub-millisecond p99 at 100M+ vectors, or native hybrid search with BM25, they earn their place. For most teams, those are future problems.

The cost calculus changes when you consider ops burden. A dedicated vector DB means a new billing line, a new set of credentials to rotate, a new failure mode to track, and a new SDK to keep current in your application. pgvector runs as a Postgres extension: one connection string, one backup strategy, one source of truth. At 10M documents with 1,536-dimensional embeddings, an HNSW index on a reasonably sized Postgres instance returns top-10 results in under 10ms. That covers the overwhelming share of RAG use cases.

pgvector 0.8.0 added iterative HNSW scans. That release made filtered similarity search practical without falling back to sequential scans every time a WHERE clause got specific. The 0.8.0 release was what tipped my team from “maybe later” to “ship it.”

2. Schema setup

Enable the extension once per database, then create your table.

-- enable pgvector (run once per database)
CREATE EXTENSION IF NOT EXISTS vector;

-- documents table
CREATE TABLE documents (
  id         BIGSERIAL PRIMARY KEY,
  source     TEXT NOT NULL,          -- filename, URL, or ID of source doc
  chunk_idx  INT NOT NULL,           -- chunk number within the source
  content    TEXT NOT NULL,          -- raw text of the chunk
  embedding  vector(1536) NOT NULL,  -- OpenAI text-embedding-3-small
  created_at TIMESTAMPTZ DEFAULT NOW()
);

Choosing between HNSW and IVFFlat

HNSW builds a navigable small-world graph. Queries scan the graph instead of comparing all rows. Build once, query immediately. The tradeoff is that the index takes more memory: roughly 8 bytes per dimension per row for a 1,536-dim column at default settings.

IVFFlat partitions the embedding space into centroid clusters. Faster to build, smaller memory footprint, but you must load rows before building the index or the centroid assignment is useless. If you are starting from zero rows, build HNSW.

-- HNSW index (recommended default)
-- m = connections per layer (default 16), higher = better recall at higher memory cost
-- ef_construction = candidate list during build (default 64), higher = better recall at slower build
CREATE INDEX ON documents
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- IVFFlat alternative (only after loading rows)
-- lists = sqrt(row_count) is a good starting point for large tables
-- CREATE INDEX ON documents USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);

Use vector_cosine_ops with the <=> operator when your embedding model normalizes vectors (OpenAI and Voyage both do). Use vector_l2_ops with <-> for raw Euclidean distance when vectors are not normalized. Use vector_ip_ops with <#> for inner product, which equals cosine similarity on normalized vectors and saves one normalization step.

3. Ingest pipeline in TypeScript

The ingest function chunks a document, calls the embedding API, and bulk inserts rows. Use postgres (the npm package, not pg) for its tagged-template SQL and native array support.

import postgres from "postgres";
import OpenAI from "openai";

const sql = postgres(process.env.DATABASE_URL!);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });

const CHUNK_SIZE = 512;   // tokens, not characters
const CHUNK_OVERLAP = 64; // tokens of overlap between adjacent chunks

function chunkText(text: string, size: number, overlap: number): string[] {
  // naive word-boundary chunker — swap for tiktoken in production
  const words = text.split(/s+/);
  const chunks: string[] = [];
  let start = 0;
  while (start < words.length) {
    const end = Math.min(start + size, words.length);
    chunks.push(words.slice(start, end).join(" "));
    start += size - overlap;
  }
  return chunks;
}

async function embedBatch(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  });
  return response.data.map((d) => d.embedding);
}

export async function ingestDocument(source: string, text: string): Promise<void> {
  const chunks = chunkText(text, CHUNK_SIZE, CHUNK_OVERLAP);

  // embed in batches of 100 (OpenAI max batch size)
  const BATCH = 100;
  for (let i = 0; i < chunks.length; i += BATCH) {
    const batch = chunks.slice(i, i + BATCH);
    const embeddings = await embedBatch(batch);

    const rows = batch.map((content, j) => ({
      source,
      chunk_idx: i + j,
      content,
      embedding: JSON.stringify(embeddings[j]),
    }));

    await sql`
      INSERT INTO documents (source, chunk_idx, content, embedding)
      SELECT
        r.source,
        r.chunk_idx::int,
        r.content,
        r.embedding::vector
      FROM jsonb_to_recordset(${JSON.stringify(rows)}::jsonb)
        AS r(source text, chunk_idx text, content text, embedding text)
    `;
  }

  console.log(`[ingest] ${source}: ${chunks.length} chunks stored`);
}

A note on chunk size: 512 words is a starting point. The right size depends on your source material. Legal documents with dense paragraphs do better at 256 words. Code files need at least 300 lines or you lose function context. The overlap prevents the embedding from missing a sentence that straddles a chunk boundary.

4. Query pipeline in TypeScript

Embed the user’s question, run a top-k cosine similarity search, return the matching chunks.

export async function queryDocuments(
  question: string,
  topK = 5,
): Promise<Array<{ source: string; content: string; distance: number }>> {
  // embed the question with the same model used at ingest time
  const [embedding] = await embedBatch([question]);
  const embeddingStr = JSON.stringify(embedding);

  const rows = await sql<{ source: string; content: string; distance: number }[]>`
    SELECT
      source,
      content,
      (embedding <=> ${embeddingStr}::vector) AS distance
    FROM documents
    ORDER BY embedding <=> ${embeddingStr}::vector
    LIMIT ${topK}
  `;

  return rows;
}

The <=> operator returns cosine distance (0 = identical, 2 = opposite). Lower numbers win. If you add metadata filters, add them in the WHERE clause before ORDER BY so the planner can use the HNSW iterative scan introduced in 0.8.0.

// filtered query example — same model must have returned results for this source
const rows = await sql<{ source: string; content: string; distance: number }[]>`
  SELECT source, content, (embedding <=> ${embeddingStr}::vector) AS distance
  FROM documents
  WHERE source = ${filterSource}
  ORDER BY embedding <=> ${embeddingStr}::vector
  LIMIT ${topK}
`;

5. Wiring retrieved docs into an LLM call

Concatenate the retrieved chunks into a context block, then call your model of choice. Claude 3.5 Sonnet or GPT-4o both handle long contexts well. Keep the context block under 80,000 tokens for cost reasons.

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });

export async function answerWithRAG(question: string): Promise<string> {
  const docs = await queryDocuments(question, 5);

  if (docs.length === 0) {
    return "No relevant documents found.";
  }

  const context = docs
    .map((d, i) => `[${i + 1}] (${d.source})n${d.content}`)
    .join("nn---nn");

  const prompt = `You are a helpful assistant. Answer the question using only the provided context.
If the context does not contain the answer, say so.

Context:
${context}

Question: ${question}`;

  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6-20250929",
    max_tokens: 1024,
    messages: [{ role: "user", content: prompt }],
  });

  const block = response.content[0];
  return block.type === "text" ? block.text : "";
}

The “answer using only the provided context” instruction is load-bearing. Without it, the model mixes retrieval with parametric memory and you cannot tell which is which. If the answer comes from the context, citations work. If it comes from training data, they do not. Force the distinction at the prompt level.

One more thing worth noting: rerank before you send to the LLM. A fast cosine search returns the 5 closest chunks by vector distance, but distance does not always equal usefulness. A cross-encoder reranker (Cohere Rerank costs about $1 per 1,000 queries) takes your top-20 candidates and scores them for actual relevance before you trim to 5. The quality jump is noticeable. Skip the reranker while prototyping, add it before you hit production.

6. Two gotchas that bite everyone

Chunk size drives recall more than index parameters

Most teams spend hours tuning HNSW m and ef_construction and see marginal gains. The actual lever is chunk size and overlap. A chunk that is too short loses context (the model cannot answer a cross-sentence question). A chunk that is too long pulls in noise, dilutes the embedding, and wastes context window in the LLM call. Run a quick eval: take 20 representative questions, retrieve top-5, then manually score whether the answer appeared in the returned chunks. Adjust chunk size in 100-word steps until recall tops 85%. Then tune the index.

Build the index after bulk loading, not before

HNSW indexing at insert time is slow. If you load 500,000 documents and the HNSW index exists, every INSERT pays the graph update cost. The fast path: load all rows with the index dropped, then build it once with CREATE INDEX. On a table of 500,000 rows with 1,536-dim embeddings, a cold HNSW build takes roughly 8 to 12 minutes on 4 vCPUs. That is far cheaper than the cumulative insert overhead.

-- drop the index before bulk load
DROP INDEX IF EXISTS documents_embedding_idx;

-- ... run your ingest pipeline ...

-- rebuild once after load
CREATE INDEX documents_embedding_idx
  ON documents USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

The bottom line

The full pipeline is about 120 lines of TypeScript and three SQL statements. pgvector 0.8.x is stable enough for production, HNSW is the right default index for most teams, and the two things that matter most for answer quality are chunk size and staying consistent between embed-at-ingest and embed-at-query time (same model, same preprocessing). Dedicated vector DBs are not wrong, they are just a layer you do not need until your row count passes 50M or your recall requirements get strict enough to warrant a tuning team.

What chunk size worked best for your use case? Drop it in the comments.

GDS K S · thegdsks.com · follow on X @thegdsks

Good retrieval beats a better model every time.

Same code, three clocks — letting a quant agent trade on its own without losing the audit trail


In the last post I argued that an LLM should never hold the approval token on a trade. A human approves. The model only proposes. That works as long as a human is in the loop on every order.

Then a user does the obvious thing. They take a strategy the agent wrote, like the backtest, and say “put it on the paper account.”

They expect it to trade: follow the market in, follow it out, update positions while they sleep.

The honest truth at that point: status = 'promoted' was a database flag. Nobody was ticking the strategy’s on_bar. The account didn’t move. That gap was the whole feature.

Closing it means the machine now places orders on live bars with no human clicking approve each time. Which sounds like exactly the thing the last post said not to do.

This post is how you close the gap without throwing away the audit trail. And the four places the trust boundary has to be redesigned the moment no human is in the chair.

The easy half: same code, three clocks

Inalpha holds one invariant tight: the Python file you backtest is the file you paper-trade. No fork for production. You swap two things underneath the strategy — the Clock and the Gateway — and the business logic doesn’t move.

The invariant itself isn’t rare. What’s rare is the thing standing on top of it here.

The author of that file is an LLM. It was vetted by a human. And it’s now running itself on live bars.

Quant engines hold the invariant, but don’t assume an agent wrote the strategy. Agent frameworks assume the LLM, but have nowhere to put a trading harness. Inalpha sits in that seam. And the same-code invariant is exactly what makes the audit chain mean anything: there’s precisely one file to point a signature at.

How it runs — three deployment modes, two clocks, one file:

  • Backtest: a TestClock driven by historical bars; fills simulated against a reference price.
  • Paper (live runner): a LiveClock on real wall-clock time, bars pulled fresh on the strategy’s timeframe, the same matching engine, the order routed out through the real plan/exec path — the only simulated part is that fills are matched locally instead of sent to a broker.
  • Live (real capital): architecturally the same seam — LiveClock, same kernel, same plan/exec path, only the Gateway swapping to a real broker. But real-money trading is deliberately out of scope for this project; holding the invariant isn’t about chasing it. The payoff is narrower and real: backtest and paper are literally one code path, so the audit chain has exactly one file to point a signature at.

So “three clocks” is shorthand: two clock implementations (TestClock / LiveClock), the third mode (real capital) a seam the architecture leaves open but the project doesn’t pursue — and the strategy file never notices which one it’s running under.

The live runner (services/paper/.../live_runner.py) is one long-lived task per running strategy. Each tick it does three things:

  1. pull the latest closed bar;
  2. feed it to a session that reuses the exact backtest kernel, firing the strategy’s on_bar;
  3. intercept the order the strategy emits and hand it to the guarded order path — it does not match locally.

When the fill comes back, it’s replayed into the session. So the strategy’s view of its own position stays consistent with what actually filled.

Why this matters for audit-grade, not just convenience: if your backtest and live code are two different files, no signature chain will tell you which one ran when the $93k order happened. Same code, three clocks is the precondition. It’s also the boring half. Here’s the half that kept me up.

The hard half: who approves the order?

Last post’s thesis was a three-step state machine. The LLM drives step one. A human drives the approval:

trade.create_plan       → plan: pending_approval
trade.approve_plan      → mints a single-use token
trade.execute_plan(tok) → places the order

A runner that trades while you sleep can’t stop and wait for a click on every bar. So the naive fix is to delete the approval step for the automated path. That’s the fix that quietly turns “audit-grade” back into “trust me.”

We did the opposite. The automated path goes through the same plan/exec state machine. The approval is just stamped approved_by = "system:live_runner".

Machine approval. The order still creates a plan. Still mints and consumes a single-use token. Still writes the same signed audit line. Nothing on the order path got a shortcut.

Machine approval is only honest if it’s earned. Ours rests on two human gates upstream, and the agent can’t route around either:

  1. A human promotes the candidate. promote is a deliberate human action, with permission: ask on the agent side. The model can’t self-promote a strategy into the runnable set.
  2. A human starts the run. paper.start_strategy is an explicit call a person makes for a specific market and timeframe.

So the chain reads: a person vetted this strategy, a person chose to run it here. Given those two signatures, having the machine approve each later order on live bars is the expected behavior, not a bypass. The audit line records system:live_runner as the approver for exactly this reason — a replay shows where the human gates were and where the machine took over.

Every order the runner places also writes a decision record (strategy_run_decisions): the bar context, the order intent, and the outcome (filled, rejected, or risk_rejected), cross-referenced to the plan and the trade.

The point of the autonomous path isn’t just that it trades. It’s that the next morning you can read, line by line, every bar where it wanted to act and what the harness did about it.

The trust boundary moves when the human leaves the chair

This isn’t a bug list. It’s four faces of one architectural question.

With a human in the loop, a lot of guarantees are propped up implicitly by “someone is at the screen.” Designing the unattended path means asking that again, on purpose: which of those props has to become something the system holds up on its own?

Four answers.

1. Identity has to become explicit.
When a human starts each run, ownership is implicit — whoever clicked owns it. Automate it, and ownership has to live in the data model, or there’s no boundary at all.

Concretely: the start path checked that a candidate was promoted, not that the caller owned it. So you could run someone else’s strategy on your own account.

The trap in fixing it was real. The candidate’s author_id is only set for UUID identities, while the account id falls back to uuid5 for everyone else. A naive author_id == account_id would lock out every non-UUID user. The fix derives an owner_account_id through the same function as the account id (migration 0013), so ownership is comparable for everyone.

2. Resource bounds are part of the trust boundary, not an ops detail.
A human starting runs self-limits. An API doesn’t. Each run is a long-lived task polling the data service on a timer, and the only limit was one instance per candidate — but a user can promote arbitrarily many. So the boundary grows a per-account cap (default 10) that returns 429, instead of letting one account quietly melt the event loop.

3. With no human, the default has to invert.
Fail-open is a default that assumes a backstop. Letting risk checks fail open in dev is fine when a human is at the screen.

The unattended runner is not at a screen. A risk engine that’s disabled or fails to load becomes an autonomous order loop with zero risk checks — the worst possible default. So on this path the default inverts: fail closed. No risk guard, no run, unless you explicitly opt out.

4. Backtest/live parity has to reach down to data shape.
A human wouldn’t trade a half-formed bar. The machine will, unless the architecture forbids it.

The latest bar each tick is often still forming — its close isn’t final. Acting on it silently diverges from the backtest, which only ever saw closed bars. So the runner decides only on closed bars, matching backtest semantics exactly.

(One implementation detail rides along. The loop treated every exception as retryable with backoff, so a determined-wrong error — a delisted symbol, a constraint violation — burned the whole retry budget before giving up. It now splits retryable from non-retryable and stops immediately on the latter. Plumbing, not architecture.)

Some of these I saw clearly only after an adversarial review of the shipped runner. But they aren’t scattered bugs. They’re four corollaries of one sentence: the trust boundary of an autonomous path is not the same boundary as one with a human in the loop.

What this still costs, and what we punted

Honesty section, same as last time.

  • The runner runs candidate code in the main event loop. The backtest path isolates strategy code in a resource-limited subprocess. The live session compiles and runs it inline. The AST audit is a static gate, not a runtime one — it won’t stop a pure-compute infinite loop from hanging the service. We lean on the two human gates to keep the code trusted. Subprocess/watchdog hardening is filed, not done.
  • A crash mid-fill can drift the in-memory position from the DB. The fill is committed to the DB first, then replayed into the session. If the process dies in between, a restart rebuilds the session from empty cash, not from the DB positions. “Resume a run from its real position” is the next robustness item, deliberately not faked in this release.
  • Single-instance only. Startup reconciliation marks every stranded running row as errored. Correct for one process, wrong the moment you run two. Multi-instance leasing is a Phase-F item, flagged in the code where it bites.

I’d rather ship the gates that are load-bearing now and name the ones that aren’t yet, than imply the autonomous path is hardened against things it isn’t.

So what

The cheap version of “let the agent trade for you” deletes the approval step and calls it autonomy.

The audit-grade version keeps the entire order path intact. It stamps the approver as the machine. It earns that stamp with two human gates the model can’t route around. Then it redesigns the trust boundary, so every guarantee the human used to backstop is one the system now holds on its own.

Autonomy isn’t the absence of the harness. It’s the harness running without you in the chair.

If this resonated:

  • 📬 Subscribe to Inalpha on Substack — one long-form post a month, ADRs and post-mortems, no algorithm between us and you
  • github.com/mirror29/inalpha — the live runner, the plan/exec path, and the four boundary changes above are all in services/paper
  • 👉 Next post: Sandboxed strategy evolution — three gates + multi-objective fitness. What happens when you actually let the LLM mutate trading code, and what catches it when it shouldn’t have. (Yes, the one I promised last time — it’s next.)

What an OpenAI-Compatible API Router Should Actually Do

An OpenAI-compatible API router should not make your stack more complicated. If it does, it has already failed.

The whole point of compatibility is boring simplicity:

One base URL.

One API key.

Same general SDK shape.

That gives you room to improve the economics without rewriting the application.

For AI coding workflows, this matters because the tool in front is often already good enough. The pain is underneath: cost, provider management, usage logs, and routing.

The minimum useful setup should look familiar:

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://incat.ai/v1",
  apiKey: process.env.OPENAI_API_KEY,
});

If a router requires a large rewrite before you can test it, most developers will not bother. They are right.

The first test should be small:

  • one workflow
  • one API key
  • one prepaid balance
  • one cost comparison

What should the router do?

Route by task

Send routine work to cheaper capable models. Keep risky work on stronger models.

Preserve logs

Developers need to know which workflow burns money.

Avoid surprise bills

Prepaid credits are useful because they turn runaway usage into a visible constraint.

Keep escape hatches

If a cheaper route is not good enough, switch back. Routing should create options, not lock-in.

That is the category I want inCat to live in.

Not another AI coding app.

Not a model museum.

An OpenAI-compatible API router for developers who want the same workflow to cost less.

Generate a config:

https://incat.ai/codex-config-generator.html