RAG with Postgres pgvector in 2026: the full TypeScript pipeline.
I spent a week evaluating dedicated vector databases before deciding to just use the Postgres instance I already had. The pgvector extension handles similarity search well enough for most production workloads, and it collapses three infrastructure components into one. This walkthrough covers everything from schema to answer: chunk your docs, embed them, store in pgvector, retrieve by cosine similarity, and wire the results into an LLM call.
TL;DR
| Step | Tool | Why |
|---|---|---|
| Enable vector store |
pgvector 0.8.x, HNSW index |
Runs in your existing Postgres, no extra infra |
| Embed |
text-embedding-3-small (1,536 dims) |
$0.02 per million tokens, fast |
| Query |
<=> cosine distance, top-k |
Works with both OpenAI and Voyage models |
| Augment | Claude or GPT-4o with retrieved docs | Context window stuffed, hallucination rate drops |
1. Why pgvector instead of a dedicated vector database
Pinecone and Weaviate are good products. If you need multi-tenant isolation, sub-millisecond p99 at 100M+ vectors, or native hybrid search with BM25, they earn their place. For most teams, those are future problems.
The cost calculus changes when you consider ops burden. A dedicated vector DB means a new billing line, a new set of credentials to rotate, a new failure mode to track, and a new SDK to keep current in your application. pgvector runs as a Postgres extension: one connection string, one backup strategy, one source of truth. At 10M documents with 1,536-dimensional embeddings, an HNSW index on a reasonably sized Postgres instance returns top-10 results in under 10ms. That covers the overwhelming share of RAG use cases.
pgvector 0.8.0 added iterative HNSW scans. That release made filtered similarity search practical without falling back to sequential scans every time a WHERE clause got specific. The 0.8.0 release was what tipped my team from “maybe later” to “ship it.”
2. Schema setup
Enable the extension once per database, then create your table.
-- enable pgvector (run once per database)
CREATE EXTENSION IF NOT EXISTS vector;
-- documents table
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
source TEXT NOT NULL, -- filename, URL, or ID of source doc
chunk_idx INT NOT NULL, -- chunk number within the source
content TEXT NOT NULL, -- raw text of the chunk
embedding vector(1536) NOT NULL, -- OpenAI text-embedding-3-small
created_at TIMESTAMPTZ DEFAULT NOW()
);
Choosing between HNSW and IVFFlat
HNSW builds a navigable small-world graph. Queries scan the graph instead of comparing all rows. Build once, query immediately. The tradeoff is that the index takes more memory: roughly 8 bytes per dimension per row for a 1,536-dim column at default settings.
IVFFlat partitions the embedding space into centroid clusters. Faster to build, smaller memory footprint, but you must load rows before building the index or the centroid assignment is useless. If you are starting from zero rows, build HNSW.
-- HNSW index (recommended default)
-- m = connections per layer (default 16), higher = better recall at higher memory cost
-- ef_construction = candidate list during build (default 64), higher = better recall at slower build
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- IVFFlat alternative (only after loading rows)
-- lists = sqrt(row_count) is a good starting point for large tables
-- CREATE INDEX ON documents USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
Use vector_cosine_ops with the <=> operator when your embedding model normalizes vectors (OpenAI and Voyage both do). Use vector_l2_ops with <-> for raw Euclidean distance when vectors are not normalized. Use vector_ip_ops with <#> for inner product, which equals cosine similarity on normalized vectors and saves one normalization step.
3. Ingest pipeline in TypeScript
The ingest function chunks a document, calls the embedding API, and bulk inserts rows. Use postgres (the npm package, not pg) for its tagged-template SQL and native array support.
import postgres from "postgres";
import OpenAI from "openai";
const sql = postgres(process.env.DATABASE_URL!);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
const CHUNK_SIZE = 512; // tokens, not characters
const CHUNK_OVERLAP = 64; // tokens of overlap between adjacent chunks
function chunkText(text: string, size: number, overlap: number): string[] {
// naive word-boundary chunker — swap for tiktoken in production
const words = text.split(/s+/);
const chunks: string[] = [];
let start = 0;
while (start < words.length) {
const end = Math.min(start + size, words.length);
chunks.push(words.slice(start, end).join(" "));
start += size - overlap;
}
return chunks;
}
async function embedBatch(texts: string[]): Promise<number[][]> {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: texts,
});
return response.data.map((d) => d.embedding);
}
export async function ingestDocument(source: string, text: string): Promise<void> {
const chunks = chunkText(text, CHUNK_SIZE, CHUNK_OVERLAP);
// embed in batches of 100 (OpenAI max batch size)
const BATCH = 100;
for (let i = 0; i < chunks.length; i += BATCH) {
const batch = chunks.slice(i, i + BATCH);
const embeddings = await embedBatch(batch);
const rows = batch.map((content, j) => ({
source,
chunk_idx: i + j,
content,
embedding: JSON.stringify(embeddings[j]),
}));
await sql`
INSERT INTO documents (source, chunk_idx, content, embedding)
SELECT
r.source,
r.chunk_idx::int,
r.content,
r.embedding::vector
FROM jsonb_to_recordset(${JSON.stringify(rows)}::jsonb)
AS r(source text, chunk_idx text, content text, embedding text)
`;
}
console.log(`[ingest] ${source}: ${chunks.length} chunks stored`);
}
A note on chunk size: 512 words is a starting point. The right size depends on your source material. Legal documents with dense paragraphs do better at 256 words. Code files need at least 300 lines or you lose function context. The overlap prevents the embedding from missing a sentence that straddles a chunk boundary.
4. Query pipeline in TypeScript
Embed the user’s question, run a top-k cosine similarity search, return the matching chunks.
export async function queryDocuments(
question: string,
topK = 5,
): Promise<Array<{ source: string; content: string; distance: number }>> {
// embed the question with the same model used at ingest time
const [embedding] = await embedBatch([question]);
const embeddingStr = JSON.stringify(embedding);
const rows = await sql<{ source: string; content: string; distance: number }[]>`
SELECT
source,
content,
(embedding <=> ${embeddingStr}::vector) AS distance
FROM documents
ORDER BY embedding <=> ${embeddingStr}::vector
LIMIT ${topK}
`;
return rows;
}
The <=> operator returns cosine distance (0 = identical, 2 = opposite). Lower numbers win. If you add metadata filters, add them in the WHERE clause before ORDER BY so the planner can use the HNSW iterative scan introduced in 0.8.0.
// filtered query example — same model must have returned results for this source
const rows = await sql<{ source: string; content: string; distance: number }[]>`
SELECT source, content, (embedding <=> ${embeddingStr}::vector) AS distance
FROM documents
WHERE source = ${filterSource}
ORDER BY embedding <=> ${embeddingStr}::vector
LIMIT ${topK}
`;
5. Wiring retrieved docs into an LLM call
Concatenate the retrieved chunks into a context block, then call your model of choice. Claude 3.5 Sonnet or GPT-4o both handle long contexts well. Keep the context block under 80,000 tokens for cost reasons.
import Anthropic from "@anthropic-ai/sdk";
const anthropic = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });
export async function answerWithRAG(question: string): Promise<string> {
const docs = await queryDocuments(question, 5);
if (docs.length === 0) {
return "No relevant documents found.";
}
const context = docs
.map((d, i) => `[${i + 1}] (${d.source})n${d.content}`)
.join("nn---nn");
const prompt = `You are a helpful assistant. Answer the question using only the provided context.
If the context does not contain the answer, say so.
Context:
${context}
Question: ${question}`;
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6-20250929",
max_tokens: 1024,
messages: [{ role: "user", content: prompt }],
});
const block = response.content[0];
return block.type === "text" ? block.text : "";
}
The “answer using only the provided context” instruction is load-bearing. Without it, the model mixes retrieval with parametric memory and you cannot tell which is which. If the answer comes from the context, citations work. If it comes from training data, they do not. Force the distinction at the prompt level.
One more thing worth noting: rerank before you send to the LLM. A fast cosine search returns the 5 closest chunks by vector distance, but distance does not always equal usefulness. A cross-encoder reranker (Cohere Rerank costs about $1 per 1,000 queries) takes your top-20 candidates and scores them for actual relevance before you trim to 5. The quality jump is noticeable. Skip the reranker while prototyping, add it before you hit production.
6. Two gotchas that bite everyone
Chunk size drives recall more than index parameters
Most teams spend hours tuning HNSW m and ef_construction and see marginal gains. The actual lever is chunk size and overlap. A chunk that is too short loses context (the model cannot answer a cross-sentence question). A chunk that is too long pulls in noise, dilutes the embedding, and wastes context window in the LLM call. Run a quick eval: take 20 representative questions, retrieve top-5, then manually score whether the answer appeared in the returned chunks. Adjust chunk size in 100-word steps until recall tops 85%. Then tune the index.
Build the index after bulk loading, not before
HNSW indexing at insert time is slow. If you load 500,000 documents and the HNSW index exists, every INSERT pays the graph update cost. The fast path: load all rows with the index dropped, then build it once with CREATE INDEX. On a table of 500,000 rows with 1,536-dim embeddings, a cold HNSW build takes roughly 8 to 12 minutes on 4 vCPUs. That is far cheaper than the cumulative insert overhead.
-- drop the index before bulk load
DROP INDEX IF EXISTS documents_embedding_idx;
-- ... run your ingest pipeline ...
-- rebuild once after load
CREATE INDEX documents_embedding_idx
ON documents USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
The bottom line
The full pipeline is about 120 lines of TypeScript and three SQL statements. pgvector 0.8.x is stable enough for production, HNSW is the right default index for most teams, and the two things that matter most for answer quality are chunk size and staying consistent between embed-at-ingest and embed-at-query time (same model, same preprocessing). Dedicated vector DBs are not wrong, they are just a layer you do not need until your row count passes 50M or your recall requirements get strict enough to warrant a tuning team.
What chunk size worked best for your use case? Drop it in the comments.
GDS K S · thegdsks.com · follow on X @thegdsks
Good retrieval beats a better model every time.
