A Quick-ish Rundown of LLM Basics

Over the past few days, I’ve realized that there are a lot of folks out there using LLMs that haven’t had an opportunity to dig, even a little, into the basics of how LLMs really work. And I guess that makes sense; for the most part, the average person doesn’t have a lot of reason to know this. But if you’re going to be a power user, there are things that would really help you to understand.

Below are the most basic basics. Not covering everything, just some stuff that I think if you get then the rest will start to make sense for you as well. Hopefully it helps someone out there.

Tokens

When you write something to an LLM, it doesn’t break that thing down by character, it breaks them down by groups of characters called “Tokens”. Every LLM has its own tokenizer, so not all choose the same tokens.

Here’s a real world example of what tokenization might look like using Qwen3.6 27b’s tokenizer: https://huggingface.co/Qwen/Qwen3.6-27B/blob/main/tokenizer.json. If you open that file, you’ll see the full list of tokens that Qwen3.6 27b utilizes.

As for how tokens work… here’s an example:

“This is a token”
– That’s 15 characters

‘This’ ‘Ġis’ ‘Ġa’ ‘Ġtoken’
– That’s 4 tokens. You’ll notice ‘Ġ’ is in each; that’s what
GPT-2/GPT-3/GPT-4 use as a space in tokenization

These line up to numbers, which the LLM then uses to do matrix math to determine the right output. If we go back to the link I gave you above, then you can see the following:

This   == 1919
ĠIs    == 369
Ġa     == 264
Ġtoken == 3817

So Qwen3.6 27b would see your sentence as (1919, 369, 264, 3817). It then does matrix math and other cool pattern-y stuff to determine the best tokens to respond to you with.

So remember this when you hear that an LLM has a context window of 1,000,000 tokens: it’s talking about those things. Sometimes whole words are tokens, sometimes not. Don’t just assume every word is a token; they try to create tokens off the most commonly used words. This, is, a are all very common in the English language. Token is very common when talking about LLMs.

Context Windows

The way I usually describe context windows is to imagine the full Song of Ice and Fire book series printed out on one really long parchment, and you have a piece of cardboard with a window cut in it that you can read text through. All you know is whatever’s currently in that window. If someone asks you about something outside the window? Tough luck, you don’t know it.

Now, the obvious thought is “well just make the window bigger”. The problem is that if you cut the window too big, you have a harder time finding any specific thing in there, and you start mixing details up. You’ve learned how to read a certain amount within that window, and pushing past that doesn’t go great. If the full book was the length of a parking lot, and someone asked you for details that could exist anywhere in that whole parking lot worth of text… well, good luck.

That’s pretty much how it works with LLMs. You’ll see models advertise huge context windows like 1,000,000 tokens, but the real-world practical use of that is a lot smaller than the marketing implies. The bigger you stuff that window, the worse the model gets at pinpointing specific information inside it. There’s a whole pile of benchmarks (needle in a haystack tests, NoLiMa, RULER, etc) showing accuracy drop as the context fills up. So a 200k token context window is not an invitation to dump 200k tokens in there and expect great results. You’ll generally get a much better answer giving the model 8k of really relevant tokens than 200k of “everything I have on the topic”.

To get a better visualization, check this benchmark out: https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/oQdzQvKHw8JyXbN87

Scroll down to the results section and you’ll see a table- the numbers in there represent how well the model pulls the right info out based on the context size it was fed. You can see that some models, like GPT-5.2 or Opus 4.6, did great all the way up to 120k (except 5.2 pro for some reason…). But look at something like minimax 2.5, for example: by the time you hit 60k tokens, you have less than a 50% chance to get all the right info you asked for.

This is a struggle a lot of us running local models deal with, and it usually means you want to account for that with a lot of great wrapper software or middleware.

Model Sizes (ie- parameters)

When we talk about models, we size them based on the number of parameters they have. 1M is a 1 Million parameter model. That’s itty bitty. 1b is 1 billion parameters- also itty bitty. Many modern models release in really huge sizes like 397b to 1T (1 Trillion parameters).

The easiest way to imagine parameters is as data points that can correspond to several pieces of data at once. So 1 datapoint doesn’t necessarily equate to something like “When did the first Ford car release?” It could also correspond to several other pieces of info at once.

Models are generally created in BF16 format to start with. Size wise- BF16 equates to about 2GB per 1b; so a 20b model would be 40GB. If you “quantize” the model (easiest way is to think of it is ‘compressing’ the model) to 8bpw, or ~q8_0, that becomes 1GB per 1b. If you go further to 4bpw, or ~q4_0, you get down to 0.5GB per 1b. That’s how we fit big models on smaller hardware.

As you can imagine, the more you quantize, the more mistakes the model will likely make.

Open Weight Models

These are models that you can download and run yourself. There are a few ways to do it, and here are some examples:

  • Raw transformers – this is the original format of the models
  • GGUF – This is a model that has been converted to run in llama.cpp
  • MLX – This is converted to run in Apple’s MLX

Many applications, like Ollama or LM Studio, wrap some of these and then have their own repositories to pull models from. For best speed and the fastest updates for model support, you generally want to avoid that. You can find all models here: https://huggingface.co.

Training

LLMs learn by being “trained”. It’s a complex process that, at the absolute highest level, involves the LLM seeing billions upon billions of tokens of information and learning patterns from it. “When I see someone say this, it usually involves someone responding with that” kind of thing. This is why people constantly harp about good data in training being the most important thing- if you have really clean examples of speech, knowledge, etc, it is easier for the LLM to find the right patterns.

Eventually, more powerful LLMs start to infer new patterns that they haven’t seen before. Remember the old math problems like if A == B and B == C, then A == C? Imagine that on a MASSIVE scale, where it creates connections between information many many many many layers deep to get from A to Z.

  • Training a commercially viable model takes ungodly amounts of money and data, and you need really smart people to do it. Companies spend millions to billions of dollars making some of the most powerful models.
  • Training data is hard to come by. If you’ve heard about how some companies scraped the internet for data? That’s why. They are looking for examples of speech, knowledge, etc. When an LLM wants to train on your data, it is less that the company wants to include your personal PII in the model (they generally don’t; they don’t want that bad publicity if someone makes the model spit it out) and more that they want nice clean interactions to give to the LLM to look at and learn more patterns.
  • This is also why AI companies are mad at each other for “distilling” their products. Distilling is the act of interacting with an LLM over and over again to get examples of the LLM’s speaking or thinking process, then creating training data to teach another LLM to act or reason that same way. An example of this from recently was that DeepSeek, Moonshot AI, and MiniMax got accused of doing this by Anthropic. The accusation was that they were using thousands of fraudulent accounts to interact with Claude millions of times, then using those interactions to teach their own models to think and speak similarly.
  • It’s possible to train little fun models pretty cheaply. One guy recently trained a small model from scratch on 1800s text, with nothing at all modern in it. This little model has no concept of anything past the industrial age.

Finetuning / Post-Training

When you hear a non-tech company say they are “training a model”, they most likely mean finetuning or post-training an open weight model.

Imagine an LLM as a big calculator for matrix math. Numbers go in, one number comes out. So that over and over and you get a response. The neat thing about matrix math is something called rank factorization- the idea that you can represent a matrix m*n with rank r by using smaller matrices m*r and r*n. Some super smart folks figured out that this allowed us to have LoRAs, which you can think of like add-on components to LLMs that modify the weight distribution.

In other words- rather than retraining the entire model to try to add more information, you train an itty bitty version of that model with the info you want, and then you can load the original model + LoRA at the same time to get a post-trained model.

Truthfully- I am pretty staunchly in the camp that you can’t reliably train new knowledge into a model this way. That’s a very common but not a universal view within the deeper LLM tinkering community; some companies have made post-training their bread and butter. I do believe that you CAN train styles, tones, etc really well into it (for example: training a model to handle documentation a certain way, or think a certain way), but ultimately I’ve yet to see a good example of a post-trained model outside of basic Instruct models from the same manufacturer that has actually been worth the effort. Maybe there are some out there, but I’m not familiar with them.

Anyhow, long story short- you CAN post-train a small model for $100 or less, but I wouldn’t even recommend it unless you really understand what you want to get out of it and why. There’s very little a post-trained model can do that you can’t do with a good workflow, prompt and data to RAG against.

How LLMs Respond

When you boil it down, LLMs work in a really simple loop. You give it a chunk of tokens. It processes them and spits out one new token. Then it takes all your original tokens plus that one new token it just spit out, and processes the whole thing again, and spits out the next token. Then it takes all your tokens plus the two new tokens, processes again, spits out the next. On and on, one token at a time, until it decides it is done and sends a stop token. You now have your response.

To simplify it- LLMs don’t think about the response all at once- they think 1 token at a time. Over and over and over until they are done. That’s it.

This is also why “reasoning” works. If you ask a model to just answer a hard math problem cold, it can fumble it, because by the time it gets to the answer it’s already locked into early tokens it picked. But if you tell it to think out loud first- write out the problem, work through it step by step- then while it’s writing all that, it’s still just predicting one token at a time, except now each new token gets to “see” all the work it just laid out. If it makes a mistake at step 2, it can sometimes catch it at step 4 and shift the line of thinking before it commits to a final answer.

If you ever watch an LLM think, and it constantly goes “But wait…”, that’s because it was trained to in order to stop it from locking in. It says its response, then it challenges the response, and in doing so that gives it a chance to realize the response was wrong.

That’s basically what chain of thought and reasoning models are. The model writing out its work so it has more to reference when generating each next token. It’s not magic, it’s just giving the model more useful context to predict from. The flip side is that more reasoning means more tokens, which means more time and more cost. And some models, like Qwen3.5/3.6 and Gemma 4, overthink badly. With those, you want to use a workflow app to manually apply CoT, if you can. Since I use Wilmer everywhere, I have workflows specifically to use Qwen/Gemma with thinking disabled, and then have a manual CoT step. That helps with overthinking massively.

RAG – Retrieval Augmented Generation

This is a $5 term for a $0.05 concept. When we talk about RAG, it boils down to a very simple concept: give the LLM the answer before it responds. Everything else, when talking about RAG, is talking about a design pattern.

  • Simplest example: The simplest form of RAG would be copying the text of an article or tutorial, putting it in your prompt, and asking the LLM to answer a question about that. The LLM will use the article to answer you.
  • Next level of simplicity: You might ask an LLM a question, the LLM uses a tool (web search, local wiki search, whatever) to pull the article, concatenates it into your prompt, and answers your question.
  • What a lot of folks think of when they think of RAG: You have a program that takes thousands, or even millions, of documents and turns them into “embeddings”- ie breaks the document into logical chunks and stores them somewhere easy to retrieve off of, such as a Vector database. Then, when you ask a question, it does some fancy stuff in the background to find the right chunks and answer your question with them. Since putting 1,000,000 files into your context all at once is impossible, this is how you go about the oft-advertised “chat with your documents” situation.

But all together, RAG comes down to a very simple concept: give the LLM the answer before it responds. That’s it. LLMs are very, very strong at this, and it’s a great way to avoid hallucinations.

For the most part, RAG solutions are not an LLM problem, they’re a software problem. If you’re struggling with RAG, you probably need to revisit HOW you’re feeding the data to your LLM and whether you’re giving it too much unnecessary stuff along with the right stuff.

Hallucinations

A hallucination is when the LLM responds with something that’s flat wrong. The reason it happens comes back to that loop in the How LLMs Respond section: an LLM doesn’t actually know anything. It’s a pattern matcher predicting the most likely next token based on what came before, based on the training that it did to determine “when I see X, I usually see a response of Y”. If the most likely next token happens to be the wrong one, well, that’s what you get. This can especially happen with information that there isn’t a lot of great data out there for, so the LLM had to infer the relationships. Asking a detailed question about Excel means it has millions of example questions, articles, documents, etc from the internet to have learned from; asking a question about FIS’ Relius Administration has far far fewer examples, so it likely inferred a lot of things based on other patterns, and it will hallucinate like mad.

LLMs, as a technology, don’t have a built-in “I’m not sure about this” lever they can pull. It just generates whatever the patterns say to generate, and confidence isn’t really part of the equation. The answer it gave you is ‘right’ from the perspective that it generated the most likely pattern. Whether that pattern is of any use to you has nothing to do with the LLM lol.

The most common reasons you see hallucinations:

  • The training data was wrong, so the pattern the model learned is wrong.
  • The training data didn’t cover the topic well, so the model is filling in gaps with whatever sounds plausible.
  • You asked something outside what the model was really trained for, and it tries to answer anyway because that’s what it was trained to do- give an answer.
  • Your context window is huge or messy, and the model is losing track of what’s actually relevant in there.
  • The model is over-quantized and just making more mistakes generally (going back to that earlier section).

Reasoning models hallucinate a bit less on certain types of problems because they get a chance to second-guess themselves while writing things out, but they absolutely still hallucinate. The single best mitigation is to put the answer in the context for it, which is RAG.

Using That Info

Knowing all this should hopefully help you start to narrow down why some of the “pro tips” of using LLMs exist. When you want a factual answer, you don’t just ask the LLM. Right or wrong, you’re getting a confident response. Instead, make sure you are injecting the right answer in before it responds- this often means tool use such as web search or, even better, “Deep Research” features you find on commercial LLMs.

This also hopefully will help you imagine why jamming ALL your codebase into the LLM, or constantly asking “What model has a bigger context window?” is the wrong question. It’s lazy to just look for bigger context windows; and that laziness will bite you. Instead, focus on how you can break the data apart so that the LLM can work in the confines of what it handles best. That means writing or downloading some supporting software.

Anyhow, good luck folks. Hope this helps the like 4 people that might read this far.

I Failed My Azure AI-102 Exam the First Time -Here’s What I Learned

There’s something nobody tells you about the Microsoft Azure AI Engineer (AI-102) certification: the practice exam and the real exam feel like two completely different tests.

I know this firsthand – because I failed the first time.

But here’s the twist: before I even attempted the exam, I had already built a real-world Retrieval Augmented Generation (RAG) system using Azure AI services for a live demonstration to associates from multiple teams in Cognizant UK and a group of colleagues from the Department of Education. I had hands-on experience with the very technology the exam covers. And I still failed.

This article is for every developer who has studied Microsoft Learn, watched the YouTube videos, sailed through the practice exams – and then walked out of the real test wondering what just happened.

Why I Decided to Take AI-102
This was entirely my own decision. Nobody asked me to do it.

I was already working with Azure AI services in my day-to-day work as a Senior Software Engineer at Cognizant, delivering enterprise and UK government applications. I wanted to formalise my knowledge, deepen my understanding of the broader Azure AI ecosystem, and demonstrate that my expertise went beyond just the services I was using on specific projects.

The AI-102 felt like the right certification – broad enough to cover the full Azure AI landscape, but technical enough to mean something.

How I Prepared
My study approach was straightforward:

  • Microsoft Learn : the official learning paths for AI-102

  • YouTube : practical walkthroughs and service deep-dives

  • Practice exams : Microsoft’s official sample questions and third-party practice tests

I studied consistently over several weeks, working through each Azure AI service systematically. Azure Cognitive Services, Azure OpenAI, Azure AI Search, Document Intelligence, Speech, Vision – I covered them all.

And when I took the practice exams, I was passing comfortably. I felt ready.

I wasn’t.

The Gap Nobody Warns You About
The Microsoft demo and practice exams are scenario-light. They test whether you know what a service does, what its key features are, and roughly when to use it.

The real AI-102 exam is fundamentally different. It is scenario-heavy.

You are not asked “what does Azure Document Intelligence do?” You are asked something closer to: “A financial services company needs to extract structured data from thousands of handwritten forms, integrate it with their existing Azure infrastructure, and ensure compliance with GDPR. Which combination of Azure AI services and configurations would you recommend, and why?”

The real exam puts you inside a business problem and asks you to think like an architect, not a student. It tests your judgement, not just your memory.

The practice exams did not prepare me for that shift in thinking. They were too easy – close ended, straightforward, and forgiving. I passed them confidently and mistook that confidence for readiness.

What Made It Harder: I Built Before I Studied
Here is something unusual about my journey: I actually built a RAG (Retrieval Augmented Generation) system using Azure AI before I sat the exam.

I developed and demonstrated an internal AI tool that allowed users to upload documents and query them intelligently using Azure AI Search for indexing and retrieval, combined with Azure OpenAI for generation. I presented this to associates from multiple teams in Cognizant UK and a group of colleagues from the DfE as a practical demonstration of what Azure AI could do in an enterprise context.

This was not my first time sharing Azure AI knowledge internally either. Around three years earlier, I had delivered an introduction to Azure AI services to Cognizant UKI associates -covering the practical landscape of what was available and how it could be applied in real projects. The RAG demo felt like a natural evolution of that earlier session -moving from “here is what Azure AI can do” to “here is a working system built with it.”

You might think that hands-on experience would make the exam easier. In some ways it did – I understood the architecture deeply, I knew the practical challenges, and I could reason about real-world scenarios confidently.

But the exam also exposed the gaps in my theoretical knowledge. There were services and configurations I had never needed in my specific project that appeared heavily in the exam. The breadth of AI-102 is wide – and real-world projects naturally focus on a subset of that breadth.

Building first taught me the practical. The exam demanded the theoretical. The gap between them was where I stumbled.

The Second Attempt
After failing, I approached my preparation differently.

Instead of going through Microsoft Learn linearly, I focused specifically on scenario-based thinking. For every service I studied, I asked myself: “In what business situation would I choose this over the alternatives? What are the constraints, trade-offs, and compliance considerations?”

I stopped treating the services as a list to memorise and started treating them as a toolkit to reason about.

I passed on my second attempt.

What the AI-102 Actually Tests
If you are preparing for this exam right now, here is what I wish someone had told me:

  1. Scenario thinking beats memorisation

The exam will put you in business situations. Practice thinking about why you would choose a service, not just what the service does.

  1. The practice exam is too easy – don’t be fooled

Passing the Microsoft sample questions comfortably does not mean you are ready. Seek out harder, scenario-based practice materials.

  1. Breadth matters as much as depth

Even if you work with Azure AI every day, the exam covers services you may rarely touch. Study the full ecosystem, not just your daily toolkit.

  1. Real experience helps but does not replace theory

Having built RAG systems and Azure AI integrations in production gave me invaluable context – but I still needed to understand the full theoretical landscape the exam demands.

  1. Failure is data, not defeat

My first failure told me exactly where my preparation was weak. I treated it as a diagnostic, not a verdict.

Where I Am Now
I am currently renewing my AI-102 certification, which reflects how seriously I take staying current in this field. The Azure AI ecosystem moves quickly – new services, updated capabilities, evolving best practices. Keeping the certification current is not just a box to tick. It is a commitment to remaining genuinely expert in the technology I use every day.

If you are preparing for AI-102, I hope this article saves you from the same mistake I made – assuming that passing practice exams means you are ready for the real thing.

Study the scenarios. Think like an architect. And if you fail the first time, use it.

Aromal Chulliyil Muraleedharan is a Senior Software Engineer at Cognizant UK with 8+ years of experience building enterprise and UK government applications using .NET, Azure, and AI services. He holds the Microsoft Azure AI Engineer (AI-102) and Azure Developer (AZ-204) certifications.

Read on Dev.to| Connect on LinkedIn | Follow on Medium

Tags: #AzureAI #AI102 #MicrosoftAzure #CloudComputing #MachineLearning #RAG #dotnet #SoftwareEngineering

CVMetric — Free ATS Resume Builder – Built for Modern Hiring Systems

We just shipped CVMetric, a resume builder web application designed to help job seekers create ATS-optimized resumes that actually pass applicant tracking systems and reach recruiters.

Problem We Solved

Most resumes fail not because of skills, but because they are:

  • Not structured for ATS parsing
  • Missing keyword alignment with job descriptions
  • Poorly formatted for recruiter readability

👉 As a result, a large percentage of applications never reach a human reviewer.

What CVMetric Offers

✔ ATS Resume Builder with structured form-based editing
✔ Real-time resume preview with print-ready output
✔ Resume scoring system (ATS compatibility + content quality)
✔ Job description matching with skill gap detection
✔ Professional resume templates (minimal, modern, sidebar)
✔ Export to PDF, DOCX, and JSON formats
✔ Resume dashboard for managing multiple versions
✔ PDF & JSON import system to rebuild resumes into structured data

Technical Implementation Highlights

Built with a focus on scalability and structured data design:

  • Next.js (App Router) for full-stack architecture
  • React + Zustand for state management
  • MongoDB + Mongoose for resume persistence layer
  • Modular resume schema for flexible template rendering
  • Rule-based ATS scoring engine (keyword + structure analysis)
  • Print-first design system for A4/Letter export accuracy
  • Template engine supporting multiple layout strategies

Core Engineering Focus

We prioritized:

  • Structured resume data modeling (not just UI forms)
  • Separation of content vs presentation (template system)
  • Deterministic ATS scoring logic
  • Export consistency across PDF/DOCX/print views
  • Performance-first editor architecture

What’s Next

We’re actively improving:

  • smarter job matching system
  • advanced ATS scoring rules
  • more resume templates
  • performance optimizations for large resumes

👉 Live project: cvmetric.com

Would love feedback from developers, engineers, and product builders on architecture, scalability, or UX improvements.

Claude Code Billing Alert, Workflow Enhancements & Open-Source OCR Benchmarks

Claude Code Billing Alert, Workflow Enhancements & Open-Source OCR Benchmarks

Today’s Highlights

Today’s highlights include a critical billing bug affecting Claude Code users, a comprehensive cheat sheet for optimizing Claude Code workflows, and the release of DharmaOCR, an open-source 3B SLM with strong cost-performance benchmarks.

Claude Code Billing Bug: ‘HERMES.md’ in Git Commits Triggers API Rates (r/ClaudeAI)

Source: https://reddit.com/r/ClaudeAI/comments/1svdm1w/psa_the_string_hermesmd_in_your_git_commit/

A critical bug has been discovered in Claude Code’s billing system that can silently incur unexpected costs for developers. Users are reporting that the presence of the string “HERMES.md” (case-sensitive) in their Git commit history can cause Claude Code to bypass the Max plan’s bundled usage and instead bill at standard API rates. One developer reported an unexpected $200 charge due to this issue.

Anthropic’s support has acknowledged the bug, indicating it’s an internal routing error related to an experimental feature that was inadvertently enabled for some users. This issue highlights the importance for developers to scrutinize cloud service billing and API usage patterns, especially when engaging with developer tools still under active development or integration. Developers are advised to check their Git commit histories and monitor their Claude Code billing closely to avoid similar unexpected charges.

Comment: This is a serious heads-up for anyone using Claude Code and Git. Unexpected billing bugs like this can derail project budgets fast. Always double-check your commits and monitor your spend.

Claude Code Cheat Sheet for Daily Use and Enhanced Workflows (r/ClaudeAI)

Source: https://reddit.com/r/ClaudeAI/comments/1sv852q/claude_code_cheat_sheet_after_6_months_of_daily/

Following positive community feedback on a previous post, a Claude Code power-user has compiled a comprehensive “cheat sheet” based on six months of daily use. This resource aims to help developers optimize their Claude Code workflows by outlining effective commands, configuration tips, and interaction patterns. The sheet covers strategies for better prompt engineering within the Claude Code environment, managing context efficiently, and leveraging the tool for specific coding tasks such as refactoring, debugging, and generating boilerplates.

It emphasizes practical advice for developers looking to deepen their integration of Claude Code into their daily development cycle, moving beyond basic prompts to more structured and repeatable interactions that yield superior results and productivity gains. The community contribution underscores the growing importance of shared knowledge in maximizing the utility of AI-powered developer tools, providing a valuable resource for both new and experienced users.

Comment: This cheat sheet is gold for Claude Code users. It distills months of practical experience into actionable tips, especially on structuring prompts for complex coding tasks.

DharmaOCR: Open-Source 3B SLM with Cost-Performance Benchmarks (r/MachineLearning)

Source: https://reddit.com/r/MachineLearning/comments/1sun6wt/dharmaocr_opensource_specialized_slm_3b/

DharmaOCR, a new open-source Specialized Small Language Model (SLM) with 3 billion parameters, has been released on Hugging Face, complete with public models and datasets. This release is accompanied by a research paper detailing extensive experimentation and a robust cost-performance benchmark comparing DharmaOCR against larger LLMs and other open-source models specifically for Optical Character Recognition (OCR) tasks. The benchmark demonstrates DharmaOCR’s efficiency and accuracy, positioning it as a highly competitive solution for specialized text extraction, particularly where cost and latency are critical considerations.

Developers and researchers can freely access and experiment with DharmaOCR, providing a valuable resource for integrating efficient OCR capabilities into applications without the overhead of larger, more general-purpose models. The project emphasizes the potential of specialized SLMs to outperform or match larger models in specific domains, offering a practical alternative for resource-constrained environments or applications requiring fine-tuned performance. This is an excellent example of a practical, open-source tool that can be immediately tested and integrated.

Comment: An excellent example of how specialized SLMs can deliver competitive performance with better cost-efficiency for specific tasks like OCR. This is definitely worth exploring for targeted applications.

I built an AI-powered PDF generation API — here’s how

PDF generation from code is still painful in 2026. You either wrestle with complex libraries that need 200+ lines for a simple invoice, or pay for bloated enterprise services.

So I built PDFGen AI — a simple REST API where you send HTML and get a PDF URL back. Or better — describe what you want in plain English and AI generates the template for you.

The Problem

Every developer who’s tried to generate PDFs programmatically knows the pain:

  • wkhtmltopdf — outdated, rendering issues, painful to install on servers
  • Puppeteer/Playwright — powerful but heavy, needs headless Chrome
  • jsPDF — client-side only, limited styling
  • PDFKit — low-level, you’re drawing rectangles manually
  • Paid services — $50-200/month for what should be a simple API call

All I wanted was: send HTML, get a PDF. That’s it.

The Solution: One API Call

curl -X POST https://pdfgen-api.vercel.app/api/generate 
  -H "Authorization: Bearer pk_your_key" 
  -H "Content-Type: application/json" 
  -d '{"html": "<h1>Invoice #001</h1><p>Amount: $500</p>"}'

Response:

{
  "success": true,
  "url": "https://storage.supabase.co/pdfs/invoice-abc123.pdf"
}

That’s the entire integration. No SDKs. No config files. No dependencies.

The AI Magic

Instead of writing HTML yourself, you can use AI to do the heavy lifting.

Generate a Template from a Description

curl -X POST https://pdfgen-api.vercel.app/api/ai/template 
  -H "Authorization: Bearer pk_your_key" 
  -H "Content-Type: application/json" 
  -d '{"prompt": "Professional invoice with logo, line items, tax, payment terms"}'

AI generates a complete, styled HTML template you can reuse.

Fill a Template with Data Automatically

curl -X POST https://pdfgen-api.vercel.app/api/ai/fill 
  -H "Authorization: Bearer pk_your_key" 
  -H "Content-Type: application/json" 
  -d '{
    "template": "<your-html-template>",
    "data": {
      "company": "Acme Corp",
      "items": [
        {"name": "Web Development", "amount": 50000},
        {"name": "Hosting", "amount": 12000}
      ],
      "tax_rate": 0.18
    }
  }'

AI maps your JSON data to the template fields — no manual field mapping needed.

JavaScript Example

const response = await fetch(
  "https://pdfgen-api.vercel.app/api/generate",
  {
    method: "POST",
    headers: {
      Authorization: "Bearer pk_your_key",
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      html: '<h1>Invoice #001</h1><p>Total: $1,500</p>',
    }),
  }
);

const { url } = await response.json();
console.log("PDF ready:", url);

Python Example

import requests

response = requests.post(
    "https://pdfgen-api.vercel.app/api/generate",
    headers={
        "Authorization": "Bearer pk_your_key",
        "Content-Type": "application/json",
    },
    json={
        "html": "<h1>Receipt</h1><p>Amount: $250</p><p>Status: Paid</p>"
    },
)

print("PDF ready:", response.json()["url"])

The Stack

Here’s what powers PDFGen AI:

Layer Technology Why
Hosting Vercel (serverless) Zero config, auto-scaling
Framework Next.js (App Router) API routes + frontend in one
Auth + DB Supabase PostgreSQL, auth, file storage
PDF Rendering Puppeteer + @sparticuz/chromium HTML to PDF in serverless
AI AWS Bedrock (Nova Micro) Fast, cheap template generation
Billing Lemon Squeezy Merchant of Record

The Chromium Challenge

The hardest part was getting Puppeteer to work on Vercel serverless. The standard Chromium binary is too large. Here’s what worked:

  1. Use @sparticuz/chromium — stripped-down build for serverless
  2. Add outputFileTracingIncludes in next.config.ts to bundle the binary
  3. Launch with headless: "shell" mode for faster startup
  4. Disable GPU with setGraphicsMode = false

Cold-start PDF generation: under 5 seconds.

Why Supabase?

  • Auth — email/password with magic links, zero config
  • Database — PostgreSQL for API keys, usage tracking
  • Storage — PDFs stored with signed URLs
  • Free tier — generous enough for an MVP

Why AWS Bedrock?

I originally used the Anthropic API directly, but switched to Bedrock:

  • Pay-per-use — no monthly minimums
  • Amazon Nova Micro — fast, cheap, perfect for templates
  • Bearer token auth — simple, no complex AWS SDK needed

5 Built-in Templates

PDFGen AI comes with ready-to-use templates:

  1. Invoice — line items, tax, totals
  2. Receipt — clean payment receipt
  3. Report — business report with sections
  4. Certificate — achievement/completion
  5. Letter — formal business letter

Lessons Learned Building Solo

Start with the API, not the UI. I tested with curl for weeks before building the frontend. Forced me to get the DX right.

Free tiers are your friend. Total infra cost: ~$0/month. Vercel free, Supabase free, Bedrock pay-per-call.

Billing in India is tricky. Stripe doesn’t support Indian merchants for international payments. Lemon Squeezy acts as Merchant of Record — handles global payments, pays you out.

SEO from day one. Sitemap, robots.txt, JSON-LD, OG images — all added before launch. 10x easier than retrofitting.

Ship fast. Idea to production in 4 weeks. Not perfect, but working.

Pricing

Plan Price PDFs/month AI calls/month
Free $0 50 10
Starter $19/mo 2,000 100
Pro $49/mo 15,000 500
Business $99/mo 100,000 2,000
Enterprise $299/mo Unlimited 5,000

No credit card required for free tier.

What’s Next

  • More built-in templates based on user requests
  • Webhook notifications when PDF is ready
  • Batch generation — hundreds of PDFs in one call
  • Custom font uploads
  • PDF merging

Try It

The API is live and free to start:

Website: pdfgen-api.vercel.app
Docs: pdfgen-api.vercel.app/docs

Sign up, grab your API key, and generate your first PDF in under a minute.

I’d love feedback — especially on the API design and developer experience. What would you build with it?

Built solo with Next.js, Supabase, AWS Bedrock, and too much coffee.