Private & Powerful: Parsing Sensitive Medical Records Locally with WebLLM and WebGPU

Handling sensitive data like Electronic Health Records (EHR) is a nightmare for privacy compliance. Whether it’s HIPAA in the US or GDPR in Europe, sending a patient’s medical history to a cloud-based LLM often triggers a cascade of security audits and potential liabilities.

But what if the data never left the user’s computer?

In this tutorial, we are diving deep into Edge AI and Privacy-preserving AI by building a local EHR parser. Using WebLLM, WebGPU acceleration, and React, we will transform raw medical text into structured JSON entirely within the browser sandbox. No servers, no APIs, and zero data leakage.

The Architecture: Why WebLLM?

Traditionally, local LLMs required a heavy Python environment (Ollama, LocalAI). With the advent of WebGPU, the browser can now access the local GPU’s power directly. WebLLM (powered by TVM.js) allows us to run models like Llama 3 or Mistral directly in the browser’s memory.

Data Flow Overview

graph TD
    A[User: Upload Medical PDF/Text] --> B[Browser Sandbox]
    B --> C{WebGPU Available?}
    C -- Yes --> D[Initialize WebLLM Engine]
    C -- No --> E[Fallback: CPU/Wasm]
    D --> F[Load Quantized Model - e.g., Llama-3-8B-q4f16]
    F --> G[Process EHR Text via Prompt Template]
    G --> H[Output Structured JSON]
    H --> I[React UI Display]
    subgraph Privacy Zone
    B
    D
    G
    end

Prerequisites

To follow along, ensure you have:

  • A browser with WebGPU support (Chrome 113+ or Edge).
  • Node.js and a React environment.
  • The tech_stack: @mlc-ai/web-llm, react, and pdfjs-dist.

Step 1: Setting Up the WebLLM Engine

First, we need to initialize the engine. This is the “brain” that will live in your browser’s worker thread.

// useWebLLM.ts
import { useState, useEffect } from 'react';
import * as webllm from "@mlc-ai/web-llm";

export function useWebLLM() {
  const [engine, setEngine] = useState<webllm.MLCEngine | null>(null);
  const [progress, setProgress] = useState(0);

  const initEngine = async () => {
    const modelId = "Llama-3-8B-Instruct-v0.1-q4f16_1-MLC"; // Quantized for browser

    const engine = await webllm.CreateMLCEngine(modelId, {
      initProgressCallback: (report) => {
        setProgress(Math.round(report.progress * 100));
        console.log(report.text);
      },
    });

    setEngine(engine);
  };

  return { engine, progress, initEngine };
}

Step 2: Extracting Text and Prompt Engineering

Medical records are messy. We need to feed the LLM a clean prompt to ensure it returns valid JSON. This is crucial for Edge AI applications where prompt tokens are “free” (no API cost) but constrained by local VRAM.

const EHR_PROMPT_TEMPLATE = (rawText: string) => `
  You are a medical data extraction assistant. 
  Extract the following fields from the medical record provided:
  - Patient Name
  - Primary Diagnosis
  - Prescribed Medications (List)
  - Recommended Follow-up

  Format the output strictly as JSON.

  Record:
  """
  ${rawText}
  """
`;

const parseMedicalRecord = async (engine: any, text: string) => {
  const messages = [
    { role: "system", content: "You are a helpful assistant that outputs only JSON." },
    { role: "user", content: EHR_PROMPT_TEMPLATE(text) }
  ];

  const reply = await engine.chat.completions.create({
    messages,
    temperature: 0.0, // Keep it deterministic
  });

  return JSON.parse(reply.choices[0].message.content);
};

Step 3: The React UI

We want a clean interface where users can paste text or upload a document and see the “Processing locally” indicator.

import React, { useState } from 'react';
import { useWebLLM } from './hooks/useWebLLM';

const EHRParser = () => {
  const { engine, progress, initEngine } = useWebLLM();
  const [input, setInput] = useState("");
  const [result, setResult] = useState(null);

  return (
    <div className="p-8 max-w-2xl mx-auto">
      <h2 className="text-2xl font-bold mb-4">Local EHR Parser 🩺</h2>

      {!engine ? (
        <button 
          onClick={initEngine}
          className="bg-blue-600 text-white px-4 py-2 rounded"
        >
          Load Local AI Model ({progress}%)
        </button>
      ) : (
        <div className="space-y-4">
          <textarea 
            className="w-full h-40 border p-2"
            placeholder="Paste medical notes here..."
            onChange={(e) => setInput(e.target.value)}
          />
          <button 
            onClick={async () => {
              const data = await parseMedicalRecord(engine, input);
              setResult(data);
            }}
            className="bg-green-600 text-white px-4 py-2 rounded"
          >
            Parse Locally
          </button>
        </div>
      )}

      {result && (
        <pre className="mt-8 bg-gray-100 p-4 rounded text-sm">
          {JSON.stringify(result, null, 2)}
        </pre>
      )}
    </div>
  );
};

The “Official” Way: Leveling Up Your AI Architecture

While running LLMs in the browser is a game-changer for privacy, orchestrating these models in a production environment requires a deeper understanding of memory management and model sharding.

For more advanced patterns on Edge AI deployment, optimizing WebGPU kernels, and building production-ready Local-first AI applications, I highly recommend exploring the deep-dive articles at the WellAlly Tech Blog. It’s a goldmine for developers who want to move beyond “Hello World” and into scalable, high-performance engineering.

Why This Matters

  1. Zero Latency: Once the model is loaded (cached in the browser’s IndexedDB), inference is lightning fast because there’s no network round-trip.
  2. Cost Efficiency: You aren’t paying $0.01 per 1k tokens to OpenAI. The user provides the compute.
  3. Ultimate Privacy: In the context of EHR, this is the gold standard. The data never exists on a server disk or in a log file.

Challenges to Consider

  • Initial Load: The first time a user visits, they might need to download 2-5GB of model weights.
  • VRAM Constraints: Low-end devices might struggle with Llama-3-8B. Always provide a “Small Model” fallback like Phi-3 or TinyLlama.

Conclusion

The web is no longer just for displaying data; it’s for processing it intelligently. By combining WebLLM and WebGPU, we can build tools that respect user privacy while offering the power of modern Generative AI.

What are you building with Edge AI? Let me know in the comments! 👇