FairLens AI: An Intelligent Dashboard for Automated Bias Auditing

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

FairLens AI is a premium, high-end SaaS platform designed for AI-powered bias auditing. I built this tool to help data scientists and researchers easily identify, quantify, and mitigate hidden biases within their datasets before those datasets are used to train machine learning models.

My vision as a developer has always been to create meaningful impact in society through technology. Monitoring and detection systems are crucial for accountability in tech, and I realized that while many people talk about AI fairness, there are very few accessible, beautifully designed tools to actually measure it. FairLens AI bridges that gap. By simply uploading a CSV dataset, users receive instant insights into fairness metrics across protected attributes, visualized through an interactive, glassmorphism-styled dashboard. It calculates complex metrics like Demographic Parity Ratio and Disparate Impact, assigns an overall fairness score, and provides actionable mitigation recommendations.

Demo

Live Project Link: FairLens AI Platform
GitHub Repository: bibhupradhanofficial/fairlens-ai

Video Demo:

Screenshots:
Fairness score:
fairness score

AI-generated executive summary and intersectional analysis:
AI-generated executive summary and intersectional analysis

The Comeback Story

This project originally started as an ambitious idea for a data visualization dashboard, but I hit a massive roadblock when it came to the actual data science and backend engineering. The Finish-Up-A-Thon gave me the exact push I needed to rethink my architecture and finally complete it.

Where the project was before:
Previously, FairLens AI was essentially a beautiful, static mockup. I had built out the frontend architecture using React 18, Vite, and Tailwind CSS, and perfected the UI using Framer Motion and Recharts to give it a premium feel. However, the project stalled completely at the backend. Writing a manual, hardcoded statistical engine capable of parsing diverse datasets, calculating edge cases for Disparate Impact, and figuring out “feature importance” was overwhelming. The dashboard was full of dummy data, and the repository sat untouched.

What I added and fixed to finish it up (The “After”):
To bring the project across the finish line, I completely abandoned the idea of hardcoding the statistical logic and pivoted to an AI-agentic architecture. I added the following major features:

  • Supabase Edge Functions: I implemented a robust, serverless backend using Deno (audit-bias/index.ts) to securely handle the dataset statistics over an API without bogging down the client.
  • Google Gemini 3 Integration: I connected the Edge Function to the Google Gemini 3 Flash Preview model via an AI gateway. I engineered a highly specific system prompt that feeds the CSV cross-tabulations to the LLM and forces it to act as a “Fairness Expert.”
  • Structured JSON Insights: Instead of returning plain text, I configured the AI to return strictly typed JSON tool calls containing the exact fairness metrics, an overall 0-100 fairness score, and concrete mitigation steps.
  • Dynamic Frontend Wiring: I updated the AuditDashboard to dynamically map this live AI data into my Recharts visualizations and metric gauges, turning the UI into a fully functional, intelligent auditing tool.

My Experience with GitHub Copilot

GitHub Copilot was an absolute game-changer for pushing this project to completion, particularly when navigating the complex typing requirements between the frontend and the Supabase Edge Functions.

  • Type Safety & Boilerplate: Copilot anticipated the Zod schemas and TypeScript interfaces required for my AuditResult objects, saving me hours of manual typing.
  • Component Generation: When building the AuditDashboard.tsx and the MetricGauge components, Copilot suggested the repetitive Tailwind classes needed for the glassmorphism effects and conditional rendering (e.g., automatically suggesting the success/warning/destructive color mappings based on the metric status).
  • Data Parsing: Copilot was incredibly helpful in suggesting the logic for processing the CSV outputs and formatting the cross-tabulations accurately before sending them off to the Edge Function payload.

It acted as a constant pair programmer, allowing me to focus on the high-level architecture and the user experience rather than getting bogged down in syntax.

You just can’t miss this…

I made git merge finish itself — in VS Code, in my terminal, and in CI

Based on your Merge Magic draft , here’s a cleaner CEO-style Markdown version:

# Merge Magic: Resolving the Merge Conflicts That Shouldn’t Need a Human

I built **Merge Magic** because I got tired of resolving the same merge-conflict pattern over and over again.

Same conflict shape.  
Same “keep both” outcome.  
Same wasted time every time I rebased onto `main`.

At first, it was a small utility to remove that friction. Over a few weeks, it became something I now use daily.

Merge Magic automatically resolves merge conflicts that are clearly additive, while surfacing the ones that actually require human judgment.

It is free, bring-your-own-AI, and it never auto-commits.

You stay in control.

---

## What Merge Magic Does

Merge Magic is designed around a simple idea:

> Most merge conflicts are not real disagreements.  
> They are just two useful changes landing in the same place.

For example:

```js
<<<<<<< HEAD
export function getUser(id) {
  console.log('[users] fetch', id);
  return db.users.findById(id);
}
=======
export function getUser(id) {
  if (!id) throw new Error('id required');
  return db.users.findById(id);
}
>>>>>>> feature/validation

One branch added logging.
Another added validation.

The correct resolution is obvious: keep both.

A developer can resolve this in 30 seconds. But multiplied across every rebase, every PR, and every team member, those 30 seconds become a tax.

Merge Magic removes that tax where it can — and refuses to guess where it should not.

How It Works

Merge Magic resolves conflicts in three layers.

1. Mechanical Pre-Pass

Some conflicts can be resolved safely from text alone.

Examples include:

  • Identical edits on both sides
  • One-sided edits where the other side matches the base
  • Clearly additive changes in different parts of the same region

These require no AI call.

They are resolved instantly because the answer is structurally obvious.

2. AI-Assisted Resolution

For conflicts that need more context, Merge Magic dispatches the conflict to whichever AI tool you already use.

Supported backends include:

  • VS Code language model API
  • Copilot
  • Claude Code CLI
  • Ollama
  • Anthropic
  • OpenAI
  • Gemini

There is no forced subscription layer.

You bring the model. Merge Magic brings the workflow.

3. Verification Floor

Every auto-resolved file is checked against build diagnostics.

Merge Magic captures the baseline error set first, then checks the merged result.

If the resolution introduces new errors, it reverts the file back to conflict markers and shows the actual diagnostic.

Pre-existing errors do not cause false failures.

This is the safety floor.

The Line I Refused to Cross

The most important design decision was not what Merge Magic resolves.

It was what it refuses to resolve.

When two branches genuinely disagree, Merge Magic does not guess.

For example:

  • Both branches rename a function differently
  • One branch deletes code while another modifies it
  • Both branches change the same constant to different values
  • Two changes appear semantically incompatible

In those cases, Merge Magic opens a decision card with context:

This conflict is between two commits:

🔴 HEAD        a1b2c3d   perf: bigger page size, shorter session timeout
                          Alice Chen · 2 days ago

🟢 MERGE_HEAD  9b79e0a   scale: max page size, longer session for enterprise
                          Bob Kumar · 1 day ago

The goal is not to hide hard decisions.

The goal is to make the easy ones disappear and make the hard ones clearer.

Three Surfaces, One Engine

Merge Magic runs in three places.

VS Code

A VS Code extension with an auto-mode dashboard.

When git merge produces conflicts, files resolve in parallel with a live progress view.

Terminal

A CLI that can register as a global Git merge driver:

npm install -g merge-magic
mergemagic setup
echo "* merge=mergemagic" >> .gitattributes

After that, git merge and git rebase can invoke the resolver inline.

This is especially useful for recurring conflicts during rebase replay.

CI

A GitHub Action can resolve PR conflicts server-side before a human review:

- run: npm install -g merge-magic

- env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  run: mergemagic ci --base "${{ github.base_ref }}"

The CI check posts a Markdown report to the Actions summary.

If a conflict requires a real human decision, the check fails loudly.

It does not silently pick a side.

What I Deliberately Do Not Claim

A lot of AI developer tools overclaim.

I tried hard not to.

It is not an AST merge

The mechanical pre-pass is a careful three-way line merge.

It is not tree-sitter.
It is not structural merging.
It is not a true AST-aware resolver.

That is a harder problem, and it is still on the roadmap.

Semantic warnings are heuristics

When the model says two changes may interact, that is not static analysis.

It is a heuristic.

Useful, but not authoritative.

The benchmark is not “better than Copilot”

Copilot’s smart-action resolver is not scriptable, so a clean automated head-to-head benchmark is not really possible.

Merge Magic’s benchmark reports match rate against known human resolutions on a corpus you provide.

That is useful.

It is also honest.

Why This Matters

The promise of AI in developer workflows should not be “trust the model blindly.”

It should be:

Remove the repetitive work.
Preserve human judgment where it matters.
Make the review surface clearer.

That is the philosophy behind Merge Magic.

It is not trying to replace code review.

It is trying to remove the part of conflict resolution that developers already know is mechanical.

Try It

VS Code

Search for Merge Magic in the Extensions Marketplace.

Or install it here:

Merge Magic on VS Code Marketplace

Terminal

npm install -g merge-magic
mergemagic demo

The demo runs in a temporary repository and will not touch your code.

Feedback I’m Looking For

I would especially value feedback on three things:

  1. Does the verification floor catch enough?
    It catches type and lint failures, but not behavioral regressions.

  2. Is the mechanical pre-pass too conservative?
    It currently defers anything ambiguous, even when an LLM might handle it well.

  3. Is the CI mode too aggressive or too cautious?
    Auto-commit is opt-in, and Merge Magic refuses to push directly to main.

The most useful feedback is where it gets the resolution wrong.

Those cases are what improve the resolver, the prompt, and the pre-pass.

Merge conflicts are not going away.

But the boring ones should.

AI Metrics Decoded: From Parameters to TOPS

AI Metrics Decoded: The Numbers That Actually Matter in Production

Why You Need to Know This (Before Your First Production Incident)

Picture this: your team picks a 70B parameter model for a new feature. It runs great on your MacBook. You push to production. The GPU bill arrives. Your manager is not happy.

Or this: your AI API costs explode halfway through the month and nobody knows why.

These are not horror stories. They happen to real engineers — usually the ones who skipped learning the core units of measurement behind AI systems.

As a junior engineer, you’re going to face questions like:

  • “Can our GPU handle this model?”
  • “Why is the response so slow?”
  • “How many tokens are we burning per user per day?”
  • “Should we use a 7B or 70B model for this use case?”

Understanding the seven core metrics below gives you the language — and the instincts — to answer confidently.

Let’s break them down.

🧠 Category 1: Model Size — Parameters & Tokens

Parameters

What it is: The learned weights inside a neural network. Think of them as the “memory” of the model — numbers that get adjusted during training to capture patterns in data.

The unit: Just a raw count. We usually express it in:

  • M = millions (e.g., BERT = 110M)
  • B = billions (e.g., LLaMA 3 8B, GPT-4 ~1.8T estimated)

Why it matters to you:

Parameter Count Approx. VRAM Needed (fp16) Typical Use Case
1B–3B ~4–6 GB Mobile / edge apps
7B–8B ~16 GB Single consumer GPU
13B–14B ~28 GB Single pro GPU (A100 40GB)
70B ~140 GB Multi-GPU setup
405B+ ~800 GB+ Cluster of H100s

Rule of thumb: 1 billion parameters ≈ 2 GB of VRAM in half-precision (fp16). Double it for full precision (fp32).

More parameters = more capable model and more expensive to run. Always.

Tokens

What it is: The unit of text that a model reads and generates. Not words — fragments.

Quick visual:

Input text:  "Learning AI is fun!"
             ↓ Tokenizer
Tokens:      ["Learn"] ["ing"] [" AI"] [" is"] [" fun"] ["!"]
Token count: 6 tokens

Why it matters to you:

  • API cost is billed per token (input + output separately).
  • Context window is measured in tokens — the model can only “see” so much at once.
  • Speed (TPS, covered below) is measured in tokens per second.
# Quick check: how many tokens is your prompt?
# Using tiktoken (OpenAI's tokenizer, also used by many OSS models)
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")
text = "Learning AI is fun!"
tokens = enc.encode(text)

print(f"Token count: {len(tokens)}")   # → 6
print(f"Tokens: {tokens}")             # → [71668, 287, 15592, 374, 2523, 0]

Quick cheat sheet:

  • 1 token ≈ 0.75 English words
  • 1,000 tokens ≈ 750 words ≈ ~1.5 pages
  • Non-English text (Hindi, Mandarin, Arabic) uses 30–70% more tokens for the same content

⚡ Category 2: Hardware Power — FLOPS vs. TOPS

This is where a lot of junior engineers get confused. FLOPS and TOPS sound similar. They are not the same thing.

FLOPS (Floating Point Operations Per Second)

What it is: A measure of raw compute power for floating point arithmetic — the kind of math needed for training and running neural networks.

The scale:

Unit Value Context
GFLOPS 10⁹ FLOPS Your laptop GPU
TFLOPS 10¹² FLOPS Cloud GPUs (A100: ~312 TFLOPS)
PFLOPS 10¹⁵ FLOPS Entire GPU clusters

Used for: Server-scale training and inference. When someone says “the H100 delivers 989 TFLOPS of FP16 performance”, this is what they mean.

Common GPUs you’ll actually use:

GPU FP16 TFLOPS Best For
RTX 4090 ~165 Local dev / fine-tuning
A100 40GB ~312 Production inference
H100 SXM ~989 Large-scale training

TOPS (Tera Operations Per Second)

What it is: Similar idea, but used for integer or mixed-precision operations on edge hardware and NPUs (Neural Processing Units).

The key difference:

FLOPS  →  Floating point math  →  GPUs / server chips  →  Training & inference at scale
TOPS   →  Integer / INT8 math  →  NPUs / edge chips    →  On-device inference

Real-world examples:

Device TOPS Use Case
Apple M4 Neural Engine ~38 TOPS On-device ML on MacBook
Qualcomm Snapdragon X Elite ~45 TOPS AI PCs / laptops
NVIDIA Jetson Orin ~275 TOPS Edge AI / robotics
Google TPU v5e ~393 TOPS Cloud inference at scale

When do you care about TOPS? When you’re deploying a model to a phone, a laptop, or an embedded device — not a data centre. If you’re picking a chip for on-device inference, TOPS is your number.

🏋️ Category 3: Training Cost — FLOPs (Cumulative)

Yes, confusingly, FLOPs (with a capital F, no “per second”) is a different metric from FLOPS.

What it is: The total number of floating point operations performed during an entire training run. It’s a measure of compute budget, not hardware speed.

The unit: Usually expressed as:

  • PetaFLOPs (10¹⁵ operations)
  • Or PetaFLOP/s-days — how many days at a given FLOPS rate the training took

Real-world examples:

Model Estimated Training FLOPs
GPT-3 (175B) ~3.14 × 10²³
LLaMA 2 70B ~2.9 × 10²³
Gemini Ultra ~5 × 10²⁴ (estimated)

Why it matters to you: Directly as a junior engineer, probably not yet. But understanding it helps you reason about:

  • Why training a model from scratch is prohibitively expensive
  • Why fine-tuning (starting from a pre-trained model) is so much cheaper
  • Why companies like Anthropic and OpenAI have massive infrastructure teams

Quick analogy: FLOPS (the hardware rate) is your car’s horsepower. FLOPs (training cost) is the total miles driven on a road trip. One is speed, one is distance.

🚀 Category 4: Speed & Latency — TTFT, TPS, TPM

These three are the metrics you’ll track the most in production. They live in your dashboards, your SLAs, and your post-mortems.

TTFT — Time To First Token

What it is: How long (in milliseconds) from sending your request to receiving the first token of the response.

Why it matters: This is what determines if your app feels fast. Even if the full response takes 10 seconds, a 200ms TTFT makes the experience feel responsive. It’s the AI equivalent of “First Contentful Paint” in web dev.

User sends prompt
        ↓
  [ ... processing ... ]   ← this duration is TTFT
        ↓
First token arrives → streaming begins → user sees output

Good TTFT benchmarks:

Scenario Target TTFT
Real-time chat < 300ms
Interactive coding assistant < 500ms
Background document processing < 2,000ms (acceptable)

TPS — Tokens Per Second

What it is: How many tokens the model generates per second during the response. Also called generation speed or throughput.

Why it matters: TPS determines whether your streaming response feels smooth or painfully slow.

  • A human reads at roughly 3–5 tokens per second comfortably.
  • Models generating at < 10 TPS feel sluggish.
  • Modern API servers target 50–150+ TPS for good UX.

What affects TPS:

  • Model size (bigger = slower per request)
  • Hardware (H100 >> A100 >> consumer GPU)
  • Batch size (serving multiple requests simultaneously reduces per-request TPS)
  • Quantization (INT4/INT8 models run faster, with a small accuracy tradeoff)

TPM — Tokens Per Minute

What it is: Your rate limit from the API provider. The maximum number of tokens your account can process per minute.

Why it matters: Hit your TPM limit and your requests start getting throttled or rejected with 429 Too Many Requests. This is a very common production issue for junior engineers on their first real deployment.

# A common mistake: not accounting for TPM in batch jobs

prompts = load_10000_prompts()   # Each ~500 tokens

for prompt in prompts:
    response = call_llm_api(prompt)   # 🚨 You'll hit TPM limit fast
    process(response)

# Better approach: add rate limiting
import time

TPM_LIMIT = 40000   # tokens per minute (check your plan)
tokens_this_minute = 0
minute_start = time.time()

for prompt in prompts:
    estimated_tokens = len(prompt.split()) * 1.3   # rough estimate

    if tokens_this_minute + estimated_tokens > TPM_LIMIT:
        sleep_time = 60 - (time.time() - minute_start)
        if sleep_time > 0:
            time.sleep(sleep_time)
        tokens_this_minute = 0
        minute_start = time.time()

    response = call_llm_api(prompt)
    tokens_this_minute += estimated_tokens
    process(response)

🔧 Senior Engineer’s Note: How It All Connects

Let me show you a real decision you’ll face: “Should we use an 8B or 70B model?”

Here’s how the metrics interact:

                    8B Model          70B Model
─────────────────────────────────────────────────
Parameters          8 billion         70 billion
VRAM Required       ~16 GB            ~140 GB
GPU Setup           1× A100 40GB      4× A100 40GB
Est. TPS            ~80–120 TPS       ~15–30 TPS
TTFT (A100)         ~150ms            ~400ms
API Cost (est.)     ~$0.15/M tokens   ~$0.90/M tokens
Quality             Good              Excellent
─────────────────────────────────────────────────

The real-world math: Say your app handles 1,000 users/day, each generating ~2,000 tokens per session.

Daily tokens = 1,000 users × 2,000 tokens = 2,000,000 tokens

8B model cost:  2M × $0.00015 = $0.30/day  → $9/month
70B model cost: 2M × $0.00090 = $1.80/day  → $54/month

That’s a 6× cost difference. For a startup, that matters.

The senior engineer’s question isn’t “which model is better?” It’s *”which model is good enough for this use case at this scale?”*

Start with the smaller model. Benchmark it against your quality requirements. Scale up only if you have to.

Quick Reference Cheat Sheet

Metric Full Name Measures Typical Unit
Parameters Model size / capacity M, B, T
Tokens Text unit for I/O and cost count
FLOPS Floating Point Ops/sec Hardware speed (server) TFLOPS
TOPS Tera Operations/sec Hardware speed (edge/NPU) TOPS
FLOPs Floating Point Ops (total) Training compute cost PetaFLOPs
TTFT Time To First Token Latency / responsiveness milliseconds
TPS Tokens Per Second Generation speed tokens/sec
TPM Tokens Per Minute API rate limit tokens/min

Where to Go Next

You now have the vocabulary. Here’s how to build on it:

  • Experiment with tokenizers → platform.openai.com/tokenizer
  • Benchmark models on your hardware → try llama.cpp or Ollama locally
  • Track TTFT and TPS in your own apps → add timing logs around your API calls from day one
  • Read model cards → every major model release includes parameter count, training FLOPs, and benchmark scores. They’re not marketing fluff — they’re specs.

The engineers who understand these numbers don’t just write code. They make better architectural decisions, avoid expensive surprises, and earn trust faster.

That’s the real reason to care.

Got questions? Drop them in the comments.

Redis Essentials: Architecture, Caching, and Setup

Redis is often a misunderstood tool in the backend developer’s arsenal. While many view it simply as a “topic” to be covered in an hour, its role in modern system design is pivotal for building high-performance, scalable applications. This article explores what Redis is, why it is used, and how to set it up locally for development.

Understanding Redis: The In-Memory Powerhouse

At its core, Redis is an in-memory data store, often referred to as a “lightning-fast” hash map or key-value store. Unlike traditional databases like MongoDB or PostgreSQL that primarily store data on a hard disk (SSD or HDD), Redis keeps its state in the RAM (Random Access Memory).

Core Concept: The In-Memory Advantage

The fundamental difference between Redis and traditional databases (like MongoDB or PostgreSQL) is where they store data. While standard databases primarily use disk storage (SSDs/HDDs), Redis keeps its state in RAM (Random Access Memory).

Because RAM access is significantly faster than mechanical or electronic disk reads, Redis is often described as “lightning fast”.

Architecture: The Caching Layer

In a typical application, Redis acts as an intermediary between the backend application and the primary database. This setup creates two primary scenarios:

  1. Cache Hit: The backend finds the required data in Redis and returns it immediately to the user, bypassing the slower database.
  2. Cache Miss: If the data isn’t in Redis, the backend queries the primary database. It then stores a copy of this “hot record” in Redis for future requests before responding to the user.

This architecture dramatically reduces “read pressure” on the primary database, which should remain the “Source of Truth” for permanent records.

Key Features and Data Management

  • Persistence: Contrary to the myth that in-memory data is always lost on restart, Redis offers persistence features. It can load data from saved files back into memory upon a server reboot.
  • Key-Value Pairs: Redis stores data in simple pairs. Developers are encouraged to use human-readable, colon-separated keys (for example, user:session:123 or product:all) to avoid collisions and simplify debugging.
  • TTL (Time to Live): This is one of Redis’s most powerful features. You can set an expiration time on a key (for example, 90 seconds). Once the time expires, Redis automatically deletes the record, ensuring the memory remains uncluttered.

Advanced Use Cases

Beyond simple data caching, Redis is used for:

  • Session Management: Storing user login states (Active/Inactive) across multiple distributed servers.
  • OTP Management: Holding temporary One-Time Passwords for a few minutes, they are valid.
  • Rate Limiting: Tracking IP addresses or user IDs to prevent abuse (for example, blocking a user for 10 minutes after too many failed login attempts).
  • Job Queues: Maintaining lists of background tasks. “Workers” (secondary backend applications) pull jobs from Redis to process time-consuming tasks like sending emails in batches.
  • Shared Counters: Tracking live metrics like page views or “likes” across various application instances.

Redis Overview

Strategic Local Setup

For development, the sources recommend using Docker and Docker Compose to spin up a local environment. A standard configuration involves:

  • Redis Image: Using redis:7-alpine for a lightweight footprint.
  • Port Mapping: Binding the default Redis port 6379 to the host machine.
  • Persistence Command: Running the server with --appendonly yes to ensure data is written to a log.

In a Node.js environment, the ioredis library is the industry-standard package for communication. A basic connection is established by creating a new Redis client using the local URL: redis://localhost:6379. Developers can test the connection using the PING command, which should return a PONG response from the server.

When to Use Redis

Redis is not a solution for every problem. Use it if your application needs to:

  • Remove read pressure from the primary DB.
  • Manage rapidly expiring temporary data.
  • Handle background job queues or shared counters.

However, it is not a replacement for a primary database.

When NOT to use Redis

Redis is not a “magic bullet”. It should not be used if you don’t have a clear bottleneck or if your data doesn’t fit the patterns described above. If you have a write-heavy application where data doesn’t need to be read frequently, or if you are trying to use it as a primary database for complex relational data, Redis may not be the right solution.

The Hidden Features of Claude

Most people use Claude like a chatbot.

Ask a question. Get an answer. Maybe generate some code or summarize a PDF.

That’s it.

But after spending more time with Claude, I realized most users are barely scratching the surface of what it can actually do.

Claude has quietly evolved into something much bigger than an AI assistant. Hidden behind the normal chat interface are features that can completely change how you work, research, write, code, and organize information.

What surprised me the most is how few people talk about these features.

Here are some of the hidden capabilities of Claude that genuinely changed the way I use AI.

One of the biggest hidden strengths of Claude is its massive context window.

Most AI tools struggle when conversations become too long or when you upload large amounts of information. Claude feels different. You can upload long research papers, entire codebases, huge PDFs, meeting transcripts, or documentation, and Claude can still maintain context surprisingly well.

This becomes incredibly useful for developers and researchers.

Instead of pasting small snippets of code, you can upload multiple files and ask Claude to explain the architecture of an entire project, identify bottlenecks, or suggest improvements. It feels less like autocomplete and more like collaborating with someone who actually understands the bigger picture.

The same thing applies to writing and research. Claude can compare ideas across large documents, summarize complex information, and help organize thoughts without constantly losing track of the conversation.

Once you experience this workflow, normal AI chats start feeling limited.

Another hidden feature that completely changes the experience is Artifacts.

Most people think AI outputs are supposed to be plain text.

Claude does something different.

Instead of simply generating code inside the chat, Claude can create interactive outputs like dashboards, mini web apps, UI layouts, diagrams, and editable documents. The first time I used Artifacts, it honestly felt like the line between AI chat and development environment started disappearing.

You can describe a landing page idea, and Claude generates a working interface. You can ask for a visualization, and it creates something interactive instead of dumping raw code into the conversation.

For frontend developers, designers, and creators, this is one of the most underrated AI features available right now.

Then there are Projects, which most casual users never even touch.

Projects basically turn Claude into a long-term workspace instead of a temporary conversation.

You can organize chats, upload files, add custom instructions, and maintain context around a specific goal or workflow. This becomes extremely powerful when working on something ongoing like a startup idea, a research topic, a coding project, or content creation.

Instead of re-explaining everything every time you open a new chat, Claude already understands the context of the project.

It sounds simple, but the productivity difference is huge.

The AI starts feeling less like a tool and more like an actual collaborator that understands what you’re trying to accomplish.

One of the most powerful but least understood parts of Claude is MCP, or Model Context Protocol.

A lot of people have never heard of it, but developers are starting to realize how important it is.

MCP allows Claude to connect with external tools, APIs, databases, local systems, and development environments. The easiest way to think about it is this:

Most AI systems can only talk.

MCP gives Claude the ability to interact with systems.

That changes everything.

Instead of just discussing workflows, Claude can become part of the workflow itself. It can retrieve information, work with connected tools, analyze external systems, and help automate complex tasks.

This is where AI starts moving beyond “assistant” territory and begins feeling more like an intelligent operating layer.

Another underrated capability is Connectors.

Most people still manually copy-paste information into AI chats. Claude can connect directly with platforms like GitHub, Google Drive, Slack, and other knowledge systems.

That means Claude can reason across your connected information instead of forcing you to constantly feed it context manually.

For example, imagine asking Claude to review documentation across multiple files, summarize GitHub issues, or identify inconsistencies in project notes.

That’s a very different experience from simply chatting with AI.

It becomes a true knowledge assistant.

Something else I noticed while using Claude is how natural its writing feels.

A lot of AI-generated content still sounds robotic or overly polished in a weird way. Claude tends to produce writing that flows more naturally, especially in long-form content.

It’s surprisingly good at:

  • restructuring articles
  • improving clarity
  • maintaining tone
  • brainstorming ideas
  • editing drafts
  • turning rough thoughts into structured writing

For writers and creators, this becomes incredibly useful because the interaction feels collaborative instead of mechanical.

You’re not just generating content.

You’re refining ideas in real time.

Claude is also exceptionally good at reasoning through complicated topics.

Instead of only giving fast answers, it performs well in deeper discussions involving:

  • architecture decisions
  • tradeoffs
  • planning
  • systems thinking
  • technical explanations
  • research analysis

One thing I’ve noticed is that Claude works best when treated like a thinking partner instead of a search engine.

The quality of the interaction changes dramatically when you ask it to:

  • compare ideas
  • challenge assumptions
  • explain reasoning
  • evaluate tradeoffs
  • simulate discussions

That’s when Claude starts showing its real strength.

The biggest realization for me was this:

Most people still think AI tools are chatbots.

But Claude increasingly feels like something else entirely.

It feels like:

  • a workspace
  • a research assistant
  • a coding partner
  • a writing collaborator
  • a reasoning engine
  • a productivity system

The hidden power of Claude isn’t one feature.

It’s the combination of all these capabilities working together:

  • large-context understanding
  • interactive artifacts
  • persistent projects
  • connectors
  • MCP integrations
  • deep reasoning
  • natural writing

Once you start using Claude this way, it stops feeling like a simple AI tool.

It starts feeling like a new way to work with information itself.

Everbench: A document management system with Local Intelligence

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Everbench

Everbench is a low-cost, efficient document research platform for those concerned about privacy.

I’ve been working on a project I called Everknown. It would be an Open Source Evernote replacement. Lately, I’ve stalled on that, having discovered a commercial service that had most of what I wanted, but I have the bones of the app developed and, when I saw this challenge, I decided to put together a small version of Everknown for link/bookmark management and page summarization functions, two of the things that Everknown was going to do for me.

Thus, the odd name Everbench. It’s a “workbench” for Everknown. It has a very simple architecture, but that’s good! Small, composable software that can be modified easily to fit needs.

Everbench conveniently captures web pages, efficiently converts them to Markdown for storage in an Obsidian Vault, creating a summary and tags for categorization. It uses an efficient HTML->MD conversion written in C with a Gemma 4 quality gate to check if the conversion was successful. Some pages can’t be converted (paywalls, login walls, empty SPA shells, mostly-navigation pages, etc.) but Gemma 4 can quickly determine that and characterize the failures. I’ve found that if the conversion fails, the page has serious problems.

Using a deterministic C parser isn’t just about extraction quality; it’s also a small security boundary. The parser strips <script>, <style>,
<noscript>, and CSS-hidden content before anything reaches Gemma 4, so the model never sees what the page is hiding from the user. Feeding raw HTML to an LLM is an open invitation for prompt injection via hidden divs, alt text, JavaScript-emitted content, or whatever the next clever trick happens to be. Prompt injection can be a significant challenge, but we have a place here in processing to insert heuristics to actively guard against it. Gumbo lets me reason about what crosses that boundary.

Demo

<!– Embed a video walkthrough or share a link to your deployed project. –> Here’s a link to the video walkthrough

Code

Everbench

How I Used Gemma 4

Gemma 4 is used for document summarization and categorization, but the novel use of it is as a quality gate for the output from the Gumbo C HTML parser.

The prompt given to Gemma 4 is currently:

You are evaluating whether a web page was successfully extracted into readable Markdown.

URL: <captured url>
Title: <captured title>

Extracted Markdown (first 2000 chars):
<extracted markdown>

Decide if this extraction is GOOD or BAD.

GOOD means: the main article content is present and readable.
BAD means: the content is mostly navigation, advertising, login walls, JavaScript placeholders, or otherwise unusable as a reference.

Respond in exactly this format:
VERDICT: GOOD|BAD
REASON: <one short sentence>

The model is not the processing pipeline; it is the judge inside the pipeline.

In Everknown, I had intended to use local and cloud models for LLM work, interchangeably, configured where it made sense. In Everbench, I just needed an LLM that could categorize (via tags) and summarize documents well. I found Gemma-4-26B-E4B to be excellent at that. The smaller models didn’t do a very good job at some of the things I needed an LLM for, and 31B was too slow and not notably better.

Locally, I can only run one model at a time and I’m hoping that Gemma-4-26B-E4B with its MoE architecture will work out as a good general-purpose local model that I might be able to get some agentic tool-using work out of as I expand projects.

I got tired of writing post-mortems — so I built RCAi for SREs

I’m an SRE at Sony Interactive Entertainment. After a week where my teammate had four incidents (and four RCAs), I built something for the blank-page problem after every outage.

What RCAi does

RCAi turns an incident timeline into a structured post-mortem / RCA:

  • Executive summary, timeline, root cause, action items
  • Usually under a minute to a first draft
  • You edit before you ship it anywhere

Free: 3 RCAs lifetime, no credit card.

Templates & export

  • Standard, executive, and deep technical templates (higher tiers unlock more)
  • PDF / Markdown on paid plans
  • Confluence / Notion = copy formatted text to clipboard (not OAuth publish)

Integrations (optional)

Paste credentials per import — we don’t store API keys:

  • PagerDuty (Team+)
  • Datadog, Grafana, Slack, xMatters (Enterprise)

Privacy

  • Claude commercial API — customer content not used for model training
  • Saved RCAs live in your account (Firebase)

Stack

React, FastAPI, Claude, Firebase, Clerk, Stripe — frontend on Vercel, API on Railway.

Ask

If you write RCAs today:

  1. What sections does your org require that generators usually miss?
  2. Would you use an AI draft as a starting point, or only for formatting?

Try it: https://www.rcaiapp.com

I’ll be in the comments — happy to talk architecture, quotas, or corporate VPN allowlists.

How We Built a Multi-Agent AI Documentation System (And What We Learned)

Last quarter at Zeppelin Labs, we shipped Orchestrator-15 — a multi-agent documentation generation platform that takes a codebase or idea spec and produces production-grade technical documentation using coordinated AI agents.

This post covers the architecture, the mistakes, and the specific patterns that made multi-agent coordination actually work in production. Not a tutorial — a war story.

Why Multi-Agent, Not Just One Big Prompt?

The naive approach to AI documentation generation is one giant prompt: “here’s my codebase, write the docs.”

It fails for the same reason you wouldn’t ask one person to simultaneously be a technical writer, an API analyst, a diagram designer, and an editor. Context windows are finite. Tasks have different optimization targets. And a single agent trying to do everything produces mediocre output across the board.

The multi-agent approach assigns specialized roles:

  • Analyzer Agent — reads the codebase structure, identifies modules, maps dependencies
  • Writer Agent — takes structured analysis output and produces prose documentation
  • Formatter Agent — applies templates, ensures consistency, handles cross-references
  • Reviewer Agent — checks completeness, flags gaps, scores output quality
    Each agent is good at one thing. The orchestrator coordinates them in sequence — and sometimes in parallel.

The Architecture

Input (codebase / spec)
        │
        ▼
┌──────────────────┐
│  Orchestrator    │  ← decides task graph, manages state
└──────┬───────────┘
       │
   ┌───┴────────────────────────────┐
   │                                │
   ▼                                ▼
Analyzer Agent                 Context Builder
(GPT-4o, low temp)            (builds shared memory)
   │
   ▼
Writer Agent(s)          ← spawned per module, run in parallel
(Claude 3.5, temp 0.7)
   │
   ▼
Formatter Agent
(structured output)
   │
   ▼
Reviewer Agent           ← gates output quality
(GPT-4o, strict prompt)
   │
   ▼
Final Documentation

The key design decision: shared memory over message passing. Each agent reads from and writes to a shared context object rather than receiving inputs directly from the previous agent. This lets the Reviewer Agent access the Analyzer’s raw output without it being filtered through the Writer — which turned out to be critical for catching documentation that technically read well but missed important implementation details.

The State Machine

Each document module moves through states:

type ModuleState =
  | 'pending'
  | 'analyzing'
  | 'writing'
  | 'formatting'
  | 'reviewing'
  | 'approved'
  | 'failed';

interface DocumentModule {
  id: string;
  name: string;
  state: ModuleState;
  analyzerOutput?: AnalysisResult;
  draft?: string;
  formattedDraft?: string;
  reviewScore?: number;
  reviewFeedback?: string;
  retryCount: number;
}

Modules that fail the Reviewer Agent’s quality gate (score < 0.75 on our rubric) get re-queued to the Writer Agent with the review feedback included in the prompt. We cap retries at 3 before flagging for human review.

This retry loop was the single biggest quality improvement we made. First-pass writer output approved directly produced documentation that was grammatically fine but structurally shallow. With the reviewer feedback loop, output quality jumped substantially — especially for complex modules.

Parallelism: Where It Works and Where It Breaks

Writer Agents can run in parallel — each module is independent. We spawn up to 8 concurrent Writer Agents using Promise.allSettled:

async function writeModulesInParallel(
  modules: DocumentModule[],
  context: SharedContext
): Promise<DocumentModule[]> {
  const chunks = chunkArray(modules, 8); // max 8 concurrent
  const results: DocumentModule[] = [];

  for (const chunk of chunks) {
    const settled = await Promise.allSettled(
      chunk.map(module => writerAgent.process(module, context))
    );

    for (const result of settled) {
      if (result.status === 'fulfilled') {
        results.push(result.value);
      } else {
        // mark failed, will retry with orchestrator
        results.push(markFailed(chunk[settled.indexOf(result)]));
      }
    }
  }

  return results;
}

What doesn’t parallelize well: anything that needs global consistency. The Formatter Agent must run sequentially because it maintains a cross-reference map — if two formatter instances run concurrently they produce conflicting internal link structures. We tried distributed locking on the reference map. It was brittle. Sequential formatting was the right call.

Prompt Architecture: The Part Nobody Talks About

The agents are only as good as their prompts. Our production prompts have four sections:

1. Role definition — what this agent is, what it optimizes for, what it explicitly ignores

2. Input schema — structured description of what the agent receives

3. Output schema — strict JSON format the agent must produce

4. Failure modes — explicit instructions for what to do when input is ambiguous, incomplete, or contradictory

The failure mode section was added after production. Agents without it hallucinated confidently when given ambiguous input. Agents with explicit failure mode instructions instead returned structured { "status": "needs_clarification", "question": "..." } responses that the orchestrator could handle gracefully.

The GitHub Copilot SDK Integration

Orchestrator-15 uses the GitHub Copilot SDK for the Analyzer Agent specifically — the SDK’s code-understanding capabilities are significantly stronger than general LLM prompting for structural code analysis. It can identify:

  • Public API surfaces vs. internal implementation details
  • Dependency graphs between modules
  • Comment density and existing documentation coverage
  • Test coverage as a proxy for module stability
    The Analyzer feeds this structured analysis to the Writer Agent, which dramatically reduces hallucinated API signatures — one of the most common failures in pure-LLM documentation generation.

What We’d Do Differently

Use structured outputs from the start. We started with free-form text outputs and added JSON schemas later. Every agent refactor was painful because downstream agents had built implicit assumptions about output format. Define your schemas before writing a single agent prompt.

Build the reviewer first. We built it last. If we’d built the quality rubric and reviewer first, we would have caught bad writer prompt patterns in day 1 instead of week 4.

Token budgets per agent. Without explicit token limits per agent, the Writer Agent would occasionally produce exhaustive output for simple modules and thin output for complex ones. Calibrating per-module token budgets based on the Analyzer’s complexity score (lines of code, dependency count) significantly improved consistency.

The Repo

Orchestrator-15 is open source. You can find it on the Zeppelin Labs GitHub. We’re actively developing it — issues and PRs welcome.

If you’re building multi-agent systems and want to compare notes, drop a comment below or reach out through zeppelinlabs.digital.

Built at Zeppelin Labs — a software development studio building SaaS products, AI systems, and automation platforms from Islamabad, Pakistan.

How I Built an AI News Brief with Next.js, Supabase, Vercel, and GPT-4o-mini

Over the past few months, I have been building a small AI news brief called DeepSignal.

The idea started from a simple personal frustration:

I was reading X, Hacker News, arXiv, OpenAI and Anthropic blogs, product launch pages, newsletters, and company updates every day, but still felt like I was either missing important AI news or wasting time on low-signal updates.

So I built a small system that does three things:

  1. Collects AI-related updates from multiple sources
  2. Scores each story with a transparent 0–100 signal score
  3. Publishes a daily and weekly brief

The product is not technically complex, but the workflow taught me a lot about building AI-assisted content products, SEO for dynamic sites, and the difference between summarizing information and filtering information.

This is a breakdown of the stack, architecture, and lessons learned.

The stack

The current stack is intentionally simple:

Frontend: Next.js 15
Database: Supabase
Hosting: Vercel
AI processing: GPT-4o-mini
Content model: Articles, sources, tags, guides, weekly briefs
SEO: sitemap, canonical URLs, RSS, structured pages

I wanted to keep the system cheap and easy to maintain because this is a solo project.

The rough monthly cost is still low. Vercel handles deployment and hosting, Supabase handles the database, and GPT-4o-mini is used for scoring and classification rather than heavy generation.

The main goal was not to build a complicated AI pipeline.

The goal was to build a reliable workflow that could turn noisy inputs into useful outputs.

The basic architecture

The system has a simple flow:

Sources
   ↓
Fetch / import
   ↓
Normalize article data
   ↓
AI relevance check
   ↓
Signal scoring
   ↓
Tagging and categorization
   ↓
Publish article pages
   ↓
Generate daily / weekly briefs
   ↓
Expose guides, RSS, sitemap

At a high level, each story becomes a structured object:

type Article = {
  id: string;
  title: string;
  url: string;
  source: string;
  summary: string;
  publishedAt: string;
  aiRelevanceScore: number;
  signalScore: number;
  tags: string[];
  category: string;
  canonicalUrl: string;
  isIndexable: boolean;
};

The most important field is not the summary.

It is isIndexable.

That one field ended up being more important than I expected.

Why filtering matters more than summarizing

At first, I thought the main problem was summarization.

Take a long article, summarize it, and users save time.

But after building the first version, I realized summarization alone does not solve the real problem.

A summary tells you:

What does this article say?

But users usually need to know:

Should I care?
Why does this matter?
Is this actually about AI?
Is this a durable signal or just a temporary headline?
Is this more important than the other 50 updates today?

That changed the product direction.

Instead of only generating summaries, the system needed to decide what should be included, ranked, grouped, and excluded.

For an AI news product, filtering is not a minor feature.

Filtering is the product.

The signal score

Each story gets a 0–100 signal score.

The score is not meant to be perfect. It is a transparent ranking system that helps explain why a story may matter.

A story can score higher based on signals like:

- source quality
- AI relevance
- novelty
- technical depth
- business impact
- research importance
- company importance
- cross-source confirmation
- relevance to builders, researchers, or operators

A simplified scoring idea looks like this:

type ScoreInput = {
  sourceWeight: number;
  aiRelevance: number;
  novelty: number;
  technicalDepth: number;
  marketImpact: number;
  researchValue: number;
  companyImportance: number;
};

function calculateSignalScore(input: ScoreInput) {
  const score =
    input.sourceWeight * 0.15 +
    input.aiRelevance * 0.25 +
    input.novelty * 0.15 +
    input.technicalDepth * 0.15 +
    input.marketImpact * 0.1 +
    input.researchValue * 0.1 +
    input.companyImportance * 0.1;

  return Math.round(Math.min(100, Math.max(0, score)));
}

The exact formula can change, but the principle matters:

I wanted users to feel that the ranking had a visible logic, not just a black-box AI label.

That was one of the biggest lessons:

A simple transparent scoring system can be more trustworthy than a more complex but invisible AI ranking.

Using GPT-4o-mini

I use GPT-4o-mini mostly for classification, scoring support, and short summaries.

The AI tasks are intentionally narrow:

- Is this article actually AI-related?
- What category does it belong to?
- What are the key takeaways?
- Is the story relevant to models, agents, research, hardware, infrastructure, regulation, or adoption?
- What tags should it receive?
- What score explanation should be shown?

I try not to use AI as a generic content generator.

Instead, I use it as a structured processing layer.

A simplified prompt pattern looks like this:

You are classifying an AI industry news article.

Return JSON only.

Evaluate:
1. AI relevance from 0 to 100
2. Signal strength from 0 to 100
3. Primary category
4. 3 to 5 tags
5. One-sentence reason why this story matters
6. Whether this story should be indexable for search

Article:
Title: ...
Source: ...
Excerpt: ...
URL: ...

The important part is forcing structured output.

For this kind of workflow, predictable JSON is more useful than beautifully written prose.

Supabase data model

The database is simple.

Core tables:

articles
sources
tags
article_tags
daily_briefs
weekly_briefs
guides
guide_articles

The articles table stores the normalized content.

The sources table stores source metadata and source quality.

The tags table keeps topic structure clean.

The guides table is for evergreen topic pages, such as:

AI agents
AI coding tools
AI research papers
OpenAI updates
Anthropic Claude updates
NVIDIA AI chips
AI hardware

This guide layer became important later for SEO.

A chronological feed is useful for freshness, but guide pages are better for long-term search and topic authority.

Next.js page structure

The site uses a few main page types:

/
Homepage

/articles/[slug]
Individual article pages

/guides
Guide index

/guides/[slug]
Evergreen topic pages

/weekly
Weekly AI brief

/tags/[slug]
Core topic pages

/sources/[slug]
Selected source pages

Not every page deserves to be indexed.

That became one of the most important SEO decisions.

SEO lesson: not every page should be in the sitemap

Early on, I made the mistake of thinking more indexed pages would be better.

It was not.

When a site has too many low-quality, thin, duplicate, or off-topic pages, search engines can get confused about what the site is actually about.

For an AI news site, this matters a lot because source feeds can easily include AI-adjacent but irrelevant content.

So I added stricter sitemap rules.

The sitemap should include:

- homepage
- about page
- guides
- high-quality guide pages
- weekly brief
- selected high-quality article pages
- selected core tag pages

The sitemap should not include:

- saved pages
- subscribe pages
- internal API routes
- search result pages
- parameter URLs
- low-quality tag pages
- non-AI articles
- thin source pages
- duplicate daily feed pages

The rule I use now is simple:

Only put a URL in the sitemap if it is:

- canonical
- indexable
- useful as a search landing page
- relevant to the core AI topic
- not thin or duplicated

This helped clean up the site’s search profile.

Canonical URLs and UTM links

For promotion, I use UTM links like:

https://ai-deep-signal.com/weekly?utm_source=x&utm_medium=social&utm_campaign=weekly

or:

https://ai-deep-signal.com/?utm_source=reddit&utm_medium=social&utm_campaign=launch

But the canonical URL must always point to the clean version:

https://ai-deep-signal.com/weekly
https://ai-deep-signal.com/

That avoids turning campaign URLs into duplicate SEO pages.

For a dynamic site, this is easy to overlook.

Tracking URLs are for analytics.

Canonical URLs are for search engines.

They should not be mixed.

Why I added guides and weekly briefs

The first version of the site was mostly a feed.

That worked, but it had a problem:

Feeds are good for browsing.

Guides are better for understanding.

So I added topic-based guides and a weekly brief.

The weekly page is for people who want a quick summary of what mattered this week.

The guide pages are for evergreen themes that should grow over time.

For example:

/guides/what-are-ai-agents
/guides/best-ai-coding-agents
/guides/ai-research-papers-this-week
/guides/nvidia-ai-chip-news
/guides/openai-news

This gives the site a more stable structure:

Homepage
  ↓
Guides
  ↓
Topic pages
  ↓
Related articles

That structure is much better than only having a reverse-chronological feed.

Deployment on Vercel

Vercel is a good fit for this kind of project because most of the site is content-oriented.

The project benefits from:

- fast deployments
- preview deployments
- automatic HTTPS
- good Next.js support
- serverless functions for lightweight API work
- ISR / caching options

But I avoid using Vercel for heavy background work.

If the project grows, I would move heavier jobs to a separate worker or queue system.

For now, Vercel + Supabase is enough.

What I would improve next

There are still many things I would improve.

Better deduplication

AI news often appears in multiple places. The same story can show up as a company blog post, a tweet thread, a newsletter item, and a Hacker News discussion.

Better clustering would make the brief cleaner.

Better source weighting

Not all sources should have equal authority. A research paper, company announcement, social post, and rewritten news article should be weighted differently.

Better guide pages

The guide pages should become more like living topic trackers, not just lists of related articles.

Each guide should eventually include:

- topic explanation
- latest updates
- important companies
- relevant research
- key risks
- related stories
- last updated date

Better scoring explanations

A score is only useful if users understand it.

I want each article to explain not just the score, but the reason behind the score.

What I learned

A few lessons stood out.

1. Filtering is harder than summarizing

Summarization is relatively easy now. Deciding what deserves attention is much harder.

2. SEO quality matters more than SEO volume

More pages are not always better. Cleaner, more relevant pages are better.

3. Topic pages are more durable than feeds

Feeds create freshness. Guides create long-term value.

4. Transparent AI systems feel more trustworthy

Users do not need a perfect score, but they do need to understand why a score exists.

5. The workflow around the model is the real product

The AI model is only one part. The source selection, scoring rules, publishing flow, SEO structure, and user experience matter just as much.

Final thoughts

This project started as a small personal tool because I was tired of reading too many AI sources every morning.

But it turned into a useful lesson:

AI products do not always need to generate more content.

Sometimes the better product is the one that helps people ignore more content.

That is what I am trying to build with DeepSignal: a cleaner way to follow AI news, research, agents, models, and infrastructure without the daily noise.

The site is here:

https://ai-deep-signal.com/?utm_source=devto&utm_medium=article&utm_campaign=build_log

The weekly brief is here:

https://ai-deep-signal.com/weekly?utm_source=devto&utm_medium=article&utm_campaign=build_log

I would love feedback from other developers:

Would you trust a transparent signal score for news ranking?
Or would you rather see a purely editorial brief without scoring?