¿Qué significan esas letras del CVSS? Guía para entenderlo de una vez

Cada vez que sale un CVE importante, alguien pega el vector CVSS en el chat del equipo y todos hacen como que lo entienden. Spoiler: la mayoría solo mira el número (9.1 CRITICAL) e ignora el resto.

El problema es que el número solo te dice qué tan grave es. El vector te dice por qué — y eso cambia completamente cómo respondes.

Primero: ¿qué es el CVSS?

CVSS (Common Vulnerability Scoring System) es un sistema de puntuación para describir vulnerabilidades de seguridad. No solo te da un número: te da un vector, que es básicamente una descripción comprimida de cómo funciona el ataque.

Existen dos versiones que vas a ver seguido: v3.1 (la más común hoy) y v4.0 (la más nueva, más detallada). Voy a explicar las dos.

💡

💡 La escala de puntuación
0.1–3.9 LOW 
4.0–6.9 MEDIUM 
7.0–8.9 HIGH 
9.0–10.0 CRITICAL

La analogía del ladrón

Para entender el vector CVSS, imagina que una vulnerabilidad es como una forma de entrar a robar a una casa. El vector CVSS responde estas preguntas sobre el “robo”:

  🏠 Analogía
  **¿Desde dónde puede atacar el ladrón?** ¿Desde la calle, o tiene que estar en el jardín? *(Attack Vector)*
  **¿Es difícil entrar?** ¿Puerta abierta o cerradura de alta seguridad? *(Attack Complexity)*
  **¿Necesita una llave?** ¿O entra sin nada? *(Privileges Required)*
  **¿Alguien tiene que abrir la puerta desde adentro?** *(User Interaction)*
  **¿Qué se puede robar?** Documentos, muebles, o puede romper cosas también. *(Impactos: C/I/A)*

CVSS v3.1 — letra por letra

Tomemos este vector real del CVE-2024-9465 (Palo Alto Expedition):

  CVSS v3.1
  CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:N





      Código
      Nombre
      Valor en este CVE
      Qué significa en palabras simples




      AV:N
      Attack Vector — Network
      🔴 Peligroso
      El atacante no necesita estar cerca físicamente. Puede atacar desde cualquier lugar del mundo por internet. *(N=Network, A=Adjacent, L=Local, P=Physical)*


      AC:L
      Attack Complexity — Low
      🔴 Peligroso
      El ataque es fácil de ejecutar. No requiere condiciones especiales, timing exacto ni conocimiento avanzado. Cualquiera con el exploit puede hacerlo. *(L=Low, H=High)*


      PR:N
      Privileges Required — None
      🔴 Peligroso
      El atacante no necesita ninguna cuenta ni contraseña previa. Llega, ataca, listo. *(N=None, L=Low, H=High)*


      UI:N
      User Interaction — None
      🔴 Peligroso
      Ningún usuario tiene que hacer clic en nada, abrir ningún archivo ni visitar ningún enlace. El ataque funciona solo. *(N=None, R=Required)*


      S:U
      Scope — Unchanged
      ⚪ Neutral
      El impacto se queda en el sistema atacado. No "salta" automáticamente a otros sistemas. *(U=Unchanged, C=Changed)*


      C:H
      Confidentiality — High
      🔴 Crítico
      Toda la información confidencial queda expuesta: contraseñas, API keys, configuraciones. El atacante puede leer todo. *(N=None, L=Low, H=High)*


      I:H
      Integrity — High
      🔴 Crítico
      El atacante puede modificar o crear datos. En este caso, puede escribir archivos arbitrarios en el sistema. *(N=None, L=Low, H=High)*


      A:N
      Availability — None
      🟢 Sin impacto
      El atacante no puede tirar el sistema. El servicio sigue disponible mientras lo explotan en silencio. *(N=None, L=Low, H=High)*

⚠️

⚠️ Cómo leer el resultado
AV:N + AC:L + PR:N + UI:N en el mismo vector = “cualquier persona en internet, sin esfuerzo, sin cuenta, sin ayuda de nadie” puede ejecutar el ataque. Eso, combinado con C:H, es la peor combinación posible para datos confidenciales.

CVSS v4.0 — ¿qué cambia?

La versión 4.0 es más nueva y más detallada. Divide el impacto en dos partes: el sistema directamente atacado (Vulnerable System) y otros sistemas que podrían verse afectados (Subsequent System).

  CVSS v4.0
  CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:H/VI:L/VA:N/SC:H/SI:N/SA:N





      Código
      Nombre
      Valor
      Qué significa




      AV:N
      Attack Vector — Network
      🔴
      Igual que en v3.1: ataque desde internet, sin estar cerca.


      AC:L
      Attack Complexity — Low
      🔴
      Fácil de ejecutar, sin condiciones especiales.


      AT:N
      Attack Requirements — None
      🔴
      **Nuevo en v4.0.** El ataque no depende de ninguna condición externa que no controle el atacante (como que haya sesiones activas o configuraciones específicas). *(N=None, P=Present)*


      PR:N
      Privileges Required — None
      🔴
      Sin cuenta, sin autenticación.


      UI:N
      User Interaction — None
      🔴
      Nadie tiene que hacer nada para que el ataque funcione.


      VC:H
      Vulnerable System Confidentiality — High
      🔴
      **El sistema atacado (Expedition):** toda su información queda expuesta. Hashes, configs, API keys.


      VI:L
      Vulnerable System Integrity — Low
      🟡
      **El sistema atacado:** el atacante puede modificar algunos datos, pero no tiene control total de escritura. Impacto parcial en integridad.


      VA:N
      Vulnerable System Availability — None
      🟢
      **El sistema atacado:** sigue funcionando. No hay denegación de servicio.


      SC:H
      Subsequent System Confidentiality — High
      🔴
      **Otros sistemas (los firewalls PAN-OS):** como las API keys quedan expuestas, los firewalls también quedan comprometidos en confidencialidad. El daño se propaga.


      SI:N
      Subsequent System Integrity — None
      🟢
      **Otros sistemas:** el atacante no puede modificar datos en los firewalls directamente a través de este vector.


      SA:N
      Subsequent System Availability — None
      🟢
      **Otros sistemas:** no puede tumbar los firewalls con este ataque.

💡

💡 La gran mejora de v4.0
v4.0 separa el impacto en VC/VI/VA (sistema directamente atacado) y SC/SI/SA (sistemas que se ven afectados después). En este CVE, eso es clave: Expedition tiene SC:H porque las API keys expuestas comprometen los firewalls. v3.1 no capturaba bien ese efecto en cadena.

Resumen visual: cómo leer un vector rápido

CVSS:3.1 / AV:? / AC:? / PR:? / UI:? / S:? / C:? / I:? / A:?
>           │       │       │       │       │     │      │     └─ ¿Se cae el servicio?
>           │       │       │       │       │     │      └──────── ¿Puede modificar datos?
>           │       │       │       │       │     └─────────────── ¿Puede leer datos privados?
>           │       │       │       │       └───────────────────── ¿El daño se propaga a otros sistemas?
>           │       │       │       └───────────────────────────── ¿Alguien tiene que hacer clic?
>           │       │       └───────────────────────────────────── ¿Necesita cuenta o contraseña?
>           │       └───────────────────────────────────────────── ¿Es difícil ejecutarlo?
>           └───────────────────────────────────────────────────── ¿Desde dónde puede atacar?
> 
> Valores de riesgo (de mayor a menor):
>   N (None) / H (High) = 🔴  →  Peor escenario
>   L (Low)             = 🟡  →  Impacto parcial
>   N (None en impacto) = 🟢  →  Sin efecto en esa categoría

La frase que resume todo

Cuando veas un vector como el de CVE-2024-9465, tradúcelo a una sola frase antes de enviarlo al equipo:

  📝 Traducción al español
  **"Cualquier persona en internet puede atacar este sistema sin credenciales ni ayuda de nadie, y obtener acceso completo a todos los datos confidenciales, incluyendo las llaves de tus firewalls."**

Eso es lo que dice el vector. Ahora sí sabes por qué tiene un 9.2.

Conclusión

El número CVSS te dice si debes preocuparte. El vector te dice cómo preocuparte. AV:N/AC:L/PR:N/UI:N juntos es lo más peligroso que existe: fácil, remoto y sin depender de nadie. Cuando lo veas así, actúa primero y analiza después.

✅ Regla práctica
Si los primeros 4 campos son AV:N / AC:L / PR:N / UI:N — el atacante puede ser cualquier persona en internet, atacando sin esfuerzo, sin cuenta y sin ayuda. Parchea hoy.

Compartir

  🐦 Twitter/X
  💼 LinkedIn

¿Te fue útil?

Mando contenido así cuando tengo algo que vale la pena.

    Suscribirse





    ← Anterior
    CVE-2024-9465: SQL Injection en Palo Alto Expedition — CVSS 9.2


    Todos los posts →
    Ver el blog completo








byron.lainez
© 2026 · Guatemala 🇬🇹

SimGemma: Democratizing STEM Education with Offline-First AI Simulations

Simgemma - Thumbnail

Introduction

Imagine a classroom in a remote village. There’s a blackboard, a few passionate teachers, and curious students. What’s missing? A high-end physics lab. Even more challenging? A stable internet connection.

Physics is a subject that demands exploration. It’s hard to grasp the beauty of gravity or the silence of a vacuum from a two-dimensional drawing. This is why I built SimGemma—an offline-first, AI-powered platform designed to bring high-fidelity 3D science simulations to every classroom, regardless of connectivity.

I’m Damodharan, a Tech Lead who spends my weekends teaching math and science to kids through an NGO. I’ve always felt that teaching topics like pendulum motion or trigonometry on a blackboard didn’t do justice to the science. These concepts, along with things like molecular structures (methane, for instance), are simply better understood in 3D.

SimGemma was created for the Google Gemma Challenge to demonstrate how open-weights models like Gemma can solve real-world problems in resource-constrained environments.

The Problem: The “Blackboard Gap”

Traditional STEM education often suffers from two major hurdles:

  1. Static Learning: Concepts like “pendulum motion in a vacuum” are taught theoretically because recreating a vacuum in a classroom is expensive and difficult.
  2. The Connectivity Divide: Most modern educational tools require high-speed internet, leaving students in remote areas behind.

I used to hand-code these simulations in Three.js, but it was time-consuming and hard to scale. I needed a way to generate these artifacts on demand.

The Solution: SimGemma

SimGemma is a “Lab in a Box.” It allows educators to generate interactive 3D simulations using simple natural language.

Key Features:

  • On-Demand 3D Artifacts: Want to see how a pendulum behaves on the Moon? Just ask. Need to visualize a Methane molecule? Gemma’s got it.
  • Vibecoding for Teachers: Teachers don’t need to be coders. They can describe the “vibe” of the lesson, and SimGemma generates the simulation logic and 3D assets.
  • True Offline Architecture: Everything runs locally. From the AI model to the 3D rendering engine.

Simgemma - carbon

Simgemma - pendulum

Simgemma - Trigonometry

Simgemma – Product link

Technical Architecture: Powered by Gemma 4

The heart of SimGemma is the Gemma 4 model. We chose Gemma for its exceptional performance-to-size ratio, making it perfect for local deployment.

1. Hybrid Offline Inference

We implemented a two-tier offline approach:

  • Server-Side (Local): For complex simulation generation, we run Gemma 4 via Ollama or llama.cpp on a local machine (e.g., a teacher’s laptop).
  • Client-Side (In-Browser): Using ONNX browser gemma4-e2b, we enable zero-server editing. This allows teachers to tweak simulation logic directly in the browser without needing any backend sandbox—everything is emulated in a local shell.

2. Programmatic Video & 3D

  • Remotion: We use Remotion to programmatically create educational videos and presentations of these simulations.
  • React Three Fiber / Three.js: The simulations themselves are high-fidelity 3D artifacts that students can interact with.

The “Vibecoding” Experience

One of the most exciting aspects of SimGemma is what we call “Vibecoding.” In our NGO workshops, we’ve seen that the biggest barrier to using technology in the classroom isn’t lack of interest—it’s the complexity of the tools.

With Gemma 4, we’ve turned the creation process into a conversation. A teacher can say: “Show me a double pendulum where the second arm is twice as heavy, and let’s see it in Mars’ gravity.”

Gemma understands the physics constraints, generates the necessary React/Three.js code, and renders it instantly. It turns educators into creators.

Breaking the Language Barrier

Living in India, where we have 22 official languages, I’ve seen how language can be a barrier to quality STEM content. Gemma 4’s translation capabilities are a game-changer. SimGemma can generate and translate these artifacts into regional languages like Tamil instantly. This means a teacher can create a simulation in English and have it ready for a Tamil-medium classroom in seconds, ensuring no student is left behind because of a language gap.

Impact: Bringing the Lab to the NGO

As a STEM volunteer, I’ve seen firsthand how an interactive simulation can light up a student’s eyes. SimGemma isn’t just about code; it’s about equity. It ensures that a child in a rural NGO workshop has access to the same quality of scientific exploration as a student in a tech-hub city.

Conclusion & Future Work

SimGemma proves that “Offline AI” isn’t a compromise—it’s a superpower. By leveraging the open-weights of Gemma 4, we’ve built a tool that is resilient, private, and accessible.

We are currently looking into:

  • Expanding the library of physics primitives.
  • Improving the browser-native ONNX performance for even smaller devices.
  • Collaborating with more NGOs to deploy SimGemma “Lab-in-a-box” kits.

Links

  • GitHub Repository: Github
  • Video Demo: Youtube
  • Try Simgemma now!

#Gemma #AI #OpenSource #Education #STEM #Physics #Remotion #ThreeJS

scrcpy Integration in a Tauri App — Android Screen Mirroring on Mac

HiyokoKit includes Android remote control via scrcpy. Launching and managing scrcpy from a Tauri app has specific challenges.
Here’s how I handle it.

What scrcpy is

scrcpy is an open-source tool that mirrors and controls an Android device screen over ADB. It’s the best free option for Android screen mirroring on Mac — fast, low latency, no app required on the device.

Launching scrcpy from Rust

use std::process::{Command, Child};

pub struct ScrcpyProcess {
    child: Option<Child>,
}

impl ScrcpyProcess {
    pub fn start(
        &mut self,
        device_serial: &str,
        max_size: u32,
        bit_rate: &str,
    ) -> Result<(), AppError> {
        let child = Command::new("scrcpy")
            .args([
                "--serial", device_serial,
                "--max-size", &max_size.to_string(),
                "--video-bit-rate", bit_rate,
                "--window-title", "Android Mirror",
                "--no-audio",
            ])
            .spawn()
            .map_err(|e| AppError::Scrcpy(e.to_string()))?;

        self.child = Some(child);
        Ok(())
    }

    pub fn stop(&mut self) {
        if let Some(mut child) = self.child.take() {
            child.kill().ok();
        }
    }

    pub fn is_running(&mut self) -> bool {
        if let Some(child) = &mut self.child {
            child.try_wait().map(|s| s.is_none()).unwrap_or(false)
        } else {
            false
        }
    }
}

Bundling scrcpy

scrcpy needs to be available on the user’s machine or bundled with your app. I bundle it in app resources as a universal binary:

{
  "bundle": {
    "resources": [
      "bin/scrcpy",
      "bin/adb"
    ]
  }
}

At runtime, get the resource path:

let scrcpy_path = app_handle
    .path()
    .resource_dir()
    .unwrap()
    .join("bin/scrcpy");

Detecting when scrcpy exits

scrcpy exits when the user closes the mirror window. Detect this to update your UI:

// Poll in background
tokio::spawn(async move {
    loop {
        tokio::time::sleep(Duration::from_secs(1)).await;

        let running = {
            let mut proc = scrcpy_state.lock().unwrap();
            proc.is_running()
        };

        if !running {
            app_handle.emit("scrcpy-stopped", ()).ok();
            break;
        }
    }
});

Multiple device support

scrcpy’s --serial flag selects a specific device when multiple are connected. Get the serial from adb devices and pass it explicitly:

async fn get_device_serial() -> Result<String, AppError> {
    let output = Command::new("adb")
        .args(["devices"])
        .output()
        .await?;

    let stdout = String::from_utf8_lossy(&output.stdout);
    stdout.lines()
        .skip(1)
        .find(|l| l.contains("device"))
        .and_then(|l| l.split_whitespace().next())
        .map(|s| s.to_string())
        .ok_or(AppError::Device("No device found".into()))
}

If this was useful, a ❤️ helps more than you’d think — thanks!

HiyokoKit (includes scrcpy-based Android remote control) → https://hiyokomtp.lemonsqueezy.com/checkout/buy/2c94dd0f-e28a-4a17-8efc-7bd93087d46d

X → @hiyoyok

Diffusion Language Models Are Here: Deep Dive into NVIDIA’s Nemotron-Labs DLM Architecture

Meta Description: NVIDIA just open-sourced Nemotron-Labs Diffusion — a family of 3B, 8B, and 14B diffusion language models that merge autoregressive and diffusion generation for up to 6.4× faster inference. Here’s the complete technical deep dive into the architecture, training methodology, three generation modes, and how to run it today with SGLang.

Diffusion Language Models Hero Banner

Table of Contents

  1. The Speed Wall Autoregressive LLMs Hit
  2. What Are Diffusion Language Models?
  3. Why DLMs Struggled — Until Now
  4. NVIDIA’s AR-to-DLM Breakthrough: Efficient-DLM
  5. Nemotron-Labs Diffusion: The Model Family
  6. Three Generation Modes: AR, Diffusion, Self-Speculation
  7. Performance Deep Dive: The Numbers That Matter
  8. Under the Hood: Block-Wise Attention & KV Caching
  9. Getting Started: Running with SGLang
  10. What This Means for Production LLM Infrastructure
  11. Conclusion & The Road Ahead

1. The Speed Wall Autoregressive LLMs Hit

Every language model you’ve ever used — GPT-4, Claude, Llama, Qwen — generates text the same fundamental way: one token at a time, left to right, each new token conditioned on every previous one. It’s called autoregressive (AR) generation, and it’s been the undisputed king of language modeling since the original GPT paper in 2018.

But AR generation has a dirty secret. It’s not a compute-bound problem. It’s a memory-bandwidth-bound problem.

Here’s why that matters: each new token requires a full model forward pass. That means loading all the model’s weights — potentially tens of gigabytes for a 7B model — from HBM (High Bandwidth Memory) into the GPU’s compute cores, every single decoding step. On modern GPUs, the arithmetic throughput is enormous, but the memory bandwidth is the bottleneck. This is why serving an LLM at batch size 1 — a single user chatting with your model — leaves your GPU vastly underutilized.

The math is brutal. An A100 80GB GPU has ~2TB/s of HBM bandwidth. A 7B-parameter model in FP16 takes ~14GB. Reading all weights takes ~7ms minimum per step. At 30 tokens/second, you’re spending the vast majority of each step just moving weights, not computing. Scale this to a production API endpoint handling thousands of concurrent users, and the economics become painful.

The community has attacked this problem from many angles: speculative decoding (using a small draft model to propose tokens verified by the large model), quantization (FP8, INT4 to shrink weight footprint), and FlashAttention (optimizing the KV-cache access pattern). These are all incremental improvements on the same fundamental loop.

NVIDIA’s Nemotron-Labs Diffusion — released on HuggingFace on May 23, 2026 — is taking a fundamentally different approach. Instead of optimizing the autoregressive loop, it breaks the loop entirely.

2. What Are Diffusion Language Models?

If you’ve worked with image generation models (Stable Diffusion, DALL·E, Flux), you already know the concept of denoising diffusion. The idea is to start with pure noise and iteratively denoise it, guided by a conditioning signal, until you arrive at a coherent output.

Diffusion Language Models (DLMs) apply this same paradigm to text. Instead of generating tokens left-to-right, a DLM:

  1. Starts with a sequence of masked or noisy tokens (analogous to Gaussian noise in image diffusion)
  2. Runs multiple denoising refinement steps, predicting the clean token distribution at each step
  3. After several iterations, the entire sequence — or a large block of it — converges to the final output

Autoregressive vs Diffusion Language Model Architecture

The key theoretical advantage is parallelism. In a standard AR model, token t can only be generated after token t-1 exists. In a DLM, all positions in a block are refined simultaneously in each forward pass. This changes the computational profile dramatically: instead of being memory-bandwidth-bound by sequential weight loads, the GPU can be kept busy with dense matrix multiplications across the full block.

The conceptual roots of DLMs trace back to Masked Diffusion Language Models (MDLMs) — work like MDLM (Sahoo et al., 2024) and SEDD (Lou et al., 2023) — that framed text generation as a discrete denoising process over masked token sequences. However, these models had significant practical shortcomings when compared to the state-of-the-art AR models of the day. NVIDIA’s work specifically addresses why, and more importantly, how to fix it.

3. Why DLMs Struggled — Until Now

The community has known about the theoretical appeal of diffusion language models for years. The reason they haven’t taken over is a cluster of practical barriers that made them non-competitive with AR models in production:

1. Accuracy Gap: DLMs trained from scratch consistently underperformed comparably-sized AR models on standard benchmarks. The discrete, iterative denoising process is harder to optimize than the clean causal language modeling objective. Models like Dream 7B were impressive for DLMs, but still lagged behind Qwen3 4B — a smaller AR model — on reasoning and knowledge tasks.

2. Training Instability: Jointly learning to denoise across many noise levels with a bidirectional attention mask creates a different gradient landscape than causal language modeling. Loss curves are noisier, and the model is more sensitive to hyperparameter choices.

3. No KV Cache Compatibility: This was the killer for inference efficiency. KV caching — where you store key/value activations from previous tokens to avoid recomputing them — is the single most important optimization for AR inference. Standard DLMs use fully bidirectional attention across the entire sequence, which means you can’t cache anything: every refinement step needs to attend over all positions with the updated token states. This essentially erased the theoretical throughput advantage.

4. Fill-in-the-Middle Mismatch: During DLM training, tokens are masked uniformly at random across the sequence. But at inference time, the model typically has a left-side prefix (the prompt) that is fully unmasked, and must fill in the right side. This creates a training-test distribution mismatch that degrades quality.

Each of these problems has a specific technical solution in NVIDIA’s Efficient-DLM framework. Let’s dig in.

4. NVIDIA’s AR-to-DLM Breakthrough: Efficient-DLM

The foundational insight behind Nemotron-Labs Diffusion (and the academic paper it builds on, arXiv:2512.14067) is deceptively simple: don’t train DLMs from scratch — convert pretrained AR models into DLMs.

This avoids the accuracy gap problem entirely. You start with a model that already has world-class knowledge and reasoning capabilities baked into its weights, then teach it to also generate diffusion-style. The result is a model that retains AR accuracy while gaining diffusion parallelism.

But there are two critical technical challenges to solve for this conversion to work.

4.1 Block-Wise Attention: Preserving Weights, Enabling KV Caching

The attention mechanism is the crux of the problem. A standard AR model uses causal (lower-triangular) attention — each token attends only to itself and all previous tokens. A standard DLM uses bidirectional (full) attention — every token attends to every other token.

The issue: if you convert an AR model and suddenly change to fully bidirectional attention, you’ve broken the statistical assumptions baked into all those attention weights during pretraining. The key-value projections were trained to operate in a causal setting; they “expect” not to see future tokens. Loading them into a fully bidirectional context produces degraded output and requires extensive retraining to recover.

Efficient-DLM introduces block-wise causal attention as the solution:

  • The sequence is divided into non-overlapping blocks of size B (e.g., 32 tokens)
  • Within each block: full bidirectional attention (every token attends to every other token in the block)
  • Across blocks: standard left-to-right causal attention (block i can attend to blocks 0 through i-1)

Block-wise Attention Pattern with KV Caching

This hybrid pattern does something clever: it’s structurally similar enough to causal attention that pretrained weight distributions are preserved — the model only needs to learn bidirectionality locally within blocks, not globally across the whole sequence. The result is a much smoother conversion that requires far less compute to recover quality.

Crucially, this also re-enables KV caching. Since attention is still causal across blocks, the KV activations of completed (committed) blocks can be cached and reused exactly like in a standard AR model. Only the current block being refined needs to be recomputed each refinement step.

4.2 Position-Dependent Token Masking

The second innovation addresses the training-test distribution mismatch. Instead of masking tokens uniformly at random during training, Efficient-DLM uses a position-dependent masking strategy that assigns higher masking probabilities to tokens in later positions in the sequence.

The intuition: at inference time, when filling in a response to a prompt, earlier tokens in the response have already been decided (or are more constrained by the left-side context), while later tokens remain more uncertain. By skewing the training mask distribution to match this pattern, the model learns a denoising objective that better mirrors what it actually faces at test time.

4.3 Joint AR + Diffusion Training Objective

Rather than optimizing purely for the diffusion objective, Nemotron-Labs Diffusion is trained with a joint AR and diffusion loss:

L_total = λ · L_AR + (1 - λ) · L_diffusion

Where L_AR is the standard cross-entropy causal language modeling loss and L_diffusion is the masked diffusion objective. This joint training ensures the model remains a first-class AR model while learning the diffusion generation capability.

The pretrained base was trained on 1.3 trillion tokens from NVIDIA’s Nemotron pretraining datasets, with an additional 45 billion tokens of supervised fine-tuning data for the instruct-tuned variants.

5. Nemotron-Labs Diffusion: The Model Family

NVIDIA released seven model checkpoints on HuggingFace under the NVIDIA Nemotron Open Model License (commercially friendly for text models):

Model Parameters Type Downloads (Day 1)
nvidia/Nemotron-Labs-Diffusion-3B ~4B Text, Instruct 14.7K
nvidia/Nemotron-Labs-Diffusion-3B-Base ~4B Text, Base 14.2K
nvidia/Nemotron-Labs-Diffusion-8B 8B Text, Instruct 24.1K
nvidia/Nemotron-Labs-Diffusion-8B-Base 8B Text, Base 228K
nvidia/Nemotron-Labs-Diffusion-14B 14B Text, Instruct 3.28K
nvidia/Nemotron-Labs-Diffusion-14B-Base 14B Text, Base 1.18K
nvidia/Nemotron-Labs-Diffusion-VLM-8B ~9B Vision-Language 590

The 8B Base model being the most downloaded (228K in under 2 days) reflects developer interest in using it as a foundation for custom fine-tuning.

6. Three Generation Modes: AR, Diffusion, Self-Speculation

The standout design decision in Nemotron-Labs Diffusion is that all three generation modes are supported from a single checkpoint. You don’t need different models — just a different deployment config in SGLang.

Three Generation Modes Performance Comparison

Mode 1: Autoregressive (ar_mode=true)

Standard left-to-right token generation, identical to how you’d run any other causal LM. This mode is the correctness baseline — most useful for debugging, A/B testing against existing pipelines, or when you need strict adherence to specific decoding behaviors.

Use when: Debugging, regression testing, or exact reproduction of AR outputs.

Mode 2: Diffusion / FastDiffuser (diffusion_mode=true)

The model fills in a block of 32 tokens simultaneously, running multiple denoising refinement steps per block. A confidence threshold determines which tokens are “committed” after each refinement pass — tokens whose predicted distribution is peaked enough get locked in, reducing the number of positions that need further refinement.

The process per block:

  1. Initialize block positions with mask tokens
  2. Forward pass with block-wise attention → predict token distributions over all positions
  3. Commit tokens above confidence threshold; keep others masked
  4. Repeat steps 2–3 until all positions are committed or max steps reached
  5. Move to next block, using committed block tokens in KV cache

Achieves 2.6× higher tokens per forward pass (TPF) compared to AR.

Use when: High-throughput batch serving where speed matters more than exact AR equivalence.

Mode 3: Self-Speculation / LinearSpec (self_speculation=true)

This is the most sophisticated mode — it fuses diffusion and autoregressive decoding into a single hybrid loop:

  1. The model uses diffusion to draft a full block of k candidate tokens bidirectionally (fast, parallel)
  2. It then uses autoregressive decoding to verify the draft tokens causally from left to right
  3. Any prefix of the draft that matches what AR would have produced gets committed
  4. The process restarts from the first disagreement position

The same model plays both roles (drafter and verifier). Output is lossless vs AR at temperature=0.

Key numbers: LinearSpec achieves ~6× higher TPF than AR, and ~865 tokens/second on NVIDIA B200 hardware — roughly 4× the AR baseline on identical hardware.

Use when: Production serving where you need maximum speed with no quality compromise.

7. Performance Deep Dive: The Numbers That Matter

Accuracy vs Qwen3 8B:
Nemotron-Labs Diffusion 8B achieves +1.2% higher average accuracy compared to Qwen3 8B across evaluated benchmarks. The DLM conversion doesn’t hurt quality — it slightly improves it, likely because the joint AR+diffusion training objective acts as an additional regularizer.

vs Dream 7B (prior DLM SOTA):
Efficient-DLM 8B achieves +5.4% higher accuracy and 4.5× higher throughput compared to Dream 7B — a decisive improvement over the previous DLM state-of-the-art.

Throughput (Tokens Per Forward Pass — TPF):

Mode TPF (relative to AR) Quality vs AR
Autoregressive 1× (baseline) Exact match
Diffusion (FastDiffuser) 2.6× Slightly different
Self-Spec Linear (LinearSpec) ~6× Lossless at T=0
Self-Spec Quadratic (QuadSpec) ~6.4× Lossless at T=0

TPF (Tokens Per Forward Pass) is a hardware-agnostic efficiency metric — it measures how many output tokens you get per model forward pass, making it useful for comparing across different GPU generations.

8. Under the Hood: Block-Wise Attention & KV Caching

Let’s look at exactly how the block-wise attention mechanism enables KV caching in a DLM setting.

In standard AR decoding, the KV cache stores the key and value projections for every previously generated token. When generating token t, the model attends to cached KV from tokens 0...(t-1) and computes new Q, K, V for position t only — O(1) cache update per step.

In a standard bidirectional DLM, this is impossible: since every token attends to every other token, and token values change with each refinement step, you’d need to recompute the entire KV matrix every step — O(n²) per refinement, no caching benefit.

Block-wise causal attention resolves this with a two-level hierarchy:

Sequence: [Block 0 | Block 1 | Block 2 | ... | Block N]

For a token in Block i:
  - Attends to ALL tokens in blocks 0...(i-1)  → cached KV (never recomputed)
  - Attends to ALL tokens within Block i        → bidirectional, recomputed each step
  - CANNOT attend to tokens in blocks (i+1)+   → causal constraint maintained

For a 32-token block size and 2048-token sequence, 98.4% of KV computations are served from cache at any given refinement step.

Here’s how to build the attention mask in PyTorch:

import torch

def build_block_causal_mask(seq_len: int, block_size: int) -> torch.Tensor:
    """
    Build a block-wise causal attention mask.

    Within each block: full bidirectional attention (True)
    Across blocks: causal left-to-right attention (True only for past blocks)
    Future blocks: masked out (False → -inf in softmax)

    Returns a boolean mask of shape [seq_len, seq_len],
    where True = can attend, False = masked.
    """
    mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)
    num_blocks = seq_len // block_size

    for block_i in range(num_blocks):
        q_start = block_i * block_size
        q_end   = q_start + block_size

        # Attend to all past blocks (causal across blocks)
        for block_j in range(block_i):
            kv_start = block_j * block_size
            kv_end   = kv_start + block_size
            mask[q_start:q_end, kv_start:kv_end] = True

        # Attend fully within current block (bidirectional within block)
        mask[q_start:q_end, q_start:q_end] = True

    return mask


# Example: 4 blocks of 4 tokens each = 16 token sequence
mask = build_block_causal_mask(seq_len=16, block_size=4)
print(mask.int())

# Output (each row = query token, each col = key token):
# Block 0 rows: [1111 | 0000 | 0000 | 0000]
# Block 1 rows: [1111 | 1111 | 0000 | 0000]
# Block 2 rows: [1111 | 1111 | 1111 | 0000]
# Block 3 rows: [1111 | 1111 | 1111 | 1111]

The resulting mask has fully-connected 4×4 diagonal blocks (bidirectional within blocks) with a lower-triangular structure across block boundaries (causal across blocks). It’s the AR causal mask, coarsened to block granularity — which is precisely why pretrained AR weight distributions are preserved.

9. Getting Started: Running with SGLang

SGLang is the recommended serving framework for Nemotron-Labs Diffusion, with integration via PR #25803 (merging into main imminently). Here’s a complete working example.

9.1 Installation

# Install SGLang with DLM support
pip install "sglang[all]>=0.4.5" --extra-index-url https://flashinfer.ai/whl/cu124/torch2.5/

# If the PR hasn't merged to main yet, install from the DLM branch directly:
# git clone https://github.com/sgl-project/sglang.git
# cd sglang && git fetch origin pull/25803/head:dlm-support
# git checkout dlm-support && pip install -e ".[all]"

# Pull the model weights
pip install huggingface-hub
huggingface-cli download nvidia/Nemotron-Labs-Diffusion-8B 
  --local-dir ./models/Nemotron-Labs-Diffusion-8B

9.2 Serving: Launch the SGLang Server

# Mode 1 — Autoregressive (standard baseline)
python -m sglang.launch_server 
  --model-path ./models/Nemotron-Labs-Diffusion-8B 
  --port 30000 --tp 1 --dtype bfloat16 
  --algorithm ar_mode

# Mode 2 — Diffusion (FastDiffuser): highest raw throughput
python -m sglang.launch_server 
  --model-path ./models/Nemotron-Labs-Diffusion-8B 
  --port 30000 --tp 1 --dtype bfloat16 
  --algorithm diffusion 
  --block-size 32 
  --confidence-threshold 0.9

# Mode 3 — Self-Speculation (LinearSpec): lossless 6x speedup
python -m sglang.launch_server 
  --model-path ./models/Nemotron-Labs-Diffusion-8B 
  --port 30000 --tp 1 --dtype bfloat16 
  --algorithm linear_spec 
  --draft-block-size 32

9.3 Inference: Python Client (OpenAI-Compatible API)

import openai
import time

# SGLang exposes an OpenAI-compatible API endpoint
client = openai.OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"  # SGLang doesn't require auth by default
)

PROMPT = """You are an expert in distributed systems.
Explain the CAP theorem and its practical implications for a microservices
architecture. Be specific with concrete trade-off examples."""

def benchmark_mode(label: str, mode_hint: str = ""):
    """Run a generation and measure wall-clock tokens/second."""
    start = time.perf_counter()

    response = client.chat.completions.create(
        model="nvidia/Nemotron-Labs-Diffusion-8B",
        messages=[{"role": "user", "content": PROMPT}],
        max_tokens=512,
        temperature=0,        # T=0 → LinearSpec is lossless vs AR
        extra_body={
            "mode": mode_hint  # "ar", "diffusion", or "linear_spec"
        } if mode_hint else {}
    )

    elapsed = time.perf_counter() - start
    tokens  = response.usage.completion_tokens
    tps     = tokens / elapsed

    print(f"n{'='*60}")
    print(f"Mode        : {label}")
    print(f"Output      : {response.choices[0].message.content[:200]}...")
    print(f"Tokens      : {tokens}")
    print(f"Time (s)    : {elapsed:.2f}")
    print(f"Throughput  : {tps:.1f} tok/s")
    print(f"{'='*60}")
    return tps

# Compare all three modes
ar_tps   = benchmark_mode("Autoregressive",           mode_hint="ar")
diff_tps = benchmark_mode("Diffusion (FastDiffuser)", mode_hint="diffusion")
spec_tps = benchmark_mode("Self-Spec (LinearSpec)",   mode_hint="linear_spec")

print(f"n📊 Speedup Summary:")
print(f"  Diffusion vs AR   : {diff_tps/ar_tps:.2f}×")
print(f"  LinearSpec vs AR  : {spec_tps/ar_tps:.2f}×")

9.4 Quick Start via HuggingFace Transformers (AR Mode)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "nvidia/Nemotron-Labs-Diffusion-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user",   "content": "Explain masked diffusion in 3 sentences."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        max_new_tokens=256,
        do_sample=False,
        use_cache=True
    )

response = tokenizer.decode(
    output_ids[0][input_ids.shape[-1]:],
    skip_special_tokens=True
)
print(response)

Note: The transformers path gives AR mode only. For diffusion and self-speculation modes, the SGLang integration is required as it implements the custom decoding loop.

10. What This Means for Production LLM Infrastructure

Latency vs Throughput Trade-off, Revisited

The classic LLM serving dilemma is that throughput optimizations (larger batch sizes, continuous batching) increase latency, and latency optimizations (small batches, low KV cache pressure) hurt throughput. Self-speculation in DLMs partially decouples this: at batch size 1, LinearSpec gives 4–6× more tokens per second than AR on the same hardware. This is the scenario where AR models are most inefficient, and where DLMs provide the biggest relative gain.

Cost Implications

A 4× throughput improvement at batch size 1 means you could serve the same number of users with 1/4 the GPU compute — or equivalently, serve 4× more users from the same GPU fleet. At current B200/H100 pricing of $4–8/hour, that’s a meaningful cost reduction for any team running a production LLM API.

Fill-in-the-Middle and Code Editing

DLMs have a natural advantage for fill-in-the-middle (FIM) tasks. AR models handle FIM awkwardly, requiring special training and prompt formatting to look “backwards” at the suffix. A DLM generating a block bidirectionally can natively condition on both prefix and suffix context within the block — making Nemotron-Labs Diffusion well-suited for code editing agents and inline completions.

Inference Budget Control

In diffusion mode, you can control the number of denoising steps as a runtime knob. Fewer steps = faster but potentially lower quality. More steps = slower but higher quality. This gives you a continuous quality-speed trade-off at inference time without retraining — something AR models simply can’t offer. A production system could dynamically reduce diffusion steps during traffic spikes and increase them during low-load periods.

When to Stick with AR

For long-context tasks (100K+ tokens) where the KV cache dominates memory, the efficiency story is less clear-cut. For streaming output where users see tokens as they’re generated, block-wise generation may feel less smooth without careful rendering logic. And for tasks requiring strict constrained decoding (grammar-constrained generation, beam search), the diffusion loop needs further tooling work.

11. Conclusion & The Road Ahead

Diffusion Language Models have been a promising idea for years, perennially held back by a cluster of practical barriers: accuracy gaps, training instability, and the loss of KV caching. NVIDIA’s Efficient-DLM work and Nemotron-Labs Diffusion have systematically addressed each of these barriers with concrete, principled solutions — block-wise causal attention, position-dependent masking, and joint AR+diffusion training objectives.

The result is a model family that is simultaneously:

  • A first-class AR model (backward compatible, lossless in LinearSpec mode)
  • A 2.6–6.4× faster inference engine (depending on mode and hardware)
  • 🔲 A better fill-in-the-middle model by architectural design
  • 🎛️ A tunable quality-speed dial at deployment time — no retraining needed

With 24K+ downloads in the first 24 hours and SGLang integration landing imminently, this is one of the most practically significant open-source releases in the LLM inference space in 2026.

The next frontier: applying the same AR-to-DLM conversion recipe to frontier-scale models (70B+), exploring multimodal DLMs beyond the 8B VLM preview, and building out constrained decoding, streaming token rendering, and fine-tuning tooling for the DLM objective.

If you’re building LLM-powered applications and care about inference cost and latency, it’s time to start experimenting with Nemotron-Labs Diffusion. The autoregressive loop had a good run — but the next chapter of language model inference looks decidedly more parallel.

🔗 Resources

  • 🤗 Nemotron-Labs Diffusion model collection on HuggingFace
  • 📄 Efficient-DLM technical paper — arXiv:2512.14067
  • 💻 NVIDIA Megatron Bridge training code (GitHub)
  • 🔧 SGLang DLM integration PR #25803

Written on May 24, 2026 — based on the HuggingFace blog post and arXiv:2512.14067 (Efficient-DLM). Performance numbers reflect published benchmarks; verify against your specific hardware and workload.

Four Levels Of Customer Understanding

Many companies think they know fairly well what their users want and need, and how they make their decisions. Yet most of the time these are merely big assumptions and big hunches — with little real evidence to support them. In practice, obvious reasons might be true, but they rarely paint the full picture.

To understand our customers, we must triangulate across four levels of customer understanding by Hannah Shamji. It’s a useful way to think about the underlying reasons for user behavior, hidden motivations, and the complex layers of messy and noisy reality that are often overlooked. Let’s see how it works.

Don’t Ask Users Your Burning Questions

To learn about customers, it might seem reasonable to ask people what they think and draw conclusions from it. But it’s rarely an effective way to get actionable answers. In fact, as it turns out, what people think, feel, say, and _do_ are often very different things.

As Erika Hall wrote, asking a question directly is the worst way to get a true and useful answer to that question. We don’t always understand or are aware of our true motivations. We often apply our own context and interpretations to questions.

We also exaggerate (a lot!). We focus on edge cases and unrealistic scenarios, and we favor short-term goals over long-term goals. So if users say that they absolutely need to compare products in a table, it doesn’t mean that they couldn’t get to their underlying goal without it.

“Possible” vs. “Probable”

Just to indicate how tricky listening to words alone is: even little nuances in words chosen matter. In practice, users are rarely precise in expressing their thoughts, and a good example is the distinction between possible, plausible, and probable, as discovered by Thomas D’hooge.

A study on Dutch verbal probability terms shows how unreliable the choice of words is. While extreme words have some agreement, terms like “possible,” “maybe,” “uncertain,” or “likely” lead to a wide spread of interpretations. So we shouldn’t rely on what people say, but rather try to go deeper.

The Levels Of Understanding

To get a more realistic and less biased view of customers’ needs, we need to understand a broader picture across 4 levels:

  • Level 1: “What they say”
    Easier to collect, but mostly opinions, and most unreliable. People often explain their behavior through the lens of how they perceive it, or how they want it to be perceived, which isn’t always accurate. We shouldn’t rely too much on CRM data, surveys, or polls.
  • Level 2: “What they think and feel”
    Gives more context, but is still heavily shaped by memory and personal preferences. Good user research and interviews help us understand expectations and experiences.
  • Level 3: “What they do”
    We study actual behavior, actions taken or skipped, usage data, and analytics. We run task analysis and workflow analysis to understand how people use the product.
  • Level 4: “Why they do it”
    We study underlying motivations and root causes, through observations of real workflows and in-depth interviews. Typically, it requires a trustworthy relationship with the user, repeat interviews, and task walkthroughs.

Personally, I wouldn’t recommend NPS (alternative). It’s worth noting that different levels might reveal conflicting or contradictory data. To get a better understanding, we need to triangulate and reconcile data with mixed-method research.

Capturing Emotions And Nuance

Emotions are always difficult to capture, but they are easier to spot once you observe people doing what they need to do without external influence or interruptions. The ability to positively impact users grows by moving from sympathy to empathy or even compassion, as articulated by Sarah Gibbons.

In the past, I was using “speak-aloud” protocol and asked users to walk me through their thought process as they were completing tasks. But it actually turns out to be quite disruptive. Because people are focused on speaking at the same time while solving a task, many emotions remain hidden or obscured by their language.

So, when conducting usability testing, I don’t ask users to speak through their experience. Instead, I observe where they tap or hover with the mouse, where their mouse circles without an action, where they scroll, and how long. Eventually, when a user confirms that they are done or that they are stuck, I ask questions.

The Emotion Wheel (website) by Geoffrey Roberts is a helpful little tool for better describing a range of emotions during user interviews or design sessions. It certainly needs refinement for product design needs, but it helps us get more precise about the sentiment customers or colleagues might be experiencing, moving beyond just “good” or “bad”.

One helpful trick is to use mirroring — repeating what a user has said, or ask the same question twice, just paraphrasing it. Or navigating the emotions wheel (see above) to better capture and understand the emotion.

These strategies help uncover some of the issues that perhaps didn’t come up in the first answer. That’s also when a user tends to add more useful context and details as they explain their confusion.

Emotions Aren’t Everything

Some people strongly disagree:

“Our work is about others — their problems, their pain, their mess. Our job is to make sense of it and then do something about it. Not to emote or perform but to act on and solve it. There is a flawed belief that to build great things, you first need to emotionally fully absorb someone else’s experience.”

— Alin Buda

I think that Alin brings up a very strong argument, and personally, I find it difficult to disagree with. However, I do see user’s emotional response as a signal of how well the product is working for them. How engaged or detached they are in their journey, how they react to aesthetics, how confused or confident they are.

Ultimately, these are signals. To make a difference, we must go beyond emotions and explore what people actually do. Usually, this means relentlessly observing, diagnosing, and focusing on underlying user needs.

Observe And Diagnose, Don’t Validate

Instead of asking, we need to observe. Usually, I focus on small things that make or break an experience. I see where users lose time, repeat actions, hover without clicking, or click and then go back. Pay attention to subtle cues like scratching their neck, raising eyebrows, or expressions of worry, joy, or confusion.

Many companies talk about “validation” through user testing, but often that means simply confirming existing assumptions. But we should instead diagnose existing behavior without preconceived notions or affiliations. We don’t validate — we actually research instead.

That research means not just understanding customers’ real motivations, but also risks, doubts, concerns, worries, and perhaps even harms.

The only way to get there is by building a sincere, honest, and trustworthy relationship — one that feels right and resonates deeply. When customers truly care and want to help, getting to a real understanding becomes much, much easier.

Practical Ways To Uncover User Needs

We don’t need expensive tools to uncover user needs. David Travis provides a fantastic overview of helpful strategies to do just that. Here are some initiatives to spread the word about real user’s struggles or gain a deeper understanding of user needs:

  • Exposure hours, when every employee must be exposed to their customers for at least 2 hours every 6–12 weeks.
  • Live UX testing, where we invite everyone in the company to join and observe.
  • Co-design with users, where we show new features and ask users to rank them.
  • Helpdesk insights, where we ask for frequent complaints and questions from the support every 3–6 months.
  • Listening in, where we tune in on a customer service call, web chat, or eavesdrop where users hang out.

The core idea here is that you don’t need extensive and expensive tools to uncover user needs. You need to create spaces where customers’ struggles can be exposed and make these struggles visible across the entire company.

It can be short video clips of user sessions or a monthly newsletter with what we learned this month. Making these pain points visible can rally everyone from marketing to engineering to keep users’ struggles at the back of their minds.

Wrapping Up

To make an impact, we must go way beyond user feedback. It’s never enough to listen to surveys — we must observe customers’ actual behaviors and build relationships to truly understand their goals and their motivations.

And most importantly, we need to understand what questions we actually want to have answered. Not what “validation” we need to move on with the project, but what we don’t know and what we need to research.

Without it, everything else is merely hunches and assumptions — and often wrong and expensive ones.

Meet “Measure UX & Design Impact”

Meet Measure UX & Design Impact, Vitaly’s practical guide for designers and UX leads on how to track and visualize the incredible impact of your UX work on business — with a live UX training later this year. Jump to details.


Meet Measure UX and Design Impact, a practical video course for designers and UX leads.

  • Video + UX Training
  • Video only

Video + UX Training

$ 495.00 $ 799.00

Get Video + UX Training

25 video lessons (8h) + Live UX Training.
100 days money-back-guarantee.

Video only

$ 250.00$ 350.00

Get the video course

25 video lessons (8h). Updated yearly.
Also available as a UX Bundle with 3 video courses.

Useful Resources

  • Four Levels of Customer Understanding, by Hannah Shamji
  • 60 Ways To Understand User Needs, by David Travis
  • Emotion Wheel Toolkit (PNG), by Geoffrey Roberts
  • Feelings Wheel PDF
  • Feelings Wheel Online
  • My Case Against Empathy, by Alin Buda
  • Possible vs. Probable, by Thomas D’hooge
  • Communicating probability: a multinational study of the interpretation of verbal probability terms, by Maarten C. de Vries, Marjolijn L. de Boer, and Martine Bouman.

Useful Books

  • Deploy Empathy: A practical guide to interviewing customers, by Michele Hansen
  • Humankind, by Rutger Bregman

What If AI Didn’t Need the Internet?

*How Gemma 4 Could Bring Powerful AI Closer to People, Not Just Servers

For a long time, using powerful AI has felt a bit like borrowing someone else’s computer from far away. Every question, image, or request had to travel through the internet to massive cloud servers before coming back with an answer. It worked, but it also came with limits — slow connections, privacy concerns, expensive APIs, and the constant need to stay online.

But recently, that idea started to change for me.

When I explored Google’s Gemma 4 models, I realized something important: the future of AI may not live only in giant data centers. It may live on our own devices.

And honestly, that feels like a breath of fresh air.

Why Local AI Feels Different

Most conversations around AI focus on bigger models, faster responses, or smarter chatbots. But what excited me most about Gemma 4 was something simpler — accessibility.

Gemma 4 introduces powerful capabilities like multimodal understanding, advanced reasoning, and a huge 128K context window, while also supporting local deployments across different kinds of devices. That means AI is no longer only for companies with huge budgets and powerful servers. In many ways, it levels the playing field.

For students, creators, and independent developers, that matters a lot.

Sometimes the best ideas come from people who do not have unlimited resources. Gemma 4 feels like a tool that opens doors instead of building walls.

A Future Beyond Constant Connectivity

In many places, stable internet is still not guaranteed. Even today, students often struggle with weak networks, limited data plans, or shared devices. Cloud-based AI can become difficult to rely on in those situations.

Now imagine this instead:

A student sitting in a small town using an offline AI tutor on a low-cost laptop.

A medical worker accessing a private assistant without uploading sensitive information to external servers.

A creator brainstorming ideas while traveling without worrying about internet speed.

That is where local AI starts to shine.

As the saying goes, “necessity is the mother of invention.” Sometimes limitations push technology in the right direction. Local AI is not just about convenience — it is about making intelligence more available to ordinary people.

What Makes Gemma 4 Exciting

One thing I appreciate about Gemma 4 is that it does not feel like a one-size-fits-all solution. The different model sizes make it flexible enough for different needs, from lightweight experiments to larger applications.

The multimodal capability is especially interesting because it allows AI to work with more than just text. That opens the door for tools that can understand images, documents, notes, and visual information in a much more natural way.

The long context window also caught my attention. Anyone who has worked with AI knows how frustrating it can be when a model “forgets” earlier parts of a conversation or document. With 128K context support, Gemma 4 can handle much larger amounts of information at once, making interactions feel smoother and more useful.

And then there is reasoning.

We are slowly moving from AI that simply responds to AI that can genuinely assist with problem-solving and deeper thinking tasks. That shift could change how students learn, how developers build, and how small teams innovate.

Why This Matters to Me as a Student Developer

As a student developer, what inspires me most is not just the technology itself — it is the possibility behind it.

Powerful AI often feels out of reach for many learners. Either the hardware is expensive, the APIs cost too much, or the tools require constant internet access. It can feel like the cards are stacked against small creators.

But local AI changes the conversation.

It gives students the freedom to experiment, build, and learn without depending entirely on cloud platforms. Even simple projects can become meaningful. An offline study assistant, a note summarizer, a multilingual learning tool — these ideas suddenly feel possible.

That is exciting because innovation should not belong only to large companies. Sometimes a small idea in the right hands can punch above its weight.

The Bigger Picture

I do not think the future will be “cloud AI versus local AI.” Both will continue to exist and grow together.

But I do believe local AI will become increasingly important.

People care more about privacy now. Developers want flexibility. Students want affordable tools. And many communities still need technology that works even with limited connectivity.

Gemma 4 feels like a step toward that future — one where AI becomes more personal, more accessible, and more adaptable to real-world situations.

Not every technological shift changes who gets to participate.

This one just might.

Final Thoughts

The most impressive thing about Gemma 4 is not only its capabilities. It is the idea behind it.

Powerful AI is slowly moving closer to people instead of farther away behind massive infrastructure. And that could make all the difference.

We often say technology should make life easier, but the best technology also makes opportunities wider. In many ways, Gemma 4 feels like a glimpse of that future.

And for students, builders, and curious minds everywhere, that future looks incredibly promising.

Terraform with AI: Build AWS Infra (Cursor + MCP)

Why Terraform with AI Matters in Modern DevOps

Writing Terraform for anything beyond a small setup quickly becomes tedious.
Once you start dealing with multiple modules, cross-resource dependencies, and AWS-specific quirks, the workflow slows down. Most of the time isn’t spent writing code — it’s spent checking documentation, fixing edge cases, and rerunning terraform apply.
Many teams are now experimenting with Terraform with AI to speed this up.
In practice, that only works partially — unless the AI has proper context.
Build Complete AWS Infrastructure with Terraform MCP Server and Cursor AI – Full Tutorial

How Terraform workflows traditionally worked

A typical workflow looks like this:
Read Terraform docs
Write modules and resources manually
Run terraform plan
Fix errors
Repeat
For small setups, this is manageable.
For production infrastructure, it becomes repetitive and slow. Most engineers end up switching between Terraform registry docs, AWS docs, and their codebase constantly.

Limitations of Using Terraform with AI Without Context

The obvious idea is to use AI to generate Terraform.
In most cases, it starts like this:

“Generate Terraform for a VPC with public and private subnets”
You do get output. But:

It may use outdated arguments
It ignores your module structure
Dependencies are incomplete
It often fails during terraform apply
👉 The core issue: AI does not understand your infrastructure context

Our First Attempt (RAG Failure) (Late 2024 – before advent of modern agents)

To solve this, we built an internal tool using:

Vector database
RAG (Retrieval-Augmented Generation)
The idea was to fetch Terraform documentation and index it in a vector database and provide it to an agent
It helped slightly — but failed in practice:

Iteration was difficult – terraform plan and apply loop – fix errors
Context size limitations
No awareness of project structure
Could not refine outputs
It generated code, but only for simple infrastructure. For complex ones it used to fail after a few iterations.

We didn’t try to optimise it further because while we were in middle of it – cursor agents became extremely powerful and they pretty much solved this iteration problem.

What changed with MCP Server + Cursor

The behavior changed once we introduced Terraform MCP Server and used it with Cursor.
Instead of generating code blindly, the system now had access to:

Terraform module documentation
Input/output structures
Resource relationships
The difference was noticeable.
The output was not perfect — but much closer to something usable.

How MCP actually changes the workflow

At a high level, MCP acts as a bridge between the editor (Cursor) and Terraform context.
Instead of guessing, the AI can:

Look up module definitions
Understand required inputs
Follow dependencies across resources

This is the key difference from standard AI usage.

⚙️ What MCP Server Actually Does Internally

The improvement with MCP is not just better prompting — it’s access to structured Terraform knowledge.
The MCP server exposes tools that allow the AI to query real Terraform data:

Key MCP Capabilities:

Provider Documentation Lookup

Fetches full documentation for resources, data sources, and functions

Module Discovery

Finds Terraform modules from the registry with usage examples

Module Details

Retrieves inputs, outputs, and configuration patterns

Policy Search

Helps identify best practices and security policies
👉 In simple terms:
Instead of guessing, the AI can look things up like an engineer would.

Terraform with AI vs Manual vs MCP (Comparison)

In practice, most teams try AI first, then realize that without context, results are unreliable. MCP fixes that gap.

A practical example: building AWS infrastructure

Let’s take a realistic setup:

VPC with public and private subnets
NAT Gateway and Internet Gateway
Application Load Balancer
Auto Scaling group (EC2)
CloudFront distribution
Cloudflare DNS
Jump box for access
This is a typical production-style setup.
Writing this manually takes time — especially when wiring dependencies correctly.

How we approached it

Instead of writing everything manually, we broke the problem into smaller steps and guided the AI.

🧠 Full Prompt Used for Infrastructure Generation

Instead of vague prompts, we used a structured, step-by-step approach to guide the AI.

Start code generation. Do it step by step. Move to next step only after the current step is complete.

Step: Create VPC and Network Infrastructure - use vpc module
- Create VPC with appropriate CIDR block
- Create public and private subnets across 2 AZs
- Set up Internet Gateway
- Configure NAT Gateways in public subnets
- Configure route tables for public/private subnets

Step: Create Security Groups
- ALB security group (allow HTTP/HTTPS inbound)
- EC2 security group 
    - allow traffic from ALB
    - allow ssh from the vpc
- Allow all outbound traffic

Step: Create Auto Scaling Group - use autoscaling module
- Create launch template for EC2 instances
- Use ubuntu ami for the instances
- Configure ASG across private subnets
- Use keypair named "vikas-aws"
- Add a user data script to install nginx and create a simple html page

Step: Create a jumpbox
- Create a jumpbox in the public subnet
- Ensure it has a public IP
- Allow SSH from internet

Step: Create Application Load Balancer - use alb module
- Create ALB in public subnets
- Configure HTTP listener
- Attach to autoscaling group

Step: CloudFront Distribution
- Configure CloudFront with ALB as origin
- Set caching TTL to 0

Step: DNS Configuration
- DNS handled via Cloudflare (no route53)

In practice, breaking the problem into steps like this improves output quality significantly compared to single prompts.

What the generated Terraform looked like

A simplified example:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.0.0"

  name = "demo-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b"]
  public_subnets  = ["10.0.1.0/24", "10.0.2.0/24"]
  private_subnets = ["10.0.3.0/24", "10.0.4.0/24"]

  enable_nat_gateway = true
  single_nat_gateway = true
}

This wasn’t perfect out of the box, but:

Structure was correct
Inputs were mostly valid
Dependencies were aligned
That already saves a significant amount of time.

What actually improved (based on usage)

From real usage:

Before:

2–4 hours to assemble infra
Multiple documentation lookups
Several failed applies

After:

Initial setup generated in minutes
Fewer structural errors
Faster iteration
In most teams, the biggest gain is reduced context switching.

Operational considerations

This approach still requires discipline:

Always run terraform plan
Review changes carefully
Do not trust generated code blindly
IAM policies and security configurations must always be reviewed manually.

When Terraform with AI Works Best

In most teams, this approach works well when:

You are building new infrastructure
You need to scaffold modules quickly
You want to reduce repetitive work
It is less effective when used blindly or without validation.

When not to use this approach

Avoid relying on it when:

Infrastructure requires strict compliance
You don’t understand the generated code
You need deterministic, audited configurations
This is not a replacement for Terraform expertise.

Where this fits in a DevOps workflow

This approach integrates naturally with:

Git-based workflows
CI/CD pipelines
Infrastructure reviews
The deployment process does not change — only the way code is written.

FAQ

What is Terraform with AI?

Terraform with AI refers to using AI tools to generate and manage infrastructure code more efficiently.

What is Terraform MCP Server?

It provides AI tools with Terraform context, including modules and documentation.

Is AI-generated Terraform safe for production?

Yes, but only after proper validation and review.

Conclusion

Terraform itself hasn’t changed.
What’s changing is how engineers interact with it.
Using Terraform with AI + MCP Server reduces friction in writing infrastructure — especially for repetitive setups.
It doesn’t replace engineering judgment, but it does make the workflow more efficient.

Related reading

https://www.kubeblogs.com/how-civo-kubernetes-routes-pod-traffic-single-egress-ip-explained/
https://www.kubeblogs.com/gp3-vs-gp2-ebs-volume-aws/

CSS :has() Selector: The Layout Trick I Wish I Knew 5 Years Ago

CSS :has() is not just a fancy :parent

When :has() started popping up in specs and tweets, I mentally filed it under “cool, but not for shipping work.” I was wrong.

Now it is in Chrome, Safari, Edge, and Firefox. I use it in real projects. It has removed entire JavaScript files and a pile of .is-active classes that I was embarrassed to maintain.

If you are a working frontend dev, the shorthand is this: :has() turns CSS from “style what is there” into “style this thing if it contains that thing”. That one capability changes layout, state, and validation flows.

I will walk through three places where it made a real difference for me:

  • Parent styling without JS
  • Sibling state UIs without wiring events
  • Form validation UI that reacts to the DOM, not a framework

All of this shipped with zero additional JavaScript.

Quick mental model of :has()

The syntax looks like a pseudo class on a selector:

.card:has(img.hero) {
  /* styles here */
}

Read it as: “select .card elements that have a descendant img.hero somewhere inside.” It is a conditional filter on the left side of the selector.

You can also scope it more tightly:

.tabs:has(> .tab.is-active) {
  /* direct children only */
}

Or use it with relational selectors like siblings:

.field:has(+ .field--error) {
  /* this .field is followed by an error field */
}

Once that clicks, you start seeing places to remove JS.

1. Parent styling in a content-heavy project

First real use: a content-heavy marketing site for a biohacking brand I work with. Editors can drop components in any order with a CMS. Sometimes a card has an image, sometimes it is text-only. The layout should adapt.

Previously I solved this with modifier classes from the CMS, or a hydration script that scans the DOM and adds classes like .card--with-media. Boring, fragile, and slightly gross.

With :has() I deleted that script.

Image-aware cards

The card markup is boring on purpose:

<article class="card">
  <img class="card__media" src="hero.jpg" alt="">
  <div class="card__body">
    <h2>Title</h2>
    <p>Some text...</p>
  </div>
</article>

<article class="card">
  <div class="card__body">
    <h2>Another Card</h2>
    <p>Text-only card.</p>
  </div>
</article>

Now the CSS decides layout based on presence of media.

.card {
  display: grid;
  gap: 1rem;
}

.card:has(.card__media) {
  grid-template-columns: minmax(0, 2fr) minmax(0, 3fr);
  align-items: center;
}

.card:not(:has(.card__media)) {
  padding: 2rem;
  background: #111;
  color: #eee;
}

Result: if marketing drops in an image, the card becomes a two-column layout. If not, it becomes a full-width text block. No new class. No CMS configuration. No JS.

I like this because the markup stays semantic and dumb. The layout is a true function of the content, which is what CSS was always supposed to do but rarely could at the parent level.

Auto-promoting “hero” sections

Same project. Editors could add a .section stack: some had a prominent CTA, some were just copy. If a section had a primary CTA, design wanted extra padding and a gradient background.

<section class="section">
  <h2>Get early access</h2>
  <p>Short description.</p>
  <a class="btn btn--primary" href="#">Join the beta</a>
</section>

<section class="section">
  <h2>What you get</h2>
  <p>More text...</p>
</section>

With :has() I treat any section with a primary button as a pseudo hero.

.section {
  padding: 2rem 1.5rem;
  background: #050505;
}

.section:has(.btn--primary) {
  padding: 4rem 1.5rem;
  background: radial-gradient(circle at top, #2f80ed, #050505);
  color: #fff;
}

.section:has(.btn--primary) h2 {
  font-size: 2.25rem;
}

That tiny selector replaced a custom “hero” block type in the CMS that content editors kept misusing. I stopped explaining “use the hero component for this” and just let the CSS infer intent from presence of a primary CTA.

You can do similar things with :has(video), :has(.badge--new), etc. It is a good fit for messy CMS content where you want layout to respond to what your editors actually do, not what the schema designer hoped they would do.

2. Sibling state UIs without event listeners

Second use case: stateful UIs that I used to wire up with click handlers. Tabs, disclosure panels, navigation highlights, that stuff.

Yes, you can still do it in JS. But if the state is already visible in the DOM, :has() lets CSS own more of the behavior. That means less code, fewer states to sync, and fewer bugs.

Tabs powered by :target and :has()

On a little side project for baseball drills, I built a tabbed interface where each tab is actually a link to an anchor. I wanted a sticky tab bar that changes style when any tab content is active.

<div class="tabs">
  <nav class="tabs__nav">
    <a href="#hitting">Hitting</a>
    <a href="#pitching">Pitching</a>
    <a href="#fielding">Fielding</a>
  </nav>

  <section id="hitting" class="tabs__panel">...</section>
  <section id="pitching" class="tabs__panel">...</section>
  <section id="fielding" class="tabs__panel">...</section>
</div>

The panels show / hide with a regular :target trick.

.tabs__panel {
  display: none;
}

.tabs__panel:target {
  display: block;
}

Old me would now add JS to toggle classes on the nav. Instead I lean on :has().

.tabs {
  border-bottom: 1px solid #333;
}

.tabs__nav a {
  padding: .5rem 1rem;
  text-decoration: none;
  color: #888;
}

.tabs__nav a:is(:hover, :focus-visible) {
  color: #fff;
}

/* highlight the active tab label */
.tabs__nav a[href^="#"] {
  position: relative;
}

.tabs:has(#hitting:target) .tabs__nav a[href="#hitting"],
.tabs:has(#pitching:target) .tabs__nav a[href="#pitching"],
.tabs:has(#fielding:target) .tabs__nav a[href="#fielding"] {
  color: #fff;
  font-weight: 600;
}

/* make the whole tabs block look active if any panel is targeted */
.tabs:has(.tabs__panel:target) {
  border-color: #2f80ed;
}

I am not pretending this scales to 50 tabs. For most content UIs, 3 to 5 tabs is realistic. Writing those few selectors is still cheaper than adding a tab manager, handling history state, and worrying about hydration.

The key pattern is: some child panel already has state via :target or [aria-selected="true"]. Let :has() bubble that state up to parents and siblings.

Accordion with native <details> and :has()

I use <details> a lot. It is surprisingly powerful with :has(). On a settings panel I wanted the container to visually compress when no section was open, then expand once any accordion entry was open.

<section class="settings">
  <details class="settings__item">
    <summary>Profile</summary>
    <div>...</div>
  </details>
  <details class="settings__item">
    <summary>Privacy</summary>
    <div>...</div>
  </details>
</section>

CSS:

.settings {
  padding: 1rem;
  border-radius: .75rem;
  border: 1px solid #333;
  max-height: 60vh;
  overflow: auto;
  transition: box-shadow .2s ease, border-color .2s ease;
}

.settings:has(.settings__item[open]) {
  border-color: #2f80ed;
  box-shadow: 0 16px 40px rgba(0, 0, 0, .55);
}

.settings__item + .settings__item {
  border-top: 1px solid #222;
}

.settings__item summary {
  cursor: pointer;
}

Once any <details> is open, the whole settings block feels “in focus”. No JS to listen for the toggle event, no syncing of .is-active classes. The HTML already has [open]. CSS reacts.

3. Form validation UI with zero JavaScript

The biggest win for me: form UI that uses :has() with built-in browser validation. No client-side validation library. No “touched” state juggling.

On my own site I revamped a contact form and a simple experiment log form. I wanted:

  • Parent field wrappers that highlight error or success
  • Inline messages that only show when actually invalid
  • Submit button that changes state based on form validity

Browser validation already tracks validity. The DOM knows. :has() lets CSS hook into that.

Field states from input validity

Markup:

<form class="form" novalidate>
  <div class="field">
    <label>
      Email
      <input type="email" name="email" required>
    </label>
    <p class="field__error">Please enter a valid email.</p>
  </div>

  <div class="field">
    <label>
      Message
      <textarea name="message" minlength="10" required></textarea>
    </label>
    <p class="field__error">Write at least 10 characters.</p>
  </div>

  <button type="submit">Send</button>
</form>

You can bind field styling to the input inside, purely with CSS.

.field {
  margin-bottom: 1.5rem;
}

.field input,
.field textarea {
  width: 100%;
  padding: .6rem .75rem;
  border-radius: .4rem;
  border: 1px solid #444;
  background: #050505;
  color: #eee;
}

.field__error {
  display: none;
  margin-top: .35rem;
  font-size: .8rem;
  color: #ff6b6b;
}

/* highlight when invalid and touched (using :user-invalid where supported) */
.field:has(input:user-invalid),
.field:has(textarea:user-invalid) {
  color: #ff6b6b;
}

.field:has(input:user-invalid) input,
.field:has(textarea:user-invalid) textarea {
  border-color: #ff6b6b;
  box-shadow: 0 0 0 1px rgba(255, 107, 107, .6);
}

.field:has(input:user-invalid) .field__error,
.field:has(textarea:user-invalid) .field__error {
  display: block;
}

/* success state */
.field:has(input:user-valid),
.field:has(textarea:user-valid) {
  color: #4caf50;
}

.field:has(input:user-valid) input,
.field:has(textarea:user-valid) textarea {
  border-color: #4caf50;
}

No custom event handlers. The browser decides when the input is valid or invalid. CSS uses :has() to move that state to the wrapper and the message.

If you want broader support than :user-invalid, you can fall back to :invalid and accept that some browsers show the state earlier.

Form-level feedback and submit button state

Now zoom out one level. The entire <form> element also exposes validity via :valid and :invalid. Combine that with :has() and your submit button can react.

.form button[type="submit"] {
  padding: .7rem 1.25rem;
  border-radius: .4rem;
  border: none;
  background: #333;
  color: #aaa;
  cursor: not-allowed;
  transition: background .15s ease, color .15s ease, transform .05s;
}

/* any invalid field keeps button in "disabled" style */
.form:has(:invalid) button[type="submit"] {
  background: #333;
  color: #777;
}

/* all fields valid, button goes live */
.form:has(:valid) button[type="submit"] {
  background: #2f80ed;
  color: #fff;
  cursor: pointer;
}

.form:has(:valid) button[type="submit"]:active {
  transform: translateY(1px);
}

If you want to actually disable the button, you still need a tiny bit of JS to toggle the disabled attribute. I usually do not bother for simple forms; the button just looks inactive until the browser considers the form valid.

The nice part is that the logic lives where it belongs. The browser enforces constraints. CSS reads that state. JS, if present at all, sends the request and displays a toast.

4. Layout tweaks based on children, not breakpoints

One more pattern that has crept into my “default toolkit”: adjusting layout based on how many items a container has.

On my baseball drills page, each drill has one or more tags. I wanted single-tag drills to show the tag inline next to the title, and multi-tag drills to move them into a separate row. Doing that in JS felt silly.

<article class="drill">
  <header class="drill__header">
    <h3 class="drill__title">Front toss</h3>
    <div class="drill__tags">
      <span class="tag">Hitting</span>
    </div>
  </header>
</article>

<article class="drill">
  <header class="drill__header">
    <h3 class="drill__title">Relay race</h3>
    <div class="drill__tags">
      <span class="tag">Fielding</span>
      <span class="tag">Conditioning</span>
    </div>
  </header>
</article>

With :has() and the :nth-child() selector you can treat the two cases differently.

.drill__header {
  display: flex;
  gap: .5rem;
  align-items: baseline;
  flex-wrap: wrap;
}

/* one tag only: keep inline */
.drill__tags:has(.tag:nth-child(1):last-child) {
  order: 0;
}

/* more than one tag: push tags to next line */
.drill__tags:has(.tag:nth-child(2)) {
  flex-basis: 100%;
  order: 1;
}

No JavaScript counting nodes. No data attributes. Just “if there is at least a second tag, change layout”. If product decides to add a third or fourth tag, the CSS keeps working.

Reality check: performance and support

I am not going to pretend :has() is free. The browser has to do more work, because selectors now depend on what is inside elements and how that changes.

My take after profiling a few real pages: do not go wild with global *:has(...) selectors. Scope them. Prefer direct children or close relationships.

/* Bad idea */
*:has(.error) { ... }

/* Reasonable */
.form:has(.field__error) { ... }

/* Even better */
.form:has(.field > .field__error) { ... }

Support is good now. Chrome, Edge, Safari, Firefox all ship :has(). Old Safari versions are the main risk. If you work on something critical for a weird enterprise fleet, check caniuse and add progressive enhancement.

Most of my patterns above fail gracefully. You lose a highlight or a layout tweak, not core functionality. That is a good bar to aim for.

How I think about :has() now

I used to reach for JavaScript whenever a parent needed to know about a child, or a sibling needed to react to state. That felt normal. It also created a lot of glue code that did not age well.

Now my filter is simple:

  • Is the state already visible in the DOM? (attribute, pseudo class, anchor, etc.)
  • Can that state reasonably drive styling only?

If the answer is yes, I try :has() first. JS comes later, if at all.

Five years ago I was writing tab managers and form validators by hand. I would not go back. :has() is the layout trick that finally lets CSS act on the structure we already have, instead of the utility classes we wish we had planned better.

If you have a component that keeps growing event listeners and state flags, look at the HTML for five minutes. There is a decent chance :has() can take some of that weight off.

You’re Renting Someone Else’s Compute — And It’s Costing You More Than You Think

Your Claude response comes back in 800 milliseconds. You’re on a roll. Three features shipped before lunch. And somewhere, silently, your debugging intuition is going to sleep.

I’ve been tracking a pattern across developer forums — not just V2EX, but in the back-channels of engineering team chats: developers who live in network-restricted regions are increasingly “renting” computational presence elsewhere. A computer in a data center, a VM in Singapore, a colleague’s spare workstation. They connect, they code, they use AI tools that would otherwise be unreachable. Problem solved.

Except it’s not solved. It’s deferred. And the cost is accumulating in a place most devs never check: the gap between what they can describe doing and what they can actually do.

The Compute Rental Economy

The V2EX discussion that triggered this article described a developer’s setup: living abroad, rented room with a desktop computer inside China, wants to remotely access that machine to use Claude’s web interface and write code. The comments branched into VPN recommendations, remote desktop protocols, browser-based solutions, and one or two voices asking the question nobody else wanted to answer — why?

The why matters. If you’re routing through a remote machine just to access an AI assistant, you’re not solving a network problem. You’re renting computational sovereignty. And like all rentals, you’re paying for access without building ownership.

Here’s what that looks like in practice, from a commenter’s description that stuck with me: a developer in Shanghai spends 4 hours daily on a remote desktop session to a machine in Tokyo. The latency hovers between 40-80ms — annoying but workable. The AI tools load. The code ships. And every evening, the developer closes the session knowing they built something without ever touching the actual hardware that built it.

That distinction — built on versus built with — is where the skill erosion starts.

Skeleton Implementation Syndrome

I need to coin a term here, because the existing vocabulary doesn’t capture this:

Skeleton Implementation Syndrome — the tendency to ship code you could describe but couldn’t write from scratch. You understand the architecture. You can explain why the service mesh routes requests the way it does. But when the AI is gone and the remote session drops, the gap between concept and implementation becomes a chasm you didn’t notice until you had to cross it alone.

This is different from normal abstraction. Normal abstraction is healthy — you don’t need to remember register allocation when writing Python. Skeleton Implementation Syndrome is pathological: you’ve delegated so much implementation to AI assistance that your mental model of how things actually work has decayed faster than your ability to ship features.

The ratio of regret here is asymmetric in a way that hurts quietly: AI assistance accelerates feature delivery (OPTIMIZED FOR) while accelerating capability decay (SACRIFICED). You win the sprint. You lose the skill. And the debt compounds invisibly because nobody measures “debugging intuition remaining” in your quarterly review.

The Local Environment Learning Tax

Here’s where I need to make an unpopular argument, and I want you to stay with me:

Running AI tools locally — even with degraded performance — produces better engineers than accessing them remotely on optimized infrastructure.

Before you close this tab: I’m not saying remote access doesn’t work. I’m saying the learning tax of renting compute is asymmetrically borne by the developer’s capability, not by their feature velocity.

When you run a model locally (even a quantized 7B parameter model that takes 45 seconds to warm up on your M2 Max), you’re forced to develop intuition about:

  • Token budgets and context windows — because you see the cost in real time, not abstracted away
  • Prompt sensitivity — because small changes produce observable differences without a slick web interface smoothing the edges
  • Failure modes — because local models fail in ways remote APIs don’t (OOM crashes, context truncation, hallucination patterns specific to your hardware)
  • System integration — because getting a local model to talk to your IDE requires actual configuration work, not just clicking “authorize”

The V2EX developer’s setup — remote machine, AI through a browser, code in a remote session — sidesteps all of this. The AI becomes a utility, like electricity. And like electricity, you stop thinking about how it works.

The Infrastructure You’re Betting On

There’s a second-order risk nobody talks about in these remote access discussions: you’re building workflow dependencies on infrastructure you don’t control.

Your remote machine exists because someone else maintains it. The network path between you and it exists because someone else routes it. The AI service you’re accessing exists because a company decided it should, and can decide otherwise.

In my local environment (M2 Max, 32GB RAM), I’ve been running a mix of local models and API access for two years. The local models are slower. They have smaller context windows. They fail in embarrassing ways. And they have never, not once, changed their terms of service, raised their prices, or decided my use case wasn’t “enterprise enough.”

The developers routing through Tokyo data centers to access Claude? They’re one corporate decision away from rebuilding their entire workflow. That’s not paranoia — that’s operational risk with a specific name: vendor dependency disguised as infrastructure convenience.

What Actually Survives

If you’re in a network-restricted region and remote access is genuinely your only option, I’m not here to tell you to suffer. Suffering isn’t a virtue. But here’s what I’d ask you to track, because I’ve watched this pattern destroy capable engineers:

Track your AI dependency score. After every coding session, ask yourself: could I have solved this without the AI? If the answer is “no, and I couldn’t have solved it six months ago either,” that’s data. That’s the gap growing.

The developers who survive this environment — who maintain capability while using AI as a multiplier — are the ones who treat AI as a colleague who happens to be infinitely patient, not a replacement for the thinking that made them dangerous in the first place.

They ask AI for second opinions, not first drafts. They use it to explore unfamiliar territory, not to avoid territory they should have mapped already. They keep a “dumb project” — something they code without AI, where inefficiency is the point, where the slow path is the learning path.

The Question I Can’t Answer For You

Here’s what I keep coming back to: the V2EX developer asked how to access Claude from their remote setup. Nobody asked what you’ll lose by making it this easy.

I don’t know your specific context. Maybe the feature velocity matters more than the debugging intuition. Maybe you’re in a sprint that doesn’t have room for the local model learning curve. Maybe the tradeoff is genuinely worth it.

But I know this: the engineers who lasted 15 years in this industry didn’t do it by shipping faster. They did it by being the person who could debug what everyone else gave up on. That capability doesn’t come from prompt engineering courses. It comes from struggling through problems without a safety net — and your remote setup, however clever, is a very comfortable safety net.

What’s the last thing you debugged without AI assistance? Not without searching the internet — without AI generating the answer for you. Go remember what that felt like. That’s the skill you might be renting away.

What’s your take?

Has your team noticed developers becoming less capable of independent debugging without AI? What’s your experience been — are you moving faster, or just shipping more?

I’d love to hear how this plays out in your specific context. Drop a comment below — I respond to every one.

Discussion on V2EX about remote access solutions for China-based developers wanting to use Claude web interface

Discussion: What’s the last thing you debugged without AI assistance — not without searching, but without AI generating the answer? How did it feel compared to using AI?

750,000 Chips, 140 Trillion Tokens: The Math Behind DeepSeek’s Permanent Price Cut

DeepSeek made its V4-Pro 75% price cut permanent on May 22. The conventional read: “they got cheaper hardware.” The real story is more interesting — and it’s about a gap that’s not closing fast enough.

What Happened

On May 22, 2026, DeepSeek announced that the 75% discount on its V4-Pro API would become permanent. The new pricing:

Metric Before After Cut
Input (cache miss) ¥12 / 1M tokens ¥3 / 1M tokens 75%
Output ¥24 / 1M tokens ¥6 / 1M tokens 75%
Input (cache hit) ¥0.1 / 1M tokens ¥0.025 / 1M tokens 75%

At current exchange rates, that’s roughly $0.44/M input and $0.87/M output — making V4-Pro one of the cheapest frontier-class models on the market, on par with DeepSeek’s own V4-Flash but with significantly more capability.

The move came exactly four weeks after V4’s launch on April 24, and coincided with growing user frustration over rate limits at Google Gemini and Anthropic Claude.

The Standard Narrative

The surface-level story has three parts:

1. Architectural efficiency. V4 uses a Mixture-of-Experts architecture with 1.6 trillion parameters, but only activates a fraction per token. This gives it a structural cost advantage over dense models of comparable capability — roughly 30% of the gap.

2. Supply chain scaling. Huawei’s Ascend 950PR entered mass production in April 2026. Huawei plans to ship ~750,000 units through the year — a 2.5x increase over 2025’s 910C output. DeepSeek specifically optimized V4 for the Ascend architecture. More chips → lower unit cost → lower API pricing.

3. Competitive positioning. Western AI providers (Google, Anthropic) have been quietly tightening rate limits as demand overwhelms their GPU supply. DeepSeek is exploiting the backlash, offering unlimited usage at a fraction of the cost to capture disgruntled developers.

All three are true. But none of them fully explains the magnitude of the cut — or why it’s permanent rather than promotional.

The Math That Changes Everything

Let’s check the numbers.

Demand Side

China’s daily token consumption hit 140 trillion in March 2026, according to the National Data Administration. The growth trajectory:

  • Early 2024: 0.1 trillion/day
  • End of 2025: 100 trillion/day
  • March 2026: 140 trillion/day

That’s a 1,000x increase in two years, and a 40% jump in just the last quarter — implying ~13% month-over-month growth.

Supply Side

Huawei’s mass-produced chip for 2026 is the Ascend 950PR (Prefill-optimized, 1 PFLOPS FP8), with the higher-end 950DT (2 PFLOPS FP8) coming in Q4. The numbers:

Chip FP8 Memory Bandwidth Inference Throughput (est.)
950PR 1 PFLOPS 128GB HBM 1.6 TB/s ~1,200 tokens/sec
950DT 2 PFLOPS 144GB HBM 4 TB/s ~2,400 tokens/sec

(Throughput derived from Huawei’s published Atlas 950 SuperNode benchmark: 19.6M tokens/sec across 8,192 cards.)

Now the arithmetic:

Item Value
Total chips (2026 target) 750,000 (70% PR + 30% DT)
Raw daily throughput 85.7 trillion tokens/day
Inference-allocated (60%) 51.4 trillion tokens/day
vs Current demand (140T) 37% coverage
vs Demand in 6 months (~291T) 18% coverage

Even in the most optimistic scenario — every single chip dedicated to inference at 100% utilization:

Scenario vs Current vs +6 months
100% inference, 100% utilization 61% coverage 29% coverage

The conclusion is stark: 750,000 Ascend 950 chips can’t cover today’s demand — let alone the demand in six months.

So Why Cut Prices?

If supply is still a fraction of demand, permanent price cuts don’t make sense in a normal market. But this is not a normal market.

The Real Logic: Pre-Commitment, Not Surplus

DeepSeek is not cutting prices because it has spare compute. It’s cutting prices to lock in routing commitments before the hardware arrives.

Here’s the timeline:

April 24:  V4 launched, optimized for Ascend
April 24+: ByteDance orders 350,000 Ascend 950 chips (~¥40B)
May 4:     Ascend 950PR mass production confirmed
May 22:    DeepSeek makes V4-Pro 75% cut permanent

The critical insight: DeepSeek’s price cut is not a cost pass-through. It’s a market share pre-commitment — using the promise of future Ascend supply to grab developer mindshare now, before Western competitors can resolve their own capacity issues.

The Numbers Behind the Strategy

Western providers are capacity-constrained:

Provider Constraint Signal
Google Gemini TSMC CoWoS capacity Rate limits tightened, user backlash
Anthropic Claude H100/B200 availability API throttling, compute-use monitoring
OpenAI Inference cluster rollout Delayed GPT-5 token limits

DeepSeek’s bet: “Spend the next 6 months building developer dependency on V4-Pro’s API — by the time Ascend supply catches up in H2 2026, those developers won’t switch back.”

This is AWS in 2006. AWS wasn’t cheaper than running your own servers in 2006. But it would be once scale kicked in. AWS priced for the scale it planned to have, not the scale it had. DeepSeek is doing the same.

What 750,000 Chips Actually Buys

The popular framing in Chinese media is “75万颗昇腾950产能大爆发.” But as the math shows, 750,000 chips isn’t abundance — it’s barely adequacy.

Think of it this way: China’s token demand is growing at roughly 0.5 trillion tokens per day every single month (the monthly increment itself is larger than the entire market 18 months ago). By year-end, demand will be 300-400+ trillion. Against that, 750K chips at the 950PR/DT mix buy roughly 50-85T/day of inference capacity.

Timeframe Demand (est.) Inference Supply Gap
March 2026 140T ~50T 90T
June 2026 ~200T ~50T 150T
September 2026 ~290T ~55T (DT ramp) 235T
December 2026 ~420T ~65T 355T

The gap is growing, not shrinking. Even with 75万 chips fully deployed, the supply-demand deficit more than triples over nine months.

This means DeepSeek’s price cut isn’t a sign of market saturation. It’s a sign of exactly the opposite: a market so unsaturated that the winner gets to define the default API for an entire generation of developers, if they can lock them in before the hardware arrives.

Three Counter-Arguments (And Why They’re Weak)

“But cache hits reduce the effective compute needed”

True — cache-hit tokens cost ~1/100th of miss tokens. And DeepSeek’s cache hit rates can be high for workloads with stable system prompts. But cache hits are mostly in the input direction. Output tokens — the expensive ones — still need full compute. And as agentic workloads grow (multi-turn, chain-of-thought), output-to-input ratios increase, making cache less effective.

“But not all 140T tokens need 950-class inference”

Also true. Many tokens are generated by smaller models (Flash variants, Qwen, etc.) that don’t need 950-level compute. But the growth is in the frontier-class tokens — longer context, more complex reasoning, higher quality requirements. That’s exactly where 950-class chips are needed.

“But they can still buy H20 / smuggled H100”

H20 is less capable than 950PR per chip (the US-designed it to be worse). And the CHIPS Act + export controls have made H100 procurement increasingly difficult. Relying on smuggled hardware is not a supply chain strategy.

What This Means

For Developers

Your inference costs are likely going down over the next 12 months, not up — even though demand is exploding. That’s unprecedented in any computing market. The driver isn’t efficiency gains or manufacturing scale. It’s a strategic subsidy by Chinese AI firms betting that locking in your API calls today is worth negative margins for a year.

Take the subsidy. But don’t assume today’s prices reflect tomorrow’s costs — they reflect tomorrow’s hopes.

For the Industry

The AI API market has entered a phase that looks like price war but functions like infrastructure land-grab. The playbook is AWS 2006, DoorDash 2019, Uber 2015: lose money on every transaction to own the default routing.

When the hardware does catch up — when Ascend 960 (2027) or 970 (2028) ships with 3-5x the throughput — the providers with the largest captive developer bases will convert negative margins to positive ones. Everyone else will be competing on price against incumbents they can’t dislodge.

The Bottom Line

DeepSeek’s permanent price cut is not evidence that Chinese AI compute supply has caught up with demand. The math shows it hasn’t — and won’t for at least 12-18 months. It’s evidence that DeepSeek is playing the long game: use today’s negative margins to own tomorrow’s default inference route, and trust that Huawei’s future chips will eventually close a gap that’s currently 3-5x wider than headlines suggest.

The 75% cut isn’t a cost breakthrough. It’s a bet that developer lock-in is worth more than current margins — and that the 75万 Ascend 950 chips shipping this year are just the beginning.

Numbers sourced from: National Data Administration (China daily token data, March 2026), Huawei Connect 2025 (Ascend 950 specs and roadmap), SCMP/DW (ByteDance order volume), DeepSeek official pricing page (May 2026). Throughput calculations based on published Atlas 950 SuperNode benchmarks. Growth projections assume continuation of 40%/quarter rate per published data.