¿Qué significan esas letras del CVSS? Guía para entenderlo de una vez

Cada vez que sale un CVE importante, alguien pega el vector CVSS en el chat del equipo y todos hacen como que lo entienden. Spoiler: la mayoría solo mira el número (9.1 CRITICAL) e ignora el resto.

El problema es que el número solo te dice qué tan grave es. El vector te dice por qué — y eso cambia completamente cómo respondes.

Primero: ¿qué es el CVSS?

CVSS (Common Vulnerability Scoring System) es un sistema de puntuación para describir vulnerabilidades de seguridad. No solo te da un número: te da un vector, que es básicamente una descripción comprimida de cómo funciona el ataque.

Existen dos versiones que vas a ver seguido: v3.1 (la más común hoy) y v4.0 (la más nueva, más detallada). Voy a explicar las dos.

💡

💡 La escala de puntuación
0.1–3.9 LOW 
4.0–6.9 MEDIUM 
7.0–8.9 HIGH 
9.0–10.0 CRITICAL

La analogía del ladrón

Para entender el vector CVSS, imagina que una vulnerabilidad es como una forma de entrar a robar a una casa. El vector CVSS responde estas preguntas sobre el “robo”:

  🏠 Analogía
  **¿Desde dónde puede atacar el ladrón?** ¿Desde la calle, o tiene que estar en el jardín? *(Attack Vector)*
  **¿Es difícil entrar?** ¿Puerta abierta o cerradura de alta seguridad? *(Attack Complexity)*
  **¿Necesita una llave?** ¿O entra sin nada? *(Privileges Required)*
  **¿Alguien tiene que abrir la puerta desde adentro?** *(User Interaction)*
  **¿Qué se puede robar?** Documentos, muebles, o puede romper cosas también. *(Impactos: C/I/A)*

CVSS v3.1 — letra por letra

Tomemos este vector real del CVE-2024-9465 (Palo Alto Expedition):

  CVSS v3.1
  CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:H/I:H/A:N





      Código
      Nombre
      Valor en este CVE
      Qué significa en palabras simples




      AV:N
      Attack Vector — Network
      🔴 Peligroso
      El atacante no necesita estar cerca físicamente. Puede atacar desde cualquier lugar del mundo por internet. *(N=Network, A=Adjacent, L=Local, P=Physical)*


      AC:L
      Attack Complexity — Low
      🔴 Peligroso
      El ataque es fácil de ejecutar. No requiere condiciones especiales, timing exacto ni conocimiento avanzado. Cualquiera con el exploit puede hacerlo. *(L=Low, H=High)*


      PR:N
      Privileges Required — None
      🔴 Peligroso
      El atacante no necesita ninguna cuenta ni contraseña previa. Llega, ataca, listo. *(N=None, L=Low, H=High)*


      UI:N
      User Interaction — None
      🔴 Peligroso
      Ningún usuario tiene que hacer clic en nada, abrir ningún archivo ni visitar ningún enlace. El ataque funciona solo. *(N=None, R=Required)*


      S:U
      Scope — Unchanged
      ⚪ Neutral
      El impacto se queda en el sistema atacado. No "salta" automáticamente a otros sistemas. *(U=Unchanged, C=Changed)*


      C:H
      Confidentiality — High
      🔴 Crítico
      Toda la información confidencial queda expuesta: contraseñas, API keys, configuraciones. El atacante puede leer todo. *(N=None, L=Low, H=High)*


      I:H
      Integrity — High
      🔴 Crítico
      El atacante puede modificar o crear datos. En este caso, puede escribir archivos arbitrarios en el sistema. *(N=None, L=Low, H=High)*


      A:N
      Availability — None
      🟢 Sin impacto
      El atacante no puede tirar el sistema. El servicio sigue disponible mientras lo explotan en silencio. *(N=None, L=Low, H=High)*

⚠️

⚠️ Cómo leer el resultado
AV:N + AC:L + PR:N + UI:N en el mismo vector = “cualquier persona en internet, sin esfuerzo, sin cuenta, sin ayuda de nadie” puede ejecutar el ataque. Eso, combinado con C:H, es la peor combinación posible para datos confidenciales.

CVSS v4.0 — ¿qué cambia?

La versión 4.0 es más nueva y más detallada. Divide el impacto en dos partes: el sistema directamente atacado (Vulnerable System) y otros sistemas que podrían verse afectados (Subsequent System).

  CVSS v4.0
  CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:H/VI:L/VA:N/SC:H/SI:N/SA:N





      Código
      Nombre
      Valor
      Qué significa




      AV:N
      Attack Vector — Network
      🔴
      Igual que en v3.1: ataque desde internet, sin estar cerca.


      AC:L
      Attack Complexity — Low
      🔴
      Fácil de ejecutar, sin condiciones especiales.


      AT:N
      Attack Requirements — None
      🔴
      **Nuevo en v4.0.** El ataque no depende de ninguna condición externa que no controle el atacante (como que haya sesiones activas o configuraciones específicas). *(N=None, P=Present)*


      PR:N
      Privileges Required — None
      🔴
      Sin cuenta, sin autenticación.


      UI:N
      User Interaction — None
      🔴
      Nadie tiene que hacer nada para que el ataque funcione.


      VC:H
      Vulnerable System Confidentiality — High
      🔴
      **El sistema atacado (Expedition):** toda su información queda expuesta. Hashes, configs, API keys.


      VI:L
      Vulnerable System Integrity — Low
      🟡
      **El sistema atacado:** el atacante puede modificar algunos datos, pero no tiene control total de escritura. Impacto parcial en integridad.


      VA:N
      Vulnerable System Availability — None
      🟢
      **El sistema atacado:** sigue funcionando. No hay denegación de servicio.


      SC:H
      Subsequent System Confidentiality — High
      🔴
      **Otros sistemas (los firewalls PAN-OS):** como las API keys quedan expuestas, los firewalls también quedan comprometidos en confidencialidad. El daño se propaga.


      SI:N
      Subsequent System Integrity — None
      🟢
      **Otros sistemas:** el atacante no puede modificar datos en los firewalls directamente a través de este vector.


      SA:N
      Subsequent System Availability — None
      🟢
      **Otros sistemas:** no puede tumbar los firewalls con este ataque.

💡

💡 La gran mejora de v4.0
v4.0 separa el impacto en VC/VI/VA (sistema directamente atacado) y SC/SI/SA (sistemas que se ven afectados después). En este CVE, eso es clave: Expedition tiene SC:H porque las API keys expuestas comprometen los firewalls. v3.1 no capturaba bien ese efecto en cadena.

Resumen visual: cómo leer un vector rápido

CVSS:3.1 / AV:? / AC:? / PR:? / UI:? / S:? / C:? / I:? / A:?
>           │       │       │       │       │     │      │     └─ ¿Se cae el servicio?
>           │       │       │       │       │     │      └──────── ¿Puede modificar datos?
>           │       │       │       │       │     └─────────────── ¿Puede leer datos privados?
>           │       │       │       │       └───────────────────── ¿El daño se propaga a otros sistemas?
>           │       │       │       └───────────────────────────── ¿Alguien tiene que hacer clic?
>           │       │       └───────────────────────────────────── ¿Necesita cuenta o contraseña?
>           │       └───────────────────────────────────────────── ¿Es difícil ejecutarlo?
>           └───────────────────────────────────────────────────── ¿Desde dónde puede atacar?
> 
> Valores de riesgo (de mayor a menor):
>   N (None) / H (High) = 🔴  →  Peor escenario
>   L (Low)             = 🟡  →  Impacto parcial
>   N (None en impacto) = 🟢  →  Sin efecto en esa categoría

La frase que resume todo

Cuando veas un vector como el de CVE-2024-9465, tradúcelo a una sola frase antes de enviarlo al equipo:

  📝 Traducción al español
  **"Cualquier persona en internet puede atacar este sistema sin credenciales ni ayuda de nadie, y obtener acceso completo a todos los datos confidenciales, incluyendo las llaves de tus firewalls."**

Eso es lo que dice el vector. Ahora sí sabes por qué tiene un 9.2.

Conclusión

El número CVSS te dice si debes preocuparte. El vector te dice cómo preocuparte. AV:N/AC:L/PR:N/UI:N juntos es lo más peligroso que existe: fácil, remoto y sin depender de nadie. Cuando lo veas así, actúa primero y analiza después.

✅ Regla práctica
Si los primeros 4 campos son AV:N / AC:L / PR:N / UI:N — el atacante puede ser cualquier persona en internet, atacando sin esfuerzo, sin cuenta y sin ayuda. Parchea hoy.

Compartir

  🐦 Twitter/X
  💼 LinkedIn

¿Te fue útil?

Mando contenido así cuando tengo algo que vale la pena.

    Suscribirse





    ← Anterior
    CVE-2024-9465: SQL Injection en Palo Alto Expedition — CVSS 9.2


    Todos los posts →
    Ver el blog completo








byron.lainez
© 2026 · Guatemala 🇬🇹

SimGemma: Democratizing STEM Education with Offline-First AI Simulations

Simgemma - Thumbnail

Introduction

Imagine a classroom in a remote village. There’s a blackboard, a few passionate teachers, and curious students. What’s missing? A high-end physics lab. Even more challenging? A stable internet connection.

Physics is a subject that demands exploration. It’s hard to grasp the beauty of gravity or the silence of a vacuum from a two-dimensional drawing. This is why I built SimGemma—an offline-first, AI-powered platform designed to bring high-fidelity 3D science simulations to every classroom, regardless of connectivity.

I’m Damodharan, a Tech Lead who spends my weekends teaching math and science to kids through an NGO. I’ve always felt that teaching topics like pendulum motion or trigonometry on a blackboard didn’t do justice to the science. These concepts, along with things like molecular structures (methane, for instance), are simply better understood in 3D.

SimGemma was created for the Google Gemma Challenge to demonstrate how open-weights models like Gemma can solve real-world problems in resource-constrained environments.

The Problem: The “Blackboard Gap”

Traditional STEM education often suffers from two major hurdles:

  1. Static Learning: Concepts like “pendulum motion in a vacuum” are taught theoretically because recreating a vacuum in a classroom is expensive and difficult.
  2. The Connectivity Divide: Most modern educational tools require high-speed internet, leaving students in remote areas behind.

I used to hand-code these simulations in Three.js, but it was time-consuming and hard to scale. I needed a way to generate these artifacts on demand.

The Solution: SimGemma

SimGemma is a “Lab in a Box.” It allows educators to generate interactive 3D simulations using simple natural language.

Key Features:

  • On-Demand 3D Artifacts: Want to see how a pendulum behaves on the Moon? Just ask. Need to visualize a Methane molecule? Gemma’s got it.
  • Vibecoding for Teachers: Teachers don’t need to be coders. They can describe the “vibe” of the lesson, and SimGemma generates the simulation logic and 3D assets.
  • True Offline Architecture: Everything runs locally. From the AI model to the 3D rendering engine.

Simgemma - carbon

Simgemma - pendulum

Simgemma - Trigonometry

Simgemma – Product link

Technical Architecture: Powered by Gemma 4

The heart of SimGemma is the Gemma 4 model. We chose Gemma for its exceptional performance-to-size ratio, making it perfect for local deployment.

1. Hybrid Offline Inference

We implemented a two-tier offline approach:

  • Server-Side (Local): For complex simulation generation, we run Gemma 4 via Ollama or llama.cpp on a local machine (e.g., a teacher’s laptop).
  • Client-Side (In-Browser): Using ONNX browser gemma4-e2b, we enable zero-server editing. This allows teachers to tweak simulation logic directly in the browser without needing any backend sandbox—everything is emulated in a local shell.

2. Programmatic Video & 3D

  • Remotion: We use Remotion to programmatically create educational videos and presentations of these simulations.
  • React Three Fiber / Three.js: The simulations themselves are high-fidelity 3D artifacts that students can interact with.

The “Vibecoding” Experience

One of the most exciting aspects of SimGemma is what we call “Vibecoding.” In our NGO workshops, we’ve seen that the biggest barrier to using technology in the classroom isn’t lack of interest—it’s the complexity of the tools.

With Gemma 4, we’ve turned the creation process into a conversation. A teacher can say: “Show me a double pendulum where the second arm is twice as heavy, and let’s see it in Mars’ gravity.”

Gemma understands the physics constraints, generates the necessary React/Three.js code, and renders it instantly. It turns educators into creators.

Breaking the Language Barrier

Living in India, where we have 22 official languages, I’ve seen how language can be a barrier to quality STEM content. Gemma 4’s translation capabilities are a game-changer. SimGemma can generate and translate these artifacts into regional languages like Tamil instantly. This means a teacher can create a simulation in English and have it ready for a Tamil-medium classroom in seconds, ensuring no student is left behind because of a language gap.

Impact: Bringing the Lab to the NGO

As a STEM volunteer, I’ve seen firsthand how an interactive simulation can light up a student’s eyes. SimGemma isn’t just about code; it’s about equity. It ensures that a child in a rural NGO workshop has access to the same quality of scientific exploration as a student in a tech-hub city.

Conclusion & Future Work

SimGemma proves that “Offline AI” isn’t a compromise—it’s a superpower. By leveraging the open-weights of Gemma 4, we’ve built a tool that is resilient, private, and accessible.

We are currently looking into:

  • Expanding the library of physics primitives.
  • Improving the browser-native ONNX performance for even smaller devices.
  • Collaborating with more NGOs to deploy SimGemma “Lab-in-a-box” kits.

Links

  • GitHub Repository: Github
  • Video Demo: Youtube
  • Try Simgemma now!

#Gemma #AI #OpenSource #Education #STEM #Physics #Remotion #ThreeJS

scrcpy Integration in a Tauri App — Android Screen Mirroring on Mac

HiyokoKit includes Android remote control via scrcpy. Launching and managing scrcpy from a Tauri app has specific challenges.
Here’s how I handle it.

What scrcpy is

scrcpy is an open-source tool that mirrors and controls an Android device screen over ADB. It’s the best free option for Android screen mirroring on Mac — fast, low latency, no app required on the device.

Launching scrcpy from Rust

use std::process::{Command, Child};

pub struct ScrcpyProcess {
    child: Option<Child>,
}

impl ScrcpyProcess {
    pub fn start(
        &mut self,
        device_serial: &str,
        max_size: u32,
        bit_rate: &str,
    ) -> Result<(), AppError> {
        let child = Command::new("scrcpy")
            .args([
                "--serial", device_serial,
                "--max-size", &max_size.to_string(),
                "--video-bit-rate", bit_rate,
                "--window-title", "Android Mirror",
                "--no-audio",
            ])
            .spawn()
            .map_err(|e| AppError::Scrcpy(e.to_string()))?;

        self.child = Some(child);
        Ok(())
    }

    pub fn stop(&mut self) {
        if let Some(mut child) = self.child.take() {
            child.kill().ok();
        }
    }

    pub fn is_running(&mut self) -> bool {
        if let Some(child) = &mut self.child {
            child.try_wait().map(|s| s.is_none()).unwrap_or(false)
        } else {
            false
        }
    }
}

Bundling scrcpy

scrcpy needs to be available on the user’s machine or bundled with your app. I bundle it in app resources as a universal binary:

{
  "bundle": {
    "resources": [
      "bin/scrcpy",
      "bin/adb"
    ]
  }
}

At runtime, get the resource path:

let scrcpy_path = app_handle
    .path()
    .resource_dir()
    .unwrap()
    .join("bin/scrcpy");

Detecting when scrcpy exits

scrcpy exits when the user closes the mirror window. Detect this to update your UI:

// Poll in background
tokio::spawn(async move {
    loop {
        tokio::time::sleep(Duration::from_secs(1)).await;

        let running = {
            let mut proc = scrcpy_state.lock().unwrap();
            proc.is_running()
        };

        if !running {
            app_handle.emit("scrcpy-stopped", ()).ok();
            break;
        }
    }
});

Multiple device support

scrcpy’s --serial flag selects a specific device when multiple are connected. Get the serial from adb devices and pass it explicitly:

async fn get_device_serial() -> Result<String, AppError> {
    let output = Command::new("adb")
        .args(["devices"])
        .output()
        .await?;

    let stdout = String::from_utf8_lossy(&output.stdout);
    stdout.lines()
        .skip(1)
        .find(|l| l.contains("device"))
        .and_then(|l| l.split_whitespace().next())
        .map(|s| s.to_string())
        .ok_or(AppError::Device("No device found".into()))
}

If this was useful, a ❤️ helps more than you’d think — thanks!

HiyokoKit (includes scrcpy-based Android remote control) → https://hiyokomtp.lemonsqueezy.com/checkout/buy/2c94dd0f-e28a-4a17-8efc-7bd93087d46d

X → @hiyoyok

Diffusion Language Models Are Here: Deep Dive into NVIDIA’s Nemotron-Labs DLM Architecture

Meta Description: NVIDIA just open-sourced Nemotron-Labs Diffusion — a family of 3B, 8B, and 14B diffusion language models that merge autoregressive and diffusion generation for up to 6.4× faster inference. Here’s the complete technical deep dive into the architecture, training methodology, three generation modes, and how to run it today with SGLang.

Diffusion Language Models Hero Banner

Table of Contents

  1. The Speed Wall Autoregressive LLMs Hit
  2. What Are Diffusion Language Models?
  3. Why DLMs Struggled — Until Now
  4. NVIDIA’s AR-to-DLM Breakthrough: Efficient-DLM
  5. Nemotron-Labs Diffusion: The Model Family
  6. Three Generation Modes: AR, Diffusion, Self-Speculation
  7. Performance Deep Dive: The Numbers That Matter
  8. Under the Hood: Block-Wise Attention & KV Caching
  9. Getting Started: Running with SGLang
  10. What This Means for Production LLM Infrastructure
  11. Conclusion & The Road Ahead

1. The Speed Wall Autoregressive LLMs Hit

Every language model you’ve ever used — GPT-4, Claude, Llama, Qwen — generates text the same fundamental way: one token at a time, left to right, each new token conditioned on every previous one. It’s called autoregressive (AR) generation, and it’s been the undisputed king of language modeling since the original GPT paper in 2018.

But AR generation has a dirty secret. It’s not a compute-bound problem. It’s a memory-bandwidth-bound problem.

Here’s why that matters: each new token requires a full model forward pass. That means loading all the model’s weights — potentially tens of gigabytes for a 7B model — from HBM (High Bandwidth Memory) into the GPU’s compute cores, every single decoding step. On modern GPUs, the arithmetic throughput is enormous, but the memory bandwidth is the bottleneck. This is why serving an LLM at batch size 1 — a single user chatting with your model — leaves your GPU vastly underutilized.

The math is brutal. An A100 80GB GPU has ~2TB/s of HBM bandwidth. A 7B-parameter model in FP16 takes ~14GB. Reading all weights takes ~7ms minimum per step. At 30 tokens/second, you’re spending the vast majority of each step just moving weights, not computing. Scale this to a production API endpoint handling thousands of concurrent users, and the economics become painful.

The community has attacked this problem from many angles: speculative decoding (using a small draft model to propose tokens verified by the large model), quantization (FP8, INT4 to shrink weight footprint), and FlashAttention (optimizing the KV-cache access pattern). These are all incremental improvements on the same fundamental loop.

NVIDIA’s Nemotron-Labs Diffusion — released on HuggingFace on May 23, 2026 — is taking a fundamentally different approach. Instead of optimizing the autoregressive loop, it breaks the loop entirely.

2. What Are Diffusion Language Models?

If you’ve worked with image generation models (Stable Diffusion, DALL·E, Flux), you already know the concept of denoising diffusion. The idea is to start with pure noise and iteratively denoise it, guided by a conditioning signal, until you arrive at a coherent output.

Diffusion Language Models (DLMs) apply this same paradigm to text. Instead of generating tokens left-to-right, a DLM:

  1. Starts with a sequence of masked or noisy tokens (analogous to Gaussian noise in image diffusion)
  2. Runs multiple denoising refinement steps, predicting the clean token distribution at each step
  3. After several iterations, the entire sequence — or a large block of it — converges to the final output

Autoregressive vs Diffusion Language Model Architecture

The key theoretical advantage is parallelism. In a standard AR model, token t can only be generated after token t-1 exists. In a DLM, all positions in a block are refined simultaneously in each forward pass. This changes the computational profile dramatically: instead of being memory-bandwidth-bound by sequential weight loads, the GPU can be kept busy with dense matrix multiplications across the full block.

The conceptual roots of DLMs trace back to Masked Diffusion Language Models (MDLMs) — work like MDLM (Sahoo et al., 2024) and SEDD (Lou et al., 2023) — that framed text generation as a discrete denoising process over masked token sequences. However, these models had significant practical shortcomings when compared to the state-of-the-art AR models of the day. NVIDIA’s work specifically addresses why, and more importantly, how to fix it.

3. Why DLMs Struggled — Until Now

The community has known about the theoretical appeal of diffusion language models for years. The reason they haven’t taken over is a cluster of practical barriers that made them non-competitive with AR models in production:

1. Accuracy Gap: DLMs trained from scratch consistently underperformed comparably-sized AR models on standard benchmarks. The discrete, iterative denoising process is harder to optimize than the clean causal language modeling objective. Models like Dream 7B were impressive for DLMs, but still lagged behind Qwen3 4B — a smaller AR model — on reasoning and knowledge tasks.

2. Training Instability: Jointly learning to denoise across many noise levels with a bidirectional attention mask creates a different gradient landscape than causal language modeling. Loss curves are noisier, and the model is more sensitive to hyperparameter choices.

3. No KV Cache Compatibility: This was the killer for inference efficiency. KV caching — where you store key/value activations from previous tokens to avoid recomputing them — is the single most important optimization for AR inference. Standard DLMs use fully bidirectional attention across the entire sequence, which means you can’t cache anything: every refinement step needs to attend over all positions with the updated token states. This essentially erased the theoretical throughput advantage.

4. Fill-in-the-Middle Mismatch: During DLM training, tokens are masked uniformly at random across the sequence. But at inference time, the model typically has a left-side prefix (the prompt) that is fully unmasked, and must fill in the right side. This creates a training-test distribution mismatch that degrades quality.

Each of these problems has a specific technical solution in NVIDIA’s Efficient-DLM framework. Let’s dig in.

4. NVIDIA’s AR-to-DLM Breakthrough: Efficient-DLM

The foundational insight behind Nemotron-Labs Diffusion (and the academic paper it builds on, arXiv:2512.14067) is deceptively simple: don’t train DLMs from scratch — convert pretrained AR models into DLMs.

This avoids the accuracy gap problem entirely. You start with a model that already has world-class knowledge and reasoning capabilities baked into its weights, then teach it to also generate diffusion-style. The result is a model that retains AR accuracy while gaining diffusion parallelism.

But there are two critical technical challenges to solve for this conversion to work.

4.1 Block-Wise Attention: Preserving Weights, Enabling KV Caching

The attention mechanism is the crux of the problem. A standard AR model uses causal (lower-triangular) attention — each token attends only to itself and all previous tokens. A standard DLM uses bidirectional (full) attention — every token attends to every other token.

The issue: if you convert an AR model and suddenly change to fully bidirectional attention, you’ve broken the statistical assumptions baked into all those attention weights during pretraining. The key-value projections were trained to operate in a causal setting; they “expect” not to see future tokens. Loading them into a fully bidirectional context produces degraded output and requires extensive retraining to recover.

Efficient-DLM introduces block-wise causal attention as the solution:

  • The sequence is divided into non-overlapping blocks of size B (e.g., 32 tokens)
  • Within each block: full bidirectional attention (every token attends to every other token in the block)
  • Across blocks: standard left-to-right causal attention (block i can attend to blocks 0 through i-1)

Block-wise Attention Pattern with KV Caching

This hybrid pattern does something clever: it’s structurally similar enough to causal attention that pretrained weight distributions are preserved — the model only needs to learn bidirectionality locally within blocks, not globally across the whole sequence. The result is a much smoother conversion that requires far less compute to recover quality.

Crucially, this also re-enables KV caching. Since attention is still causal across blocks, the KV activations of completed (committed) blocks can be cached and reused exactly like in a standard AR model. Only the current block being refined needs to be recomputed each refinement step.

4.2 Position-Dependent Token Masking

The second innovation addresses the training-test distribution mismatch. Instead of masking tokens uniformly at random during training, Efficient-DLM uses a position-dependent masking strategy that assigns higher masking probabilities to tokens in later positions in the sequence.

The intuition: at inference time, when filling in a response to a prompt, earlier tokens in the response have already been decided (or are more constrained by the left-side context), while later tokens remain more uncertain. By skewing the training mask distribution to match this pattern, the model learns a denoising objective that better mirrors what it actually faces at test time.

4.3 Joint AR + Diffusion Training Objective

Rather than optimizing purely for the diffusion objective, Nemotron-Labs Diffusion is trained with a joint AR and diffusion loss:

L_total = λ · L_AR + (1 - λ) · L_diffusion

Where L_AR is the standard cross-entropy causal language modeling loss and L_diffusion is the masked diffusion objective. This joint training ensures the model remains a first-class AR model while learning the diffusion generation capability.

The pretrained base was trained on 1.3 trillion tokens from NVIDIA’s Nemotron pretraining datasets, with an additional 45 billion tokens of supervised fine-tuning data for the instruct-tuned variants.

5. Nemotron-Labs Diffusion: The Model Family

NVIDIA released seven model checkpoints on HuggingFace under the NVIDIA Nemotron Open Model License (commercially friendly for text models):

Model Parameters Type Downloads (Day 1)
nvidia/Nemotron-Labs-Diffusion-3B ~4B Text, Instruct 14.7K
nvidia/Nemotron-Labs-Diffusion-3B-Base ~4B Text, Base 14.2K
nvidia/Nemotron-Labs-Diffusion-8B 8B Text, Instruct 24.1K
nvidia/Nemotron-Labs-Diffusion-8B-Base 8B Text, Base 228K
nvidia/Nemotron-Labs-Diffusion-14B 14B Text, Instruct 3.28K
nvidia/Nemotron-Labs-Diffusion-14B-Base 14B Text, Base 1.18K
nvidia/Nemotron-Labs-Diffusion-VLM-8B ~9B Vision-Language 590

The 8B Base model being the most downloaded (228K in under 2 days) reflects developer interest in using it as a foundation for custom fine-tuning.

6. Three Generation Modes: AR, Diffusion, Self-Speculation

The standout design decision in Nemotron-Labs Diffusion is that all three generation modes are supported from a single checkpoint. You don’t need different models — just a different deployment config in SGLang.

Three Generation Modes Performance Comparison

Mode 1: Autoregressive (ar_mode=true)

Standard left-to-right token generation, identical to how you’d run any other causal LM. This mode is the correctness baseline — most useful for debugging, A/B testing against existing pipelines, or when you need strict adherence to specific decoding behaviors.

Use when: Debugging, regression testing, or exact reproduction of AR outputs.

Mode 2: Diffusion / FastDiffuser (diffusion_mode=true)

The model fills in a block of 32 tokens simultaneously, running multiple denoising refinement steps per block. A confidence threshold determines which tokens are “committed” after each refinement pass — tokens whose predicted distribution is peaked enough get locked in, reducing the number of positions that need further refinement.

The process per block:

  1. Initialize block positions with mask tokens
  2. Forward pass with block-wise attention → predict token distributions over all positions
  3. Commit tokens above confidence threshold; keep others masked
  4. Repeat steps 2–3 until all positions are committed or max steps reached
  5. Move to next block, using committed block tokens in KV cache

Achieves 2.6× higher tokens per forward pass (TPF) compared to AR.

Use when: High-throughput batch serving where speed matters more than exact AR equivalence.

Mode 3: Self-Speculation / LinearSpec (self_speculation=true)

This is the most sophisticated mode — it fuses diffusion and autoregressive decoding into a single hybrid loop:

  1. The model uses diffusion to draft a full block of k candidate tokens bidirectionally (fast, parallel)
  2. It then uses autoregressive decoding to verify the draft tokens causally from left to right
  3. Any prefix of the draft that matches what AR would have produced gets committed
  4. The process restarts from the first disagreement position

The same model plays both roles (drafter and verifier). Output is lossless vs AR at temperature=0.

Key numbers: LinearSpec achieves ~6× higher TPF than AR, and ~865 tokens/second on NVIDIA B200 hardware — roughly 4× the AR baseline on identical hardware.

Use when: Production serving where you need maximum speed with no quality compromise.

7. Performance Deep Dive: The Numbers That Matter

Accuracy vs Qwen3 8B:
Nemotron-Labs Diffusion 8B achieves +1.2% higher average accuracy compared to Qwen3 8B across evaluated benchmarks. The DLM conversion doesn’t hurt quality — it slightly improves it, likely because the joint AR+diffusion training objective acts as an additional regularizer.

vs Dream 7B (prior DLM SOTA):
Efficient-DLM 8B achieves +5.4% higher accuracy and 4.5× higher throughput compared to Dream 7B — a decisive improvement over the previous DLM state-of-the-art.

Throughput (Tokens Per Forward Pass — TPF):

Mode TPF (relative to AR) Quality vs AR
Autoregressive 1× (baseline) Exact match
Diffusion (FastDiffuser) 2.6× Slightly different
Self-Spec Linear (LinearSpec) ~6× Lossless at T=0
Self-Spec Quadratic (QuadSpec) ~6.4× Lossless at T=0

TPF (Tokens Per Forward Pass) is a hardware-agnostic efficiency metric — it measures how many output tokens you get per model forward pass, making it useful for comparing across different GPU generations.

8. Under the Hood: Block-Wise Attention & KV Caching

Let’s look at exactly how the block-wise attention mechanism enables KV caching in a DLM setting.

In standard AR decoding, the KV cache stores the key and value projections for every previously generated token. When generating token t, the model attends to cached KV from tokens 0...(t-1) and computes new Q, K, V for position t only — O(1) cache update per step.

In a standard bidirectional DLM, this is impossible: since every token attends to every other token, and token values change with each refinement step, you’d need to recompute the entire KV matrix every step — O(n²) per refinement, no caching benefit.

Block-wise causal attention resolves this with a two-level hierarchy:

Sequence: [Block 0 | Block 1 | Block 2 | ... | Block N]

For a token in Block i:
  - Attends to ALL tokens in blocks 0...(i-1)  → cached KV (never recomputed)
  - Attends to ALL tokens within Block i        → bidirectional, recomputed each step
  - CANNOT attend to tokens in blocks (i+1)+   → causal constraint maintained

For a 32-token block size and 2048-token sequence, 98.4% of KV computations are served from cache at any given refinement step.

Here’s how to build the attention mask in PyTorch:

import torch

def build_block_causal_mask(seq_len: int, block_size: int) -> torch.Tensor:
    """
    Build a block-wise causal attention mask.

    Within each block: full bidirectional attention (True)
    Across blocks: causal left-to-right attention (True only for past blocks)
    Future blocks: masked out (False → -inf in softmax)

    Returns a boolean mask of shape [seq_len, seq_len],
    where True = can attend, False = masked.
    """
    mask = torch.zeros(seq_len, seq_len, dtype=torch.bool)
    num_blocks = seq_len // block_size

    for block_i in range(num_blocks):
        q_start = block_i * block_size
        q_end   = q_start + block_size

        # Attend to all past blocks (causal across blocks)
        for block_j in range(block_i):
            kv_start = block_j * block_size
            kv_end   = kv_start + block_size
            mask[q_start:q_end, kv_start:kv_end] = True

        # Attend fully within current block (bidirectional within block)
        mask[q_start:q_end, q_start:q_end] = True

    return mask


# Example: 4 blocks of 4 tokens each = 16 token sequence
mask = build_block_causal_mask(seq_len=16, block_size=4)
print(mask.int())

# Output (each row = query token, each col = key token):
# Block 0 rows: [1111 | 0000 | 0000 | 0000]
# Block 1 rows: [1111 | 1111 | 0000 | 0000]
# Block 2 rows: [1111 | 1111 | 1111 | 0000]
# Block 3 rows: [1111 | 1111 | 1111 | 1111]

The resulting mask has fully-connected 4×4 diagonal blocks (bidirectional within blocks) with a lower-triangular structure across block boundaries (causal across blocks). It’s the AR causal mask, coarsened to block granularity — which is precisely why pretrained AR weight distributions are preserved.

9. Getting Started: Running with SGLang

SGLang is the recommended serving framework for Nemotron-Labs Diffusion, with integration via PR #25803 (merging into main imminently). Here’s a complete working example.

9.1 Installation

# Install SGLang with DLM support
pip install "sglang[all]>=0.4.5" --extra-index-url https://flashinfer.ai/whl/cu124/torch2.5/

# If the PR hasn't merged to main yet, install from the DLM branch directly:
# git clone https://github.com/sgl-project/sglang.git
# cd sglang && git fetch origin pull/25803/head:dlm-support
# git checkout dlm-support && pip install -e ".[all]"

# Pull the model weights
pip install huggingface-hub
huggingface-cli download nvidia/Nemotron-Labs-Diffusion-8B 
  --local-dir ./models/Nemotron-Labs-Diffusion-8B

9.2 Serving: Launch the SGLang Server

# Mode 1 — Autoregressive (standard baseline)
python -m sglang.launch_server 
  --model-path ./models/Nemotron-Labs-Diffusion-8B 
  --port 30000 --tp 1 --dtype bfloat16 
  --algorithm ar_mode

# Mode 2 — Diffusion (FastDiffuser): highest raw throughput
python -m sglang.launch_server 
  --model-path ./models/Nemotron-Labs-Diffusion-8B 
  --port 30000 --tp 1 --dtype bfloat16 
  --algorithm diffusion 
  --block-size 32 
  --confidence-threshold 0.9

# Mode 3 — Self-Speculation (LinearSpec): lossless 6x speedup
python -m sglang.launch_server 
  --model-path ./models/Nemotron-Labs-Diffusion-8B 
  --port 30000 --tp 1 --dtype bfloat16 
  --algorithm linear_spec 
  --draft-block-size 32

9.3 Inference: Python Client (OpenAI-Compatible API)

import openai
import time

# SGLang exposes an OpenAI-compatible API endpoint
client = openai.OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"  # SGLang doesn't require auth by default
)

PROMPT = """You are an expert in distributed systems.
Explain the CAP theorem and its practical implications for a microservices
architecture. Be specific with concrete trade-off examples."""

def benchmark_mode(label: str, mode_hint: str = ""):
    """Run a generation and measure wall-clock tokens/second."""
    start = time.perf_counter()

    response = client.chat.completions.create(
        model="nvidia/Nemotron-Labs-Diffusion-8B",
        messages=[{"role": "user", "content": PROMPT}],
        max_tokens=512,
        temperature=0,        # T=0 → LinearSpec is lossless vs AR
        extra_body={
            "mode": mode_hint  # "ar", "diffusion", or "linear_spec"
        } if mode_hint else {}
    )

    elapsed = time.perf_counter() - start
    tokens  = response.usage.completion_tokens
    tps     = tokens / elapsed

    print(f"n{'='*60}")
    print(f"Mode        : {label}")
    print(f"Output      : {response.choices[0].message.content[:200]}...")
    print(f"Tokens      : {tokens}")
    print(f"Time (s)    : {elapsed:.2f}")
    print(f"Throughput  : {tps:.1f} tok/s")
    print(f"{'='*60}")
    return tps

# Compare all three modes
ar_tps   = benchmark_mode("Autoregressive",           mode_hint="ar")
diff_tps = benchmark_mode("Diffusion (FastDiffuser)", mode_hint="diffusion")
spec_tps = benchmark_mode("Self-Spec (LinearSpec)",   mode_hint="linear_spec")

print(f"n📊 Speedup Summary:")
print(f"  Diffusion vs AR   : {diff_tps/ar_tps:.2f}×")
print(f"  LinearSpec vs AR  : {spec_tps/ar_tps:.2f}×")

9.4 Quick Start via HuggingFace Transformers (AR Mode)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "nvidia/Nemotron-Labs-Diffusion-8B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user",   "content": "Explain masked diffusion in 3 sentences."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        max_new_tokens=256,
        do_sample=False,
        use_cache=True
    )

response = tokenizer.decode(
    output_ids[0][input_ids.shape[-1]:],
    skip_special_tokens=True
)
print(response)

Note: The transformers path gives AR mode only. For diffusion and self-speculation modes, the SGLang integration is required as it implements the custom decoding loop.

10. What This Means for Production LLM Infrastructure

Latency vs Throughput Trade-off, Revisited

The classic LLM serving dilemma is that throughput optimizations (larger batch sizes, continuous batching) increase latency, and latency optimizations (small batches, low KV cache pressure) hurt throughput. Self-speculation in DLMs partially decouples this: at batch size 1, LinearSpec gives 4–6× more tokens per second than AR on the same hardware. This is the scenario where AR models are most inefficient, and where DLMs provide the biggest relative gain.

Cost Implications

A 4× throughput improvement at batch size 1 means you could serve the same number of users with 1/4 the GPU compute — or equivalently, serve 4× more users from the same GPU fleet. At current B200/H100 pricing of $4–8/hour, that’s a meaningful cost reduction for any team running a production LLM API.

Fill-in-the-Middle and Code Editing

DLMs have a natural advantage for fill-in-the-middle (FIM) tasks. AR models handle FIM awkwardly, requiring special training and prompt formatting to look “backwards” at the suffix. A DLM generating a block bidirectionally can natively condition on both prefix and suffix context within the block — making Nemotron-Labs Diffusion well-suited for code editing agents and inline completions.

Inference Budget Control

In diffusion mode, you can control the number of denoising steps as a runtime knob. Fewer steps = faster but potentially lower quality. More steps = slower but higher quality. This gives you a continuous quality-speed trade-off at inference time without retraining — something AR models simply can’t offer. A production system could dynamically reduce diffusion steps during traffic spikes and increase them during low-load periods.

When to Stick with AR

For long-context tasks (100K+ tokens) where the KV cache dominates memory, the efficiency story is less clear-cut. For streaming output where users see tokens as they’re generated, block-wise generation may feel less smooth without careful rendering logic. And for tasks requiring strict constrained decoding (grammar-constrained generation, beam search), the diffusion loop needs further tooling work.

11. Conclusion & The Road Ahead

Diffusion Language Models have been a promising idea for years, perennially held back by a cluster of practical barriers: accuracy gaps, training instability, and the loss of KV caching. NVIDIA’s Efficient-DLM work and Nemotron-Labs Diffusion have systematically addressed each of these barriers with concrete, principled solutions — block-wise causal attention, position-dependent masking, and joint AR+diffusion training objectives.

The result is a model family that is simultaneously:

  • A first-class AR model (backward compatible, lossless in LinearSpec mode)
  • A 2.6–6.4× faster inference engine (depending on mode and hardware)
  • 🔲 A better fill-in-the-middle model by architectural design
  • 🎛️ A tunable quality-speed dial at deployment time — no retraining needed

With 24K+ downloads in the first 24 hours and SGLang integration landing imminently, this is one of the most practically significant open-source releases in the LLM inference space in 2026.

The next frontier: applying the same AR-to-DLM conversion recipe to frontier-scale models (70B+), exploring multimodal DLMs beyond the 8B VLM preview, and building out constrained decoding, streaming token rendering, and fine-tuning tooling for the DLM objective.

If you’re building LLM-powered applications and care about inference cost and latency, it’s time to start experimenting with Nemotron-Labs Diffusion. The autoregressive loop had a good run — but the next chapter of language model inference looks decidedly more parallel.

🔗 Resources

  • 🤗 Nemotron-Labs Diffusion model collection on HuggingFace
  • 📄 Efficient-DLM technical paper — arXiv:2512.14067
  • 💻 NVIDIA Megatron Bridge training code (GitHub)
  • 🔧 SGLang DLM integration PR #25803

Written on May 24, 2026 — based on the HuggingFace blog post and arXiv:2512.14067 (Efficient-DLM). Performance numbers reflect published benchmarks; verify against your specific hardware and workload.