Deleting the 8.4GB Python Sidecar: Pure Go + CUDA with `CGO_ENABLED=0`

TL;DR: I built gocudrv so Go services can talk directly to NVIDIA GPUs — no cgo, no CUDA toolkit, no bloated Python dependencies. One static binary.

Last month I was reviewing a production AI service. The core business logic was clean, efficient Go (15MB binary), but GPU access was routed through a Python sidecar.

The results were painful:

8.4GB Docker images — bloated with unused CUDA toolkits and PyTorch dependencies
4-minute cold starts during autoscaling
Extra serialization + network hops between Go → Python → GPU

We had accepted this because “GPUs belong to Python.” I decided to challenge that assumption.

The Impossible Build: `CGO_ENABLED=0`

Most Go developers assume you need cgo for CUDA. Instead, I used the CUDA Driver API (already present wherever an NVIDIA driver is installed) together with purego to bypass the C compiler entirely.

// internal/platform/platform_linux.go
func LibraryCandidates() []string {
    return []string{
        "libcuda.so.1",
        "/usr/lib/x86_64-linux-gnu/libcuda.so.1",
        "/usr/lib/wsl/lib/libcuda.so.1", // Works seamlessly in WSL2
    }
}

gocudrv loads libcuda.so at runtime. Standard go build works — even when building on a Mac targeting Linux.

The Receipts: Size & Build Comparison

Metric	Python Sidecar Approach	gocudrv (Pure Go)
Artifact Size	~8,400 MB	2.4 MB
Build Time	5–10 minutes (Docker)	< 2 seconds
External Dependencies	Python + PyTorch + CUDA Toolkit	NVIDIA Driver only
Deployment Simplicity	Multiple processes + networking	Single static binary

Low-Level Kernel Performance (10M element vector add on RTX 4070 Ti)

For a simple vector addition (~114 MB data):

H→D Copy: 19.3 ms
Kernel Launch: 3.4 ms
D→H Copy: 25.6 ms
Total GPU Pipeline: 48.3 ms

These numbers represent the raw GPU work. In the previous Python sidecar setup, we also paid extra for:

JSON/Protobuf serialization
Local network socket transfer (Go → Python)
Python interpreter + PyTorch overhead

The real win is not necessarily beating PyTorch on micro-benchmarks, but removing the entire sidecar layer and its operational complexity.

Beyond a Simple Wrapper

Pure Go doesn’t mean slow. I focused on asynchronous overlap from the beginning to hide PCIe transfer latency:

stream, _ := ctx.NewStream()

// Start DMA transfer — returns immediately
err := buf.CopyFromHostAsync(ctx, stream, hostBuffer)

// Go can do useful work while the GPU is computing
// ...

// Synchronize only when needed
err = stream.Synchronize(ctx)

Why This Matters in 2026

AI is shifting from research demos to critical infrastructure. Go excels at stability, concurrency, observability, and operational predictability — exactly what production model serving demands.

Removing the Python sidecar gives you:

Dramatically smaller images and faster deploys.
Much better cold start times.
Single language and single binary (much simpler observability and debugging).
No GIL, better P99 tail latencies.

Current State (Honest)

gocudrv is still early and experimental. Core functionality works today (device management, memory management, PTX loading, streams, async copies), but it is not yet ready for complex high-performance inference serving.

I’m actively working on CUDA Graphs, Events & Timing, and multi-GPU support.

If you’re a Go engineer tired of carrying heavy Python AI runtimes in production, I’d love your feedback and contributions.

→ link

Deleting the 8.4GB Python Sidecar: Pure Go + CUDA with `CGO_ENABLED=0`

The Impossible Build: `CGO_ENABLED=0`

The Receipts: Size & Build Comparison

Low-Level Kernel Performance (10M element vector add on RTX 4070 Ti)

Beyond a Simple Wrapper

Why This Matters in 2026

Current State (Honest)

Search

Quads Text

Recent Posts

Archives

Meta

The Impossible Build: CGO_ENABLED=0

The Receipts: Size & Build Comparison

Low-Level Kernel Performance (10M element vector add on RTX 4070 Ti)

Beyond a Simple Wrapper

Why This Matters in 2026

Current State (Honest)

Search

Quads Text

Recent Posts

Archives

Meta

The Impossible Build: `CGO_ENABLED=0`