TL;DR: I built gocudrv so Go services can talk directly to NVIDIA GPUs — no cgo, no CUDA toolkit, no bloated Python dependencies. One static binary.
Last month I was reviewing a production AI service. The core business logic was clean, efficient Go (15MB binary), but GPU access was routed through a Python sidecar.
The results were painful:
- 8.4GB Docker images — bloated with unused CUDA toolkits and PyTorch dependencies
- 4-minute cold starts during autoscaling
- Extra serialization + network hops between Go → Python → GPU
We had accepted this because “GPUs belong to Python.” I decided to challenge that assumption.
The Impossible Build: CGO_ENABLED=0
Most Go developers assume you need cgo for CUDA. Instead, I used the CUDA Driver API (already present wherever an NVIDIA driver is installed) together with purego to bypass the C compiler entirely.
// internal/platform/platform_linux.go
func LibraryCandidates() []string {
return []string{
"libcuda.so.1",
"/usr/lib/x86_64-linux-gnu/libcuda.so.1",
"/usr/lib/wsl/lib/libcuda.so.1", // Works seamlessly in WSL2
}
}
gocudrv loads libcuda.so at runtime. Standard go build works — even when building on a Mac targeting Linux.
The Receipts: Size & Build Comparison
| Metric | Python Sidecar Approach | gocudrv (Pure Go) |
|---|---|---|
| Artifact Size | ~8,400 MB | 2.4 MB |
| Build Time | 5–10 minutes (Docker) | < 2 seconds |
| External Dependencies | Python + PyTorch + CUDA Toolkit | NVIDIA Driver only |
| Deployment Simplicity | Multiple processes + networking | Single static binary |
Low-Level Kernel Performance (10M element vector add on RTX 4070 Ti)
For a simple vector addition (~114 MB data):
- H→D Copy: 19.3 ms
- Kernel Launch: 3.4 ms
- D→H Copy: 25.6 ms
- Total GPU Pipeline: 48.3 ms
These numbers represent the raw GPU work. In the previous Python sidecar setup, we also paid extra for:
- JSON/Protobuf serialization
- Local network socket transfer (Go → Python)
- Python interpreter + PyTorch overhead
The real win is not necessarily beating PyTorch on micro-benchmarks, but removing the entire sidecar layer and its operational complexity.
Beyond a Simple Wrapper
Pure Go doesn’t mean slow. I focused on asynchronous overlap from the beginning to hide PCIe transfer latency:
stream, _ := ctx.NewStream()
// Start DMA transfer — returns immediately
err := buf.CopyFromHostAsync(ctx, stream, hostBuffer)
// Go can do useful work while the GPU is computing
// ...
// Synchronize only when needed
err = stream.Synchronize(ctx)
Why This Matters in 2026
AI is shifting from research demos to critical infrastructure. Go excels at stability, concurrency, observability, and operational predictability — exactly what production model serving demands.
Removing the Python sidecar gives you:
- Dramatically smaller images and faster deploys.
- Much better cold start times.
- Single language and single binary (much simpler observability and debugging).
- No GIL, better P99 tail latencies.
Current State (Honest)
gocudrv is still early and experimental. Core functionality works today (device management, memory management, PTX loading, streams, async copies), but it is not yet ready for complex high-performance inference serving.
I’m actively working on CUDA Graphs, Events & Timing, and multi-GPU support.
If you’re a Go engineer tired of carrying heavy Python AI runtimes in production, I’d love your feedback and contributions.
→ link
