Split VRAM of GPU on Kubernetes

Overview

Can leftover VRAM only be allocated at the GPU level in Kubernetes? I looked into how surplus resources could be assigned to other workloads.

Examples

GPU spec: 32GB
Workstation spec: 2 GPUs
Persistent inference spec: 1 GPU, 12GB VRAM
This leaves GPUs with 32GB and 20GB of spare VRAM. I wanted to allocate this surplus to other tasks and maximize GPU utilization.

HOWTO

Replica

If you can cap the number of pods and pre-calculate resource requirements, replicas let you limit overall resource allocation. e.g., min CPU 1, RAM 2GB; max CPU 2, RAM 4GB; up to 5 pods.

However, this only works for CPU and RAM — not for GPU.

Extended Resources

ref: Apply Extended Resources Kubernetes Extended Resources let you control how much VRAM each pod gets, so you can tune resources for training vs. serving pods.

The catch: if a container gobbles up all VRAM at runtime, there’s no way to detect it. You could spin up a sidecar container to monitor it, but that just adds more unnecessary resource usage.

GPU Virtualization

GPU virtualization is traditionally done through VMware, enabling fine-grained resource allocation on a single GPU. But virtualization support is limited to data-center-grade GPUs, and VMware is not free.

Conclusion

Kubernetes doesn’t natively support VRAM splitting. Extended Resources can allocate VRAM at pod creation time, but this doesn’t actually constrain the pod’s runtime VRAM usage.