If you run workloads on Kubernetes, you’ve most likely experienced “cold starts“: delays in application startup that occur when a workload is scheduled to a node that did not previously host the workload and the Pod needs to be started from scratch.
What happens during a cold start?
Deploying a containerized application on Kubernetes typically involves several steps, including pulling the container image, starting the container, and initializing the application code. These processes increase the time before a Pod starts serving traffic, resulting in increased latency for the first request served by a new Pod. Since new nodes do not have pre-existing container images, the initial startup time may be significantly longer. For subsequent requests, the Pod is already started and warmed up, so it can quickly handle requests without additional startup time.
Cold starts are common when Pods are constantly shut down and restarted, as this forces requests to be routed to new cold Pods. A common solution is to keep the pod pool in a ready state to reduce cold start latency.
However, for larger workloads such as AI/ML, especially on expensive and scarce GPUs, the cost of warm pool practice can be very high. Therefore, cold starts are especially common for AI/ML workloads, where pods are typically shut down after completing requests.
Google Kubernetes Engine (GKE) is Google Cloud’s managed Kubernetes service that makes it easier to deploy and maintain complex containerized workloads. In this article, we’ll discuss four different techniques to reduce cold start latency on GKE so that you can provide responsive services.
Techniques to Overcome Cold Start Challenges
1. Use temporary storage with a local SSD or larger boot disk
Node has the Kubelet and container runtime (docker or containerd) root installed on the local SSD. Therefore, the container layer is supported by local SSD, and the IOPS and throughput are recorded in About local SSD. This is often more cost-effective than increasing PD size.
LocalSSD has approximately 3x higher throughput than PD for the same cost, allowing image pulls to run faster and reducing startup latency for workloads.
You can create a node pool that uses ephemeral storage with local SSD in an existing cluster running on GKE version 1.25.3-gke.1800 or later.
gcloud container node-pools create POOL_NAME \
–cluster=CLUSTER_NAME \
–ephemeral-storage-local-ssd count=<NUMBER_OF_DISKS> \
–machine-type=MACHINE_TYPE
For more information, see Configuring temporary storage using local SSD .
2. Enable container image streaming
Image streaming can significantly reduce workload startup time by allowing workloads to start without waiting for the entire image to download. For example, with GKE Image Streaming, the end-to-end startup time (from workload creation to server startup for traffic) of an NVIDIA Triton server (5.4GB container image) can be reduced from 191 seconds to 30 seconds.
You must use ArtifactRegistry for your container and meet the requirements. Image streaming can be enabled on the cluster in the following ways.
gcloud container clusters create CLUSTER_NAME \
–zone=COMPUTE_ZONE \
–image-type=”COS_CONTAINERD” \
–enable-image-streaming
To learn more, see Pull container images using ImageStream .
3. Use Zstandard to compress container images
Zstandard compression is a feature supported by ContainerD. Zstandard benchmarks show that zstd decompresses more than 3 times faster than gzip (current default).
Here’s how to use the zstd builder with docker buildx:
docker buildx create –name zstd-builder –driver docker-container \
–driver-opt image=moby/buildkit:v0.10.3
docker buildx use zstd-builder
Here’s how to build and push an image:
IMAGE_URI=us-central1-docker.pkg.dev/teck-gke-dev/example
IMAGE_TAG=v1
<Create your Dockerfile>
docker buildx build –file Dockerfile –output type=image,name=$IMAGE_URI:$IMAGE_TAG,oci-mediatypes=true,compression=zstd,compression-level=3,force-compression=true,push=true .
Note that Zstandard is not compatible with image streams. If your application needs to load a large portion of the container image content before starting, it is best to use Zstandard. If your application only needs to load a small portion of the entire container image to start executing, then try image streaming.
4. Use the Preloader DaemonSet to preload the basic container on the node
Last but not least, ContainerD reuses image layers between different containers if they share the same base container. The preloader DaemonSet starts running even before the GPU driver is installed (driver installation takes approximately 30 seconds). This means it can preload the required containers before scheduling GPU workloads to GPU nodes and starting pulling images early.
Here is an example of a preloader DaemonSet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: container-preloader
labels:
k8s-app: container-preloader
spec:
selector:
matchLabels:
k8s-app: container-preloader
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: container-preloader
k8s-app: container-preloader
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-accelerator
operator: Exists
tolerations:
- operator: "Exists"
containers:
- image: "<CONTAINER_TO_BE_PRELOADED>"
name: container-preloader
command: ["sleep", "inf"]
Beyond cold starts
The cold start challenge is a common problem in container orchestration systems. Through careful planning and optimization, you can mitigate its impact on applications running on GKE. By using ephemeral storage with a larger boot disk, enabling container streaming or Zstandard compression, and using daemon sets to preload base containers, you can reduce cold start latency and ensure that your system is more responsive and efficient.