4 Ways to reduce cold-start-latency on Google Kubernetes-Engine (GKE)

4 Ways to reduce cold-start-latency on Google Kubernetes-Engine (GKE)
4 Ways to reduce cold-start-latency on Google Kubernetes-Engine (GKE)

If you run workloads on Kubernetes, you’ve most likely experienced “cold starts“: delays in application startup that occur when a workload is scheduled to a node that did not previously host the workload and the Pod needs to be started from scratch. 

Extended startup times can result in longer response times and a worse user experience – especially when applications automatically scale to handle traffic surges.

What happens during a cold start?

Deploying a containerized application on Kubernetes typically involves several steps, including pulling the container image, starting the container, and initializing the application code. These processes increase the time before a Pod starts serving traffic, resulting in increased latency for the first request served by a new Pod. Since new nodes do not have pre-existing container images, the initial startup time may be significantly longer. For subsequent requests, the Pod is already started and warmed up, so it can quickly handle requests without additional startup time.

Cold starts are common when Pods are constantly shut down and restarted, as this forces requests to be routed to new cold Pods. A common solution is to keep the pod pool in a ready state to reduce cold start latency.

However, for larger workloads such as AI/ML, especially on expensive and scarce GPUs, the cost of warm pool practice can be very high. Therefore, cold starts are especially common for AI/ML workloads, where pods are typically shut down after completing requests.

Google Kubernetes Engine (GKE) is Google Cloud’s managed Kubernetes service that makes it easier to deploy and maintain complex containerized workloads. In this article, we’ll discuss four different techniques to reduce cold start latency on GKE so that you can provide responsive services.

Techniques to Overcome Cold Start Challenges

1. Use temporary storage with a local SSD or larger boot disk

Node has the Kubelet and container runtime (docker or containerd) root installed on the local SSD. Therefore, the container layer is supported by local SSD, and the IOPS and throughput are recorded in About local SSD. This is often more cost-effective than increasing PD size.

LocalSSD has approximately 3x higher throughput than PD for the same cost, allowing image pulls to run faster and reducing startup latency for workloads.

You can create a node pool that uses ephemeral storage with local SSD in an existing cluster running on GKE version 1.25.3-gke.1800 or later.

gcloud container node-pools create POOL_NAME \
–cluster=CLUSTER_NAME \
–ephemeral-storage-local-ssd count=<NUMBER_OF_DISKS> \

For more information, see Configuring temporary storage using local SSD .

2. Enable container image streaming

Image streaming can significantly reduce workload startup time by allowing workloads to start without waiting for the entire image to download. For example, with GKE Image Streaming, the end-to-end startup time (from workload creation to server startup for traffic) of an NVIDIA Triton server (5.4GB container image) can be reduced from 191 seconds to 30 seconds.

You must use ArtifactRegistry for your container and meet the requirements. Image streaming can be enabled on the cluster in the following ways.

gcloud container clusters create CLUSTER_NAME \
–image-type=”COS_CONTAINERD” \

To learn more, see Pull container images using ImageStream .

3. Use Zstandard to compress container images

Zstandard compression is a feature supported by ContainerD. Zstandard benchmarks show that zstd decompresses more than 3 times faster than gzip (current default).


Here’s how to use the zstd builder with docker buildx:

docker buildx create –name zstd-builder –driver docker-container \
–driver-opt image=moby/buildkit:v0.10.3
docker buildx use zstd-builder

Here’s how to build and push an image:


<Create your Dockerfile>

docker buildx build –file Dockerfile –output type=image,name=$IMAGE_URI:$IMAGE_TAG,oci-mediatypes=true,compression=zstd,compression-level=3,force-compression=true,push=true .

Note that Zstandard is not compatible with image streams. If your application needs to load a large portion of the container image content before starting, it is best to use Zstandard. If your application only needs to load a small portion of the entire container image to start executing, then try image streaming.

4. Use the Preloader DaemonSet to preload the basic container on the node

Last but not least, ContainerD reuses image layers between different containers if they share the same base container. The preloader DaemonSet starts running even before the GPU driver is installed (driver installation takes approximately 30 seconds). This means it can preload the required containers before scheduling GPU workloads to GPU nodes and starting pulling images early.

Here is an example of a preloader DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
  name: container-preloader
    k8s-app: container-preloader
      k8s-app: container-preloader
    type: RollingUpdate
        name: container-preloader
        k8s-app: container-preloader
            - matchExpressions:
              - key: cloud.google.com/gke-accelerator
                operator: Exists
      - operator: "Exists"
      - image: "<CONTAINER_TO_BE_PRELOADED>"
        name: container-preloader
        command: ["sleep", "inf"]

Beyond cold starts

The cold start challenge is a common problem in container orchestration systems. Through careful planning and optimization, you can mitigate its impact on applications running on GKE. By using ephemeral storage with a larger boot disk, enabling container streaming or Zstandard compression, and using daemon sets to preload base containers, you can reduce cold start latency and ensure that your system is more responsive and efficient.

If you want to know more about GKE, you can leave a message to contact
me .


  • Mohamed BEN HASSINE

    Mohamed BEN HASSINE is a Hands-On Cloud Solution Architect based out of France. he has been working on Java, Web , API and Cloud technologies for over 12 years and still going strong for learning new things. Actually , he plays the role of Cloud / Application Architect in Paris ,while he is designing cloud native solutions and APIs ( REST , gRPC). using cutting edge technologies ( GCP / Kubernetes / APIGEE / Java / Python )

You May Also Like
Scaling Kubernetes (GKE / AKS / EKS )
Read More

Scaling Kubernetes by Examples

Table of Contents Hide Scaling a Kubernetes DeploymentUse CaseSolutionImplementing Horizontal Pod AutoscalingUse CaseSolutionAutomating Cluster Scaling in GKEUse CaseSolutionDiscussionDynamically…