MAKING AN APPLICATION HIGHLY AVAILABLE WITH KUBERNETES

MAKING AN APPLICATION HIGHLY AVAILABLE WITH KUBERNETES
MAKING AN APPLICATION HIGHLY AVAILABLE WITH KUBERNETES

Migrating your application to Kubernetes can streamline many processes, but it doesn’t ensure high availability. Between deployments, node recycling, and unexpected interruptions, your application might struggle to handle requests promptly.

In this article we will explain how to properly define your application so that it is highly available on a Kubernetes cluster, going over the different parts to configure and the actions to implement over time.



In today’s technology landscape, Kubernetes has become the go-to choice for container orchestration. Its widespread adoption is a testament to its ability to simplify and streamline the deployment, management, and scaling of containerized applications. Kubernetes’ meteoric rise also highlights the shift in mindset that has taken place in the infrastructure world from managing “pet” servers to managing “cattle.”

Traditional application management requires special attention to each server, with manual operations to keep them functional, like a pet: this is the “pet” mentality, where we pamper our servers. 

The “cattle” mentality is about processing our resources on a larger scale, without paying special attention to each server. This method brings its own set of constraints, such as shorter server lifetimes and more frequent application restarts.

These more frequent outages must be taken into account when migrating to Kubernetes, and require configuring your application to be highly available.

Let’s explore together what elements to choose and configure to ensure that our application is always available and responds within acceptable time frames to our customers’ requests through an example.

Let’s consider our application “highly available” if it is able to respond to more than 99% of requests in less than a second, and this 24/7.

Choosing Application Type

On Kubernetes, a web application can be deployed in two ways:

  • With a Deployment = ideal for “stateless” applications that do not need to keep their state locally to function, such as a simple home page
  • or with a StatefulSet, for applications that store data locally, such as a visit counter.

It is easier to transition a stateless application to high availability than a stateful application, because the stateful application will have to manage additional parameters for scaling to work, such as synchronizing state across instances or ensuring that a client only accesses a single instance. Data consistency in a stateful application is an important consideration, otherwise two identical requests from a client could return a different result.

Today, a standard web application should not keep its state locally. The most used application models recommend separating the data part from the “compute” / “code” part, using a 2-tier / 3-tier / N-tier architecture. This type of architecture will cut the application features and isolate them to simplify their maintenance; each of its features will have to be deployed in high availability.

The application we will deploy is a simple static “hello world” page, which does not need to store any state. We will use the “nginx” image in our examples.

So we are going to use a Deployment for our application.

A Deployment allows you to configure several elements:

  • a pod template, the element that will run an instance of your application
  • the number of pods you want to have in parallel
  • the deployment strategy: how a new deployment should be done (replace everything, or replace old pods little by little).

In addition, the Deployment will continuously check that the specified number of pods is respected: if a pod is deleted, the Deployment will quickly replace it.

Choosing Application Placement

To have a “high availability” deployment, we need at least two instances of the application to allow for possible interruptions on one of them.

You also need to make sure that your pods are not on the same node, or if possible in the same geographic area. Indeed, if a problem occurs on the node (or on the area), then all instances of the application would be affected at once.

There are two methods to specify where pods can be placed depending on the application:

  • PodAffinity and PodAntiAffinity: This method allows us to indicate to Kubernetes that our application wants to have its pods close to or isolated from the pods of another application. 

By “close” we mean based on a label placed on the node: proximity can be based on the area, or the physical server.

Choosing Application Placement with kubernetes
  • TopologySpreadConstraint: This newer method allows to define how the application should be distributed based on node/zone. The distribution should be fair based on the chosen selection parameter, with a maximum tolerated variance: for example, if the selector is the zone and the maximum variance is 1, each zone will have a similar number of pods, and a new pod will always be put on the zone with the least pods of the application.
TopologySpreadConstraint in Kubernetes

To achieve what we want, we will need to use both a PodAntiAffinity and a TopologySpreadConstraint, to give the following instructions:

  • one application pod per node at most: PodAntiAffinity
  • an application pod on each zone: TopologySpreadConstraint

Note that limiting to a single pod per node will work best on clusters that are of a significant size, with at least more nodes than application instances.

Here is an example of the implementation to be carried out on our deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: example-deployment
spec:
replicas: 2 # minimum two instances
template:
metadata:
labels:
app: example
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
# Ensures two pods of the same application cannot be on the same node
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- example
topologyKey: "kubernetes.io/hostname"
topologySpreadConstraints:
# Ensures zones have an equal number of pods, with a variance of 1
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: example
          

Furthermore, to be able to make calls on the instances of this deployment, we will need to create a Service. For our application, we will choose a service of type “ClusterIP”, internal to the Kubernetes cluster.

This service will distribute traffic evenly across our application instances. Instances that are considered non-functional by a ReadinessProbe will not receive traffic through the service.

apiVersion : v1 
kind : Service
apiVersion: v1
kind: Service
metadata:
name: example
spec:
type: ClusterIP
ports:
- port: 80
selector:
app: example
   

Our application is now available from inside the cluster. There are several methods to make it available from outside the cluster, which will depend on your cluster and network configuration.

The standard today seems to be moving towards the Ingress object, so we’ll define this object to use the Service above to access our application.

Traffic Repartition in Kubernetes
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: example-ingress
spec:
rules:
- host: example.com
http:
paths:
- pathType: Prefix
path: "/"
backend:
service:
name: example
port:
number: 80               
             

At this point, our application is deployed with multiple instances, which will be deployed on multiple hosts and different zones. So we are protected from interruptions due to a critical issue of the host or zone, our application should always respond to calls 24/7. Our application is accessible from the Kubernetes cluster and from outside the cluster.

Now let’s look at how to configure our application so that it can always respond to customer requests in an acceptable time.

Defining Pod Resources

First of all, we will need to ensure that our application has sufficient resources to function properly: we will therefore include in our deployment the CPU and memory requests that our application needs.

Additionally, we will limit the amount of memory the application is allowed to use, to protect other applications on the node if there is a memory leak on our application.

 
containers:
- name: example-container
resources:
# Our application has access to at least 1 CPU and 1 GiB of memory
requests:
cpu: 1
memory: 1Gi
# We impose a limit on our application to protect the node and other applications
limits:
memory: 2Gi
# The container must restart if it stops, for any reason
restartPolicy: Always

 

With these settings, Kubernetes will now make sure to place these pods on nodes that can offer these resources. We recommend setting the parameters so that your actual usage is 80% of what you requested in the requests, to keep some margin.

Monitoring Pod Lifecycle

Now, let’s look at the pod lifecycle. The pod will start, launch the main process, and keep it alive until it stops or the pod needs to stop it. Kubernetes will, by default, send traffic to this pod as soon as the pod starts, while the web service can take up to several minutes to start depending on your application.

So we’ll need to tell Kubernetes when the pod is ready to receive traffic. We do this with Probes:

  • StartupProbe: Performs a regular test, when the pod starts, to know when the service will have started and restarts the pod if it times out.
  • LivenessProbe: Once the startupProbe is finished, it checks that the service is running correctly; if its tests fail, it restarts the pod.
  • ReadinessProbe: Once the Pod is started, it checks that the service is working properly; if its tests fail, it prevents Kubernetes from sending traffic to this Pod by removing it from the authorized endpoints for services.

We can use all three probes with the same test, “verify that the /healthz call works”, to get the results we want:

containers : 
     - name : example-container image : nginx ports :        - name : http containerPort : 80 # This test will run from container startup # until it succeeds startupProbe : httpGet : path : /healthz port : 80 failureThreshold : 30 periodSeconds : 10 # This test will restart the container on error livenessProbe : httpGet : path : /healthz port : 80 containers:
- name: example-container
image: nginx
ports:
- name: http
containerPort: 80
# This test will run from container startup until it succeeds
startupProbe:
httpGet:
path: /healthz
port: 80
failureThreshold: 30
periodSeconds: 10
# This test will restart the container on error
livenessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 5
# This test will stop serving traffic to the container on error
readinessProbe:
httpGet:
path: /healthz
port: 80
periodSeconds: 5
failureThreshold: 2
      
         

Our application is now able to know when one of its instances has problems, can regulate traffic to this instance and restart it if it does not come back up in a timely manner.

Now we need to make sure that our application works properly even if it experiences a traffic spike.

Automatically Scaling Application

We can do this with a HorizontalPodAutoscaler, which will compare the actual CPU/memory usage with the requests we have given to our instances and increase or decrease the number of replicas in the deployment based on the results.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: example-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: example-deployment
# At least 2 instances
minReplicas: 2
# At most 10 instances
maxReplicas: 10
# Scales the number of instances based on CPU usage; we target a usage of 80%.
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80

       

Here we choose to have 2 instances minimum and 10 maximum. You have to take into account the capacity of your cluster with this dynamic provisioning of pods. In a cloud context, you can also do dynamic node provisioning, which allows us to put autoscaling on the nodes.

For example, we can use Karpenter to ensure that we have enough nodes for our pods. This tool allows you to define one or more node types to be dynamically added to your cluster based on demand. Additionally, it is able to remove nodes when the cluster no longer needs them, which helps reduce costs. This node removal can happen while there are pods of our application on the node to be removed, which is why it is important to have our application in high availability.

> Be careful, a bad configuration of an autoscaler on the nodes can cost you dearly! Remember to always set a limit on the number of nodes to create, and to monitor the total cost of the cluster.

So now we have an application deployed on multiple nodes and multiple zones, which has enough resources and is able to not serve traffic to KO instances. So we have managed all the non-voluntary interruptions of the service, now we need to look at the voluntary interruptions 🙂.

Managing Voluntary Interruptions

There are two types of voluntary interruptions: those generated during a deployment, and those generated by system applications. From an application point of view, they are identical: the web server receives a signal (SIGTERM) indicating that it must stop, and will be forced to do so after a certain time (SIGKILL).

Our job here is to check that these interrupts are not sent to all instances of our application at the same time.

To protect our application during a deployment, our Deployment must apply the “RollingUpdate” strategy to update the application. This method replaces instances one after the other, ensuring that the newly deployed instance is ready to receive traffic before removing an old one. This is the default deployment strategy of the Deployment, but we can also make it explicit in our manifest:

apiVersion : apps/v1 
kind : Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-deployment
spec:
# RollingUpdate strategy ensures that we always have instances available during updates
strategy:
type: RollingUpdate
rollingUpdate:
# Allows 1 extra pod to be created during the update
maxSurge: 1
# Ensures that 0 pods are unavailable during the update
maxUnavailable: 0

To protect our application from interruptions coming from system applications, we will need to create a PodDisruptionBudget, which will define the number of pods that our application can afford to interrupt or the minimum number of uninterrupted pods.

You can achieve this goal in two ways:

  • Either by using the “minAvailable” parameter to indicate the minimum number of pods that must be ready at any time.
  • Either by using the “maxUnavailable” parameter to specify the maximum number of pods the application can afford to lose.

For our application, we will only allow 1 pod to be interrupted at a time.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: example-pdb
spec:
# Only one pod should be interrupted at a time
maxUnavailable: 1
selector:
matchLabels:
app: example 

Note that the PodDisruptionBudget will prevent your pods from being intentionally interrupted as a “best effort”; some applications may not respect this budget, or may be forced to work around it.

Our application is now deployed across multiple nodes and multiple zones. It will not have resource issues, and will be able to divert traffic from instances that have issues. Additionally, instance outages are controlled and do not impact application availability.

Supervising Application

Our application should now be considered “high availability”, but we need to make sure of this and to do this, add supervision.

This monitoring should be done on several levels, and ideally should also have a component capable of alerting teams if limit values ​​are reached.

In Kubernetes, you need to monitor these metrics:

  • CPU usage + pod throttling
  • pod memory usage
  • number of pod restarts
  • number of pods deployed
  • 99th percentile application response time

They will allow us as cluster administrator to see that the application is working correctly, and above all to quickly have alerts when the application encounters problems.

To implement this monitoring, the Prometheus + Grafana combo is today the most common in the Kubernetes world.

The metrics specified above are metrics that are easily obtainable as a Kubernetes administrator, without modifying the application. If you can modify the application, adding metrics directly from the code will allow you to know as quickly as possible how your application is performing, and where there might be problems. The OpenTelemetry framework will allow you to quickly output metrics from your application, in a way that is compatible with most collection tools.

We also need to check that the application is reachable from our target, the previous metrics only come from inside the cluster. There are various tools that allow you to do regular tests to see if the application responds to a given URL, such as UptimeRobot or Uptime Kuma.

This type of test allows us to verify that all the components between the internet and our application are working: if for example a firewall rule was changed, only an external access test to the cluster could detect the problem.

Regular monitoring is necessary to ensure that your application is working. You will also need to regularly adapt the configuration of your application to take into account the code and traffic changes that it will have during its life.

Conclusion

Modifying your application to be highly available requires configuring several elements to cover all possible cases of interruptions. Kubernetes allows us to do this relatively simply, as long as we know all the possible causes of problems we may encounter.

Moving an application to high availability will increase its cost due to the number of additional instances, and may also result in longer deployment times = this is a necessary trade-off on production environments, but may not be desired on development environments.

Author

  • Mohamed BEN HASSINE

    Mohamed BEN HASSINE is a Hands-On Cloud Solution Architect based out of France. he has been working on Java, Web , API and Cloud technologies for over 12 years and still going strong for learning new things. Actually , he plays the role of Cloud / Application Architect in Paris ,while he is designing cloud native solutions and APIs ( REST , gRPC). using cutting edge technologies ( GCP / Kubernetes / APIGEE / Java / Python )

    View all posts
0 Shares:
Leave a Reply
You May Also Like