Scaling Kubernetes with Horizontal and Vertical Pod Autoscaling for Optimal Performance

Modern applications require elasticity — the ability to automatically adjust to fluctuating traffic and workloads. Kubernetes provides powerful built-in mechanisms for autoscaling pods, ensuring applications remain responsive while optimizing resource usage. In this article, we dive into the core concepts, configurations, and best practices of Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA), helping you scale workloads effectively in dynamic environments.

Why Autoscaling in Kubernetes Matters

As applications experience variable loads, static resource allocation often leads to either over-provisioning (wasted resources) or under-provisioning (performance degradation). Kubernetes autoscaling mitigates these issues by:

Dynamically scaling replica counts (HPA)
Adjusting CPU and memory requests/limits (VPA)
Maintaining application performance SLAs
Reducing manual intervention and cost overhead

Horizontal Pod Autoscaler (HPA)

The Horizontal Pod Autoscaler automatically increases or decreases the number of pod replicas in a deployment, statefulset, or replicaset based on observed metrics like CPU, memory, or custom metrics.

How HPA Works

Periodically queries the metrics API (e.g., via Prometheus Adapter or Kubernetes Metrics Server)
Calculates the desired replica count using the target metric threshold
Scales the number of pods up or down accordingly

Prerequisites

Kubernetes Metrics Server or Custom Metrics Adapter
Defined CPU/Memory requests in pod specs
API version: autoscaling/v2 (for multi-metric support)

Example HPA Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: webapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: webapp
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

This configuration maintains CPU usage around 70% across all pods by scaling replicas between 2 and 10.

Benefits of HPA

Ideal for stateless applications
Scales in/out based on real-time demand
Quick response to traffic surges

Limitations

Not suitable for apps with long startup times
Ineffective when bottlenecks are internal (not replica-count related)
Requires proper CPU/memory request/limit definitions

Vertical Pod Autoscaler (VPA)

The Vertical Pod Autoscaler adjusts the resource requests and limits (CPU and memory) for individual pods, making it ideal for workloads that cannot scale horizontally.

How VPA Works

Monitors historical and real-time usage
Recommends, updates, or automatically applies new resource values
Triggers a pod restart for changes to take effect (currently unavoidable)

VPA Modes

Off – Only provides recommendations
Initial – Applies suggestions on pod creation
Auto – Actively updates live pods (requires restart)

Example VPA Configuration

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: webapp-vpa
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: webapp
  updatePolicy:
    updateMode: "Auto"

This configuration automatically manages CPU and memory for the webapp deployment based on usage patterns.

Benefits of VPA

Useful for memory-intensive or monolithic apps
Reduces the need to manually tune resources
Prevents resource starvation or over-allocation

Limitations

Restarts pods, which may be disruptive
Not ideal for stateless high-availability workloads
Not compatible with HPA for the same resource (e.g., both managing CPU)

Combining HPA and VPA

While HPA and VPA can work together, you must configure them carefully. Kubernetes does not allow both autoscalers to manage the same resource type (e.g., CPU) simultaneously. However, you can:

Use HPA for CPU (replica scaling)
Use VPA for memory (resource optimization)

Best Practice Configuration

Enable VPA only for off or initial mode if using HPA
Configure CPU with HPA, memory with VPA
Consider custom metrics or external metrics with HPA (via Prometheus Adapter)

Advanced Use Cases

1. Autoscaling with Custom Metrics

Use Prometheus Adapter to scale based on custom metrics like queue length, request latency, or business KPIs.

2. KEDA Integration

Kubernetes Event-Driven Autoscaling (KEDA) allows scaling based on event sources like Kafka, RabbitMQ, or Azure Functions.

3. Cluster Autoscaler

While HPA/VPA manage pods, Cluster Autoscaler handles node scaling, ensuring sufficient compute to host newly spawned pods.

Monitoring and Observability

Use these tools to monitor autoscaling behavior:

Prometheus + Grafana: Visualize HPA and VPA metrics
Kube Metrics Server: Supplies resource metrics to autoscalers
VPA Recommender Logs: Understand resource tuning decisions
Kubernetes Events: Track scaling events for debugging

Best Practices for Kubernetes Autoscaling

Always define resource requests and limits
Set realistic minReplicas and maxReplicas
Use liveness/readiness probes to prevent premature scaling
Monitor scaling latency and application startup time
Use custom metrics for domain-specific scaling needs

Conclusion

Scaling in Kubernetes is both a science and an art. By leveraging Horizontal Pod Autoscaler and Vertical Pod Autoscaler, you can build resilient, performant, and cost-efficient systems. Whether you’re running stateless microservices or monolithic batch jobs, autoscaling enables your platform to adapt intelligently to demand — maximizing uptime and minimizing waste.