Monitoring and Logging Kubernetes Clusters with Prometheus and Grafana

Effective monitoring and logging are critical for running reliable Kubernetes clusters in production. Without proper visibility into cluster health, resource usage, and application performance, troubleshooting becomes challenging and outages costly.

This article guides you through setting up Prometheus for metrics collection and Grafana for visualization, creating a robust observability stack that empowers your team to monitor Kubernetes clusters effectively.

Why Monitoring and Logging Matter in Kubernetes

Kubernetes clusters are dynamic environments with ephemeral containers, autoscaling workloads, and complex inter-service communication. Key challenges include:

Tracking resource utilization (CPU, memory, disk I/O)
Monitoring pod and node health
Detecting performance bottlenecks
Correlating logs and metrics for root cause analysis
Configuring alerting to respond to incidents proactively

Prometheus and Grafana together provide a comprehensive, extensible solution to these challenges.

Setting Up Prometheus for Kubernetes Monitoring

Prometheus is a powerful, open-source monitoring system designed for time-series data collection. It scrapes metrics from Kubernetes components and user applications, storing them for querying and alerting.

Key Components

Prometheus Server: Scrapes and stores metrics
Alertmanager: Handles alerts based on Prometheus rules
Exporters: Components that expose metrics from Kubernetes nodes, cAdvisor, kube-state-metrics, etc.

Installation

You can install Prometheus in Kubernetes using the Prometheus Operator, which simplifies deployment and management.

kubectl create namespace monitoring
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml

Core Metrics to Monitor

Node Exporter: CPU, memory, disk, and network metrics per node
kube-state-metrics: Kubernetes object states like deployments, pods, nodes
cAdvisor: Container resource usage metrics
API Server, Controller Manager, Scheduler metrics

Example scrape config (in Prometheus CRD)

scrapeConfigs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    metrics_path: /metrics
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__

Visualizing Metrics with Grafana

Grafana is an open-source analytics and monitoring dashboard. It integrates seamlessly with Prometheus to visualize metrics in customizable dashboards.

Steps to Set Up Grafana

Deploy Grafana in your cluster or on a VM
Configure Prometheus as a data source
Import Kubernetes dashboards from Grafana Labs

Recommended Dashboards

Kubernetes cluster monitoring (ID: 6417)
Node Exporter Full (ID: 1860)
Kube State Metrics (ID: 11074)

Alerting

Grafana supports alerting based on dashboard panels or Prometheus alert rules, delivering notifications through email, Slack, PagerDuty, and more.

Centralized Logging with Prometheus and Complementary Tools

Prometheus specializes in metrics, so logging requires complementary tools like:

Fluentd / Fluent Bit: Collects and forwards logs from pods/nodes
Elasticsearch: Stores and indexes logs
Kibana: Visualizes and queries logs

Combining logs with Prometheus metrics enables comprehensive troubleshooting.

Best Practices for Kubernetes Monitoring

Define Service Level Objectives (SLOs) to guide alert thresholds
Use labeling conventions for effective metric filtering
Secure your monitoring stack (RBAC, network policies)
Regularly update dashboards and alert rules as clusters evolve
Use recording rules in Prometheus for expensive queries
Employ downsampling or long-term storage solutions (e.g., Thanos)

Conclusion

Setting up monitoring and logging for Kubernetes clusters using Prometheus and Grafana provides visibility critical for maintaining healthy, performant applications. By following best practices and leveraging community dashboards, you can detect issues early, optimize resource usage, and ensure smoother operations.