Effective monitoring and logging are critical for running reliable Kubernetes clusters in production. Without proper visibility into cluster health, resource usage, and application performance, troubleshooting becomes challenging and outages costly.

This article guides you through setting up Prometheus for metrics collection and Grafana for visualization, creating a robust observability stack that empowers your team to monitor Kubernetes clusters effectively.


Why Monitoring and Logging Matter in Kubernetes

Kubernetes clusters are dynamic environments with ephemeral containers, autoscaling workloads, and complex inter-service communication. Key challenges include:

  • Tracking resource utilization (CPU, memory, disk I/O)
  • Monitoring pod and node health
  • Detecting performance bottlenecks
  • Correlating logs and metrics for root cause analysis
  • Configuring alerting to respond to incidents proactively

Prometheus and Grafana together provide a comprehensive, extensible solution to these challenges.


Setting Up Prometheus for Kubernetes Monitoring

Prometheus is a powerful, open-source monitoring system designed for time-series data collection. It scrapes metrics from Kubernetes components and user applications, storing them for querying and alerting.

Key Components
  • Prometheus Server: Scrapes and stores metrics
  • Alertmanager: Handles alerts based on Prometheus rules
  • Exporters: Components that expose metrics from Kubernetes nodes, cAdvisor, kube-state-metrics, etc.
Installation

You can install Prometheus in Kubernetes using the Prometheus Operator, which simplifies deployment and management.

kubectl create namespace monitoring
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/prometheus-operator/main/bundle.yaml
Core Metrics to Monitor
  • Node Exporter: CPU, memory, disk, and network metrics per node
  • kube-state-metrics: Kubernetes object states like deployments, pods, nodes
  • cAdvisor: Container resource usage metrics
  • API Server, Controller Manager, Scheduler metrics
Example scrape config (in Prometheus CRD)
scrapeConfigs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    metrics_path: /metrics
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__

Visualizing Metrics with Grafana

Grafana is an open-source analytics and monitoring dashboard. It integrates seamlessly with Prometheus to visualize metrics in customizable dashboards.

Steps to Set Up Grafana
  • Deploy Grafana in your cluster or on a VM
  • Configure Prometheus as a data source
  • Import Kubernetes dashboards from Grafana Labs
  • Kubernetes cluster monitoring (ID: 6417)
  • Node Exporter Full (ID: 1860)
  • Kube State Metrics (ID: 11074)
Alerting

Grafana supports alerting based on dashboard panels or Prometheus alert rules, delivering notifications through email, Slack, PagerDuty, and more.


Centralized Logging with Prometheus and Complementary Tools

Prometheus specializes in metrics, so logging requires complementary tools like:

  • Fluentd / Fluent Bit: Collects and forwards logs from pods/nodes
  • Elasticsearch: Stores and indexes logs
  • Kibana: Visualizes and queries logs

Combining logs with Prometheus metrics enables comprehensive troubleshooting.


Best Practices for Kubernetes Monitoring

  • Define Service Level Objectives (SLOs) to guide alert thresholds
  • Use labeling conventions for effective metric filtering
  • Secure your monitoring stack (RBAC, network policies)
  • Regularly update dashboards and alert rules as clusters evolve
  • Use recording rules in Prometheus for expensive queries
  • Employ downsampling or long-term storage solutions (e.g., Thanos)

Conclusion

Setting up monitoring and logging for Kubernetes clusters using Prometheus and Grafana provides visibility critical for maintaining healthy, performant applications. By following best practices and leveraging community dashboards, you can detect issues early, optimize resource usage, and ensure smoother operations.