Monitoring & Troubleshooting

This guide covers setting up monitoring for your Kepler deployment and resolving common issues. It includes OpenShift-specific monitoring configuration, Grafana integration, and systematic troubleshooting approaches.

Monitoring Setup

Enable User Workload Monitoring

Ensure User Workload Monitoring is enabled in your OpenShift cluster:

# Check current configuration
oc -n openshift-monitoring get configmap cluster-monitoring-config -o yaml

If User Workload Monitoring is not enabled, create or update the configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    enableUserWorkload: true

Apply the configuration:

oc apply -f cluster-monitoring-config.yaml

Verify ServiceMonitor Creation

The Kepler Operator automatically creates ServiceMonitor resources for Prometheus integration:

# Check ServiceMonitor
oc get servicemonitor -n power-monitor

# View ServiceMonitor configuration
oc get servicemonitor -n power-monitor -o yaml

Expected output:

NAME             AGE
kepler-exporter  5m

Access OpenShift Metrics

Navigate to Observe → Metrics in the OpenShift console to view power consumption metrics:

OpenShift Metrics Dashboard Overview OpenShift metrics dashboard showing power consumption overview

Key Metrics to Monitor

Query these metrics in the OpenShift console:

# Node power consumption
kepler_node_power_total

# Pod power consumption
kepler_pod_power_total

# Container energy consumption
kepler_container_energy_total

# Process power consumption
kepler_process_power_total

Test Metrics Endpoint

Verify that metrics are being exported correctly:

# Port forward to metrics endpoint
oc port-forward -n power-monitor svc/kepler-exporter 9102:9102

# Test metrics endpoint (in another terminal)
curl http://localhost:9102/metrics | grep kepler_node_power_total

# Check metrics availability
curl -s http://localhost:9102/metrics | grep -c "kepler_"

Expected: Should return multiple metrics starting with kepler_.

Grafana Integration

Import Kepler Dashboard

For advanced visualization, import the official Kepler Grafana dashboard:

# Download the dashboard JSON
curl -O https://raw.githubusercontent.com/sustainable-computing-io/kepler-operator/v1alpha1/hack/dashboard/assets/kepler/dashboard.json

Then import into your Grafana instance:

Open Grafana
Go to Dashboards → Import
Upload the dashboard.json file
Configure data source (Prometheus)

Custom Grafana Queries

Useful PromQL queries for Grafana dashboards:

# Total cluster power consumption
sum(kepler_node_power_total)

# Power consumption by node
kepler_node_power_total

# Top 10 power-consuming pods
topk(10, kepler_pod_power_total)

# Power consumption rate (watts)
rate(kepler_node_energy_total[5m])

# CPU vs Power correlation
kepler_node_power_total / on(instance) kepler_node_cpu_usage_percentage * 100

Troubleshooting

Common Issues and Solutions

Issue: PowerMonitor Not Creating DaemonSet

Symptoms:

PowerMonitor exists but no DaemonSet is created
Status shows conditions with errors

Diagnosis:

# Check PowerMonitor status and conditions
oc describe powermonitor power-monitor

# Check operator logs
oc logs -n openshift-operators deployment/kepler-operator-controller-manager

Solutions:

Check RBAC permissions:

# Verify operator service account permissions
oc auth can-i create daemonsets --as=system:serviceaccount:openshift-operators:kepler-operator-controller-manager

# If permissions are missing, check ClusterRole
oc describe clusterrole kepler-operator-manager-role

Verify resource quotas:

# Check namespace resource quotas
oc describe resourcequota -n power-monitor

# Check if quotas are blocking creation
oc get events -n power-monitor | grep -i quota

Issue: Pods Not Scheduling on Nodes

Symptoms:

DaemonSet created but pods remain in Pending state
Pods show FailedScheduling events

Diagnosis:

# Check pod status and events
oc get pods -n power-monitor
oc describe pods -n power-monitor

# Check node labels and taints
oc get nodes --show-labels
oc describe nodes | grep -i taint

Solutions:

Update node selector:

# Check available node labels
oc get nodes --show-labels | grep kubernetes.io/os

# Update PowerMonitor nodeSelector
oc patch powermonitor power-monitor --type='merge' -p='
{
  "spec": {
    "kepler": {
      "deployment": {
        "nodeSelector": {
          "kubernetes.io/os": "linux"
        }
      }
    }
  }
}'

Add tolerations for tainted nodes:

# Add toleration for master nodes
oc patch powermonitor power-monitor --type='merge' -p='
{
  "spec": {
    "kepler": {
      "deployment": {
        "tolerations": [
          {
            "key": "node-role.kubernetes.io/master",
            "operator": "Exists",
            "effect": "NoSchedule"
          }
        ]
      }
    }
  }
}'

Issue: Missing Metrics in Monitoring

Symptoms:

Pods are running but metrics don't appear in OpenShift console
ServiceMonitor exists but no data in Prometheus

Diagnosis:

# Check ServiceMonitor configuration
oc get servicemonitor -n power-monitor -o yaml

# Check service endpoints
oc get endpoints -n power-monitor

# Test metrics endpoint directly
oc port-forward -n power-monitor svc/kepler-exporter 9102:9102
curl http://localhost:9102/metrics | head -20

Solutions:

Verify User Workload Monitoring:

# Check if User Workload Monitoring is enabled
oc -n openshift-monitoring get configmap cluster-monitoring-config -o yaml

# Check user workload monitoring pods
oc get pods -n openshift-user-workload-monitoring

Check ServiceMonitor labels:

# Ensure ServiceMonitor has correct labels for discovery
oc patch servicemonitor kepler-exporter -n power-monitor --type='merge' -p='
{
  "metadata": {
    "labels": {
      "app.kubernetes.io/name": "kepler-exporter"
    }
  }
}'

Issue: High Resource Usage

Symptoms:

Kepler pods consuming excessive CPU or memory
Cluster performance degradation

Diagnosis:

# Check resource usage
oc top pods -n power-monitor

# Check current configuration
oc get powermonitor power-monitor -o yaml | grep -A 10 config

Solutions:

Reduce metric granularity:

# Reduce to node and pod metrics only
oc patch powermonitor power-monitor --type='merge' -p='
{
  "spec": {
    "kepler": {
      "config": {
        "metricLevels": ["node", "pod"],
        "sampleRate": "10s"
      }
    }
  }
}'

Limit terminated workload tracking:

# Reduce terminated workload tracking
oc patch powermonitor power-monitor --type='merge' -p='
{
  "spec": {
    "kepler": {
      "config": {
        "maxTerminated": 100
      }
    }
  }
}'

Advanced Debugging

Enable Debug Logging

For detailed troubleshooting, enable debug logging:

# Enable debug logging
oc patch powermonitor power-monitor --type='merge' -p='
{
  "spec": {
    "kepler": {
      "config": {
        "logLevel": "debug"
      }
    }
  }
}'

# View debug logs
oc logs -n power-monitor -l app.kubernetes.io/name=kepler-exporter -f

Remember to disable debug logging in production:

oc patch powermonitor power-monitor --type='merge' -p='
{
  "spec": {
    "kepler": {
      "config": {
        "logLevel": "info"
      }
    }
  }
}'

Collect Diagnostic Information

Create a diagnostic script for support:

#!/bin/bash
# kepler-diagnostics.sh

echo "=== Kepler Diagnostics ==="
echo "Date: $(date)"
echo "Cluster: $(oc cluster-info | head -1)"
echo

echo "=== PowerMonitor Status ==="
oc get powermonitor power-monitor -o wide
echo

echo "=== PowerMonitor Conditions ==="
oc describe powermonitor power-monitor | grep -A 20 "Conditions:"
echo

echo "=== DaemonSet Status ==="
oc get daemonset -n power-monitor
echo

echo "=== Pod Status ==="
oc get pods -n power-monitor -o wide
echo

echo "=== Recent Events ==="
oc get events -n power-monitor --sort-by='.lastTimestamp' | tail -10
echo

echo "=== ServiceMonitor ==="
oc get servicemonitor -n power-monitor
echo

echo "=== Operator Logs (last 50 lines) ==="
oc logs -n openshift-operators deployment/kepler-operator-controller-manager --tail=50

Run with:

chmod +x kepler-diagnostics.sh
./kepler-diagnostics.sh > kepler-diagnostics-$(date +%Y%m%d-%H%M%S).log

Performance Tuning

Recommended Production Settings

For production environments with performance considerations:

spec:
  kepler:
    config:
      logLevel: warn
      metricLevels: [node, pod]  # Avoid process/container levels
      sampleRate: 10s            # Reduce sampling frequency
      maxTerminated: 500         # Limit memory usage
    deployment:
      nodeSelector:
        node-role.kubernetes.io/worker: ""  # Avoid master nodes

Resource Limits

Set resource limits for Kepler pods:

spec:
  kepler:
    deployment:
      resources:
        limits:
          cpu: 200m
          memory: 256Mi
        requests:
          cpu: 100m
          memory: 128Mi

Getting Help

If you continue to experience issues:

Check the logs with the diagnostic script above
Search existing issues in the Kepler Operator repository
Open a new issue with diagnostic information
Join the community - See Support for community channels