Metrics Reference¶
Complete reference of all metrics emitted by the E6 Operator.
Metrics Endpoint¶
URL: http://<operator-pod>:8080/metrics
Format: Prometheus text format
Access:
# Port-forward to metrics service
kubectl port-forward -n e6-operator-system \
svc/e6-operator-metrics-service 8080:8080
# Query all metrics
curl http://localhost:8080/metrics
# Query specific metric
curl http://localhost:8080/metrics | grep controller_runtime_reconcile_total
Metric Categories¶
- Controller Runtime Metrics - Reconciliation performance
- Workqueue Metrics - Queue depth and processing
- Process Metrics - CPU, memory, file descriptors
- Go Runtime Metrics - Goroutines, GC, memory
- Webhook Metrics - Admission webhook performance
- Leader Election Metrics - HA leader status
- Client Metrics - Kubernetes API client performance
Controller Runtime Metrics¶
controller_runtime_reconcile_total¶
Type: Counter
Description: Total number of reconciliations per controller.
Labels: - controller - Controller name (e.g., "metadataservices") - result - Reconciliation result: success, error, requeue, requeue_after
Usage:
# Total reconciliations
sum(controller_runtime_reconcile_total{controller="metadataservices"})
# Success rate
sum(rate(controller_runtime_reconcile_total{result="success"}[5m]))
/
sum(rate(controller_runtime_reconcile_total[5m]))
* 100
# Error rate
rate(controller_runtime_reconcile_total{result="error",controller="metadataservices"}[5m])
# Requeue rate (indicates transient issues)
rate(controller_runtime_reconcile_total{result="requeue",controller="metadataservices"}[5m])
What it tells you: - High success rate (>99%) = healthy operator - High error rate = configuration issues, API server problems, or bugs - High requeue rate = resources waiting for conditions (e.g., deployments not ready) - Sudden spike = mass resource updates or operator restart
controller_runtime_reconcile_errors_total¶
Type: Counter
Description: Total number of reconciliation errors per controller.
Labels: - controller - Controller name (e.g., "metadataservices")
Usage:
# Error rate
rate(controller_runtime_reconcile_errors_total{controller="metadataservices"}[5m])
# Total errors in last hour
increase(controller_runtime_reconcile_errors_total{controller="metadataservices"}[1h])
# Alert on high error rate
rate(controller_runtime_reconcile_errors_total[5m]) > 0.1
What it tells you: - Errors > 0 = problems need investigation - Check operator logs for error details - Common causes: RBAC issues, API server timeouts, invalid resource specs
controller_runtime_reconcile_time_seconds¶
Type: Histogram
Description: Distribution of reconciliation duration in seconds.
Labels: - controller - Controller name (e.g., "metadataservices")
Buckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 30, 60
Usage:
# P50 latency
histogram_quantile(0.50,
rate(controller_runtime_reconcile_time_seconds_bucket{controller="metadataservices"}[5m])
)
# P95 latency
histogram_quantile(0.95,
rate(controller_runtime_reconcile_time_seconds_bucket{controller="metadataservices"}[5m])
)
# P99 latency
histogram_quantile(0.99,
rate(controller_runtime_reconcile_time_seconds_bucket{controller="metadataservices"}[5m])
)
# Average latency
rate(controller_runtime_reconcile_time_seconds_sum{controller="metadataservices"}[5m])
/
rate(controller_runtime_reconcile_time_seconds_count{controller="metadataservices"}[5m])
What it tells you: - P50 < 1s = good performance - P95 < 5s = acceptable performance - P99 > 30s = investigate slow reconciliations - Slow reconciliations may indicate: - Complex resources with many dependencies - Slow Kubernetes API responses - Network issues - Heavy operator load
controller_runtime_max_concurrent_reconciles¶
Type: Gauge
Description: Maximum number of concurrent reconciles allowed per controller.
Labels: - controller - Controller name (e.g., "metadataservices")
Usage:
# Current max concurrent reconciles
controller_runtime_max_concurrent_reconciles{controller="metadataservices"}
What it tells you: - Default is typically 1 - Can be increased for higher throughput - Compare to workqueue depth to determine if more concurrency needed
controller_runtime_active_workers¶
Type: Gauge
Description: Number of currently active workers per controller.
Labels: - controller - Controller name (e.g., "metadataservices")
Usage:
# Current active workers
controller_runtime_active_workers{controller="metadataservices"}
# Worker utilization
controller_runtime_active_workers{controller="metadataservices"}
/
controller_runtime_max_concurrent_reconciles{controller="metadataservices"}
* 100
What it tells you: - Active workers = reconciliations currently in progress - High utilization + high workqueue depth = need more workers - Low utilization + high workqueue depth = slow reconciliations
Workqueue Metrics¶
workqueue_adds_total¶
Type: Counter
Description: Total number of items added to workqueue.
Labels: - name - Queue name (e.g., "metadataservices")
Usage:
# Add rate
rate(workqueue_adds_total{name="metadataservices"}[5m])
# Total adds in last hour
increase(workqueue_adds_total{name="metadataservices"}[1h])
What it tells you: - Sudden spike = many resources created/updated - Steady high rate = operator processing many reconciliations - Add rate should roughly match reconciliation rate
workqueue_depth¶
Type: Gauge
Description: Current depth of workqueue (pending items).
Labels: - name - Queue name (e.g., "metadataservices")
Usage:
# Current depth
workqueue_depth{name="metadataservices"}
# Alert on high depth
workqueue_depth{name="metadataservices"} > 100
What it tells you: - Depth = 0: Operator keeping up with changes - Depth > 10: Backlog building up - Depth > 100: Operator struggling to keep up - High depth causes: - Slow reconciliations - High resource churn - Insufficient concurrency - Operator restart (clears queue)
workqueue_queue_duration_seconds¶
Type: Histogram
Description: Time items spend waiting in queue before processing.
Labels: - name - Queue name (e.g., "metadataservices")
Buckets: 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000
Usage:
# P95 queue wait time
histogram_quantile(0.95,
rate(workqueue_queue_duration_seconds_bucket{name="metadataservices"}[5m])
)
# P99 queue wait time
histogram_quantile(0.99,
rate(workqueue_queue_duration_seconds_bucket{name="metadataservices"}[5m])
)
What it tells you: - Low wait time (< 1s): Operator responsive - High wait time (> 10s): Backlog growing - Very high wait time (> 60s): Resources delayed significantly
workqueue_work_duration_seconds¶
Type: Histogram
Description: Time taken to process an item from queue.
Labels: - name - Queue name (e.g., "metadataservices")
Buckets: Same as queue_duration_seconds
Usage:
# P95 processing time
histogram_quantile(0.95,
rate(workqueue_work_duration_seconds_bucket{name="metadataservices"}[5m])
)
What it tells you: - Similar to reconcile_time_seconds but measured at queue level - Includes both reconciliation and requeueing logic
workqueue_retries_total¶
Type: Counter
Description: Total number of retries handled by workqueue.
Labels: - name - Queue name (e.g., "metadataservices")
Usage:
# Retry rate
rate(workqueue_retries_total{name="metadataservices"}[5m])
# Total retries in last hour
increase(workqueue_retries_total{name="metadataservices"}[1h])
What it tells you: - Retries happen when reconciliation returns error or requeue - High retry rate indicates transient issues - Normal for resources in transition (e.g., waiting for pods to start) - Sustained high retry rate = persistent errors
workqueue_unfinished_work_seconds¶
Type: Gauge
Description: Time since oldest unprocessed item was added.
Labels: - name - Queue name (e.g., "metadataservices")
Usage:
What it tells you: - Age of oldest pending reconciliation - High value (> 300s) = significant backlog
workqueue_longest_running_processor_seconds¶
Type: Gauge
Description: Duration of longest running processor.
Labels: - name - Queue name (e.g., "metadataservices")
Usage:
# Longest running reconciliation
workqueue_longest_running_processor_seconds{name="metadataservices"}
What it tells you: - Identifies stuck reconciliations - Very high value (> 600s) = reconciliation may be stuck
Process Metrics¶
process_cpu_seconds_total¶
Type: Counter
Description: Total user and system CPU time spent in seconds.
Labels: None
Usage:
# CPU usage rate (cores)
rate(process_cpu_seconds_total{job="e6-operator"}[5m])
# CPU usage percentage (assuming 1 CPU limit)
rate(process_cpu_seconds_total{job="e6-operator"}[5m]) * 100
What it tells you: - CPU consumption of operator - High CPU = heavy reconciliation load or inefficient code - Compare to CPU limits to check for throttling
process_resident_memory_bytes¶
Type: Gauge
Description: Resident memory size (RSS) in bytes.
Labels: None
Usage:
# Memory usage in MB
process_resident_memory_bytes{job="e6-operator"} / 1024 / 1024
# Memory usage in GB
process_resident_memory_bytes{job="e6-operator"} / 1024 / 1024 / 1024
What it tells you: - Actual memory consumed by operator process - Steady growth = memory leak - Sudden spike = large resource processing - Compare to memory limits
process_virtual_memory_bytes¶
Type: Gauge
Description: Virtual memory size in bytes.
Labels: None
Usage:
What it tells you: - Total virtual address space (includes mapped files, libraries) - Usually much larger than resident memory - Less useful than RSS for monitoring
process_virtual_memory_max_bytes¶
Type: Gauge
Description: Maximum virtual memory available in bytes.
Labels: None
Usage:
What it tells you: - System limit for virtual memory - Rarely hit in practice
process_open_fds¶
Type: Gauge
Description: Number of open file descriptors.
Labels: None
Usage:
# Current open FDs
process_open_fds{job="e6-operator"}
# Alert on high FD usage
process_open_fds{job="e6-operator"} > 1000
What it tells you: - Includes files, sockets, pipes - High count (> 1000) = file descriptor leak - System limit typically 1024-65536
process_max_fds¶
Type: Gauge
Description: Maximum number of open file descriptors.
Labels: None
Usage:
# FD utilization percentage
process_open_fds{job="e6-operator"}
/
process_max_fds{job="e6-operator"}
* 100
What it tells you: - System limit (ulimit -n) - Approaching limit = risk of "too many open files" errors
process_start_time_seconds¶
Type: Gauge
Description: Unix timestamp when process started.
Labels: None
Usage:
# Process uptime in seconds
time() - process_start_time_seconds{job="e6-operator"}
# Process uptime in hours
(time() - process_start_time_seconds{job="e6-operator"}) / 3600
What it tells you: - When operator last restarted - Frequent restarts indicate crashes or OOM kills
Go Runtime Metrics¶
go_goroutines¶
Type: Gauge
Description: Number of goroutines currently running.
Labels: None
Usage:
# Current goroutine count
go_goroutines{job="e6-operator"}
# Alert on goroutine leak
go_goroutines{job="e6-operator"} > 1000
What it tells you: - Typical count: 50-200 for idle operator - High count (> 1000) = goroutine leak - Steady growth = leak (investigate with pprof) - Spikes during heavy reconciliation are normal
go_threads¶
Type: Gauge
Description: Number of OS threads created.
Labels: None
Usage:
What it tells you: - Typically matches number of CPU cores (GOMAXPROCS) - Much lower than goroutine count
go_gc_duration_seconds¶
Type: Summary
Description: Distribution of GC pause durations.
Labels: - quantile - 0, 0.25, 0.5, 0.75, 1.0
Usage:
# Median GC pause
go_gc_duration_seconds{quantile="0.5",job="e6-operator"}
# Maximum GC pause
go_gc_duration_seconds{quantile="1.0",job="e6-operator"}
What it tells you: - GC pause = stop-the-world time - High pauses (> 100ms) = memory pressure - Frequent long pauses = operator unresponsive
go_memstats_alloc_bytes¶
Type: Gauge
Description: Bytes of allocated heap objects.
Labels: None
Usage:
What it tells you: - Currently allocated memory on heap - Includes both reachable and unreachable (garbage) objects
go_memstats_heap_alloc_bytes¶
Type: Gauge
Description: Bytes of allocated heap objects (same as alloc_bytes).
Labels: None
go_memstats_heap_inuse_bytes¶
Type: Gauge
Description: Bytes in in-use spans.
Labels: None
Usage:
What it tells you: - Memory actually in use by application - Lower than sys_bytes (which includes free memory held by Go)
go_memstats_heap_sys_bytes¶
Type: Gauge
Description: Bytes of heap memory obtained from OS.
Labels: None
Usage:
What it tells you: - Total heap memory requested from OS - Go may not return this to OS immediately after GC
go_memstats_gc_cpu_fraction¶
Type: Gauge
Description: Fraction of CPU time used by GC.
Labels: None
Usage:
What it tells you: - Percentage of CPU spent on garbage collection - High value (> 10%) = memory pressure, frequent GC
Webhook Metrics¶
controller_runtime_webhook_requests_total¶
Type: Counter
Description: Total number of admission webhook requests received.
Labels: - webhook - Webhook path (e.g., "/validate-e6data-io-v1alpha1-metadataservices") - verb - HTTP verb (POST) - code - HTTP response code (200, 403, 500)
Usage:
# Request rate
rate(controller_runtime_webhook_requests_total[5m])
# Success rate
sum(rate(controller_runtime_webhook_requests_total{code="200"}[5m]))
/
sum(rate(controller_runtime_webhook_requests_total[5m]))
* 100
# Rejection rate (validation failures)
rate(controller_runtime_webhook_requests_total{code="403"}[5m])
What it tells you: - High rejection rate = users applying invalid resources - 500 errors = webhook crashes or timeouts
controller_runtime_webhook_latency_seconds¶
Type: Histogram
Description: Distribution of webhook request latency.
Labels: - webhook - Webhook path
Buckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10
Usage:
# P95 webhook latency
histogram_quantile(0.95,
rate(controller_runtime_webhook_latency_seconds_bucket[5m])
)
# P99 webhook latency
histogram_quantile(0.99,
rate(controller_runtime_webhook_latency_seconds_bucket[5m])
)
What it tells you: - Latency adds to kubectl apply time - High latency (> 1s) = user experience impact - Very high latency (> 10s) = risk of timeout
Leader Election Metrics¶
leader_election_master_status¶
Type: Gauge
Description: Whether this instance is the leader (1 = leader, 0 = follower).
Labels: - name - Leader election name (e.g., "e6-operator-leader-election")
Usage:
# Current leader count (should be 1)
sum(leader_election_master_status)
# Which pod is leader
leader_election_master_status{name="e6-operator-leader-election"} == 1
What it tells you: - Only one instance should have value 1 - Value 0 = standby replica - No instances with 1 = leader election in progress or failed
Client Metrics¶
rest_client_requests_total¶
Type: Counter
Description: Total number of HTTP requests to Kubernetes API.
Labels: - code - HTTP response code - host - API server host - method - HTTP method (GET, POST, PUT, PATCH, DELETE)
Usage:
# Request rate to API server
rate(rest_client_requests_total[5m])
# Error rate (5xx responses)
rate(rest_client_requests_total{code=~"5.."}[5m])
What it tells you: - High request rate = operator chatty with API server - 429 errors = rate limiting (too many requests) - 5xx errors = API server issues
rest_client_request_duration_seconds¶
Type: Histogram
Description: Duration of requests to Kubernetes API.
Labels: - verb - HTTP method - url - Request URL path
Buckets: 0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.024, 2.048, 4.096, 8.192, 16.384, 32.768
Usage:
What it tells you: - Slow API responses = performance bottleneck - High latency = overloaded API server or network issues
Useful Metric Combinations¶
Operator Health Score¶
# Composite health score (0-100)
(
# Success rate (40% weight)
(sum(rate(controller_runtime_reconcile_total{result="success"}[5m]))
/
sum(rate(controller_runtime_reconcile_total[5m]))) * 40
+
# Low error rate (30% weight)
(1 - clamp_max(rate(controller_runtime_reconcile_errors_total[5m]) / 0.1, 1)) * 30
+
# Low latency (20% weight)
(1 - clamp_max(
histogram_quantile(0.95,
rate(controller_runtime_reconcile_time_seconds_bucket[5m])
) / 30, 1)) * 20
+
# Low queue depth (10% weight)
(1 - clamp_max(workqueue_depth / 100, 1)) * 10
)
Resource Saturation¶
# CPU saturation
rate(process_cpu_seconds_total[5m]) / <cpu_limit>
# Memory saturation
process_resident_memory_bytes / <memory_limit_bytes>
# Workqueue saturation
workqueue_depth / 1000
SLI/SLO Examples¶
# SLI: Reconciliation success rate
sum(rate(controller_runtime_reconcile_total{result="success"}[5m]))
/
sum(rate(controller_runtime_reconcile_total[5m]))
# SLO: 99.9% success rate
# Alert if below threshold for 5 minutes
# SLI: Reconciliation latency P95
histogram_quantile(0.95,
rate(controller_runtime_reconcile_time_seconds_bucket[5m])
)
# SLO: P95 < 10s
# Alert if above threshold for 5 minutes
Additional Resources¶
- Monitoring Guide - Setup and configuration
- Grafana Dashboard - Pre-built dashboard
- Prometheus Documentation
- controller-runtime Metrics