Metrics Reference¶

Complete reference of all metrics emitted by the E6 Operator.

Metrics Endpoint¶

URL: http://<operator-pod>:8080/metrics

Format: Prometheus text format

Access:

# Port-forward to metrics service
kubectl port-forward -n e6-operator-system \
  svc/e6-operator-metrics-service 8080:8080

# Query all metrics
curl http://localhost:8080/metrics

# Query specific metric
curl http://localhost:8080/metrics | grep controller_runtime_reconcile_total

Metric Categories¶

Controller Runtime Metrics - Reconciliation performance
Workqueue Metrics - Queue depth and processing
Process Metrics - CPU, memory, file descriptors
Go Runtime Metrics - Goroutines, GC, memory
Webhook Metrics - Admission webhook performance
Leader Election Metrics - HA leader status
Client Metrics - Kubernetes API client performance

Controller Runtime Metrics¶

controller_runtime_reconcile_total¶

Type: Counter

Description: Total number of reconciliations per controller.

Labels: - controller - Controller name (e.g., "metadataservices") - result - Reconciliation result: success, error, requeue, requeue_after

Usage:

# Total reconciliations
sum(controller_runtime_reconcile_total{controller="metadataservices"})

# Success rate
sum(rate(controller_runtime_reconcile_total{result="success"}[5m]))
/
sum(rate(controller_runtime_reconcile_total[5m]))
* 100

# Error rate
rate(controller_runtime_reconcile_total{result="error",controller="metadataservices"}[5m])

# Requeue rate (indicates transient issues)
rate(controller_runtime_reconcile_total{result="requeue",controller="metadataservices"}[5m])

What it tells you: - High success rate (>99%) = healthy operator - High error rate = configuration issues, API server problems, or bugs - High requeue rate = resources waiting for conditions (e.g., deployments not ready) - Sudden spike = mass resource updates or operator restart

controller_runtime_reconcile_errors_total¶

Type: Counter

Description: Total number of reconciliation errors per controller.

Labels: - controller - Controller name (e.g., "metadataservices")

Usage:

# Error rate
rate(controller_runtime_reconcile_errors_total{controller="metadataservices"}[5m])

# Total errors in last hour
increase(controller_runtime_reconcile_errors_total{controller="metadataservices"}[1h])

# Alert on high error rate
rate(controller_runtime_reconcile_errors_total[5m]) > 0.1

What it tells you: - Errors > 0 = problems need investigation - Check operator logs for error details - Common causes: RBAC issues, API server timeouts, invalid resource specs

controller_runtime_reconcile_time_seconds¶

Type: Histogram

Description: Distribution of reconciliation duration in seconds.

Labels: - controller - Controller name (e.g., "metadataservices")

Buckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 30, 60

Usage:

# P50 latency
histogram_quantile(0.50,
  rate(controller_runtime_reconcile_time_seconds_bucket{controller="metadataservices"}[5m])
)

# P95 latency
histogram_quantile(0.95,
  rate(controller_runtime_reconcile_time_seconds_bucket{controller="metadataservices"}[5m])
)

# P99 latency
histogram_quantile(0.99,
  rate(controller_runtime_reconcile_time_seconds_bucket{controller="metadataservices"}[5m])
)

# Average latency
rate(controller_runtime_reconcile_time_seconds_sum{controller="metadataservices"}[5m])
/
rate(controller_runtime_reconcile_time_seconds_count{controller="metadataservices"}[5m])

What it tells you: - P50 < 1s = good performance - P95 < 5s = acceptable performance - P99 > 30s = investigate slow reconciliations - Slow reconciliations may indicate: - Complex resources with many dependencies - Slow Kubernetes API responses - Network issues - Heavy operator load

controller_runtime_max_concurrent_reconciles¶

Type: Gauge

Description: Maximum number of concurrent reconciles allowed per controller.

Labels: - controller - Controller name (e.g., "metadataservices")

Usage:

# Current max concurrent reconciles
controller_runtime_max_concurrent_reconciles{controller="metadataservices"}

What it tells you: - Default is typically 1 - Can be increased for higher throughput - Compare to workqueue depth to determine if more concurrency needed

controller_runtime_active_workers¶

Type: Gauge

Description: Number of currently active workers per controller.

Labels: - controller - Controller name (e.g., "metadataservices")

Usage:

# Current active workers
controller_runtime_active_workers{controller="metadataservices"}

# Worker utilization
controller_runtime_active_workers{controller="metadataservices"}
/
controller_runtime_max_concurrent_reconciles{controller="metadataservices"}
* 100

What it tells you: - Active workers = reconciliations currently in progress - High utilization + high workqueue depth = need more workers - Low utilization + high workqueue depth = slow reconciliations

Workqueue Metrics¶

workqueue_adds_total¶

Type: Counter

Description: Total number of items added to workqueue.

Labels: - name - Queue name (e.g., "metadataservices")

Usage:

# Add rate
rate(workqueue_adds_total{name="metadataservices"}[5m])

# Total adds in last hour
increase(workqueue_adds_total{name="metadataservices"}[1h])

What it tells you: - Sudden spike = many resources created/updated - Steady high rate = operator processing many reconciliations - Add rate should roughly match reconciliation rate

workqueue_depth¶

Type: Gauge

Description: Current depth of workqueue (pending items).

Labels: - name - Queue name (e.g., "metadataservices")

Usage:

# Current depth
workqueue_depth{name="metadataservices"}

# Alert on high depth
workqueue_depth{name="metadataservices"} > 100

What it tells you: - Depth = 0: Operator keeping up with changes - Depth > 10: Backlog building up - Depth > 100: Operator struggling to keep up - High depth causes: - Slow reconciliations - High resource churn - Insufficient concurrency - Operator restart (clears queue)

workqueue_queue_duration_seconds¶

Type: Histogram

Description: Time items spend waiting in queue before processing.

Labels: - name - Queue name (e.g., "metadataservices")

Buckets: 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000

Usage:

# P95 queue wait time
histogram_quantile(0.95,
  rate(workqueue_queue_duration_seconds_bucket{name="metadataservices"}[5m])
)

# P99 queue wait time
histogram_quantile(0.99,
  rate(workqueue_queue_duration_seconds_bucket{name="metadataservices"}[5m])
)

What it tells you: - Low wait time (< 1s): Operator responsive - High wait time (> 10s): Backlog growing - Very high wait time (> 60s): Resources delayed significantly

workqueue_work_duration_seconds¶

Type: Histogram

Description: Time taken to process an item from queue.

Labels: - name - Queue name (e.g., "metadataservices")

Buckets: Same as queue_duration_seconds

Usage:

# P95 processing time
histogram_quantile(0.95,
  rate(workqueue_work_duration_seconds_bucket{name="metadataservices"}[5m])
)

What it tells you: - Similar to reconcile_time_seconds but measured at queue level - Includes both reconciliation and requeueing logic

workqueue_retries_total¶

Type: Counter

Description: Total number of retries handled by workqueue.

Labels: - name - Queue name (e.g., "metadataservices")

Usage:

# Retry rate
rate(workqueue_retries_total{name="metadataservices"}[5m])

# Total retries in last hour
increase(workqueue_retries_total{name="metadataservices"}[1h])

What it tells you: - Retries happen when reconciliation returns error or requeue - High retry rate indicates transient issues - Normal for resources in transition (e.g., waiting for pods to start) - Sustained high retry rate = persistent errors

workqueue_unfinished_work_seconds¶

Type: Gauge

Description: Time since oldest unprocessed item was added.

Labels: - name - Queue name (e.g., "metadataservices")

Usage:

# Oldest pending item age
workqueue_unfinished_work_seconds{name="metadataservices"}

What it tells you: - Age of oldest pending reconciliation - High value (> 300s) = significant backlog

workqueue_longest_running_processor_seconds¶

Type: Gauge

Description: Duration of longest running processor.

Labels: - name - Queue name (e.g., "metadataservices")

Usage:

# Longest running reconciliation
workqueue_longest_running_processor_seconds{name="metadataservices"}

What it tells you: - Identifies stuck reconciliations - Very high value (> 600s) = reconciliation may be stuck

Process Metrics¶

process_cpu_seconds_total¶

Type: Counter

Description: Total user and system CPU time spent in seconds.

Labels: None

Usage:

# CPU usage rate (cores)
rate(process_cpu_seconds_total{job="e6-operator"}[5m])

# CPU usage percentage (assuming 1 CPU limit)
rate(process_cpu_seconds_total{job="e6-operator"}[5m]) * 100

What it tells you: - CPU consumption of operator - High CPU = heavy reconciliation load or inefficient code - Compare to CPU limits to check for throttling

process_resident_memory_bytes¶

Type: Gauge

Description: Resident memory size (RSS) in bytes.

Labels: None

Usage:

# Memory usage in MB
process_resident_memory_bytes{job="e6-operator"} / 1024 / 1024

# Memory usage in GB
process_resident_memory_bytes{job="e6-operator"} / 1024 / 1024 / 1024

What it tells you: - Actual memory consumed by operator process - Steady growth = memory leak - Sudden spike = large resource processing - Compare to memory limits

process_virtual_memory_bytes¶

Type: Gauge

Description: Virtual memory size in bytes.

Labels: None

Usage:

# Virtual memory in MB
process_virtual_memory_bytes{job="e6-operator"} / 1024 / 1024

What it tells you: - Total virtual address space (includes mapped files, libraries) - Usually much larger than resident memory - Less useful than RSS for monitoring

process_virtual_memory_max_bytes¶

Type: Gauge

Description: Maximum virtual memory available in bytes.

Labels: None

Usage:

# Max virtual memory in GB
process_virtual_memory_max_bytes{job="e6-operator"} / 1024 / 1024 / 1024

What it tells you: - System limit for virtual memory - Rarely hit in practice

process_open_fds¶

Type: Gauge

Description: Number of open file descriptors.

Labels: None

Usage:

# Current open FDs
process_open_fds{job="e6-operator"}

# Alert on high FD usage
process_open_fds{job="e6-operator"} > 1000

What it tells you: - Includes files, sockets, pipes - High count (> 1000) = file descriptor leak - System limit typically 1024-65536

process_max_fds¶

Type: Gauge

Description: Maximum number of open file descriptors.

Labels: None

Usage:

# FD utilization percentage
process_open_fds{job="e6-operator"}
/
process_max_fds{job="e6-operator"}
* 100

What it tells you: - System limit (ulimit -n) - Approaching limit = risk of "too many open files" errors

process_start_time_seconds¶

Type: Gauge

Description: Unix timestamp when process started.

Labels: None

Usage:

# Process uptime in seconds
time() - process_start_time_seconds{job="e6-operator"}

# Process uptime in hours
(time() - process_start_time_seconds{job="e6-operator"}) / 3600

What it tells you: - When operator last restarted - Frequent restarts indicate crashes or OOM kills

Go Runtime Metrics¶

go_goroutines¶

Type: Gauge

Description: Number of goroutines currently running.

Labels: None

Usage:

# Current goroutine count
go_goroutines{job="e6-operator"}

# Alert on goroutine leak
go_goroutines{job="e6-operator"} > 1000

What it tells you: - Typical count: 50-200 for idle operator - High count (> 1000) = goroutine leak - Steady growth = leak (investigate with pprof) - Spikes during heavy reconciliation are normal

go_threads¶

Type: Gauge

Description: Number of OS threads created.

Labels: None

Usage:

# Current thread count
go_threads{job="e6-operator"}

What it tells you: - Typically matches number of CPU cores (GOMAXPROCS) - Much lower than goroutine count

go_gc_duration_seconds¶

Type: Summary

Description: Distribution of GC pause durations.

Labels: - quantile - 0, 0.25, 0.5, 0.75, 1.0

Usage:

# Median GC pause
go_gc_duration_seconds{quantile="0.5",job="e6-operator"}

# Maximum GC pause
go_gc_duration_seconds{quantile="1.0",job="e6-operator"}

What it tells you: - GC pause = stop-the-world time - High pauses (> 100ms) = memory pressure - Frequent long pauses = operator unresponsive

go_memstats_alloc_bytes¶

Type: Gauge

Description: Bytes of allocated heap objects.

Labels: None

Usage:

# Heap allocation in MB
go_memstats_alloc_bytes{job="e6-operator"} / 1024 / 1024

What it tells you: - Currently allocated memory on heap - Includes both reachable and unreachable (garbage) objects

go_memstats_heap_alloc_bytes¶

Type: Gauge

Description: Bytes of allocated heap objects (same as alloc_bytes).

Labels: None

go_memstats_heap_inuse_bytes¶

Type: Gauge

Description: Bytes in in-use spans.

Labels: None

Usage:

# Heap in-use in MB
go_memstats_heap_inuse_bytes{job="e6-operator"} / 1024 / 1024

What it tells you: - Memory actually in use by application - Lower than sys_bytes (which includes free memory held by Go)

go_memstats_heap_sys_bytes¶

Type: Gauge

Description: Bytes of heap memory obtained from OS.

Labels: None

Usage:

# Total heap from OS in MB
go_memstats_heap_sys_bytes{job="e6-operator"} / 1024 / 1024

What it tells you: - Total heap memory requested from OS - Go may not return this to OS immediately after GC

go_memstats_gc_cpu_fraction¶

Type: Gauge

Description: Fraction of CPU time used by GC.

Labels: None

Usage:

# GC CPU usage percentage
go_memstats_gc_cpu_fraction{job="e6-operator"} * 100

What it tells you: - Percentage of CPU spent on garbage collection - High value (> 10%) = memory pressure, frequent GC

Webhook Metrics¶

controller_runtime_webhook_requests_total¶

Type: Counter

Description: Total number of admission webhook requests received.

Labels: - webhook - Webhook path (e.g., "/validate-e6data-io-v1alpha1-metadataservices") - verb - HTTP verb (POST) - code - HTTP response code (200, 403, 500)

Usage:

# Request rate
rate(controller_runtime_webhook_requests_total[5m])

# Success rate
sum(rate(controller_runtime_webhook_requests_total{code="200"}[5m]))
/
sum(rate(controller_runtime_webhook_requests_total[5m]))
* 100

# Rejection rate (validation failures)
rate(controller_runtime_webhook_requests_total{code="403"}[5m])

What it tells you: - High rejection rate = users applying invalid resources - 500 errors = webhook crashes or timeouts

controller_runtime_webhook_latency_seconds¶

Type: Histogram

Description: Distribution of webhook request latency.

Labels: - webhook - Webhook path

Buckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10

Usage:

# P95 webhook latency
histogram_quantile(0.95,
  rate(controller_runtime_webhook_latency_seconds_bucket[5m])
)

# P99 webhook latency
histogram_quantile(0.99,
  rate(controller_runtime_webhook_latency_seconds_bucket[5m])
)

What it tells you: - Latency adds to kubectl apply time - High latency (> 1s) = user experience impact - Very high latency (> 10s) = risk of timeout

Leader Election Metrics¶

leader_election_master_status¶

Type: Gauge

Description: Whether this instance is the leader (1 = leader, 0 = follower).

Labels: - name - Leader election name (e.g., "e6-operator-leader-election")

Usage:

# Current leader count (should be 1)
sum(leader_election_master_status)

# Which pod is leader
leader_election_master_status{name="e6-operator-leader-election"} == 1

What it tells you: - Only one instance should have value 1 - Value 0 = standby replica - No instances with 1 = leader election in progress or failed

Client Metrics¶

rest_client_requests_total¶

Type: Counter

Description: Total number of HTTP requests to Kubernetes API.

Labels: - code - HTTP response code - host - API server host - method - HTTP method (GET, POST, PUT, PATCH, DELETE)

Usage:

# Request rate to API server
rate(rest_client_requests_total[5m])

# Error rate (5xx responses)
rate(rest_client_requests_total{code=~"5.."}[5m])

What it tells you: - High request rate = operator chatty with API server - 429 errors = rate limiting (too many requests) - 5xx errors = API server issues

rest_client_request_duration_seconds¶

Type: Histogram

Description: Duration of requests to Kubernetes API.

Labels: - verb - HTTP method - url - Request URL path

Buckets: 0.001, 0.002, 0.004, 0.008, 0.016, 0.032, 0.064, 0.128, 0.256, 0.512, 1.024, 2.048, 4.096, 8.192, 16.384, 32.768

Usage:

# P95 API latency
histogram_quantile(0.95,
  rate(rest_client_request_duration_seconds_bucket[5m])
)

What it tells you: - Slow API responses = performance bottleneck - High latency = overloaded API server or network issues

Useful Metric Combinations¶

Operator Health Score¶

# Composite health score (0-100)
(
  # Success rate (40% weight)
  (sum(rate(controller_runtime_reconcile_total{result="success"}[5m]))
   /
   sum(rate(controller_runtime_reconcile_total[5m]))) * 40
  +
  # Low error rate (30% weight)
  (1 - clamp_max(rate(controller_runtime_reconcile_errors_total[5m]) / 0.1, 1)) * 30
  +
  # Low latency (20% weight)
  (1 - clamp_max(
    histogram_quantile(0.95,
      rate(controller_runtime_reconcile_time_seconds_bucket[5m])
    ) / 30, 1)) * 20
  +
  # Low queue depth (10% weight)
  (1 - clamp_max(workqueue_depth / 100, 1)) * 10
)

Resource Saturation¶

# CPU saturation
rate(process_cpu_seconds_total[5m]) / <cpu_limit>

# Memory saturation
process_resident_memory_bytes / <memory_limit_bytes>

# Workqueue saturation
workqueue_depth / 1000

SLI/SLO Examples¶

# SLI: Reconciliation success rate
sum(rate(controller_runtime_reconcile_total{result="success"}[5m]))
/
sum(rate(controller_runtime_reconcile_total[5m]))

# SLO: 99.9% success rate
# Alert if below threshold for 5 minutes

# SLI: Reconciliation latency P95
histogram_quantile(0.95,
  rate(controller_runtime_reconcile_time_seconds_bucket[5m])
)

# SLO: P95 < 10s
# Alert if above threshold for 5 minutes

Additional Resources¶

Monitoring Guide - Setup and configuration
Grafana Dashboard - Pre-built dashboard
Prometheus Documentation
controller-runtime Metrics