Skip to content

MonitoringServices

API Version: e6data.io/v1alpha2 Kind: MonitoringServices Short Names: ms


1. Purpose

MonitoringServices deploys Vector-based log and metrics collection for e6data workspaces. It provides:

  • Log Collection: Collects container logs from pods and stores them in S3
  • Metrics Collection: Scrapes Prometheus metrics from pods and stores them in S3
  • GreptimeDB Integration: Dual-write to GreptimeDB for real-time queries
  • Namespace Filtering: Control which namespaces/pods are monitored

Architecture

                    ┌─────────────────────────────────────────┐
                    │          MonitoringServices             │
                    │                                         │
┌──────────────┐    │  ┌─────────────────────────────────┐   │
│  Pod Logs    │────┼──│     Vector Logs DaemonSet       │   │
│  (stdout)    │    │  │  • kubernetes_logs source       │   │
└──────────────┘    │  │  • S3 sink (archival)           │──┼──▶ S3 Bucket
                    │  │  • GreptimeDB sink (real-time)  │──┼──▶ GreptimeDB
                    │  └─────────────────────────────────┘   │
                    │                                         │
┌──────────────┐    │  ┌─────────────────────────────────┐   │
│  Pod Metrics │────┼──│   Vector Metrics DaemonSet      │   │
│  (/metrics)  │    │  │  • prometheus_scrape source     │   │
└──────────────┘    │  │  • S3 sink (archival)           │──┼──▶ S3 Bucket
                    │  │  • prometheus_remote_write      │──┼──▶ GreptimeDB
                    │  └─────────────────────────────────┘   │
                    └─────────────────────────────────────────┘

2. High-level Behavior

When you create a MonitoringServices CR, the operator:

  1. Auto-detects cloud provider (AWS, GCP, Azure)
  2. Creates ServiceAccount and RBAC (if autoCreateRBAC enabled)
  3. Deploys Vector Logs DaemonSet for container log collection
  4. Deploys Vector Metrics DaemonSet for Prometheus metrics scraping
  5. Configures GreptimeDB integration (if greptimeRef specified)

Child Resources Created

Resource Type Name Pattern Purpose
ServiceAccount {name}-vector Pod identity for S3 access
Role/ClusterRole {name}-vector RBAC for pod discovery
RoleBinding {name}-vector Binds role to service account
DaemonSet {name}-vector-logs Log collection on each node
DaemonSet {name}-vector-metrics Metrics collection on each node
ConfigMap {name}-vector-logs-config Vector logs configuration
ConfigMap {name}-vector-metrics-config Vector metrics configuration

3. Spec Reference

3.1 Top-level Fields

Field Type Required Default Description
workspace string Yes - Workspace name (for labels, SA lookup)
tenant string Yes - Tenant identifier
cloud string No Auto-detected Cloud provider (AWS/GCP/AZURE)
imageRepository string No timberio/vector Vector image repository
serviceAccount string No Workspace name ServiceAccount name
autoCreateRBAC bool No true Auto-create SA and RBAC
useClusterRole bool No false Use ClusterRole (all namespaces) vs Role
imagePullSecrets []string No [] Registry pull secrets
greptimeRef GreptimeDBRef No - GreptimeDB integration
vectorLogs VectorLogsSpec No - Log collection config
vectorMetrics VectorMetricsSpec No - Metrics collection config
tolerations []Toleration No Auto-populated Pod tolerations
nodeSelector map[string]string No {} Node selection
affinity Affinity No - Affinity rules
karpenterNodePool string No - Karpenter NodePool name

3.2 RBAC Configuration

Setting Scope Effect
useClusterRole: false Namespace-scoped Vector can only discover pods in its own namespace
useClusterRole: true Cluster-wide Vector can discover pods across all namespaces

For multi-namespace monitoring, use useClusterRole: true.

3.3 VectorLogs

Field Type Required Default Description
enabled bool No true Enable log collection
image ImageSpec No - Image override
resources ResourceSpec No - CPU/Memory
s3Bucket string Yes - S3 bucket for logs
s3Region string Yes - S3 bucket region
s3Prefix string No vector-logs-v1 Key prefix in bucket
s3Endpoint string No - Custom S3 endpoint
s3Partition string No hour Time partitioning
batchMaxBytes int64 No 10485760 Max batch size (bytes)
batchTimeoutSecs int32 No 30 Max batch timeout
compression string No gzip Compression format
encodingCodec string No json Output encoding
includeNamespaces []string No All Namespaces to include
excludeNamespaces []string No [] Namespaces to exclude
includePodLabels map No All Pod labels to include
excludePodLabels map No {} Pod labels to exclude
environmentVariables map No {} Container env vars
configVariables map No {} Vector config vars

3.4 VectorMetrics

Field Type Required Default Description
enabled bool No true Enable metrics collection
image ImageSpec No - Image override
resources ResourceSpec No - CPU/Memory
s3Bucket string Yes - S3 bucket for metrics
s3Region string Yes - S3 bucket region
s3Prefix string No vector-metrics-v1 Key prefix in bucket
s3Endpoint string No - Custom S3 endpoint
s3Partition string No hour Time partitioning
scrapeInterval int32 No 30 Prometheus scrape interval (secs)
scrapeTimeout int32 No 10 Prometheus scrape timeout (secs)
prometheusPodAnnotation string No prometheus.io/scrape Annotation to identify scrape targets
prometheusPortAnnotation string No prometheus.io/port Annotation for metrics port
prometheusPathAnnotation string No prometheus.io/path Annotation for metrics path
batchMaxBytes int64 No 10485760 Max batch size (bytes)
batchTimeoutSecs int32 No 30 Max batch timeout
compression string No gzip Compression format
encodingCodec string No json Output encoding
includeNamespaces []string No All Namespaces to include
excludeNamespaces []string No [] Namespaces to exclude
includePodLabels map No All Pod labels to include
excludePodLabels map No {} Pod labels to exclude
environmentVariables map No {} Container env vars
configVariables map No {} Vector config vars

3.5 GreptimeDBRef

Field Type Required Default Description
name string Yes - GreptimeDBCluster CR name
namespace string No Same namespace GreptimeDBCluster namespace
database string No public Database name in GreptimeDB
logsTable string No logs Table for logs
metricsEnabled bool No true Send metrics to GreptimeDB
logsEnabled bool No true Send logs to GreptimeDB

4. Example Manifests

4.1 Basic AWS Setup (IRSA)

apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
  name: monitoring
  namespace: workspace-prod
spec:
  workspace: analytics-prod
  tenant: my-company

  # Logs collection
  vectorLogs:
    enabled: true
    s3Bucket: my-logs-bucket
    s3Region: us-east-1
    s3Prefix: "e6data-logs"
    resources:
      cpu: "200m"
      memory: "256Mi"

  # Metrics collection
  vectorMetrics:
    enabled: true
    s3Bucket: my-metrics-bucket
    s3Region: us-east-1
    s3Prefix: "e6data-metrics"
    resources:
      cpu: "200m"
      memory: "256Mi"

4.2 S3-Compatible Storage (Linode/DigitalOcean)

For non-AWS S3-compatible storage, you need to provide the endpoint and credentials via environment variables:

apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
  name: monitoring
  namespace: workspace-prod
spec:
  workspace: analytics-prod
  tenant: my-company

  # Reference SA with S3 access credentials
  serviceAccount: e6data-sa  # Must have S3 credentials mounted

  vectorLogs:
    enabled: true
    s3Bucket: my-bucket
    s3Region: us-east-1
    s3Prefix: "logs"
    s3Endpoint: "https://us-east-1.linodeobjects.com"
    environmentVariables:
      AWS_ACCESS_KEY_ID: "YOUR_ACCESS_KEY"
      AWS_SECRET_ACCESS_KEY: "YOUR_SECRET_KEY"

  vectorMetrics:
    enabled: true
    s3Bucket: my-bucket
    s3Region: us-east-1
    s3Prefix: "metrics"
    s3Endpoint: "https://us-east-1.linodeobjects.com"
    environmentVariables:
      AWS_ACCESS_KEY_ID: "YOUR_ACCESS_KEY"
      AWS_SECRET_ACCESS_KEY: "YOUR_SECRET_KEY"

Alternative: Use Kubernetes Secret

# First create the secret
apiVersion: v1
kind: Secret
metadata:
  name: s3-credentials
  namespace: workspace-prod
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "YOUR_ACCESS_KEY"
  AWS_SECRET_ACCESS_KEY: "YOUR_SECRET_KEY"
---
apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
  name: monitoring
  namespace: workspace-prod
spec:
  workspace: analytics-prod
  tenant: my-company

  vectorLogs:
    enabled: true
    s3Bucket: my-bucket
    s3Region: us-east-1
    s3Endpoint: "https://us-east-1.linodeobjects.com"
    # Reference secret via env vars from secret
    environmentVariables:
      AWS_ACCESS_KEY_ID:
        valueFrom:
          secretKeyRef:
            name: s3-credentials
            key: AWS_ACCESS_KEY_ID
      AWS_SECRET_ACCESS_KEY:
        valueFrom:
          secretKeyRef:
            name: s3-credentials
            key: AWS_SECRET_ACCESS_KEY

4.3 With GreptimeDB Integration

apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
  name: monitoring
  namespace: workspace-prod
spec:
  workspace: analytics-prod
  tenant: my-company

  # GreptimeDB for real-time queries
  greptimeRef:
    name: greptime-prod
    namespace: greptime-system
    database: analytics
    logsTable: query_logs
    metricsEnabled: true
    logsEnabled: true

  vectorLogs:
    enabled: true
    s3Bucket: logs-archive
    s3Region: us-east-1

  vectorMetrics:
    enabled: true
    s3Bucket: metrics-archive
    s3Region: us-east-1

4.4 Namespace-Scoped Collection

apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
  name: monitoring
  namespace: workspace-prod
spec:
  workspace: analytics-prod
  tenant: my-company

  # Only collect from this namespace
  useClusterRole: false

  vectorLogs:
    enabled: true
    s3Bucket: my-bucket
    s3Region: us-east-1
    includeNamespaces:
      - workspace-prod
    excludePodLabels:
      app: debug  # Don't collect debug pod logs

  vectorMetrics:
    enabled: true
    s3Bucket: my-bucket
    s3Region: us-east-1
    includeNamespaces:
      - workspace-prod

4.5 Cross-Namespace Collection

apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
  name: monitoring
  namespace: e6-monitoring
spec:
  workspace: central-monitoring
  tenant: my-company

  # Collect from multiple namespaces
  useClusterRole: true

  vectorLogs:
    enabled: true
    s3Bucket: central-logs
    s3Region: us-east-1
    includeNamespaces:
      - workspace-dev
      - workspace-staging
      - workspace-prod
    excludeNamespaces:
      - kube-system

  vectorMetrics:
    enabled: true
    s3Bucket: central-metrics
    s3Region: us-east-1
    includeNamespaces:
      - workspace-dev
      - workspace-staging
      - workspace-prod

5. Status & Lifecycle

5.1 Status Fields

Field Type Description
phase string Current lifecycle phase
message string Human-readable status
ready bool Overall readiness
vectorLogsStatus ComponentStatus Logs DaemonSet status
vectorMetricsStatus ComponentStatus Metrics DaemonSet status
greptimedbStatus GreptimeDBIntegrationStatus GreptimeDB connection status
conditions []Condition Detailed conditions

5.2 Phase Values

Phase Description
Pending Initial state, setup starting
Creating Creating child resources
Running All components healthy
Degraded Some components unhealthy
Failed Setup failed

5.3 Component Status

status:
  phase: Running
  ready: true
  vectorLogsStatus:
    ready: true
    replicas: 3      # Nodes in cluster
    readyReplicas: 3
    message: "All nodes collecting logs"
  vectorMetricsStatus:
    ready: true
    replicas: 3
    readyReplicas: 3
    message: "All nodes scraping metrics"
  greptimedbStatus:
    discovered: true
    clusterName: greptime-prod
    clusterNamespace: greptime-system
    pipelineReady: true
    endpoint: "greptime-prod-frontend.greptime-system.svc:4000"
    database: analytics

6. RBAC Requirements

6.1 Operator RBAC

The operator needs these permissions to manage MonitoringServices:

# e6data.io CRD permissions
- apiGroups: ["e6data.io"]
  resources: ["monitoringservices", "monitoringservices/status", "monitoringservices/finalizers"]
  verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]

# DaemonSet management
- apiGroups: ["apps"]
  resources: ["daemonsets"]
  verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]

# Core resources
- apiGroups: [""]
  resources: ["services", "configmaps", "secrets", "serviceaccounts"]
  verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]

# RBAC management (for auto-created roles)
- apiGroups: ["rbac.authorization.k8s.io"]
  resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
  verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]

6.2 Vector DaemonSet RBAC

Vector needs these permissions for pod discovery and log/metrics collection:

Namespace-Scoped (useClusterRole: false):

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
rules:
  - apiGroups: [""]
    resources: ["namespaces", "pods", "nodes"]
    verbs: ["get", "list", "watch"]

Cluster-Wide (useClusterRole: true):

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
rules:
  - apiGroups: [""]
    resources: ["namespaces", "pods", "nodes"]
    verbs: ["get", "list", "watch"]

Dependencies

CRD Relationship
GreptimeDBCluster Optional target for real-time data

Integration with QueryService

For query history collection, configure QueryService with queryHistory instead of MonitoringServices:

apiVersion: e6data.io/v1alpha1
kind: QueryService
spec:
  queryHistory:
    enabled: true
    s3Prefix: "query-history"
    greptimeRef:
      name: greptime-prod
      namespace: greptime-system

8. Troubleshooting

8.1 Common Issues

Vector Pods Not Running

Symptoms:

$ kubectl get ds
NAME                      DESIRED   CURRENT   READY
monitoring-vector-logs    3         3         0

Checks:

# Check pod status
kubectl get pods -l app.kubernetes.io/instance=monitoring

# Check pod events
kubectl describe pod monitoring-vector-logs-xxxxx

# Check for image pull errors
kubectl get events --field-selector reason=Failed

S3 Permission Errors

Symptoms: Vector logs show "Access Denied" or "NoCredentialProviders".

Checks:

# Verify ServiceAccount annotations (AWS IRSA)
kubectl get sa monitoring-vector -o yaml

# Verify IAM role trust policy (AWS)
aws iam get-role --role-name <role-name>

# Test S3 access from pod
kubectl exec -it monitoring-vector-logs-xxxxx -- aws s3 ls s3://bucket/

Metrics Not Being Scraped

Symptoms: No metrics data in S3 or GreptimeDB.

Checks:

# Verify pods have prometheus annotations
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name} {.metadata.annotations}{"\n"}{end}'

# Check Vector config
kubectl get cm monitoring-vector-metrics-config -o yaml

# Check Vector logs
kubectl logs -l app.kubernetes.io/name=vector-metrics --tail=100

8.2 Useful Commands

# Get MonitoringServices status
kubectl get ms monitoring -o yaml

# List Vector DaemonSets
kubectl get ds -l app.kubernetes.io/instance=monitoring

# Check Vector logs for errors
kubectl logs -l app.kubernetes.io/name=vector-logs --tail=100 | grep -i error

# Check Vector metrics for errors
kubectl logs -l app.kubernetes.io/name=vector-metrics --tail=100 | grep -i error

# Verify S3 output
aws s3 ls s3://bucket/vector-logs-v1/

# Check GreptimeDB connection
kubectl exec -it monitoring-vector-logs-xxxxx -- curl http://greptime-frontend:4000/health

9. Best Practices

9.1 Resource Configuration

Workload CPU Memory
Light (< 10 pods) 100m 128Mi
Medium (10-50 pods) 200m 256Mi
Heavy (50+ pods) 500m 512Mi

9.2 Filtering Strategy

  1. Start narrow: Begin with specific namespace/pod filters
  2. Exclude noisy pods: Filter out debug, test, and system pods
  3. Use labels: Leverage prometheus.io/scrape: "true" for metrics

9.3 Storage Considerations

  1. Use compression: Always enable gzip or zstd compression
  2. Set appropriate batch sizes: Larger batches reduce S3 API calls
  3. Partition by hour: Balances file count vs. query performance
  4. Enable GreptimeDB: For real-time queries, use S3 for archival