MonitoringServices¶

API Version: e6data.io/v1alpha2 Kind: MonitoringServices Short Names: ms

1. Purpose¶

MonitoringServices deploys Vector-based log and metrics collection for e6data workspaces. It provides:

Log Collection: Collects container logs from pods and stores them in S3
Metrics Collection: Scrapes Prometheus metrics from pods and stores them in S3
GreptimeDB Integration: Dual-write to GreptimeDB for real-time queries
Namespace Filtering: Control which namespaces/pods are monitored

Architecture¶

                    ┌─────────────────────────────────────────┐
                    │          MonitoringServices             │
                    │                                         │
┌──────────────┐    │  ┌─────────────────────────────────┐   │
│  Pod Logs    │────┼──│     Vector Logs DaemonSet       │   │
│  (stdout)    │    │  │  • kubernetes_logs source       │   │
└──────────────┘    │  │  • S3 sink (archival)           │──┼──▶ S3 Bucket
                    │  │  • GreptimeDB sink (real-time)  │──┼──▶ GreptimeDB
                    │  └─────────────────────────────────┘   │
                    │                                         │
┌──────────────┐    │  ┌─────────────────────────────────┐   │
│  Pod Metrics │────┼──│   Vector Metrics DaemonSet      │   │
│  (/metrics)  │    │  │  • prometheus_scrape source     │   │
└──────────────┘    │  │  • S3 sink (archival)           │──┼──▶ S3 Bucket
                    │  │  • prometheus_remote_write      │──┼──▶ GreptimeDB
                    │  └─────────────────────────────────┘   │
                    └─────────────────────────────────────────┘

2. High-level Behavior¶

When you create a MonitoringServices CR, the operator:

Auto-detects cloud provider (AWS, GCP, Azure)
Creates ServiceAccount and RBAC (if autoCreateRBAC enabled)
Deploys Vector Logs DaemonSet for container log collection
Deploys Vector Metrics DaemonSet for Prometheus metrics scraping
Configures GreptimeDB integration (if greptimeRef specified)

Child Resources Created¶

Resource Type	Name Pattern	Purpose
ServiceAccount	`{name}-vector`	Pod identity for S3 access
Role/ClusterRole	`{name}-vector`	RBAC for pod discovery
RoleBinding	`{name}-vector`	Binds role to service account
DaemonSet	`{name}-vector-logs`	Log collection on each node
DaemonSet	`{name}-vector-metrics`	Metrics collection on each node
ConfigMap	`{name}-vector-logs-config`	Vector logs configuration
ConfigMap	`{name}-vector-metrics-config`	Vector metrics configuration

3. Spec Reference¶

3.1 Top-level Fields¶

Field	Type	Required	Default	Description
`workspace`	string	Yes	-	Workspace name (for labels, SA lookup)
`tenant`	string	Yes	-	Tenant identifier
`cloud`	string	No	Auto-detected	Cloud provider (AWS/GCP/AZURE)
`imageRepository`	string	No	`timberio/vector`	Vector image repository
`serviceAccount`	string	No	Workspace name	ServiceAccount name
`autoCreateRBAC`	bool	No	`true`	Auto-create SA and RBAC
`useClusterRole`	bool	No	`false`	Use ClusterRole (all namespaces) vs Role
`imagePullSecrets`	[]string	No	`[]`	Registry pull secrets
`greptimeRef`	GreptimeDBRef	No	-	GreptimeDB integration
`vectorLogs`	VectorLogsSpec	No	-	Log collection config
`vectorMetrics`	VectorMetricsSpec	No	-	Metrics collection config
`tolerations`	[]Toleration	No	Auto-populated	Pod tolerations
`nodeSelector`	map[string]string	No	`{}`	Node selection
`affinity`	Affinity	No	-	Affinity rules
`karpenterNodePool`	string	No	-	Karpenter NodePool name

3.2 RBAC Configuration¶

Setting	Scope	Effect
`useClusterRole: false`	Namespace-scoped	Vector can only discover pods in its own namespace
`useClusterRole: true`	Cluster-wide	Vector can discover pods across all namespaces

For multi-namespace monitoring, use useClusterRole: true.

3.3 VectorLogs¶

Field	Type	Required	Default	Description
`enabled`	bool	No	`true`	Enable log collection
`image`	ImageSpec	No	-	Image override
`resources`	ResourceSpec	No	-	CPU/Memory
`s3Bucket`	string	Yes	-	S3 bucket for logs
`s3Region`	string	Yes	-	S3 bucket region
`s3Prefix`	string	No	`vector-logs-v1`	Key prefix in bucket
`s3Endpoint`	string	No	-	Custom S3 endpoint
`s3Partition`	string	No	`hour`	Time partitioning
`batchMaxBytes`	int64	No	`10485760`	Max batch size (bytes)
`batchTimeoutSecs`	int32	No	`30`	Max batch timeout
`compression`	string	No	`gzip`	Compression format
`encodingCodec`	string	No	`json`	Output encoding
`includeNamespaces`	[]string	No	All	Namespaces to include
`excludeNamespaces`	[]string	No	`[]`	Namespaces to exclude
`includePodLabels`	map	No	All	Pod labels to include
`excludePodLabels`	map	No	`{}`	Pod labels to exclude
`environmentVariables`	map	No	`{}`	Container env vars
`configVariables`	map	No	`{}`	Vector config vars

3.4 VectorMetrics¶

Field	Type	Required	Default	Description
`enabled`	bool	No	`true`	Enable metrics collection
`image`	ImageSpec	No	-	Image override
`resources`	ResourceSpec	No	-	CPU/Memory
`s3Bucket`	string	Yes	-	S3 bucket for metrics
`s3Region`	string	Yes	-	S3 bucket region
`s3Prefix`	string	No	`vector-metrics-v1`	Key prefix in bucket
`s3Endpoint`	string	No	-	Custom S3 endpoint
`s3Partition`	string	No	`hour`	Time partitioning
`scrapeInterval`	int32	No	`30`	Prometheus scrape interval (secs)
`scrapeTimeout`	int32	No	`10`	Prometheus scrape timeout (secs)
`prometheusPodAnnotation`	string	No	`prometheus.io/scrape`	Annotation to identify scrape targets
`prometheusPortAnnotation`	string	No	`prometheus.io/port`	Annotation for metrics port
`prometheusPathAnnotation`	string	No	`prometheus.io/path`	Annotation for metrics path
`batchMaxBytes`	int64	No	`10485760`	Max batch size (bytes)
`batchTimeoutSecs`	int32	No	`30`	Max batch timeout
`compression`	string	No	`gzip`	Compression format
`encodingCodec`	string	No	`json`	Output encoding
`includeNamespaces`	[]string	No	All	Namespaces to include
`excludeNamespaces`	[]string	No	`[]`	Namespaces to exclude
`includePodLabels`	map	No	All	Pod labels to include
`excludePodLabels`	map	No	`{}`	Pod labels to exclude
`environmentVariables`	map	No	`{}`	Container env vars
`configVariables`	map	No	`{}`	Vector config vars

3.5 GreptimeDBRef¶

Field	Type	Required	Default	Description
`name`	string	Yes	-	GreptimeDBCluster CR name
`namespace`	string	No	Same namespace	GreptimeDBCluster namespace
`database`	string	No	`public`	Database name in GreptimeDB
`logsTable`	string	No	`logs`	Table for logs
`metricsEnabled`	bool	No	`true`	Send metrics to GreptimeDB
`logsEnabled`	bool	No	`true`	Send logs to GreptimeDB

4. Example Manifests¶

4.1 Basic AWS Setup (IRSA)¶

apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
  name: monitoring
  namespace: workspace-prod
spec:
  workspace: analytics-prod
  tenant: my-company

  # Logs collection
  vectorLogs:
    enabled: true
    s3Bucket: my-logs-bucket
    s3Region: us-east-1
    s3Prefix: "e6data-logs"
    resources:
      cpu: "200m"
      memory: "256Mi"

  # Metrics collection
  vectorMetrics:
    enabled: true
    s3Bucket: my-metrics-bucket
    s3Region: us-east-1
    s3Prefix: "e6data-metrics"
    resources:
      cpu: "200m"
      memory: "256Mi"

4.2 S3-Compatible Storage (Linode/DigitalOcean)¶

For non-AWS S3-compatible storage, you need to provide the endpoint and credentials via environment variables:

apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
  name: monitoring
  namespace: workspace-prod
spec:
  workspace: analytics-prod
  tenant: my-company

  # Reference SA with S3 access credentials
  serviceAccount: e6data-sa  # Must have S3 credentials mounted

  vectorLogs:
    enabled: true
    s3Bucket: my-bucket
    s3Region: us-east-1
    s3Prefix: "logs"
    s3Endpoint: "https://us-east-1.linodeobjects.com"
    environmentVariables:
      AWS_ACCESS_KEY_ID: "YOUR_ACCESS_KEY"
      AWS_SECRET_ACCESS_KEY: "YOUR_SECRET_KEY"

  vectorMetrics:
    enabled: true
    s3Bucket: my-bucket
    s3Region: us-east-1
    s3Prefix: "metrics"
    s3Endpoint: "https://us-east-1.linodeobjects.com"
    environmentVariables:
      AWS_ACCESS_KEY_ID: "YOUR_ACCESS_KEY"
      AWS_SECRET_ACCESS_KEY: "YOUR_SECRET_KEY"

Alternative: Use Kubernetes Secret

# First create the secret
apiVersion: v1
kind: Secret
metadata:
  name: s3-credentials
  namespace: workspace-prod
type: Opaque
stringData:
  AWS_ACCESS_KEY_ID: "YOUR_ACCESS_KEY"
  AWS_SECRET_ACCESS_KEY: "YOUR_SECRET_KEY"
---
apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
  name: monitoring
  namespace: workspace-prod
spec:
  workspace: analytics-prod
  tenant: my-company

  vectorLogs:
    enabled: true
    s3Bucket: my-bucket
    s3Region: us-east-1
    s3Endpoint: "https://us-east-1.linodeobjects.com"
    # Reference secret via env vars from secret
    environmentVariables:
      AWS_ACCESS_KEY_ID:
        valueFrom:
          secretKeyRef:
            name: s3-credentials
            key: AWS_ACCESS_KEY_ID
      AWS_SECRET_ACCESS_KEY:
        valueFrom:
          secretKeyRef:
            name: s3-credentials
            key: AWS_SECRET_ACCESS_KEY

4.3 With GreptimeDB Integration¶

apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
  name: monitoring
  namespace: workspace-prod
spec:
  workspace: analytics-prod
  tenant: my-company

  # GreptimeDB for real-time queries
  greptimeRef:
    name: greptime-prod
    namespace: greptime-system
    database: analytics
    logsTable: query_logs
    metricsEnabled: true
    logsEnabled: true

  vectorLogs:
    enabled: true
    s3Bucket: logs-archive
    s3Region: us-east-1

  vectorMetrics:
    enabled: true
    s3Bucket: metrics-archive
    s3Region: us-east-1

4.4 Namespace-Scoped Collection¶

apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
  name: monitoring
  namespace: workspace-prod
spec:
  workspace: analytics-prod
  tenant: my-company

  # Only collect from this namespace
  useClusterRole: false

  vectorLogs:
    enabled: true
    s3Bucket: my-bucket
    s3Region: us-east-1
    includeNamespaces:
      - workspace-prod
    excludePodLabels:
      app: debug  # Don't collect debug pod logs

  vectorMetrics:
    enabled: true
    s3Bucket: my-bucket
    s3Region: us-east-1
    includeNamespaces:
      - workspace-prod

4.5 Cross-Namespace Collection¶

apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
  name: monitoring
  namespace: e6-monitoring
spec:
  workspace: central-monitoring
  tenant: my-company

  # Collect from multiple namespaces
  useClusterRole: true

  vectorLogs:
    enabled: true
    s3Bucket: central-logs
    s3Region: us-east-1
    includeNamespaces:
      - workspace-dev
      - workspace-staging
      - workspace-prod
    excludeNamespaces:
      - kube-system

  vectorMetrics:
    enabled: true
    s3Bucket: central-metrics
    s3Region: us-east-1
    includeNamespaces:
      - workspace-dev
      - workspace-staging
      - workspace-prod

5. Status & Lifecycle¶

5.1 Status Fields¶

Field	Type	Description
`phase`	string	Current lifecycle phase
`message`	string	Human-readable status
`ready`	bool	Overall readiness
`vectorLogsStatus`	ComponentStatus	Logs DaemonSet status
`vectorMetricsStatus`	ComponentStatus	Metrics DaemonSet status
`greptimedbStatus`	GreptimeDBIntegrationStatus	GreptimeDB connection status
`conditions`	[]Condition	Detailed conditions

5.2 Phase Values¶

Phase	Description
`Pending`	Initial state, setup starting
`Creating`	Creating child resources
`Running`	All components healthy
`Degraded`	Some components unhealthy
`Failed`	Setup failed

5.3 Component Status¶

status:
  phase: Running
  ready: true
  vectorLogsStatus:
    ready: true
    replicas: 3      # Nodes in cluster
    readyReplicas: 3
    message: "All nodes collecting logs"
  vectorMetricsStatus:
    ready: true
    replicas: 3
    readyReplicas: 3
    message: "All nodes scraping metrics"
  greptimedbStatus:
    discovered: true
    clusterName: greptime-prod
    clusterNamespace: greptime-system
    pipelineReady: true
    endpoint: "greptime-prod-frontend.greptime-system.svc:4000"
    database: analytics

6. RBAC Requirements¶

6.1 Operator RBAC¶

The operator needs these permissions to manage MonitoringServices:

# e6data.io CRD permissions
- apiGroups: ["e6data.io"]
  resources: ["monitoringservices", "monitoringservices/status", "monitoringservices/finalizers"]
  verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]

# DaemonSet management
- apiGroups: ["apps"]
  resources: ["daemonsets"]
  verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]

# Core resources
- apiGroups: [""]
  resources: ["services", "configmaps", "secrets", "serviceaccounts"]
  verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]

# RBAC management (for auto-created roles)
- apiGroups: ["rbac.authorization.k8s.io"]
  resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
  verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]

6.2 Vector DaemonSet RBAC¶

Vector needs these permissions for pod discovery and log/metrics collection:

Namespace-Scoped (useClusterRole: false):

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
rules:
  - apiGroups: [""]
    resources: ["namespaces", "pods", "nodes"]
    verbs: ["get", "list", "watch"]

Cluster-Wide (useClusterRole: true):

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
rules:
  - apiGroups: [""]
    resources: ["namespaces", "pods", "nodes"]
    verbs: ["get", "list", "watch"]

Dependencies¶

CRD	Relationship
GreptimeDBCluster	Optional target for real-time data

Integration with QueryService¶

For query history collection, configure QueryService with queryHistory instead of MonitoringServices:

apiVersion: e6data.io/v1alpha1
kind: QueryService
spec:
  queryHistory:
    enabled: true
    s3Prefix: "query-history"
    greptimeRef:
      name: greptime-prod
      namespace: greptime-system

8. Troubleshooting¶

8.1 Common Issues¶

Vector Pods Not Running¶

Symptoms:

$ kubectl get ds
NAME                      DESIRED   CURRENT   READY
monitoring-vector-logs    3         3         0

Checks:

# Check pod status
kubectl get pods -l app.kubernetes.io/instance=monitoring

# Check pod events
kubectl describe pod monitoring-vector-logs-xxxxx

# Check for image pull errors
kubectl get events --field-selector reason=Failed

S3 Permission Errors¶

Symptoms: Vector logs show "Access Denied" or "NoCredentialProviders".

Checks:

# Verify ServiceAccount annotations (AWS IRSA)
kubectl get sa monitoring-vector -o yaml

# Verify IAM role trust policy (AWS)
aws iam get-role --role-name <role-name>

# Test S3 access from pod
kubectl exec -it monitoring-vector-logs-xxxxx -- aws s3 ls s3://bucket/

Metrics Not Being Scraped¶

Symptoms: No metrics data in S3 or GreptimeDB.

Checks:

# Verify pods have prometheus annotations
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name} {.metadata.annotations}{"\n"}{end}'

# Check Vector config
kubectl get cm monitoring-vector-metrics-config -o yaml

# Check Vector logs
kubectl logs -l app.kubernetes.io/name=vector-metrics --tail=100

8.2 Useful Commands¶

# Get MonitoringServices status
kubectl get ms monitoring -o yaml

# List Vector DaemonSets
kubectl get ds -l app.kubernetes.io/instance=monitoring

# Check Vector logs for errors
kubectl logs -l app.kubernetes.io/name=vector-logs --tail=100 | grep -i error

# Check Vector metrics for errors
kubectl logs -l app.kubernetes.io/name=vector-metrics --tail=100 | grep -i error

# Verify S3 output
aws s3 ls s3://bucket/vector-logs-v1/

# Check GreptimeDB connection
kubectl exec -it monitoring-vector-logs-xxxxx -- curl http://greptime-frontend:4000/health

9. Best Practices¶

9.1 Resource Configuration¶

Workload	CPU	Memory
Light (< 10 pods)	100m	128Mi
Medium (10-50 pods)	200m	256Mi
Heavy (50+ pods)	500m	512Mi

9.2 Filtering Strategy¶

Start narrow: Begin with specific namespace/pod filters
Exclude noisy pods: Filter out debug, test, and system pods
Use labels: Leverage prometheus.io/scrape: "true" for metrics

9.3 Storage Considerations¶

Use compression: Always enable gzip or zstd compression
Set appropriate batch sizes: Larger batches reduce S3 API calls
Partition by hour: Balances file count vs. query performance
Enable GreptimeDB: For real-time queries, use S3 for archival

MonitoringServices¶

1. Purpose¶

Architecture¶

2. High-level Behavior¶

Child Resources Created¶

3. Spec Reference¶

3.1 Top-level Fields¶

3.2 RBAC Configuration¶

3.3 VectorLogs¶

3.4 VectorMetrics¶

3.5 GreptimeDBRef¶

4. Example Manifests¶

4.1 Basic AWS Setup (IRSA)¶

4.2 S3-Compatible Storage (Linode/DigitalOcean)¶

4.3 With GreptimeDB Integration¶

4.4 Namespace-Scoped Collection¶

4.5 Cross-Namespace Collection¶

5. Status & Lifecycle¶

5.1 Status Fields¶

5.2 Phase Values¶

5.3 Component Status¶

6. RBAC Requirements¶

6.1 Operator RBAC¶

6.2 Vector DaemonSet RBAC¶

7. Related Resources¶

Dependencies¶

Integration with QueryService¶

8. Troubleshooting¶

8.1 Common Issues¶

Vector Pods Not Running¶

S3 Permission Errors¶

Metrics Not Being Scraped¶

8.2 Useful Commands¶

9. Best Practices¶

9.1 Resource Configuration¶

9.2 Filtering Strategy¶

9.3 Storage Considerations¶