MonitoringServices¶
API Version: e6data.io/v1alpha2 Kind: MonitoringServices Short Names: ms
1. Purpose¶
MonitoringServices deploys Vector-based log and metrics collection for e6data workspaces. It provides:
- Log Collection: Collects container logs from pods and stores them in S3
- Metrics Collection: Scrapes Prometheus metrics from pods and stores them in S3
- GreptimeDB Integration: Dual-write to GreptimeDB for real-time queries
- Namespace Filtering: Control which namespaces/pods are monitored
Architecture¶
┌─────────────────────────────────────────┐
│ MonitoringServices │
│ │
┌──────────────┐ │ ┌─────────────────────────────────┐ │
│ Pod Logs │────┼──│ Vector Logs DaemonSet │ │
│ (stdout) │ │ │ • kubernetes_logs source │ │
└──────────────┘ │ │ • S3 sink (archival) │──┼──▶ S3 Bucket
│ │ • GreptimeDB sink (real-time) │──┼──▶ GreptimeDB
│ └─────────────────────────────────┘ │
│ │
┌──────────────┐ │ ┌─────────────────────────────────┐ │
│ Pod Metrics │────┼──│ Vector Metrics DaemonSet │ │
│ (/metrics) │ │ │ • prometheus_scrape source │ │
└──────────────┘ │ │ • S3 sink (archival) │──┼──▶ S3 Bucket
│ │ • prometheus_remote_write │──┼──▶ GreptimeDB
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────┘
2. High-level Behavior¶
When you create a MonitoringServices CR, the operator:
- Auto-detects cloud provider (AWS, GCP, Azure)
- Creates ServiceAccount and RBAC (if autoCreateRBAC enabled)
- Deploys Vector Logs DaemonSet for container log collection
- Deploys Vector Metrics DaemonSet for Prometheus metrics scraping
- Configures GreptimeDB integration (if greptimeRef specified)
Child Resources Created¶
| Resource Type | Name Pattern | Purpose |
|---|---|---|
| ServiceAccount | {name}-vector | Pod identity for S3 access |
| Role/ClusterRole | {name}-vector | RBAC for pod discovery |
| RoleBinding | {name}-vector | Binds role to service account |
| DaemonSet | {name}-vector-logs | Log collection on each node |
| DaemonSet | {name}-vector-metrics | Metrics collection on each node |
| ConfigMap | {name}-vector-logs-config | Vector logs configuration |
| ConfigMap | {name}-vector-metrics-config | Vector metrics configuration |
3. Spec Reference¶
3.1 Top-level Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
workspace | string | Yes | - | Workspace name (for labels, SA lookup) |
tenant | string | Yes | - | Tenant identifier |
cloud | string | No | Auto-detected | Cloud provider (AWS/GCP/AZURE) |
imageRepository | string | No | timberio/vector | Vector image repository |
serviceAccount | string | No | Workspace name | ServiceAccount name |
autoCreateRBAC | bool | No | true | Auto-create SA and RBAC |
useClusterRole | bool | No | false | Use ClusterRole (all namespaces) vs Role |
imagePullSecrets | []string | No | [] | Registry pull secrets |
greptimeRef | GreptimeDBRef | No | - | GreptimeDB integration |
vectorLogs | VectorLogsSpec | No | - | Log collection config |
vectorMetrics | VectorMetricsSpec | No | - | Metrics collection config |
tolerations | []Toleration | No | Auto-populated | Pod tolerations |
nodeSelector | map[string]string | No | {} | Node selection |
affinity | Affinity | No | - | Affinity rules |
karpenterNodePool | string | No | - | Karpenter NodePool name |
3.2 RBAC Configuration¶
| Setting | Scope | Effect |
|---|---|---|
useClusterRole: false | Namespace-scoped | Vector can only discover pods in its own namespace |
useClusterRole: true | Cluster-wide | Vector can discover pods across all namespaces |
For multi-namespace monitoring, use useClusterRole: true.
3.3 VectorLogs¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
enabled | bool | No | true | Enable log collection |
image | ImageSpec | No | - | Image override |
resources | ResourceSpec | No | - | CPU/Memory |
s3Bucket | string | Yes | - | S3 bucket for logs |
s3Region | string | Yes | - | S3 bucket region |
s3Prefix | string | No | vector-logs-v1 | Key prefix in bucket |
s3Endpoint | string | No | - | Custom S3 endpoint |
s3Partition | string | No | hour | Time partitioning |
batchMaxBytes | int64 | No | 10485760 | Max batch size (bytes) |
batchTimeoutSecs | int32 | No | 30 | Max batch timeout |
compression | string | No | gzip | Compression format |
encodingCodec | string | No | json | Output encoding |
includeNamespaces | []string | No | All | Namespaces to include |
excludeNamespaces | []string | No | [] | Namespaces to exclude |
includePodLabels | map | No | All | Pod labels to include |
excludePodLabels | map | No | {} | Pod labels to exclude |
environmentVariables | map | No | {} | Container env vars |
configVariables | map | No | {} | Vector config vars |
3.4 VectorMetrics¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
enabled | bool | No | true | Enable metrics collection |
image | ImageSpec | No | - | Image override |
resources | ResourceSpec | No | - | CPU/Memory |
s3Bucket | string | Yes | - | S3 bucket for metrics |
s3Region | string | Yes | - | S3 bucket region |
s3Prefix | string | No | vector-metrics-v1 | Key prefix in bucket |
s3Endpoint | string | No | - | Custom S3 endpoint |
s3Partition | string | No | hour | Time partitioning |
scrapeInterval | int32 | No | 30 | Prometheus scrape interval (secs) |
scrapeTimeout | int32 | No | 10 | Prometheus scrape timeout (secs) |
prometheusPodAnnotation | string | No | prometheus.io/scrape | Annotation to identify scrape targets |
prometheusPortAnnotation | string | No | prometheus.io/port | Annotation for metrics port |
prometheusPathAnnotation | string | No | prometheus.io/path | Annotation for metrics path |
batchMaxBytes | int64 | No | 10485760 | Max batch size (bytes) |
batchTimeoutSecs | int32 | No | 30 | Max batch timeout |
compression | string | No | gzip | Compression format |
encodingCodec | string | No | json | Output encoding |
includeNamespaces | []string | No | All | Namespaces to include |
excludeNamespaces | []string | No | [] | Namespaces to exclude |
includePodLabels | map | No | All | Pod labels to include |
excludePodLabels | map | No | {} | Pod labels to exclude |
environmentVariables | map | No | {} | Container env vars |
configVariables | map | No | {} | Vector config vars |
3.5 GreptimeDBRef¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name | string | Yes | - | GreptimeDBCluster CR name |
namespace | string | No | Same namespace | GreptimeDBCluster namespace |
database | string | No | public | Database name in GreptimeDB |
logsTable | string | No | logs | Table for logs |
metricsEnabled | bool | No | true | Send metrics to GreptimeDB |
logsEnabled | bool | No | true | Send logs to GreptimeDB |
4. Example Manifests¶
4.1 Basic AWS Setup (IRSA)¶
apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
name: monitoring
namespace: workspace-prod
spec:
workspace: analytics-prod
tenant: my-company
# Logs collection
vectorLogs:
enabled: true
s3Bucket: my-logs-bucket
s3Region: us-east-1
s3Prefix: "e6data-logs"
resources:
cpu: "200m"
memory: "256Mi"
# Metrics collection
vectorMetrics:
enabled: true
s3Bucket: my-metrics-bucket
s3Region: us-east-1
s3Prefix: "e6data-metrics"
resources:
cpu: "200m"
memory: "256Mi"
4.2 S3-Compatible Storage (Linode/DigitalOcean)¶
For non-AWS S3-compatible storage, you need to provide the endpoint and credentials via environment variables:
apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
name: monitoring
namespace: workspace-prod
spec:
workspace: analytics-prod
tenant: my-company
# Reference SA with S3 access credentials
serviceAccount: e6data-sa # Must have S3 credentials mounted
vectorLogs:
enabled: true
s3Bucket: my-bucket
s3Region: us-east-1
s3Prefix: "logs"
s3Endpoint: "https://us-east-1.linodeobjects.com"
environmentVariables:
AWS_ACCESS_KEY_ID: "YOUR_ACCESS_KEY"
AWS_SECRET_ACCESS_KEY: "YOUR_SECRET_KEY"
vectorMetrics:
enabled: true
s3Bucket: my-bucket
s3Region: us-east-1
s3Prefix: "metrics"
s3Endpoint: "https://us-east-1.linodeobjects.com"
environmentVariables:
AWS_ACCESS_KEY_ID: "YOUR_ACCESS_KEY"
AWS_SECRET_ACCESS_KEY: "YOUR_SECRET_KEY"
Alternative: Use Kubernetes Secret
# First create the secret
apiVersion: v1
kind: Secret
metadata:
name: s3-credentials
namespace: workspace-prod
type: Opaque
stringData:
AWS_ACCESS_KEY_ID: "YOUR_ACCESS_KEY"
AWS_SECRET_ACCESS_KEY: "YOUR_SECRET_KEY"
---
apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
name: monitoring
namespace: workspace-prod
spec:
workspace: analytics-prod
tenant: my-company
vectorLogs:
enabled: true
s3Bucket: my-bucket
s3Region: us-east-1
s3Endpoint: "https://us-east-1.linodeobjects.com"
# Reference secret via env vars from secret
environmentVariables:
AWS_ACCESS_KEY_ID:
valueFrom:
secretKeyRef:
name: s3-credentials
key: AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY:
valueFrom:
secretKeyRef:
name: s3-credentials
key: AWS_SECRET_ACCESS_KEY
4.3 With GreptimeDB Integration¶
apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
name: monitoring
namespace: workspace-prod
spec:
workspace: analytics-prod
tenant: my-company
# GreptimeDB for real-time queries
greptimeRef:
name: greptime-prod
namespace: greptime-system
database: analytics
logsTable: query_logs
metricsEnabled: true
logsEnabled: true
vectorLogs:
enabled: true
s3Bucket: logs-archive
s3Region: us-east-1
vectorMetrics:
enabled: true
s3Bucket: metrics-archive
s3Region: us-east-1
4.4 Namespace-Scoped Collection¶
apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
name: monitoring
namespace: workspace-prod
spec:
workspace: analytics-prod
tenant: my-company
# Only collect from this namespace
useClusterRole: false
vectorLogs:
enabled: true
s3Bucket: my-bucket
s3Region: us-east-1
includeNamespaces:
- workspace-prod
excludePodLabels:
app: debug # Don't collect debug pod logs
vectorMetrics:
enabled: true
s3Bucket: my-bucket
s3Region: us-east-1
includeNamespaces:
- workspace-prod
4.5 Cross-Namespace Collection¶
apiVersion: e6data.io/v1alpha2
kind: MonitoringServices
metadata:
name: monitoring
namespace: e6-monitoring
spec:
workspace: central-monitoring
tenant: my-company
# Collect from multiple namespaces
useClusterRole: true
vectorLogs:
enabled: true
s3Bucket: central-logs
s3Region: us-east-1
includeNamespaces:
- workspace-dev
- workspace-staging
- workspace-prod
excludeNamespaces:
- kube-system
vectorMetrics:
enabled: true
s3Bucket: central-metrics
s3Region: us-east-1
includeNamespaces:
- workspace-dev
- workspace-staging
- workspace-prod
5. Status & Lifecycle¶
5.1 Status Fields¶
| Field | Type | Description |
|---|---|---|
phase | string | Current lifecycle phase |
message | string | Human-readable status |
ready | bool | Overall readiness |
vectorLogsStatus | ComponentStatus | Logs DaemonSet status |
vectorMetricsStatus | ComponentStatus | Metrics DaemonSet status |
greptimedbStatus | GreptimeDBIntegrationStatus | GreptimeDB connection status |
conditions | []Condition | Detailed conditions |
5.2 Phase Values¶
| Phase | Description |
|---|---|
Pending | Initial state, setup starting |
Creating | Creating child resources |
Running | All components healthy |
Degraded | Some components unhealthy |
Failed | Setup failed |
5.3 Component Status¶
status:
phase: Running
ready: true
vectorLogsStatus:
ready: true
replicas: 3 # Nodes in cluster
readyReplicas: 3
message: "All nodes collecting logs"
vectorMetricsStatus:
ready: true
replicas: 3
readyReplicas: 3
message: "All nodes scraping metrics"
greptimedbStatus:
discovered: true
clusterName: greptime-prod
clusterNamespace: greptime-system
pipelineReady: true
endpoint: "greptime-prod-frontend.greptime-system.svc:4000"
database: analytics
6. RBAC Requirements¶
6.1 Operator RBAC¶
The operator needs these permissions to manage MonitoringServices:
# e6data.io CRD permissions
- apiGroups: ["e6data.io"]
resources: ["monitoringservices", "monitoringservices/status", "monitoringservices/finalizers"]
verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
# DaemonSet management
- apiGroups: ["apps"]
resources: ["daemonsets"]
verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
# Core resources
- apiGroups: [""]
resources: ["services", "configmaps", "secrets", "serviceaccounts"]
verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
# RBAC management (for auto-created roles)
- apiGroups: ["rbac.authorization.k8s.io"]
resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
verbs: ["create", "delete", "get", "list", "patch", "update", "watch"]
6.2 Vector DaemonSet RBAC¶
Vector needs these permissions for pod discovery and log/metrics collection:
Namespace-Scoped (useClusterRole: false):
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
rules:
- apiGroups: [""]
resources: ["namespaces", "pods", "nodes"]
verbs: ["get", "list", "watch"]
Cluster-Wide (useClusterRole: true):
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
rules:
- apiGroups: [""]
resources: ["namespaces", "pods", "nodes"]
verbs: ["get", "list", "watch"]
7. Related Resources¶
Dependencies¶
| CRD | Relationship |
|---|---|
| GreptimeDBCluster | Optional target for real-time data |
Integration with QueryService¶
For query history collection, configure QueryService with queryHistory instead of MonitoringServices:
apiVersion: e6data.io/v1alpha1
kind: QueryService
spec:
queryHistory:
enabled: true
s3Prefix: "query-history"
greptimeRef:
name: greptime-prod
namespace: greptime-system
8. Troubleshooting¶
8.1 Common Issues¶
Vector Pods Not Running¶
Symptoms:
Checks:
# Check pod status
kubectl get pods -l app.kubernetes.io/instance=monitoring
# Check pod events
kubectl describe pod monitoring-vector-logs-xxxxx
# Check for image pull errors
kubectl get events --field-selector reason=Failed
S3 Permission Errors¶
Symptoms: Vector logs show "Access Denied" or "NoCredentialProviders".
Checks:
# Verify ServiceAccount annotations (AWS IRSA)
kubectl get sa monitoring-vector -o yaml
# Verify IAM role trust policy (AWS)
aws iam get-role --role-name <role-name>
# Test S3 access from pod
kubectl exec -it monitoring-vector-logs-xxxxx -- aws s3 ls s3://bucket/
Metrics Not Being Scraped¶
Symptoms: No metrics data in S3 or GreptimeDB.
Checks:
# Verify pods have prometheus annotations
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name} {.metadata.annotations}{"\n"}{end}'
# Check Vector config
kubectl get cm monitoring-vector-metrics-config -o yaml
# Check Vector logs
kubectl logs -l app.kubernetes.io/name=vector-metrics --tail=100
8.2 Useful Commands¶
# Get MonitoringServices status
kubectl get ms monitoring -o yaml
# List Vector DaemonSets
kubectl get ds -l app.kubernetes.io/instance=monitoring
# Check Vector logs for errors
kubectl logs -l app.kubernetes.io/name=vector-logs --tail=100 | grep -i error
# Check Vector metrics for errors
kubectl logs -l app.kubernetes.io/name=vector-metrics --tail=100 | grep -i error
# Verify S3 output
aws s3 ls s3://bucket/vector-logs-v1/
# Check GreptimeDB connection
kubectl exec -it monitoring-vector-logs-xxxxx -- curl http://greptime-frontend:4000/health
9. Best Practices¶
9.1 Resource Configuration¶
| Workload | CPU | Memory |
|---|---|---|
| Light (< 10 pods) | 100m | 128Mi |
| Medium (10-50 pods) | 200m | 256Mi |
| Heavy (50+ pods) | 500m | 512Mi |
9.2 Filtering Strategy¶
- Start narrow: Begin with specific namespace/pod filters
- Exclude noisy pods: Filter out debug, test, and system pods
- Use labels: Leverage
prometheus.io/scrape: "true"for metrics
9.3 Storage Considerations¶
- Use compression: Always enable gzip or zstd compression
- Set appropriate batch sizes: Larger batches reduce S3 API calls
- Partition by hour: Balances file count vs. query performance
- Enable GreptimeDB: For real-time queries, use S3 for archival