Troubleshooting Guide¶
This guide covers common issues and their solutions across all e6data operator CRDs.
Quick Diagnostics¶
Check Operator Health¶
# Operator pod status
kubectl get pods -n e6-operator-system
# Operator logs (last 100 lines)
kubectl logs -n e6-operator-system -l app=e6-operator --tail=100
# Operator logs with error filter
kubectl logs -n e6-operator-system -l app=e6-operator | grep -i error
# Watch operator logs in real-time
kubectl logs -n e6-operator-system -l app=e6-operator -f
Check Resource Status¶
# All e6data resources in a namespace
kubectl get mds,qs,e6cat,pool,gov -n workspace-prod
# Detailed status for a resource
kubectl describe mds my-metadata -n workspace-prod
# Events for a resource
kubectl get events --field-selector involvedObject.name=my-metadata -n workspace-prod
# YAML output with full status
kubectl get mds my-metadata -n workspace-prod -o yaml
Common Issues by Symptom¶
Pods Not Starting¶
Symptom: Pods stuck in Pending¶
Possible Causes: 1. Insufficient cluster resources 2. NodeSelector/tolerations don't match nodes 3. PVC not binding (storage class issues) 4. Karpenter not provisioning nodes
Diagnosis:
# Check pod events
kubectl describe pod <pod-name> -n <namespace>
# Check node availability
kubectl get nodes
kubectl describe node <node-name>
# Check PVCs
kubectl get pvc -n <namespace>
# Check Karpenter (if used)
kubectl get nodepools
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter
Solutions:
# Scale up node pool (if manual)
# Or check Karpenter limits in NodePool
# Verify storage class exists
kubectl get sc
# Check resource quotas
kubectl describe resourcequota -n <namespace>
Symptom: Pods in CrashLoopBackOff¶
Possible Causes: 1. Invalid configuration (env vars, config maps) 2. Missing secrets or credentials 3. Application error on startup 4. Insufficient memory (OOMKilled)
Diagnosis:
# Get pod logs (current crash)
kubectl logs <pod-name> -n <namespace>
# Get previous crash logs
kubectl logs <pod-name> -n <namespace> --previous
# Check container status
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses}'
# Check for OOMKilled
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'
Solutions:
# If OOMKilled, increase memory
kubectl patch mds my-metadata --type=merge -p '{"spec":{"storage":{"resources":{"memory":"16Gi"}}}}'
# Check ConfigMap content
kubectl get cm <configmap-name> -o yaml
# Verify secrets exist
kubectl get secrets -n <namespace>
Symptom: Pods in ImagePullBackOff¶
Possible Causes: 1. Image doesn't exist (wrong tag) 2. Private registry without credentials 3. Registry rate limiting
Diagnosis:
# Check pod events for image pull error
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Events:"
# Verify image exists
docker manifest inspect <image>:<tag>
Solutions:
# Add image pull secret
kubectl create secret docker-registry regcred \
--docker-server=<registry> \
--docker-username=<user> \
--docker-password=<password>
# Reference in CR
# spec.imagePullSecrets: [regcred]
# Check current pull secrets
kubectl get mds my-metadata -o jsonpath='{.spec.imagePullSecrets}'
CRD Phase Issues¶
MetadataServices stuck in Creating or Updating¶
Possible Causes: 1. Pods not becoming ready 2. Health probes failing 3. Storage backend not accessible
Diagnosis:
# Check deployment status
kubectl get deploy -l app.kubernetes.io/instance=my-metadata -n <namespace>
# Check pod readiness
kubectl get pods -l app.kubernetes.io/instance=my-metadata -n <namespace>
# Check storage service logs
kubectl logs -l app.kubernetes.io/name=storage -n <namespace> --tail=50
Solutions:
# Verify storage backend access
kubectl exec -it <storage-pod> -n <namespace> -- aws s3 ls s3://bucket/
# Check health endpoint
kubectl port-forward svc/my-metadata-storage 8081:8081 -n <namespace>
curl http://localhost:8081/health
QueryService stuck in Waiting¶
Cause: MetadataServices not ready in the same namespace.
Diagnosis:
# Check MetadataServices status
kubectl get mds -n <namespace>
# Verify phase is Running
kubectl get mds -n <namespace> -o jsonpath='{.items[*].status.phase}'
Solution: Wait for MetadataServices to reach Running phase, or fix MetadataServices issues first.
E6Catalog stuck in Creating¶
Possible Causes: 1. Storage service not responding 2. Catalog source (Hive/Glue) unreachable 3. Network/firewall issues
Diagnosis:
# Check operation status
kubectl get e6cat <name> -o jsonpath='{.status.operationStatus}'
# Check which storage service is being used
kubectl get e6cat <name> -o jsonpath='{.status.activeStorageService}'
# Check storage service logs for catalog operations
kubectl logs -l app.kubernetes.io/name=storage -n <namespace> | grep -i catalog
Solutions:
# Verify Hive connectivity
kubectl run -it --rm debug --image=busybox -- nc -zv <hive-host> 9083
# Verify Glue access
kubectl exec -it <storage-pod> -- aws glue get-databases --max-results 1
# Check DNS resolution
kubectl run -it --rm debug --image=busybox -- nslookup <hostname>
Blue-Green Deployment Issues¶
Stuck in Deploying phase¶
Possible Causes: 1. New strategy pods not becoming ready 2. Health probes failing 3. Insufficient resources for new deployment
Diagnosis:
# Check both strategies
kubectl get deploy -l e6data.io/strategy=blue -n <namespace>
kubectl get deploy -l e6data.io/strategy=green -n <namespace>
# Check pending strategy pods
kubectl get pods -l e6data.io/strategy=<pending-strategy> -n <namespace>
# Check deployment status
kubectl get mds <name> -o jsonpath='{.status.deploymentPhase}'
kubectl get mds <name> -o jsonpath='{.status.pendingStrategy}'
Solutions:
# Check if it's a resource issue
kubectl describe pod -l e6data.io/strategy=<pending-strategy>
# Force rollback (if needed)
kubectl annotate mds <name> e6data.io/rollback-to=previous --overwrite
# Manual cleanup (last resort)
kubectl delete deploy -l e6data.io/strategy=<stuck-strategy>
Automatic rollback occurred¶
Cause: New deployment failed health checks within timeout (2 minutes).
Diagnosis:
# Check release history
kubectl get mds <name> -o jsonpath='{.status.releaseHistory}' | jq
# Look for Failed status in history
kubectl get mds <name> -o jsonpath='{.status.releaseHistory[?(@.status=="Failed")]}' | jq
# Check operator logs for rollback reason
kubectl logs -n e6-operator-system -l app=e6-operator | grep -i rollback
Solutions: 1. Fix the underlying issue (image, config, resources) 2. Update the CR with corrected values 3. Operator will attempt deployment again
Autoscaling Issues¶
Executors not scaling¶
Possible Causes: 1. Autoscaling not enabled 2. Already at min/max limits 3. Operator API not receiving requests
Diagnosis:
# Check autoscaling config
kubectl get qs <name> -o jsonpath='{.spec.executor.autoscaling}'
# Check current vs target replicas
kubectl get qs <name> -o jsonpath='{.status.executorDeployment}'
# Check scaling history
kubectl get qs <name> -o jsonpath='{.status.scalingHistory}' | jq
# Check operator API endpoint
kubectl port-forward -n e6-operator-system svc/e6-operator 8082:8082
curl http://localhost:8082/health
Solutions:
# Enable autoscaling
kubectl patch qs <name> --type=merge -p '{
"spec":{"executor":{"autoscaling":{
"enabled":true,
"minExecutors":2,
"maxExecutors":20
}}}
}'
# Manual scale (for testing)
kubectl patch qs <name> --type=merge -p '{"spec":{"executor":{"replicas":10}}}'
Pool executors not starting¶
Possible Causes: 1. Pool not active 2. QueryService not in allowed list 3. Incompatible resources
Diagnosis:
# Check pool status
kubectl get pool <name> -o yaml
# Check if QS is attached
kubectl get pool <name> -o jsonpath='{.status.attachedQueryServices}' | jq
# Check compatibility
kubectl get pool <name> -o jsonpath='{.status.attachedQueryServices[?(@.compatible==false)]}' | jq
# Check allocations
kubectl get pool <name> -o jsonpath='{.status.allocations}' | jq
Solutions:
# Verify QS has pool label (if using selector)
kubectl get qs <name> -o jsonpath='{.metadata.labels}'
# Add label if missing
kubectl label qs <name> e6data.io/pool=<pool-name>
# Or add to explicit allow list in Pool spec
Storage/Networking Issues¶
Can't access object storage (S3/GCS/Azure)¶
Possible Causes: 1. IAM permissions missing 2. Workload identity not configured 3. Wrong endpoint/region 4. Network policy blocking
Diagnosis:
# Check service account annotations (IRSA/Workload Identity)
kubectl get sa <sa-name> -n <namespace> -o yaml
# Test S3 access from pod
kubectl exec -it <pod> -n <namespace> -- aws s3 ls s3://bucket/
# Check for network policies
kubectl get networkpolicy -n <namespace>
Solutions:
# For AWS IRSA
kubectl annotate sa <sa-name> \
eks.amazonaws.com/role-arn=arn:aws:iam::ACCOUNT:role/ROLE_NAME
# For GCP Workload Identity
kubectl annotate sa <sa-name> \
iam.gke.io/gcp-service-account=SA@PROJECT.iam.gserviceaccount.com
# Verify IAM policy allows s3:GetObject, s3:ListBucket, etc.
DNS resolution failures¶
Symptoms: Services can't discover each other, connection timeouts.
Diagnosis:
# Test DNS from a pod
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default
# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns
Recovery Procedures¶
Force Re-reconciliation¶
# Add/update annotation to trigger reconcile
kubectl annotate mds <name> e6data.io/reconcile-trigger=$(date +%s) --overwrite
Clean Restart¶
# Delete all pods (deployments will recreate)
kubectl delete pods -l app.kubernetes.io/instance=<name> -n <namespace>
# Restart operator
kubectl rollout restart deployment e6-operator -n e6-operator-system
Complete Resource Reset¶
# Delete and recreate (WARNING: may cause downtime)
kubectl delete mds <name> -n <namespace>
kubectl apply -f metadata-services.yaml
Manual Finalizer Removal (Stuck Deletion)¶
# Only if resource won't delete normally
kubectl patch mds <name> -p '{"metadata":{"finalizers":[]}}' --type=merge
Logging and Debugging¶
Enable Debug Logging¶
# Operator debug logs (check operator deployment for env var)
kubectl set env deployment/e6-operator -n e6-operator-system LOG_LEVEL=debug
# Application debug logging
kubectl patch mds <name> --type=merge -p '{
"spec":{"storage":{"environmentVariables":{"E6_LOGGING_LEVEL":"E6_DEBUG"}}}
}'
Collect Diagnostic Bundle¶
# Collect all related resources
kubectl get mds,qs,e6cat,pool,gov,deploy,svc,cm,secret,pvc -n <namespace> -o yaml > diagnostic-bundle.yaml
# Add events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' >> diagnostic-bundle.yaml
# Add operator logs
kubectl logs -n e6-operator-system -l app=e6-operator --tail=1000 >> diagnostic-bundle.yaml
Getting Help¶
If you can't resolve an issue:
- Collect diagnostic bundle (see above)
- Check GitHub Issues: https://github.com/e6data/e6-operator/issues
- Open new issue with:
- Kubernetes version
- Operator version
- Cloud provider
- CR YAML (sanitized)
- Diagnostic bundle
- Steps to reproduce