Troubleshooting Guide¶

This guide covers common issues and their solutions across all e6data operator CRDs.

Quick Diagnostics¶

Check Operator Health¶

# Operator pod status
kubectl get pods -n e6-operator-system

# Operator logs (last 100 lines)
kubectl logs -n e6-operator-system -l app=e6-operator --tail=100

# Operator logs with error filter
kubectl logs -n e6-operator-system -l app=e6-operator | grep -i error

# Watch operator logs in real-time
kubectl logs -n e6-operator-system -l app=e6-operator -f

Check Resource Status¶

# All e6data resources in a namespace
kubectl get mds,qs,e6cat,pool,gov -n workspace-prod

# Detailed status for a resource
kubectl describe mds my-metadata -n workspace-prod

# Events for a resource
kubectl get events --field-selector involvedObject.name=my-metadata -n workspace-prod

# YAML output with full status
kubectl get mds my-metadata -n workspace-prod -o yaml

Common Issues by Symptom¶

Pods Not Starting¶

Symptom: Pods stuck in `Pending`¶

Possible Causes: 1. Insufficient cluster resources 2. NodeSelector/tolerations don't match nodes 3. PVC not binding (storage class issues) 4. Karpenter not provisioning nodes

Diagnosis:

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Check node availability
kubectl get nodes
kubectl describe node <node-name>

# Check PVCs
kubectl get pvc -n <namespace>

# Check Karpenter (if used)
kubectl get nodepools
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter

Solutions:

# Scale up node pool (if manual)
# Or check Karpenter limits in NodePool

# Verify storage class exists
kubectl get sc

# Check resource quotas
kubectl describe resourcequota -n <namespace>

Symptom: Pods in `CrashLoopBackOff`¶

Possible Causes: 1. Invalid configuration (env vars, config maps) 2. Missing secrets or credentials 3. Application error on startup 4. Insufficient memory (OOMKilled)

Diagnosis:

# Get pod logs (current crash)
kubectl logs <pod-name> -n <namespace>

# Get previous crash logs
kubectl logs <pod-name> -n <namespace> --previous

# Check container status
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses}'

# Check for OOMKilled
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'

Solutions:

# If OOMKilled, increase memory
kubectl patch mds my-metadata --type=merge -p '{"spec":{"storage":{"resources":{"memory":"16Gi"}}}}'

# Check ConfigMap content
kubectl get cm <configmap-name> -o yaml

# Verify secrets exist
kubectl get secrets -n <namespace>

Symptom: Pods in `ImagePullBackOff`¶

Possible Causes: 1. Image doesn't exist (wrong tag) 2. Private registry without credentials 3. Registry rate limiting

Diagnosis:

# Check pod events for image pull error
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Events:"

# Verify image exists
docker manifest inspect <image>:<tag>

Solutions:

# Add image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<password>

# Reference in CR
# spec.imagePullSecrets: [regcred]

# Check current pull secrets
kubectl get mds my-metadata -o jsonpath='{.spec.imagePullSecrets}'

CRD Phase Issues¶

MetadataServices stuck in `Creating` or `Updating`¶

Possible Causes: 1. Pods not becoming ready 2. Health probes failing 3. Storage backend not accessible

Diagnosis:

# Check deployment status
kubectl get deploy -l app.kubernetes.io/instance=my-metadata -n <namespace>

# Check pod readiness
kubectl get pods -l app.kubernetes.io/instance=my-metadata -n <namespace>

# Check storage service logs
kubectl logs -l app.kubernetes.io/name=storage -n <namespace> --tail=50

Solutions:

# Verify storage backend access
kubectl exec -it <storage-pod> -n <namespace> -- aws s3 ls s3://bucket/

# Check health endpoint
kubectl port-forward svc/my-metadata-storage 8081:8081 -n <namespace>
curl http://localhost:8081/health

QueryService stuck in `Waiting`¶

Cause: MetadataServices not ready in the same namespace.

Diagnosis:

# Check MetadataServices status
kubectl get mds -n <namespace>

# Verify phase is Running
kubectl get mds -n <namespace> -o jsonpath='{.items[*].status.phase}'

Solution: Wait for MetadataServices to reach Running phase, or fix MetadataServices issues first.

E6Catalog stuck in `Creating`¶

Possible Causes: 1. Storage service not responding 2. Catalog source (Hive/Glue) unreachable 3. Network/firewall issues

Diagnosis:

# Check operation status
kubectl get e6cat <name> -o jsonpath='{.status.operationStatus}'

# Check which storage service is being used
kubectl get e6cat <name> -o jsonpath='{.status.activeStorageService}'

# Check storage service logs for catalog operations
kubectl logs -l app.kubernetes.io/name=storage -n <namespace> | grep -i catalog

Solutions:

# Verify Hive connectivity
kubectl run -it --rm debug --image=busybox -- nc -zv <hive-host> 9083

# Verify Glue access
kubectl exec -it <storage-pod> -- aws glue get-databases --max-results 1

# Check DNS resolution
kubectl run -it --rm debug --image=busybox -- nslookup <hostname>

Blue-Green Deployment Issues¶

Stuck in `Deploying` phase¶

Possible Causes: 1. New strategy pods not becoming ready 2. Health probes failing 3. Insufficient resources for new deployment

Diagnosis:

# Check both strategies
kubectl get deploy -l e6data.io/strategy=blue -n <namespace>
kubectl get deploy -l e6data.io/strategy=green -n <namespace>

# Check pending strategy pods
kubectl get pods -l e6data.io/strategy=<pending-strategy> -n <namespace>

# Check deployment status
kubectl get mds <name> -o jsonpath='{.status.deploymentPhase}'
kubectl get mds <name> -o jsonpath='{.status.pendingStrategy}'

Solutions:

# Check if it's a resource issue
kubectl describe pod -l e6data.io/strategy=<pending-strategy>

# Force rollback (if needed)
kubectl annotate mds <name> e6data.io/rollback-to=previous --overwrite

# Manual cleanup (last resort)
kubectl delete deploy -l e6data.io/strategy=<stuck-strategy>

Automatic rollback occurred¶

Cause: New deployment failed health checks within timeout (2 minutes).

Diagnosis:

# Check release history
kubectl get mds <name> -o jsonpath='{.status.releaseHistory}' | jq

# Look for Failed status in history
kubectl get mds <name> -o jsonpath='{.status.releaseHistory[?(@.status=="Failed")]}' | jq

# Check operator logs for rollback reason
kubectl logs -n e6-operator-system -l app=e6-operator | grep -i rollback

Solutions: 1. Fix the underlying issue (image, config, resources) 2. Update the CR with corrected values 3. Operator will attempt deployment again

Autoscaling Issues¶

Executors not scaling¶

Possible Causes: 1. Autoscaling not enabled 2. Already at min/max limits 3. Operator API not receiving requests

Diagnosis:

# Check autoscaling config
kubectl get qs <name> -o jsonpath='{.spec.executor.autoscaling}'

# Check current vs target replicas
kubectl get qs <name> -o jsonpath='{.status.executorDeployment}'

# Check scaling history
kubectl get qs <name> -o jsonpath='{.status.scalingHistory}' | jq

# Check operator API endpoint
kubectl port-forward -n e6-operator-system svc/e6-operator 8082:8082
curl http://localhost:8082/health

Solutions:

# Enable autoscaling
kubectl patch qs <name> --type=merge -p '{
  "spec":{"executor":{"autoscaling":{
    "enabled":true,
    "minExecutors":2,
    "maxExecutors":20
  }}}
}'

# Manual scale (for testing)
kubectl patch qs <name> --type=merge -p '{"spec":{"executor":{"replicas":10}}}'

Pool executors not starting¶

Possible Causes: 1. Pool not active 2. QueryService not in allowed list 3. Incompatible resources

Diagnosis:

# Check pool status
kubectl get pool <name> -o yaml

# Check if QS is attached
kubectl get pool <name> -o jsonpath='{.status.attachedQueryServices}' | jq

# Check compatibility
kubectl get pool <name> -o jsonpath='{.status.attachedQueryServices[?(@.compatible==false)]}' | jq

# Check allocations
kubectl get pool <name> -o jsonpath='{.status.allocations}' | jq

Solutions:

# Verify QS has pool label (if using selector)
kubectl get qs <name> -o jsonpath='{.metadata.labels}'

# Add label if missing
kubectl label qs <name> e6data.io/pool=<pool-name>

# Or add to explicit allow list in Pool spec

Storage/Networking Issues¶

Can't access object storage (S3/GCS/Azure)¶

Possible Causes: 1. IAM permissions missing 2. Workload identity not configured 3. Wrong endpoint/region 4. Network policy blocking

Diagnosis:

# Check service account annotations (IRSA/Workload Identity)
kubectl get sa <sa-name> -n <namespace> -o yaml

# Test S3 access from pod
kubectl exec -it <pod> -n <namespace> -- aws s3 ls s3://bucket/

# Check for network policies
kubectl get networkpolicy -n <namespace>

Solutions:

# For AWS IRSA
kubectl annotate sa <sa-name> \
  eks.amazonaws.com/role-arn=arn:aws:iam::ACCOUNT:role/ROLE_NAME

# For GCP Workload Identity
kubectl annotate sa <sa-name> \
  iam.gke.io/gcp-service-account=SA@PROJECT.iam.gserviceaccount.com

# Verify IAM policy allows s3:GetObject, s3:ListBucket, etc.

DNS resolution failures¶

Symptoms: Services can't discover each other, connection timeouts.

Diagnosis:

# Test DNS from a pod
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

Recovery Procedures¶

Force Re-reconciliation¶

# Add/update annotation to trigger reconcile
kubectl annotate mds <name> e6data.io/reconcile-trigger=$(date +%s) --overwrite

Clean Restart¶

# Delete all pods (deployments will recreate)
kubectl delete pods -l app.kubernetes.io/instance=<name> -n <namespace>

# Restart operator
kubectl rollout restart deployment e6-operator -n e6-operator-system

Complete Resource Reset¶

# Delete and recreate (WARNING: may cause downtime)
kubectl delete mds <name> -n <namespace>
kubectl apply -f metadata-services.yaml

Manual Finalizer Removal (Stuck Deletion)¶

# Only if resource won't delete normally
kubectl patch mds <name> -p '{"metadata":{"finalizers":[]}}' --type=merge

Logging and Debugging¶

Enable Debug Logging¶

# Operator debug logs (check operator deployment for env var)
kubectl set env deployment/e6-operator -n e6-operator-system LOG_LEVEL=debug

# Application debug logging
kubectl patch mds <name> --type=merge -p '{
  "spec":{"storage":{"environmentVariables":{"E6_LOGGING_LEVEL":"E6_DEBUG"}}}
}'

Collect Diagnostic Bundle¶

# Collect all related resources
kubectl get mds,qs,e6cat,pool,gov,deploy,svc,cm,secret,pvc -n <namespace> -o yaml > diagnostic-bundle.yaml

# Add events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' >> diagnostic-bundle.yaml

# Add operator logs
kubectl logs -n e6-operator-system -l app=e6-operator --tail=1000 >> diagnostic-bundle.yaml

Getting Help¶

If you can't resolve an issue:

Collect diagnostic bundle (see above)
Check GitHub Issues: https://github.com/e6data/e6-operator/issues
Open new issue with:
Kubernetes version
Operator version
Cloud provider
CR YAML (sanitized)
Diagnostic bundle
Steps to reproduce

Troubleshooting Guide¶

Quick Diagnostics¶

Check Operator Health¶

Check Resource Status¶

Common Issues by Symptom¶

Pods Not Starting¶

Symptom: Pods stuck in Pending¶

Symptom: Pods in CrashLoopBackOff¶

Symptom: Pods in ImagePullBackOff¶

CRD Phase Issues¶

MetadataServices stuck in Creating or Updating¶

QueryService stuck in Waiting¶

E6Catalog stuck in Creating¶

Blue-Green Deployment Issues¶

Stuck in Deploying phase¶

Automatic rollback occurred¶

Autoscaling Issues¶

Executors not scaling¶

Pool executors not starting¶

Storage/Networking Issues¶

Can't access object storage (S3/GCS/Azure)¶

DNS resolution failures¶

Recovery Procedures¶

Force Re-reconciliation¶

Clean Restart¶

Complete Resource Reset¶

Manual Finalizer Removal (Stuck Deletion)¶

Logging and Debugging¶

Enable Debug Logging¶

Collect Diagnostic Bundle¶

Getting Help¶

Symptom: Pods stuck in `Pending`¶

Symptom: Pods in `CrashLoopBackOff`¶

Symptom: Pods in `ImagePullBackOff`¶

MetadataServices stuck in `Creating` or `Updating`¶

QueryService stuck in `Waiting`¶

E6Catalog stuck in `Creating`¶

Stuck in `Deploying` phase¶