Skip to content

Troubleshooting Guide

This guide covers common issues and their solutions across all e6data operator CRDs.


Quick Diagnostics

Check Operator Health

# Operator pod status
kubectl get pods -n e6-operator-system

# Operator logs (last 100 lines)
kubectl logs -n e6-operator-system -l app=e6-operator --tail=100

# Operator logs with error filter
kubectl logs -n e6-operator-system -l app=e6-operator | grep -i error

# Watch operator logs in real-time
kubectl logs -n e6-operator-system -l app=e6-operator -f

Check Resource Status

# All e6data resources in a namespace
kubectl get mds,qs,e6cat,pool,gov -n workspace-prod

# Detailed status for a resource
kubectl describe mds my-metadata -n workspace-prod

# Events for a resource
kubectl get events --field-selector involvedObject.name=my-metadata -n workspace-prod

# YAML output with full status
kubectl get mds my-metadata -n workspace-prod -o yaml

Common Issues by Symptom

Pods Not Starting

Symptom: Pods stuck in Pending

Possible Causes: 1. Insufficient cluster resources 2. NodeSelector/tolerations don't match nodes 3. PVC not binding (storage class issues) 4. Karpenter not provisioning nodes

Diagnosis:

# Check pod events
kubectl describe pod <pod-name> -n <namespace>

# Check node availability
kubectl get nodes
kubectl describe node <node-name>

# Check PVCs
kubectl get pvc -n <namespace>

# Check Karpenter (if used)
kubectl get nodepools
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter

Solutions:

# Scale up node pool (if manual)
# Or check Karpenter limits in NodePool

# Verify storage class exists
kubectl get sc

# Check resource quotas
kubectl describe resourcequota -n <namespace>


Symptom: Pods in CrashLoopBackOff

Possible Causes: 1. Invalid configuration (env vars, config maps) 2. Missing secrets or credentials 3. Application error on startup 4. Insufficient memory (OOMKilled)

Diagnosis:

# Get pod logs (current crash)
kubectl logs <pod-name> -n <namespace>

# Get previous crash logs
kubectl logs <pod-name> -n <namespace> --previous

# Check container status
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses}'

# Check for OOMKilled
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*].lastState.terminated.reason}'

Solutions:

# If OOMKilled, increase memory
kubectl patch mds my-metadata --type=merge -p '{"spec":{"storage":{"resources":{"memory":"16Gi"}}}}'

# Check ConfigMap content
kubectl get cm <configmap-name> -o yaml

# Verify secrets exist
kubectl get secrets -n <namespace>


Symptom: Pods in ImagePullBackOff

Possible Causes: 1. Image doesn't exist (wrong tag) 2. Private registry without credentials 3. Registry rate limiting

Diagnosis:

# Check pod events for image pull error
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Events:"

# Verify image exists
docker manifest inspect <image>:<tag>

Solutions:

# Add image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<password>

# Reference in CR
# spec.imagePullSecrets: [regcred]

# Check current pull secrets
kubectl get mds my-metadata -o jsonpath='{.spec.imagePullSecrets}'


CRD Phase Issues

MetadataServices stuck in Creating or Updating

Possible Causes: 1. Pods not becoming ready 2. Health probes failing 3. Storage backend not accessible

Diagnosis:

# Check deployment status
kubectl get deploy -l app.kubernetes.io/instance=my-metadata -n <namespace>

# Check pod readiness
kubectl get pods -l app.kubernetes.io/instance=my-metadata -n <namespace>

# Check storage service logs
kubectl logs -l app.kubernetes.io/name=storage -n <namespace> --tail=50

Solutions:

# Verify storage backend access
kubectl exec -it <storage-pod> -n <namespace> -- aws s3 ls s3://bucket/

# Check health endpoint
kubectl port-forward svc/my-metadata-storage 8081:8081 -n <namespace>
curl http://localhost:8081/health


QueryService stuck in Waiting

Cause: MetadataServices not ready in the same namespace.

Diagnosis:

# Check MetadataServices status
kubectl get mds -n <namespace>

# Verify phase is Running
kubectl get mds -n <namespace> -o jsonpath='{.items[*].status.phase}'

Solution: Wait for MetadataServices to reach Running phase, or fix MetadataServices issues first.


E6Catalog stuck in Creating

Possible Causes: 1. Storage service not responding 2. Catalog source (Hive/Glue) unreachable 3. Network/firewall issues

Diagnosis:

# Check operation status
kubectl get e6cat <name> -o jsonpath='{.status.operationStatus}'

# Check which storage service is being used
kubectl get e6cat <name> -o jsonpath='{.status.activeStorageService}'

# Check storage service logs for catalog operations
kubectl logs -l app.kubernetes.io/name=storage -n <namespace> | grep -i catalog

Solutions:

# Verify Hive connectivity
kubectl run -it --rm debug --image=busybox -- nc -zv <hive-host> 9083

# Verify Glue access
kubectl exec -it <storage-pod> -- aws glue get-databases --max-results 1

# Check DNS resolution
kubectl run -it --rm debug --image=busybox -- nslookup <hostname>


Blue-Green Deployment Issues

Stuck in Deploying phase

Possible Causes: 1. New strategy pods not becoming ready 2. Health probes failing 3. Insufficient resources for new deployment

Diagnosis:

# Check both strategies
kubectl get deploy -l e6data.io/strategy=blue -n <namespace>
kubectl get deploy -l e6data.io/strategy=green -n <namespace>

# Check pending strategy pods
kubectl get pods -l e6data.io/strategy=<pending-strategy> -n <namespace>

# Check deployment status
kubectl get mds <name> -o jsonpath='{.status.deploymentPhase}'
kubectl get mds <name> -o jsonpath='{.status.pendingStrategy}'

Solutions:

# Check if it's a resource issue
kubectl describe pod -l e6data.io/strategy=<pending-strategy>

# Force rollback (if needed)
kubectl annotate mds <name> e6data.io/rollback-to=previous --overwrite

# Manual cleanup (last resort)
kubectl delete deploy -l e6data.io/strategy=<stuck-strategy>


Automatic rollback occurred

Cause: New deployment failed health checks within timeout (2 minutes).

Diagnosis:

# Check release history
kubectl get mds <name> -o jsonpath='{.status.releaseHistory}' | jq

# Look for Failed status in history
kubectl get mds <name> -o jsonpath='{.status.releaseHistory[?(@.status=="Failed")]}' | jq

# Check operator logs for rollback reason
kubectl logs -n e6-operator-system -l app=e6-operator | grep -i rollback

Solutions: 1. Fix the underlying issue (image, config, resources) 2. Update the CR with corrected values 3. Operator will attempt deployment again


Autoscaling Issues

Executors not scaling

Possible Causes: 1. Autoscaling not enabled 2. Already at min/max limits 3. Operator API not receiving requests

Diagnosis:

# Check autoscaling config
kubectl get qs <name> -o jsonpath='{.spec.executor.autoscaling}'

# Check current vs target replicas
kubectl get qs <name> -o jsonpath='{.status.executorDeployment}'

# Check scaling history
kubectl get qs <name> -o jsonpath='{.status.scalingHistory}' | jq

# Check operator API endpoint
kubectl port-forward -n e6-operator-system svc/e6-operator 8082:8082
curl http://localhost:8082/health

Solutions:

# Enable autoscaling
kubectl patch qs <name> --type=merge -p '{
  "spec":{"executor":{"autoscaling":{
    "enabled":true,
    "minExecutors":2,
    "maxExecutors":20
  }}}
}'

# Manual scale (for testing)
kubectl patch qs <name> --type=merge -p '{"spec":{"executor":{"replicas":10}}}'


Pool executors not starting

Possible Causes: 1. Pool not active 2. QueryService not in allowed list 3. Incompatible resources

Diagnosis:

# Check pool status
kubectl get pool <name> -o yaml

# Check if QS is attached
kubectl get pool <name> -o jsonpath='{.status.attachedQueryServices}' | jq

# Check compatibility
kubectl get pool <name> -o jsonpath='{.status.attachedQueryServices[?(@.compatible==false)]}' | jq

# Check allocations
kubectl get pool <name> -o jsonpath='{.status.allocations}' | jq

Solutions:

# Verify QS has pool label (if using selector)
kubectl get qs <name> -o jsonpath='{.metadata.labels}'

# Add label if missing
kubectl label qs <name> e6data.io/pool=<pool-name>

# Or add to explicit allow list in Pool spec


Storage/Networking Issues

Can't access object storage (S3/GCS/Azure)

Possible Causes: 1. IAM permissions missing 2. Workload identity not configured 3. Wrong endpoint/region 4. Network policy blocking

Diagnosis:

# Check service account annotations (IRSA/Workload Identity)
kubectl get sa <sa-name> -n <namespace> -o yaml

# Test S3 access from pod
kubectl exec -it <pod> -n <namespace> -- aws s3 ls s3://bucket/

# Check for network policies
kubectl get networkpolicy -n <namespace>

Solutions:

# For AWS IRSA
kubectl annotate sa <sa-name> \
  eks.amazonaws.com/role-arn=arn:aws:iam::ACCOUNT:role/ROLE_NAME

# For GCP Workload Identity
kubectl annotate sa <sa-name> \
  iam.gke.io/gcp-service-account=SA@PROJECT.iam.gserviceaccount.com

# Verify IAM policy allows s3:GetObject, s3:ListBucket, etc.


DNS resolution failures

Symptoms: Services can't discover each other, connection timeouts.

Diagnosis:

# Test DNS from a pod
kubectl run -it --rm debug --image=busybox -- nslookup kubernetes.default

# Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns


Recovery Procedures

Force Re-reconciliation

# Add/update annotation to trigger reconcile
kubectl annotate mds <name> e6data.io/reconcile-trigger=$(date +%s) --overwrite

Clean Restart

# Delete all pods (deployments will recreate)
kubectl delete pods -l app.kubernetes.io/instance=<name> -n <namespace>

# Restart operator
kubectl rollout restart deployment e6-operator -n e6-operator-system

Complete Resource Reset

# Delete and recreate (WARNING: may cause downtime)
kubectl delete mds <name> -n <namespace>
kubectl apply -f metadata-services.yaml

Manual Finalizer Removal (Stuck Deletion)

# Only if resource won't delete normally
kubectl patch mds <name> -p '{"metadata":{"finalizers":[]}}' --type=merge

Logging and Debugging

Enable Debug Logging

# Operator debug logs (check operator deployment for env var)
kubectl set env deployment/e6-operator -n e6-operator-system LOG_LEVEL=debug

# Application debug logging
kubectl patch mds <name> --type=merge -p '{
  "spec":{"storage":{"environmentVariables":{"E6_LOGGING_LEVEL":"E6_DEBUG"}}}
}'

Collect Diagnostic Bundle

# Collect all related resources
kubectl get mds,qs,e6cat,pool,gov,deploy,svc,cm,secret,pvc -n <namespace> -o yaml > diagnostic-bundle.yaml

# Add events
kubectl get events -n <namespace> --sort-by='.lastTimestamp' >> diagnostic-bundle.yaml

# Add operator logs
kubectl logs -n e6-operator-system -l app=e6-operator --tail=1000 >> diagnostic-bundle.yaml

Getting Help

If you can't resolve an issue:

  1. Collect diagnostic bundle (see above)
  2. Check GitHub Issues: https://github.com/e6data/e6-operator/issues
  3. Open new issue with:
  4. Kubernetes version
  5. Operator version
  6. Cloud provider
  7. CR YAML (sanitized)
  8. Diagnostic bundle
  9. Steps to reproduce