Status Fields and Diagnostics Guide
This guide explains how to interpret status fields, phases, and diagnostics across all e6data CRDs.
Quick Reference: kubectl Commands
# List all resources with status
kubectl get mds,qs,e6cat,pool -n workspace-prod
# Detailed status for a resource
kubectl describe mds my-metadata -n workspace-prod
# Get specific status field
kubectl get mds my-metadata -n workspace-prod -o jsonpath='{.status.phase}'
# Watch status changes in real-time
kubectl get mds -n workspace-prod -w
# Get full status as YAML
kubectl get mds my-metadata -n workspace-prod -o yaml | yq '.status'
Phase Values
| Phase | Description | Action |
Pending | CR created, waiting to start reconciliation | Wait for operator |
Creating | First deployment in progress | Wait ~2-5 minutes |
Running | All components healthy and serving | Normal operation |
Updating | Blue-green deployment in progress | Wait ~2-5 minutes |
Failed | Deployment failed (pods not starting) | Check pod logs |
Degraded | Partial failure (some pods unhealthy) | Check specific component |
Terminating | Being deleted, cleanup in progress | Wait for finalizer |
Status Fields Explained
status:
phase: Running # Current lifecycle phase
ready: true # Overall readiness (all components healthy)
message: "All services running" # Human-readable status message
# Blue-green deployment tracking
activeStrategy: blue # Currently serving traffic (blue or green)
pendingStrategy: "" # Strategy being deployed (empty when stable)
deploymentPhase: Stable # Stable|Deploying|Switching|Draining|Cleanup
activeReleaseVersion: "v1.0.462" # Current active version
# Per-component status
storageDeployment:
name: my-metadata-storage-blue
ready: true
replicas: 2
readyReplicas: 2 # Should equal replicas when healthy
secondaryStorageDeployment: # Only if HA enabled
name: my-metadata-secondary-storage-blue
ready: true
replicas: 1
readyReplicas: 1
schemaDeployment:
name: my-metadata-schema-blue
ready: true
replicas: 1
readyReplicas: 1
# Release history (last 10 deployments)
releaseHistory:
- version: "v1.0.462"
strategy: blue
storageTag: "1.0.462-4730d5a"
schemaTag: "1.0.562-5a58ed2"
timestamp: "2024-12-09T10:30:00Z"
status: Active # Active|Superseded|Failed
Diagnosing Issues
# Check which pods are unhealthy
kubectl get pods -l app.kubernetes.io/instance=my-metadata -n workspace-prod
# Check pod events
kubectl describe pod my-metadata-storage-blue-xxx -n workspace-prod
# Check container logs
kubectl logs my-metadata-storage-blue-xxx -n workspace-prod
# Check if readiness probe failing
kubectl get pods -n workspace-prod -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[*].ready}{"\n"}{end}'
QueryService Status
Phase Values
| Phase | Description | Action |
Waiting | Waiting for MetadataServices to be ready | Check MDS status |
Deploying | Initial deployment or update in progress | Wait ~3-5 minutes |
Ready | All components healthy | Normal operation |
Updating | Blue-green update in progress | Wait ~3-5 minutes |
Failed | Deployment failed | Check component logs |
Degraded | Some components unhealthy | Check specific component |
Status Fields Explained
status:
phase: Ready
ready: true
message: "Query cluster ready"
# Blue-green deployment
activeStrategy: blue
pendingStrategy: ""
deploymentPhase: Stable
activeReleaseVersion: "v1.0.1160"
# Component statuses
plannerDeployment:
ready: true
replicas: 1
readyReplicas: 1
queueDeployment:
ready: true
replicas: 1
readyReplicas: 1
executorDeployment:
ready: true
replicas: 4
readyReplicas: 4
# Pool executor status (if using Pool)
poolExecutorDeployment:
ready: true
replicas: 2
readyReplicas: 2
poolName: "burst-pool"
poolNamespace: "e6-pools"
regularExecutorReplicas: 4 # Executors on regular nodes
poolExecutorReplicas: 2 # Executors on pool nodes
# Service endpoints (traffic routing handled by Envoy + xDS)
plannerService: "my-cluster-planner-blue.workspace-prod.svc:10001"
queueService: "my-cluster-queue-blue.workspace-prod.svc:10003"
# Scaling history (last 20 operations)
scalingHistory:
- timestamp: "2024-12-09T10:30:00Z"
component: executor
oldReplicas: 2
newReplicas: 4
trigger: autoscaling-api # autoscaling-api|kubectl|manual
strategy: blue
# Suspension history (last 20 operations)
suspensionHistory:
- timestamp: "2024-12-09T08:00:00Z"
action: suspend # suspend|resume
trigger: auto-suspension-api
strategy: blue
componentsSuspended: [planner, queue, executor]
preSuspensionReplicas:
plannerReplicas: 1
queueReplicas: 1
executorReplicas: 4
Diagnosing Issues
# Check all QueryService components
kubectl get pods -l app.kubernetes.io/instance=my-cluster -n workspace-prod
# Check Envoy proxy (traffic routing)
kubectl logs -l e6data.io/component=envoy -n workspace-prod --tail=50
# Check planner for query errors
kubectl logs -l app=planner -n workspace-prod --tail=100 | grep -i error
# Check executor health
kubectl get pods -l app=executor -n workspace-prod -o wide
E6Catalog Status
Phase Values
| Phase | Description | Action |
Waiting | Waiting for MetadataServices | Check MDS status |
Creating | Catalog registration in progress | Wait ~1-2 minutes |
Ready | Catalog registered and accessible | Normal operation |
Updating | Catalog update in progress | Wait ~1-2 minutes |
Refreshing | Metadata refresh in progress | Wait for completion |
Deleting | Catalog being removed | Wait for finalizer |
Failed | Operation failed | Check operationStatus |
Status Fields Explained
status:
phase: Ready
# Storage service being used
activeStorageService: "my-metadata-storage-blue"
storageServiceEndpoint: "http://my-metadata-storage-blue.workspace-prod.svc:8081"
# Catalog information from API
catalogDetails:
catalogName: "data-lake"
catalogType: "GLUE"
isDefault: true
status: "ACTIVE"
createdAt: "2024-12-09T10:00:00Z"
updatedAt: "2024-12-09T10:30:00Z"
# Last refresh timestamp
lastRefreshTime: "2024-12-09T10:30:00Z"
# Current operation status (populated during async operations)
operationStatus:
operation: update # create|update|refresh
status: success # in_progress|success|partial_success|failed
message: "Catalog updated successfully"
startTime: "2024-12-09T10:28:00Z"
lastUpdated: "2024-12-09T10:30:00Z"
totalDBsRefreshed: 15
totalTablesRefreshed: 234
# Only populated on failure or partial success
diagnosticsFilePath: "s3://bucket/diagnostics/catalog-update-2024-12-09.json"
failures:
- type: table
name: "db1.problematic_table"
reason: "Schema inference failed: unsupported data type"
- type: database
name: "restricted_db"
reason: "Access denied"
Operation Status Values
| Status | Description | Meaning |
in_progress | Operation running | Poll again in 10 seconds |
success | All items succeeded | Operation complete |
partial_success | Some items failed | Catalog usable, check failures |
failed | Operation failed completely | Check error message and logs |
Diagnosing Issues
# Check operation status
kubectl get e6cat my-catalog -n workspace-prod -o jsonpath='{.status.operationStatus}'
# View failures inline
kubectl get e6cat my-catalog -o jsonpath='{.status.operationStatus.failures}' | jq
# Get diagnostics file path
kubectl get e6cat my-catalog -o jsonpath='{.status.operationStatus.diagnosticsFilePath}'
# Download and view diagnostics file (AWS S3)
aws s3 cp s3://bucket/diagnostics/catalog-update-2024-12-09.json - | jq
# Check storage service logs for catalog operations
kubectl logs -l app.kubernetes.io/name=storage -n workspace-prod | grep -i catalog
Diagnostics File Structure
{
"operation": "update",
"catalogName": "data-lake",
"startTime": "2024-12-09T10:28:00Z",
"endTime": "2024-12-09T10:30:00Z",
"summary": {
"totalDatabases": 16,
"successfulDatabases": 15,
"failedDatabases": 1,
"totalTables": 250,
"successfulTables": 234,
"failedTables": 16
},
"failures": [
{
"type": "database",
"name": "restricted_db",
"reason": "Access denied: IAM role lacks glue:GetDatabase permission",
"timestamp": "2024-12-09T10:28:15Z"
},
{
"type": "table",
"database": "db1",
"name": "problematic_table",
"reason": "Schema inference failed: column 'data' has unsupported type 'struct<nested:array<map<string,int>>>'",
"timestamp": "2024-12-09T10:28:45Z"
}
],
"successful": [
{"type": "database", "name": "db1"},
{"type": "database", "name": "db2"},
// ...
]
}
CatalogRefresh Status
Phase Values
| Phase | Description | Action |
Pending | Waiting to start (another refresh may be running) | Wait for lock |
Running | Refresh operation in progress | Wait for completion |
Succeeded | All databases/tables refreshed successfully | Complete |
PartialSuccess | Some items failed but catalog is usable | Check failures |
Failed | Refresh failed completely | Check error message |
TimedOut | Exceeded configured timeout | Retry with longer timeout |
Status Fields Explained
status:
phase: Succeeded
# Timing
startTime: "2024-12-09T10:00:00Z"
completionTime: "2024-12-09T10:05:32Z"
# Results
databasesRefreshed: 15
tablesRefreshed: 234
message: "Refresh completed in 5m32s"
# Diagnostics (for partial_success or failed)
diagnosticsFilePath: "s3://bucket/diagnostics/refresh-2024-12-09.json"
failures:
- type: table
name: "db1.broken_table"
reason: "Parquet file corrupted"
Diagnosing Issues
# Check refresh status
kubectl get catalogrefresh -n workspace-prod
# View detailed status
kubectl describe catalogrefresh refresh-20241209 -n workspace-prod
# Check for failures
kubectl get catalogrefresh refresh-20241209 -o jsonpath='{.status.failures}' | jq
# If timed out, check what was in progress
kubectl logs -l app.kubernetes.io/name=storage -n workspace-prod | grep -i refresh
CatalogRefreshSchedule Status
Status Fields Explained
status:
# Next scheduled run
nextScheduledTime: "2024-12-10T02:00:00Z"
# Currently running refresh (if any)
activeRefreshes:
- name: "nightly-refresh-20241209-020000"
startTime: "2024-12-09T02:00:00Z"
# Statistics
statistics:
totalRuns: 45
successfulRuns: 42
partialSuccessRuns: 2
failedRuns: 1
totalDatabasesRefreshed: 675
totalTablesRefreshed: 10530
averageRefreshDurationSeconds: 332
# Recent history (last 5 runs)
recentHistory:
- name: "nightly-refresh-20241209-020000"
phase: Succeeded
startTime: "2024-12-09T02:00:00Z"
completionTime: "2024-12-09T02:05:32Z"
databasesRefreshed: 15
tablesRefreshed: 234
- name: "nightly-refresh-20241208-020000"
phase: PartialSuccess
startTime: "2024-12-08T02:00:00Z"
completionTime: "2024-12-08T02:06:15Z"
databasesRefreshed: 14
tablesRefreshed: 220
Diagnosing Issues
# List all scheduled refreshes
kubectl get catalogrefreshschedule -n workspace-prod
# Check recent runs
kubectl get catalogrefresh -n workspace-prod --sort-by=.metadata.creationTimestamp
# Check schedule statistics
kubectl get crs nightly-refresh -o jsonpath='{.status.statistics}' | jq
# View failed runs
kubectl get catalogrefresh -n workspace-prod -o json | \
jq '.items[] | select(.status.phase == "Failed") | {name: .metadata.name, message: .status.message}'
Pool Status
Phase Values
| Phase | Description | Action |
Pending | Pool created, initializing | Wait for Karpenter |
Creating | Creating NodePool/NodeClass | Wait ~1-2 minutes |
Active | Pool ready for allocations | Normal operation |
Suspended | Pool manually suspended | Resume when needed |
Deleting | Being deleted | Wait for finalizer |
Failed | NodePool creation failed | Check Karpenter logs |
Status Fields Explained
status:
phase: Active
# Node provisioning
nodePoolName: "burst-pool-nodepool"
nodeClassName: "burst-pool-ec2nodeclass" # AWS
# Capacity
availableExecutors: 16
allocatedExecutors: 6
# Attached QueryServices
attachedQueryServices:
- name: analytics-cluster
namespace: workspace-prod
allocatedExecutors: 4
compatible: true
- name: reporting-cluster
namespace: workspace-prod
allocatedExecutors: 2
compatible: true
# Warmup status
warmupDaemonSets:
- name: burst-pool-warmup-executor-1-0-2123
imageTag: "1.0.2123-abe4ff294"
readyNodes: 4
desiredNodes: 4
Diagnosing Issues
# Check pool status
kubectl get pool -A
# Check Karpenter NodePool
kubectl get nodepool burst-pool-nodepool -o yaml
# Check if nodes are being provisioned
kubectl get nodes -l karpenter.sh/nodepool=burst-pool-nodepool
# Check warmup DaemonSets
kubectl get daemonset -l e6data.io/pool=burst-pool
# Check Karpenter logs for provisioning issues
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter | grep burst-pool
Common Diagnostic Patterns
Check All E6 Resources
# Get all e6data resources in a namespace
kubectl get mds,qs,e6cat,catalogrefresh,crs,pool,gov -n workspace-prod
# Get all resources across all namespaces
kubectl get mds,qs,e6cat,pool -A
Collect Diagnostic Bundle
#!/bin/bash
NAMESPACE=$1
OUTPUT="e6-diagnostic-$(date +%Y%m%d-%H%M%S).yaml"
echo "Collecting diagnostics for namespace: $NAMESPACE"
{
echo "--- MetadataServices ---"
kubectl get mds -n $NAMESPACE -o yaml
echo "--- QueryServices ---"
kubectl get qs -n $NAMESPACE -o yaml
echo "--- E6Catalogs ---"
kubectl get e6cat -n $NAMESPACE -o yaml
echo "--- Pods ---"
kubectl get pods -n $NAMESPACE -o yaml
echo "--- Events ---"
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp'
echo "--- Operator Logs ---"
kubectl logs -n e6-operator-system -l app=e6-operator --tail=500
} > $OUTPUT
echo "Diagnostics saved to: $OUTPUT"
Watch for Status Changes
# Watch all resources
watch -n 2 "kubectl get mds,qs,e6cat -n workspace-prod"
# Watch with custom columns
kubectl get qs -n workspace-prod -w \
-o custom-columns=NAME:.metadata.name,PHASE:.status.phase,EXECUTORS:.status.executorDeployment.readyReplicas