Skip to content

Status Fields and Diagnostics Guide

This guide explains how to interpret status fields, phases, and diagnostics across all e6data CRDs.


Quick Reference: kubectl Commands

# List all resources with status
kubectl get mds,qs,e6cat,pool -n workspace-prod

# Detailed status for a resource
kubectl describe mds my-metadata -n workspace-prod

# Get specific status field
kubectl get mds my-metadata -n workspace-prod -o jsonpath='{.status.phase}'

# Watch status changes in real-time
kubectl get mds -n workspace-prod -w

# Get full status as YAML
kubectl get mds my-metadata -n workspace-prod -o yaml | yq '.status'

MetadataServices Status

Phase Values

Phase Description Action
Pending CR created, waiting to start reconciliation Wait for operator
Creating First deployment in progress Wait ~2-5 minutes
Running All components healthy and serving Normal operation
Updating Blue-green deployment in progress Wait ~2-5 minutes
Failed Deployment failed (pods not starting) Check pod logs
Degraded Partial failure (some pods unhealthy) Check specific component
Terminating Being deleted, cleanup in progress Wait for finalizer

Status Fields Explained

status:
  phase: Running              # Current lifecycle phase
  ready: true                 # Overall readiness (all components healthy)
  message: "All services running"  # Human-readable status message

  # Blue-green deployment tracking
  activeStrategy: blue        # Currently serving traffic (blue or green)
  pendingStrategy: ""         # Strategy being deployed (empty when stable)
  deploymentPhase: Stable     # Stable|Deploying|Switching|Draining|Cleanup
  activeReleaseVersion: "v1.0.462"  # Current active version

  # Per-component status
  storageDeployment:
    name: my-metadata-storage-blue
    ready: true
    replicas: 2
    readyReplicas: 2          # Should equal replicas when healthy

  secondaryStorageDeployment:  # Only if HA enabled
    name: my-metadata-secondary-storage-blue
    ready: true
    replicas: 1
    readyReplicas: 1

  schemaDeployment:
    name: my-metadata-schema-blue
    ready: true
    replicas: 1
    readyReplicas: 1

  # Release history (last 10 deployments)
  releaseHistory:
    - version: "v1.0.462"
      strategy: blue
      storageTag: "1.0.462-4730d5a"
      schemaTag: "1.0.562-5a58ed2"
      timestamp: "2024-12-09T10:30:00Z"
      status: Active          # Active|Superseded|Failed

Diagnosing Issues

# Check which pods are unhealthy
kubectl get pods -l app.kubernetes.io/instance=my-metadata -n workspace-prod

# Check pod events
kubectl describe pod my-metadata-storage-blue-xxx -n workspace-prod

# Check container logs
kubectl logs my-metadata-storage-blue-xxx -n workspace-prod

# Check if readiness probe failing
kubectl get pods -n workspace-prod -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[*].ready}{"\n"}{end}'

QueryService Status

Phase Values

Phase Description Action
Waiting Waiting for MetadataServices to be ready Check MDS status
Deploying Initial deployment or update in progress Wait ~3-5 minutes
Ready All components healthy Normal operation
Updating Blue-green update in progress Wait ~3-5 minutes
Failed Deployment failed Check component logs
Degraded Some components unhealthy Check specific component

Status Fields Explained

status:
  phase: Ready
  ready: true
  message: "Query cluster ready"

  # Blue-green deployment
  activeStrategy: blue
  pendingStrategy: ""
  deploymentPhase: Stable
  activeReleaseVersion: "v1.0.1160"

  # Component statuses
  plannerDeployment:
    ready: true
    replicas: 1
    readyReplicas: 1

  queueDeployment:
    ready: true
    replicas: 1
    readyReplicas: 1

  executorDeployment:
    ready: true
    replicas: 4
    readyReplicas: 4

  # Pool executor status (if using Pool)
  poolExecutorDeployment:
    ready: true
    replicas: 2
    readyReplicas: 2
  poolName: "burst-pool"
  poolNamespace: "e6-pools"
  regularExecutorReplicas: 4    # Executors on regular nodes
  poolExecutorReplicas: 2       # Executors on pool nodes

  # Service endpoints (traffic routing handled by Envoy + xDS)
  plannerService: "my-cluster-planner-blue.workspace-prod.svc:10001"
  queueService: "my-cluster-queue-blue.workspace-prod.svc:10003"

  # Scaling history (last 20 operations)
  scalingHistory:
    - timestamp: "2024-12-09T10:30:00Z"
      component: executor
      oldReplicas: 2
      newReplicas: 4
      trigger: autoscaling-api    # autoscaling-api|kubectl|manual
      strategy: blue

  # Suspension history (last 20 operations)
  suspensionHistory:
    - timestamp: "2024-12-09T08:00:00Z"
      action: suspend             # suspend|resume
      trigger: auto-suspension-api
      strategy: blue
      componentsSuspended: [planner, queue, executor]
      preSuspensionReplicas:
        plannerReplicas: 1
        queueReplicas: 1
        executorReplicas: 4

Diagnosing Issues

# Check all QueryService components
kubectl get pods -l app.kubernetes.io/instance=my-cluster -n workspace-prod

# Check Envoy proxy (traffic routing)
kubectl logs -l e6data.io/component=envoy -n workspace-prod --tail=50

# Check planner for query errors
kubectl logs -l app=planner -n workspace-prod --tail=100 | grep -i error

# Check executor health
kubectl get pods -l app=executor -n workspace-prod -o wide

E6Catalog Status

Phase Values

Phase Description Action
Waiting Waiting for MetadataServices Check MDS status
Creating Catalog registration in progress Wait ~1-2 minutes
Ready Catalog registered and accessible Normal operation
Updating Catalog update in progress Wait ~1-2 minutes
Refreshing Metadata refresh in progress Wait for completion
Deleting Catalog being removed Wait for finalizer
Failed Operation failed Check operationStatus

Status Fields Explained

status:
  phase: Ready

  # Storage service being used
  activeStorageService: "my-metadata-storage-blue"
  storageServiceEndpoint: "http://my-metadata-storage-blue.workspace-prod.svc:8081"

  # Catalog information from API
  catalogDetails:
    catalogName: "data-lake"
    catalogType: "GLUE"
    isDefault: true
    status: "ACTIVE"
    createdAt: "2024-12-09T10:00:00Z"
    updatedAt: "2024-12-09T10:30:00Z"

  # Last refresh timestamp
  lastRefreshTime: "2024-12-09T10:30:00Z"

  # Current operation status (populated during async operations)
  operationStatus:
    operation: update           # create|update|refresh
    status: success             # in_progress|success|partial_success|failed
    message: "Catalog updated successfully"
    startTime: "2024-12-09T10:28:00Z"
    lastUpdated: "2024-12-09T10:30:00Z"
    totalDBsRefreshed: 15
    totalTablesRefreshed: 234

    # Only populated on failure or partial success
    diagnosticsFilePath: "s3://bucket/diagnostics/catalog-update-2024-12-09.json"
    failures:
      - type: table
        name: "db1.problematic_table"
        reason: "Schema inference failed: unsupported data type"
      - type: database
        name: "restricted_db"
        reason: "Access denied"

Operation Status Values

Status Description Meaning
in_progress Operation running Poll again in 10 seconds
success All items succeeded Operation complete
partial_success Some items failed Catalog usable, check failures
failed Operation failed completely Check error message and logs

Diagnosing Issues

# Check operation status
kubectl get e6cat my-catalog -n workspace-prod -o jsonpath='{.status.operationStatus}'

# View failures inline
kubectl get e6cat my-catalog -o jsonpath='{.status.operationStatus.failures}' | jq

# Get diagnostics file path
kubectl get e6cat my-catalog -o jsonpath='{.status.operationStatus.diagnosticsFilePath}'

# Download and view diagnostics file (AWS S3)
aws s3 cp s3://bucket/diagnostics/catalog-update-2024-12-09.json - | jq

# Check storage service logs for catalog operations
kubectl logs -l app.kubernetes.io/name=storage -n workspace-prod | grep -i catalog

Diagnostics File Structure

{
  "operation": "update",
  "catalogName": "data-lake",
  "startTime": "2024-12-09T10:28:00Z",
  "endTime": "2024-12-09T10:30:00Z",
  "summary": {
    "totalDatabases": 16,
    "successfulDatabases": 15,
    "failedDatabases": 1,
    "totalTables": 250,
    "successfulTables": 234,
    "failedTables": 16
  },
  "failures": [
    {
      "type": "database",
      "name": "restricted_db",
      "reason": "Access denied: IAM role lacks glue:GetDatabase permission",
      "timestamp": "2024-12-09T10:28:15Z"
    },
    {
      "type": "table",
      "database": "db1",
      "name": "problematic_table",
      "reason": "Schema inference failed: column 'data' has unsupported type 'struct<nested:array<map<string,int>>>'",
      "timestamp": "2024-12-09T10:28:45Z"
    }
  ],
  "successful": [
    {"type": "database", "name": "db1"},
    {"type": "database", "name": "db2"},
    // ...
  ]
}

CatalogRefresh Status

Phase Values

Phase Description Action
Pending Waiting to start (another refresh may be running) Wait for lock
Running Refresh operation in progress Wait for completion
Succeeded All databases/tables refreshed successfully Complete
PartialSuccess Some items failed but catalog is usable Check failures
Failed Refresh failed completely Check error message
TimedOut Exceeded configured timeout Retry with longer timeout

Status Fields Explained

status:
  phase: Succeeded

  # Timing
  startTime: "2024-12-09T10:00:00Z"
  completionTime: "2024-12-09T10:05:32Z"

  # Results
  databasesRefreshed: 15
  tablesRefreshed: 234
  message: "Refresh completed in 5m32s"

  # Diagnostics (for partial_success or failed)
  diagnosticsFilePath: "s3://bucket/diagnostics/refresh-2024-12-09.json"
  failures:
    - type: table
      name: "db1.broken_table"
      reason: "Parquet file corrupted"

Diagnosing Issues

# Check refresh status
kubectl get catalogrefresh -n workspace-prod

# View detailed status
kubectl describe catalogrefresh refresh-20241209 -n workspace-prod

# Check for failures
kubectl get catalogrefresh refresh-20241209 -o jsonpath='{.status.failures}' | jq

# If timed out, check what was in progress
kubectl logs -l app.kubernetes.io/name=storage -n workspace-prod | grep -i refresh

CatalogRefreshSchedule Status

Status Fields Explained

status:
  # Next scheduled run
  nextScheduledTime: "2024-12-10T02:00:00Z"

  # Currently running refresh (if any)
  activeRefreshes:
    - name: "nightly-refresh-20241209-020000"
      startTime: "2024-12-09T02:00:00Z"

  # Statistics
  statistics:
    totalRuns: 45
    successfulRuns: 42
    partialSuccessRuns: 2
    failedRuns: 1
    totalDatabasesRefreshed: 675
    totalTablesRefreshed: 10530
    averageRefreshDurationSeconds: 332

  # Recent history (last 5 runs)
  recentHistory:
    - name: "nightly-refresh-20241209-020000"
      phase: Succeeded
      startTime: "2024-12-09T02:00:00Z"
      completionTime: "2024-12-09T02:05:32Z"
      databasesRefreshed: 15
      tablesRefreshed: 234
    - name: "nightly-refresh-20241208-020000"
      phase: PartialSuccess
      startTime: "2024-12-08T02:00:00Z"
      completionTime: "2024-12-08T02:06:15Z"
      databasesRefreshed: 14
      tablesRefreshed: 220

Diagnosing Issues

# List all scheduled refreshes
kubectl get catalogrefreshschedule -n workspace-prod

# Check recent runs
kubectl get catalogrefresh -n workspace-prod --sort-by=.metadata.creationTimestamp

# Check schedule statistics
kubectl get crs nightly-refresh -o jsonpath='{.status.statistics}' | jq

# View failed runs
kubectl get catalogrefresh -n workspace-prod -o json | \
  jq '.items[] | select(.status.phase == "Failed") | {name: .metadata.name, message: .status.message}'

Pool Status

Phase Values

Phase Description Action
Pending Pool created, initializing Wait for Karpenter
Creating Creating NodePool/NodeClass Wait ~1-2 minutes
Active Pool ready for allocations Normal operation
Suspended Pool manually suspended Resume when needed
Deleting Being deleted Wait for finalizer
Failed NodePool creation failed Check Karpenter logs

Status Fields Explained

status:
  phase: Active

  # Node provisioning
  nodePoolName: "burst-pool-nodepool"
  nodeClassName: "burst-pool-ec2nodeclass"  # AWS

  # Capacity
  availableExecutors: 16
  allocatedExecutors: 6

  # Attached QueryServices
  attachedQueryServices:
    - name: analytics-cluster
      namespace: workspace-prod
      allocatedExecutors: 4
      compatible: true
    - name: reporting-cluster
      namespace: workspace-prod
      allocatedExecutors: 2
      compatible: true

  # Warmup status
  warmupDaemonSets:
    - name: burst-pool-warmup-executor-1-0-2123
      imageTag: "1.0.2123-abe4ff294"
      readyNodes: 4
      desiredNodes: 4

Diagnosing Issues

# Check pool status
kubectl get pool -A

# Check Karpenter NodePool
kubectl get nodepool burst-pool-nodepool -o yaml

# Check if nodes are being provisioned
kubectl get nodes -l karpenter.sh/nodepool=burst-pool-nodepool

# Check warmup DaemonSets
kubectl get daemonset -l e6data.io/pool=burst-pool

# Check Karpenter logs for provisioning issues
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter | grep burst-pool

Common Diagnostic Patterns

Check All E6 Resources

# Get all e6data resources in a namespace
kubectl get mds,qs,e6cat,catalogrefresh,crs,pool,gov -n workspace-prod

# Get all resources across all namespaces
kubectl get mds,qs,e6cat,pool -A

Collect Diagnostic Bundle

#!/bin/bash
NAMESPACE=$1
OUTPUT="e6-diagnostic-$(date +%Y%m%d-%H%M%S).yaml"

echo "Collecting diagnostics for namespace: $NAMESPACE"

{
  echo "--- MetadataServices ---"
  kubectl get mds -n $NAMESPACE -o yaml

  echo "--- QueryServices ---"
  kubectl get qs -n $NAMESPACE -o yaml

  echo "--- E6Catalogs ---"
  kubectl get e6cat -n $NAMESPACE -o yaml

  echo "--- Pods ---"
  kubectl get pods -n $NAMESPACE -o yaml

  echo "--- Events ---"
  kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp'

  echo "--- Operator Logs ---"
  kubectl logs -n e6-operator-system -l app=e6-operator --tail=500
} > $OUTPUT

echo "Diagnostics saved to: $OUTPUT"

Watch for Status Changes

# Watch all resources
watch -n 2 "kubectl get mds,qs,e6cat -n workspace-prod"

# Watch with custom columns
kubectl get qs -n workspace-prod -w \
  -o custom-columns=NAME:.metadata.name,PHASE:.status.phase,EXECUTORS:.status.executorDeployment.readyReplicas