Rollback Guide¶

This guide explains how to roll back MetadataServices deployments using the E6 Operator's built-in rollback capabilities.

Overview¶

The E6 Operator provides two types of rollback mechanisms:

Manual Rollback - Explicitly roll back to a specific previous release
Automatic Rollback - Automatically roll back on deployment failures

Both mechanisms leverage the operator's blue-green deployment strategy to ensure zero-downtime during rollback operations.

How Rollback Works¶

Release Tracking¶

The operator automatically tracks the last 10 releases for each MetadataServices resource in the status.releaseHistory field:

status:
  activeStrategy: green
  activeReleaseVersion: v20241106-143022-a7b3c4d
  releaseHistory:
  - version: v20241106-120000-abc123
    strategy: blue
    storageTag: "1.0.437-173b0ad"
    schemaTag: "1.0.547-8b066dd"
    timestamp: "2024-11-06T12:00:00Z"
    status: "Superseded"
  - version: v20241106-143022-a7b3c4d
    strategy: green
    storageTag: "1.0.450-new"
    schemaTag: "1.0.550-new"
    timestamp: "2024-11-06T14:30:22Z"
    status: "Active"

Each release record contains: - version - Auto-generated release identifier - strategy - Blue or green deployment - storageTag - Storage service image tag - schemaTag - Schema service image tag - timestamp - When the release was created - status - Active, Superseded, or Failed

Manual Rollback¶

Prerequisites¶

MetadataServices resource must have at least 2 releases in history
Target release version must exist in status.releaseHistory
kubectl access to the cluster

Step-by-Step Manual Rollback¶

1. List Available Releases¶

# Get release history
kubectl get metadataservices <name> -n <namespace> -o jsonpath='{.status.releaseHistory}' | jq .

# Example output:
[
  {
    "version": "v20241106-120000-abc123",
    "strategy": "blue",
    "storageTag": "1.0.437-173b0ad",
    "schemaTag": "1.0.547-8b066dd",
    "timestamp": "2024-11-06T12:00:00Z",
    "status": "Superseded"
  },
  {
    "version": "v20241106-143022-a7b3c4d",
    "strategy": "green",
    "storageTag": "1.0.450-new",
    "schemaTag": "1.0.550-new",
    "timestamp": "2024-11-06T14:30:22Z",
    "status": "Failed"
  }
]

2. Initiate Rollback with Annotation¶

# Annotate with target release version
kubectl annotate metadataservices <name> -n <namespace> \
  e6data.io/rollback-to=v20241106-120000-abc123

Example:

kubectl annotate metadataservices sample1 -n autoscalingv2 \
  e6data.io/rollback-to=v20241106-120000-abc123

3. Monitor Rollback Progress¶

# Watch deployment phase
kubectl get metadataservices <name> -n <namespace> -w \
  -o jsonpath='{.status.deploymentPhase}'

# Watch active strategy
kubectl get metadataservices <name> -n <namespace> -o yaml | grep -A5 "activeStrategy"

# Watch pods
kubectl get pods -n <namespace> -w

4. Verify Rollback Completion¶

# Check current active release
kubectl get metadataservices <name> -n <namespace> \
  -o jsonpath='{.status.activeReleaseVersion}'

# Check image tags
kubectl get metadataservices <name> -n <namespace> \
  -o jsonpath='{.spec.storage.imageTag}'

kubectl get metadataservices <name> -n <namespace> \
  -o jsonpath='{.spec.schema.imageTag}'

# Verify deployment is stable
kubectl get metadataservices <name> -n <namespace> \
  -o jsonpath='{.status.deploymentPhase}'
# Expected: Stable

Rollback Flow¶

1. User adds annotation: e6data.io/rollback-to=<version>
2. Operator searches release history for target version
3. Operator updates CR spec with target release's image tags
4. Operator removes annotation (to prevent retry loops)
5. Blue-green deployment triggers automatically
6. New deployment created with rolled-back version
7. Traffic switched after grace period (2 minutes)
8. Old deployment cleaned up
9. Status updated to reflect rollback

Error Handling¶

Scenario 1: Invalid Release Version

kubectl annotate metadataservices sample1 -n autoscalingv2 \
  e6data.io/rollback-to=v20241106-000000-invalid

Result: Operator logs error and removes annotation:

Error: release version 'v20241106-000000-invalid' not found in release history

Solution: Check available versions with:

kubectl get metadataservices sample1 -n autoscalingv2 \
  -o jsonpath='{.status.releaseHistory[*].version}'

Scenario 2: Rollback to Active Version

# Get current active version
ACTIVE=$(kubectl get metadataservices sample1 -n autoscalingv2 \
  -o jsonpath='{.status.activeReleaseVersion}')

# Try to rollback to same version
kubectl annotate metadataservices sample1 -n autoscalingv2 \
  e6data.io/rollback-to=$ACTIVE

Result: No-op (annotation removed, no deployment triggered)

Solution: This is expected behavior - rollback to current version is harmless.

Automatic Rollback¶

The operator automatically rolls back failed deployments to maintain service availability.

When Automatic Rollback Triggers¶

Automatic rollback occurs when the operator detects:

ImagePullBackOff - Image tag doesn't exist or registry credentials invalid
CrashLoopBackOff - Container crashes on startup
Pod creation failure - Resource constraints, scheduling issues
Readiness probe failure - Application fails health checks after 2 minutes

Detection Window¶

Grace period: 2 minutes after deployment marked Ready
Timeout: 2 minutes for new deployment to become Ready
Total: 4 minutes maximum before automatic rollback

Automatic Rollback Scenarios¶

Scenario 1: Subsequent Deployment Failure (Zero-Downtime)¶

Setup: - Current active deployment: blue (working) - New deployment: green (failing)

Flow:

1. User updates image tag to new version
2. Operator creates green deployment
3. Green deployment fails (ImagePullBackOff)
4. Operator waits 2 minutes
5. Failure detected
6. Operator marks green release as "Failed"
7. Operator cleans up green deployment
8. Blue deployment remains active (ZERO DOWNTIME)
9. Status returns to Stable with blue active

Example Logs:

INFO    Deployment Deploying: Waiting for pending strategy to be ready
ERROR   Pending strategy failed: ImagePullBackOff: image not found
INFO    Automatic rollback: Cleaning up failed deployment (green)
INFO    Rollback complete: Active strategy remains blue
INFO    Deployment phase: Stable

Scenario 2: First Deployment Failure (No Rollback)¶

Setup: - First deployment of MetadataServices resource - No previous successful deployment

Flow:

1. User creates MetadataServices resource
2. Operator creates blue deployment
3. Blue deployment fails (CrashLoopBackOff)
4. Operator waits 2 minutes
5. Failure detected
6. No previous version exists - cannot rollback
7. Operator marks deployment as "Failed"
8. Waits for manual intervention (user fixes spec)

Example Logs:

INFO    First deployment detected (no active strategy)
INFO    Creating blue deployment
ERROR   Deployment failed: CrashLoopBackOff
ERROR   First deployment failed - no previous version to rollback to
INFO    Deployment phase: Failed
INFO    Waiting for manual intervention

Recovery:

# Fix the issue (update image tag, fix configuration, etc.)
kubectl edit metadataservices <name> -n <namespace>

# Operator will automatically retry deployment

Failure Detection Details¶

The operator checks for the following pod conditions:

Condition	Description	Detection
ImagePullBackOff	Image tag not found	Pod status reason
ErrImagePull	Image pull error	Pod status reason
CrashLoopBackOff	Container crashing	Pod status reason
CreateContainerError	Container creation failed	Pod status reason
RunContainerError	Container run failed	Pod status reason
Pending	Pod not scheduled (>2min)	Pod phase
Failed	Pod failed	Pod phase
Not Ready	Readiness probe failing (>2min)	Container ready status

Rollback Best Practices¶

1. Always Check Release History Before Rollback¶

# Get formatted release history
kubectl get metadataservices <name> -n <namespace> -o json | \
  jq -r '.status.releaseHistory[] | "\(.version) - \(.status) - Storage: \(.storageTag) Schema: \(.schemaTag)"'

2. Monitor Rollback Progress¶

# Tail operator logs
kubectl logs -n e6-operator-system \
  deployment/e6-operator-controller-manager \
  -f | grep -E "rollback|Rollback"

# Watch resource status
watch -n 2 "kubectl get metadataservices <name> -n <namespace> \
  -o jsonpath='{.status.deploymentPhase}: {.status.activeStrategy}'"

3. Verify Application Health After Rollback¶

# Check pod status
kubectl get pods -n <namespace> -l app=<name>

# Check logs for errors
kubectl logs -n <namespace> -l app=<name>-storage --tail=100

# Check service endpoints
kubectl get endpoints -n <namespace>

# Test connectivity
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl http://<name>-storage-<strategy>:9005

4. Document Rollback Reasons¶

# Add annotation explaining why rollback was needed
kubectl annotate metadataservices <name> -n <namespace> \
  e6data.io/rollback-reason="ImagePullBackOff on new schema image 1.0.999"

5. Clean Up Failed Releases¶

Failed releases remain in history for auditing. To clean up:

# The operator automatically keeps only last 10 releases
# Older releases are automatically pruned
# No manual cleanup needed

Common Rollback Scenarios¶

Scenario 1: Bad Image Tag¶

Problem:

# Applied wrong image tag
kubectl edit metadataservices sample1 -n autoscalingv2
# Changed storage.imageTag: "1.0.999-nonexistent"

Detection:

ERROR   Pending strategy failed: ImagePullBackOff

Solution:

# Automatic rollback occurs after 2 minutes
# OR manual rollback:
kubectl annotate metadataservices sample1 -n autoscalingv2 \
  e6data.io/rollback-to=$(kubectl get metadataservices sample1 -n autoscalingv2 \
    -o jsonpath='{.status.releaseHistory[-2].version}')

Scenario 2: Configuration Error¶

Problem:

# Applied invalid configuration
kubectl edit metadataservices sample1 -n autoscalingv2
# Added invalid config: E6_BUCKET: "invalid://path"

Detection:

ERROR   Pending strategy failed: CrashLoopBackOff

Solution:

# Manual rollback to previous working version
kubectl annotate metadataservices sample1 -n autoscalingv2 \
  e6data.io/rollback-to=v20241106-120000-abc123

Scenario 3: Resource Exhaustion¶

Problem:

# Increased resources beyond cluster capacity
kubectl edit metadataservices sample1 -n autoscalingv2
# Changed storage.resources.memory: "500Gi"

Detection:

ERROR   Pending strategy failed: Pod cannot be scheduled (Insufficient memory)

Solution:

# Manual rollback or fix resources
kubectl annotate metadataservices sample1 -n autoscalingv2 \
  e6data.io/rollback-to=v20241106-120000-abc123

Troubleshooting¶

Rollback Not Triggering¶

Check 1: Verify annotation syntax

# Correct syntax
kubectl annotate metadataservices sample1 -n autoscalingv2 \
  e6data.io/rollback-to=v20241106-120000-abc123

# WRONG (will be ignored)
kubectl annotate metadataservices sample1 -n autoscalingv2 \
  rollback-to=v20241106-120000-abc123

Check 2: Verify release exists

kubectl get metadataservices sample1 -n autoscalingv2 \
  -o jsonpath='{.status.releaseHistory[*].version}' | \
  grep v20241106-120000-abc123

Check 3: Check operator logs

kubectl logs -n e6-operator-system \
  deployment/e6-operator-controller-manager \
  --tail=100 | grep -i rollback

Automatic Rollback Not Working¶

Check 1: Verify failure timeout

# Automatic rollback triggers after 2 minutes
# Wait full duration before assuming failure

Check 2: Check deployment phase

kubectl get metadataservices sample1 -n autoscalingv2 \
  -o jsonpath='{.status.deploymentPhase}'

# Should be: Deploying (waiting for failure detection)

Check 3: Verify operator is running

kubectl get pods -n e6-operator-system
kubectl logs -n e6-operator-system \
  deployment/e6-operator-controller-manager --tail=50

Rollback Stuck in Deploying¶

Check 1: Verify new deployment readiness

kubectl get deployments -n <namespace> | grep <strategy>
kubectl describe deployment <name>-storage-<strategy> -n <namespace>

Check 2: Check pod status

kubectl get pods -n <namespace> -l strategy=<strategy>
kubectl describe pod <pod-name> -n <namespace>

Check 3: Force cleanup (last resort)

# Delete pending deployment manually
kubectl delete deployment <name>-storage-<strategy> -n <namespace>
kubectl delete deployment <name>-schema-<strategy> -n <namespace>

# Remove annotation
kubectl annotate metadataservices <name> -n <namespace> \
  e6data.io/rollback-to-

Monitoring Rollback Operations¶

Prometheus Metrics¶

# Rollback rate
rate(controller_runtime_reconcile_total{result="rollback"}[5m])

# Failed deployments
count(kube_deployment_status_replicas_unavailable{namespace="<namespace>"} > 0)

# Deployment phase distribution
count by (phase) (kube_customresource_metadataservices_deployment_phase)

Grafana Dashboard Queries¶

Rollback Events Panel:

sum(increase(controller_runtime_reconcile_total{
  controller="metadataservices",
  result=~"rollback.*"
}[24h])) by (result)

Active vs Failed Releases:

count by (status) (kube_customresource_metadataservices_release_history_info)

Alerts¶

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: e6-operator-rollback-alerts
spec:
  groups:
  - name: rollback
    rules:
    - alert: FrequentRollbacks
      expr: rate(controller_runtime_reconcile_total{result="rollback"}[1h]) > 3
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Frequent rollbacks detected"
        description: "More than 3 rollbacks per hour"

    - alert: RollbackFailed
      expr: controller_runtime_reconcile_errors_total{controller="metadataservices"} > 0
      for: 10m
      labels:
        severity: critical
      annotations:
        summary: "Rollback operation failed"
        description: "Rollback failed with errors"

Additional Resources¶

FAQ¶

Q: Can I rollback to any previous release?

A: Yes, as long as the release exists in status.releaseHistory (last 10 releases are kept).

Q: What happens if I rollback to a release that also fails?

A: The operator will detect the failure and maintain the currently active (working) deployment. You'll need to manually fix the issue or rollback to a different version.

Q: Is there downtime during rollback?

A: No. Rollback uses the blue-green deployment strategy, so the active deployment remains running until the rollback is complete and verified.

Q: Can I disable automatic rollback?

A: Not currently. Automatic rollback is a safety feature to prevent service disruption. However, it only triggers on clear failures (ImagePullBackOff, CrashLoopBackOff, etc.).

Q: How long does a rollback take?

A: Typically 4-6 minutes: - 2 minutes: Deploy rolled-back version - 2 minutes: Grace period for stability - 1-2 minutes: Traffic switch and cleanup

Q: Can I rollback multiple times in a row?

A: Yes. Each rollback creates a new release in the history, so you can roll forward and backward as needed.

Q: What if the operator crashes during rollback?

A: The operator is stateless and reconciles based on the CR spec and status. When it restarts, it will continue the rollback from where it left off.