Rollback Guide¶
This guide explains how to roll back MetadataServices deployments using the E6 Operator's built-in rollback capabilities.
Overview¶
The E6 Operator provides two types of rollback mechanisms:
- Manual Rollback - Explicitly roll back to a specific previous release
- Automatic Rollback - Automatically roll back on deployment failures
Both mechanisms leverage the operator's blue-green deployment strategy to ensure zero-downtime during rollback operations.
How Rollback Works¶
Release Tracking¶
The operator automatically tracks the last 10 releases for each MetadataServices resource in the status.releaseHistory field:
status:
activeStrategy: green
activeReleaseVersion: v20241106-143022-a7b3c4d
releaseHistory:
- version: v20241106-120000-abc123
strategy: blue
storageTag: "1.0.437-173b0ad"
schemaTag: "1.0.547-8b066dd"
timestamp: "2024-11-06T12:00:00Z"
status: "Superseded"
- version: v20241106-143022-a7b3c4d
strategy: green
storageTag: "1.0.450-new"
schemaTag: "1.0.550-new"
timestamp: "2024-11-06T14:30:22Z"
status: "Active"
Each release record contains: - version - Auto-generated release identifier - strategy - Blue or green deployment - storageTag - Storage service image tag - schemaTag - Schema service image tag - timestamp - When the release was created - status - Active, Superseded, or Failed
Manual Rollback¶
Prerequisites¶
- MetadataServices resource must have at least 2 releases in history
- Target release version must exist in
status.releaseHistory - kubectl access to the cluster
Step-by-Step Manual Rollback¶
1. List Available Releases¶
# Get release history
kubectl get metadataservices <name> -n <namespace> -o jsonpath='{.status.releaseHistory}' | jq .
# Example output:
[
{
"version": "v20241106-120000-abc123",
"strategy": "blue",
"storageTag": "1.0.437-173b0ad",
"schemaTag": "1.0.547-8b066dd",
"timestamp": "2024-11-06T12:00:00Z",
"status": "Superseded"
},
{
"version": "v20241106-143022-a7b3c4d",
"strategy": "green",
"storageTag": "1.0.450-new",
"schemaTag": "1.0.550-new",
"timestamp": "2024-11-06T14:30:22Z",
"status": "Failed"
}
]
2. Initiate Rollback with Annotation¶
# Annotate with target release version
kubectl annotate metadataservices <name> -n <namespace> \
e6data.io/rollback-to=v20241106-120000-abc123
Example:
kubectl annotate metadataservices sample1 -n autoscalingv2 \
e6data.io/rollback-to=v20241106-120000-abc123
3. Monitor Rollback Progress¶
# Watch deployment phase
kubectl get metadataservices <name> -n <namespace> -w \
-o jsonpath='{.status.deploymentPhase}'
# Watch active strategy
kubectl get metadataservices <name> -n <namespace> -o yaml | grep -A5 "activeStrategy"
# Watch pods
kubectl get pods -n <namespace> -w
4. Verify Rollback Completion¶
# Check current active release
kubectl get metadataservices <name> -n <namespace> \
-o jsonpath='{.status.activeReleaseVersion}'
# Check image tags
kubectl get metadataservices <name> -n <namespace> \
-o jsonpath='{.spec.storage.imageTag}'
kubectl get metadataservices <name> -n <namespace> \
-o jsonpath='{.spec.schema.imageTag}'
# Verify deployment is stable
kubectl get metadataservices <name> -n <namespace> \
-o jsonpath='{.status.deploymentPhase}'
# Expected: Stable
Rollback Flow¶
1. User adds annotation: e6data.io/rollback-to=<version>
2. Operator searches release history for target version
3. Operator updates CR spec with target release's image tags
4. Operator removes annotation (to prevent retry loops)
5. Blue-green deployment triggers automatically
6. New deployment created with rolled-back version
7. Traffic switched after grace period (2 minutes)
8. Old deployment cleaned up
9. Status updated to reflect rollback
Error Handling¶
Scenario 1: Invalid Release Version
kubectl annotate metadataservices sample1 -n autoscalingv2 \
e6data.io/rollback-to=v20241106-000000-invalid
Result: Operator logs error and removes annotation:
Solution: Check available versions with:
kubectl get metadataservices sample1 -n autoscalingv2 \
-o jsonpath='{.status.releaseHistory[*].version}'
Scenario 2: Rollback to Active Version
# Get current active version
ACTIVE=$(kubectl get metadataservices sample1 -n autoscalingv2 \
-o jsonpath='{.status.activeReleaseVersion}')
# Try to rollback to same version
kubectl annotate metadataservices sample1 -n autoscalingv2 \
e6data.io/rollback-to=$ACTIVE
Result: No-op (annotation removed, no deployment triggered)
Solution: This is expected behavior - rollback to current version is harmless.
Automatic Rollback¶
The operator automatically rolls back failed deployments to maintain service availability.
When Automatic Rollback Triggers¶
Automatic rollback occurs when the operator detects:
- ImagePullBackOff - Image tag doesn't exist or registry credentials invalid
- CrashLoopBackOff - Container crashes on startup
- Pod creation failure - Resource constraints, scheduling issues
- Readiness probe failure - Application fails health checks after 2 minutes
Detection Window¶
- Grace period: 2 minutes after deployment marked Ready
- Timeout: 2 minutes for new deployment to become Ready
- Total: 4 minutes maximum before automatic rollback
Automatic Rollback Scenarios¶
Scenario 1: Subsequent Deployment Failure (Zero-Downtime)¶
Setup: - Current active deployment: blue (working) - New deployment: green (failing)
Flow:
1. User updates image tag to new version
2. Operator creates green deployment
3. Green deployment fails (ImagePullBackOff)
4. Operator waits 2 minutes
5. Failure detected
6. Operator marks green release as "Failed"
7. Operator cleans up green deployment
8. Blue deployment remains active (ZERO DOWNTIME)
9. Status returns to Stable with blue active
Example Logs:
INFO Deployment Deploying: Waiting for pending strategy to be ready
ERROR Pending strategy failed: ImagePullBackOff: image not found
INFO Automatic rollback: Cleaning up failed deployment (green)
INFO Rollback complete: Active strategy remains blue
INFO Deployment phase: Stable
Scenario 2: First Deployment Failure (No Rollback)¶
Setup: - First deployment of MetadataServices resource - No previous successful deployment
Flow:
1. User creates MetadataServices resource
2. Operator creates blue deployment
3. Blue deployment fails (CrashLoopBackOff)
4. Operator waits 2 minutes
5. Failure detected
6. No previous version exists - cannot rollback
7. Operator marks deployment as "Failed"
8. Waits for manual intervention (user fixes spec)
Example Logs:
INFO First deployment detected (no active strategy)
INFO Creating blue deployment
ERROR Deployment failed: CrashLoopBackOff
ERROR First deployment failed - no previous version to rollback to
INFO Deployment phase: Failed
INFO Waiting for manual intervention
Recovery:
# Fix the issue (update image tag, fix configuration, etc.)
kubectl edit metadataservices <name> -n <namespace>
# Operator will automatically retry deployment
Failure Detection Details¶
The operator checks for the following pod conditions:
| Condition | Description | Detection |
|---|---|---|
| ImagePullBackOff | Image tag not found | Pod status reason |
| ErrImagePull | Image pull error | Pod status reason |
| CrashLoopBackOff | Container crashing | Pod status reason |
| CreateContainerError | Container creation failed | Pod status reason |
| RunContainerError | Container run failed | Pod status reason |
| Pending | Pod not scheduled (>2min) | Pod phase |
| Failed | Pod failed | Pod phase |
| Not Ready | Readiness probe failing (>2min) | Container ready status |
Rollback Best Practices¶
1. Always Check Release History Before Rollback¶
# Get formatted release history
kubectl get metadataservices <name> -n <namespace> -o json | \
jq -r '.status.releaseHistory[] | "\(.version) - \(.status) - Storage: \(.storageTag) Schema: \(.schemaTag)"'
2. Monitor Rollback Progress¶
# Tail operator logs
kubectl logs -n e6-operator-system \
deployment/e6-operator-controller-manager \
-f | grep -E "rollback|Rollback"
# Watch resource status
watch -n 2 "kubectl get metadataservices <name> -n <namespace> \
-o jsonpath='{.status.deploymentPhase}: {.status.activeStrategy}'"
3. Verify Application Health After Rollback¶
# Check pod status
kubectl get pods -n <namespace> -l app=<name>
# Check logs for errors
kubectl logs -n <namespace> -l app=<name>-storage --tail=100
# Check service endpoints
kubectl get endpoints -n <namespace>
# Test connectivity
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
curl http://<name>-storage-<strategy>:9005
4. Document Rollback Reasons¶
# Add annotation explaining why rollback was needed
kubectl annotate metadataservices <name> -n <namespace> \
e6data.io/rollback-reason="ImagePullBackOff on new schema image 1.0.999"
5. Clean Up Failed Releases¶
Failed releases remain in history for auditing. To clean up:
# The operator automatically keeps only last 10 releases
# Older releases are automatically pruned
# No manual cleanup needed
Common Rollback Scenarios¶
Scenario 1: Bad Image Tag¶
Problem:
# Applied wrong image tag
kubectl edit metadataservices sample1 -n autoscalingv2
# Changed storage.imageTag: "1.0.999-nonexistent"
Detection:
Solution:
# Automatic rollback occurs after 2 minutes
# OR manual rollback:
kubectl annotate metadataservices sample1 -n autoscalingv2 \
e6data.io/rollback-to=$(kubectl get metadataservices sample1 -n autoscalingv2 \
-o jsonpath='{.status.releaseHistory[-2].version}')
Scenario 2: Configuration Error¶
Problem:
# Applied invalid configuration
kubectl edit metadataservices sample1 -n autoscalingv2
# Added invalid config: E6_BUCKET: "invalid://path"
Detection:
Solution:
# Manual rollback to previous working version
kubectl annotate metadataservices sample1 -n autoscalingv2 \
e6data.io/rollback-to=v20241106-120000-abc123
Scenario 3: Resource Exhaustion¶
Problem:
# Increased resources beyond cluster capacity
kubectl edit metadataservices sample1 -n autoscalingv2
# Changed storage.resources.memory: "500Gi"
Detection:
Solution:
# Manual rollback or fix resources
kubectl annotate metadataservices sample1 -n autoscalingv2 \
e6data.io/rollback-to=v20241106-120000-abc123
Troubleshooting¶
Rollback Not Triggering¶
Check 1: Verify annotation syntax
# Correct syntax
kubectl annotate metadataservices sample1 -n autoscalingv2 \
e6data.io/rollback-to=v20241106-120000-abc123
# WRONG (will be ignored)
kubectl annotate metadataservices sample1 -n autoscalingv2 \
rollback-to=v20241106-120000-abc123
Check 2: Verify release exists
kubectl get metadataservices sample1 -n autoscalingv2 \
-o jsonpath='{.status.releaseHistory[*].version}' | \
grep v20241106-120000-abc123
Check 3: Check operator logs
kubectl logs -n e6-operator-system \
deployment/e6-operator-controller-manager \
--tail=100 | grep -i rollback
Automatic Rollback Not Working¶
Check 1: Verify failure timeout
Check 2: Check deployment phase
kubectl get metadataservices sample1 -n autoscalingv2 \
-o jsonpath='{.status.deploymentPhase}'
# Should be: Deploying (waiting for failure detection)
Check 3: Verify operator is running
kubectl get pods -n e6-operator-system
kubectl logs -n e6-operator-system \
deployment/e6-operator-controller-manager --tail=50
Rollback Stuck in Deploying¶
Check 1: Verify new deployment readiness
kubectl get deployments -n <namespace> | grep <strategy>
kubectl describe deployment <name>-storage-<strategy> -n <namespace>
Check 2: Check pod status
kubectl get pods -n <namespace> -l strategy=<strategy>
kubectl describe pod <pod-name> -n <namespace>
Check 3: Force cleanup (last resort)
# Delete pending deployment manually
kubectl delete deployment <name>-storage-<strategy> -n <namespace>
kubectl delete deployment <name>-schema-<strategy> -n <namespace>
# Remove annotation
kubectl annotate metadataservices <name> -n <namespace> \
e6data.io/rollback-to-
Monitoring Rollback Operations¶
Prometheus Metrics¶
# Rollback rate
rate(controller_runtime_reconcile_total{result="rollback"}[5m])
# Failed deployments
count(kube_deployment_status_replicas_unavailable{namespace="<namespace>"} > 0)
# Deployment phase distribution
count by (phase) (kube_customresource_metadataservices_deployment_phase)
Grafana Dashboard Queries¶
Rollback Events Panel:
sum(increase(controller_runtime_reconcile_total{
controller="metadataservices",
result=~"rollback.*"
}[24h])) by (result)
Active vs Failed Releases:
Alerts¶
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: e6-operator-rollback-alerts
spec:
groups:
- name: rollback
rules:
- alert: FrequentRollbacks
expr: rate(controller_runtime_reconcile_total{result="rollback"}[1h]) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Frequent rollbacks detected"
description: "More than 3 rollbacks per hour"
- alert: RollbackFailed
expr: controller_runtime_reconcile_errors_total{controller="metadataservices"} > 0
for: 10m
labels:
severity: critical
annotations:
summary: "Rollback operation failed"
description: "Rollback failed with errors"
Additional Resources¶
FAQ¶
Q: Can I rollback to any previous release?
A: Yes, as long as the release exists in status.releaseHistory (last 10 releases are kept).
Q: What happens if I rollback to a release that also fails?
A: The operator will detect the failure and maintain the currently active (working) deployment. You'll need to manually fix the issue or rollback to a different version.
Q: Is there downtime during rollback?
A: No. Rollback uses the blue-green deployment strategy, so the active deployment remains running until the rollback is complete and verified.
Q: Can I disable automatic rollback?
A: Not currently. Automatic rollback is a safety feature to prevent service disruption. However, it only triggers on clear failures (ImagePullBackOff, CrashLoopBackOff, etc.).
Q: How long does a rollback take?
A: Typically 4-6 minutes: - 2 minutes: Deploy rolled-back version - 2 minutes: Grace period for stability - 1-2 minutes: Traffic switch and cleanup
Q: Can I rollback multiple times in a row?
A: Yes. Each rollback creates a new release in the history, so you can roll forward and backward as needed.
Q: What if the operator crashes during rollback?
A: The operator is stateless and reconciles based on the CR spec and status. When it restarts, it will continue the rollback from where it left off.