Blue-Green Deployment Implementation Plan¶
Current Status¶
✅ Completed Phases¶
Phase 1 - Foundation¶
- CRD Types Extended (api/v1alpha1/metadataservices_types.go)
- Added
ReleaseVersionto MetadataServicesSpec (line 28-31) - Added release tracking fields to MetadataServicesStatus:
ActiveStrategy- Currently active color (blue/green)ActiveReleaseVersion- Currently active release versionPendingStrategy- Strategy being prepared for deploymentReleaseHistory[]- Last 10 releases
-
Added
ReleaseRecordtype to track individual releases:- Version, Strategy, StorageTag, SchemaTag, Timestamp, Status
-
Helper Functions Added (controllers/metadataservices_controller.go)
buildLabels()- Now includes strategy in labels (line 680-691)getActiveStrategy()- Returns active strategy or "blue" for first deployment (line 693-698)-
needsNewRelease()- Checks if image tags changed (line 700-720) -
Build Status: ✅ Successfully compiling
Phase 2 - Blue-Green Resource Naming ✅ COMPLETED¶
Goal: Add strategy suffix to all resource names while maintaining single active deployment.
Status: ✅ All resources now include strategy suffix (-blue/-green)
Changes Completed:
- Updated Resource Names - Added strategy suffix to:
- Deployments:
<name>-storage-<strategy>,<name>-schema-<strategy> - ConfigMaps:
<name>-storage-<strategy>,<name>-schema-<strategy> -
Services:
<name>-storage-<strategy>,<name>-schema-<strategy> -
Update Reconciliation Functions:
-
Files to Update:
reconcileStorageConfigMaps()- line 282reconcileStorageDeployment()- line 372reconcileStorageService()- line 722reconcileSchemaConfigMaps()- line 925reconcileSchemaDeployment()- line 1007reconcileSchemaService()- line 1297constructStorageDeployment()- line 396-
constructSchemaDeployment()- line 1037 -
Volume Mount Updates: In constructStorageDeployment and constructSchemaDeployment, update configmap references:
Phase 3 - Release Version Generation ✅ COMPLETED¶
Goal: Auto-generate release versions and track them.
Status: ✅ Release tracking fully implemented with auto-generation and history
Implementation Completed:
-
Add generateReleaseVersion() function:
-
Add createReleaseRecord() function:
func (r *MetadataServicesReconciler) createReleaseRecord(qm *e6datav1alpha1.MetadataServices, strategy string) e6datav1alpha1.ReleaseRecord { version := qm.Spec.ReleaseVersion if version == "" { version = r.generateReleaseVersion() } return e6datav1alpha1.ReleaseRecord{ Version: version, Strategy: strategy, StorageTag: qm.Spec.Storage.ImageTag, SchemaTag: qm.Spec.Schema.ImageTag, Timestamp: metav1.Now(), Status: "Active", } } -
Add addToReleaseHistory() function:
func (r *MetadataServicesReconciler) addToReleaseHistory(qm *e6datav1alpha1.MetadataServices, record e6datav1alpha1.ReleaseRecord) { // Mark previous releases as "Superseded" for i := range qm.Status.ReleaseHistory { if qm.Status.ReleaseHistory[i].Status == "Active" { qm.Status.ReleaseHistory[i].Status = "Superseded" } } // Add new record qm.Status.ReleaseHistory = append(qm.Status.ReleaseHistory, record) // Keep only last 10 releases if len(qm.Status.ReleaseHistory) > 10 { qm.Status.ReleaseHistory = qm.Status.ReleaseHistory[len(qm.Status.ReleaseHistory)-10:] } } -
Update Reconcile() to track releases:
// In Reconcile(), after setting defaults: if r.needsNewRelease(qm) { strategy := r.determineTargetStrategy(qm) record := r.createReleaseRecord(qm, strategy) r.addToReleaseHistory(qm, record) qm.Status.ActiveStrategy = strategy qm.Status.ActiveReleaseVersion = record.Version if err := r.Status().Update(ctx, qm); err != nil { return ctrl.Result{}, err } }
Phase 4 - Blue-Green Switching Logic ✅ COMPLETED¶
Goal: Deploy new version alongside old, switch traffic, cleanup old.
Status: ✅ Full state machine implemented with automatic deployment, switching, and cleanup
High-Level Flow (IMPLEMENTED):
1. Detect image tag change
2. Determine target strategy (flip from active)
3. Deploy new strategy resources:
- Create <name>-storage-<new-strategy> configmap
- Create <name>-storage-<new-strategy> deployment
- Create <name>-storage-<new-strategy> service
- Same for schema
4. Wait for new deployment to be ready (2 min grace period)
5. Update common configmap to point to new services:
STORAGE_SERVICE_HOST=<name>-storage-<new-strategy>
SCHEMA_SERVICE_HOST=<name>-schema-<new-strategy>
6. Update status:
- ActiveStrategy = new-strategy
- PendingStrategy = ""
7. Delete old strategy resources:
- Delete <name>-storage-<old-strategy> deployment
- Delete <name>-storage-<old-strategy> configmap
- Delete <name>-storage-<old-strategy> service
- Same for schema
Implementation Details Completed:
- Added DeploymentPhase to Status (api/v1alpha1/metadataservices_types.go:218-229):
- Added
DeploymentPhaseenum: Stable, Deploying, Switching, Cleanup - Added
PendingStrategyDeployedAttimestamp -
Added
OldStrategyfor cleanup tracking -
Added Helper Functions (controllers/metadataservices_controller.go:800-970):
determineTargetStrategy()- Returns opposite of active strategyisStrategyReady()- Checks if both storage and schema deployments are readyswitchCommonConfigMap()- Updates service hosts to new strategy-
cleanupOldStrategy()- Deletes all old strategy resources -
Implemented State Machine (controllers/metadataservices_controller.go:169-277):
- Stable: Detects changes and initiates deployment
- Deploying: Waits for new strategy to be ready + 2 min grace period
- Switching: Updates common configmap, switches active strategy
-
Cleanup: Deletes old resources, returns to Stable
-
Updated Reconciliation Logic:
- Reconciles both active AND pending strategies during Deploying phase
- All reconcile functions now accept strategy parameter
- Renamed to
reconcile*ForStrategy()pattern
Original Implementation Notes:
-
Phase to Status (for reference):
-
State Machine in Reconcile():
switch qm.Status.DeploymentPhase { case "": // Initialize to Stable qm.Status.DeploymentPhase = "Stable" case "Stable": if r.needsNewRelease(qm) { // Transition to Deploying qm.Status.DeploymentPhase = "Deploying" qm.Status.PendingStrategy = r.determineTargetStrategy(qm) // Deploy new strategy resources } case "Deploying": // Check if new deployment is ready if r.isStrategyReady(qm, qm.Status.PendingStrategy) { // Check grace period (2 minutes) if time.Since(qm.Status.PendingStrategyDeployedAt.Time) > 2*time.Minute { qm.Status.DeploymentPhase = "Switching" } } case "Switching": // Update common configmap r.switchCommonConfigMap(qm, qm.Status.PendingStrategy) // Update status oldStrategy := qm.Status.ActiveStrategy qm.Status.ActiveStrategy = qm.Status.PendingStrategy qm.Status.PendingStrategy = "" qm.Status.DeploymentPhase = "Cleanup" qm.Status.OldStrategy = oldStrategy case "Cleanup": // Delete old strategy resources r.cleanupOldStrategy(qm, qm.Status.OldStrategy) qm.Status.DeploymentPhase = "Stable" qm.Status.OldStrategy = "" } -
Add Helper Functions:
// isStrategyReady checks if deployment is ready
func (r *MetadataServicesReconciler) isStrategyReady(ctx context.Context, qm *e6datav1alpha1.MetadataServices, strategy string) bool {
// Check storage deployment
storageDep := &appsv1.Deployment{}
err := r.Get(ctx, types.NamespacedName{
Name: fmt.Sprintf("%s-storage-%s", qm.Name, strategy),
Namespace: qm.Namespace,
}, storageDep)
if err != nil || storageDep.Status.ReadyReplicas != *storageDep.Spec.Replicas {
return false
}
// Check schema deployment
schemaDep := &appsv1.Deployment{}
err = r.Get(ctx, types.NamespacedName{
Name: fmt.Sprintf("%s-schema-%s", qm.Name, strategy),
Namespace: qm.Namespace,
}, schemaDep)
if err != nil || schemaDep.Status.ReadyReplicas != *schemaDep.Spec.Replicas {
return false
}
return true
}
// switchCommonConfigMap updates service hosts to new strategy
func (r *MetadataServicesReconciler) switchCommonConfigMap(ctx context.Context, qm *e6datav1alpha1.MetadataServices, newStrategy string) error {
configMap := &corev1.ConfigMap{}
err := r.Get(ctx, types.NamespacedName{
Name: fmt.Sprintf("%s-common", qm.Spec.Workspace),
Namespace: qm.Namespace,
}, configMap)
if err != nil {
return err
}
configContent := ""
configContent += fmt.Sprintf("STORAGE_SERVICE_HOST=%s-storage-%s\n", qm.Name, newStrategy)
configContent += fmt.Sprintf("SECONDARY_STORAGE_SERVICE_HOST=%s-secondary-storage-%s\n", qm.Name, newStrategy)
configContent += fmt.Sprintf("SCHEMA_SERVICE_HOST=%s-schema-%s\n", qm.Name, newStrategy)
configMap.Data["config.properties"] = configContent
return r.Update(ctx, configMap)
}
// cleanupOldStrategy deletes old strategy resources
func (r *MetadataServicesReconciler) cleanupOldStrategy(ctx context.Context, qm *e6datav1alpha1.MetadataServices, oldStrategy string) error {
// Delete storage deployment
storageDep := &appsv1.Deployment{}
storageDep.Name = fmt.Sprintf("%s-storage-%s", qm.Name, oldStrategy)
storageDep.Namespace = qm.Namespace
r.Delete(ctx, storageDep)
// Delete schema deployment
schemaDep := &appsv1.Deployment{}
schemaDep.Name = fmt.Sprintf("%s-schema-%s", qm.Name, oldStrategy)
schemaDep.Namespace = qm.Namespace
r.Delete(ctx, schemaDep)
// Delete configmaps
storageConfigMap := &corev1.ConfigMap{}
storageConfigMap.Name = fmt.Sprintf("%s-storage-%s", qm.Name, oldStrategy)
storageConfigMap.Namespace = qm.Namespace
r.Delete(ctx, storageConfigMap)
// Delete services
storageService := &corev1.Service{}
storageService.Name = fmt.Sprintf("%s-storage-%s", qm.Name, oldStrategy)
storageService.Namespace = qm.Namespace
r.Delete(ctx, storageService)
// Same for schema resources...
return nil
}
Phase 5 - Rollback Support ✅ COMPLETED¶
Goal: Allow rolling back to previous release (both manual and automatic).
Implementation Completed:¶
- Manual Rollback via Annotation (Lines 169-238):
- Annotation:
e6data.io/rollback-to: <version> - Searches release history for target version
- Updates CR spec with target release's image tags
- Removes annotation to prevent retry loops
- Triggers blue-green deployment with rolled-back version
-
Comprehensive comments explain the process
-
Automatic Rollback on Failure (Lines 295-381):
- Detects failures after 2-minute timeout (configurable)
- Checks pod status for: ImagePullBackOff, CrashLoopBackOff, etc.
- Marks failed release as "Failed" in history
- Cleans up failed deployment resources
- Two scenarios handled:
- First deployment failure: No rollback possible, mark as Failed, wait for manual fix
- Subsequent failures: Automatic rollback to active strategy (zero-downtime)
-
Comprehensive comments explain both scenarios
-
Helper Function - isStrategyFailed() (Lines 1158-1213):
- Checks both storage and schema deployments for failures
- Uses checkPodStatus() to detect various failure types
- Returns (bool, string) with failure status and reason
- Comprehensive function documentation
Test Results:¶
✅ Manual Rollback: - Tested rollback from blue to green via annotation - Successfully reverted to previous release - Blue-green deployment triggered correctly
✅ Automatic Rollback: - Tested with bad image tag (ImagePullBackOff) - Detected failure after 2 minutes - Automatically rolled back to active strategy - Failed deployment cleaned up - Failed release marked in history
✅ First Deployment Failure: - Logic implemented to handle first deployment failures - Sets phase to "Failed" when no previous version exists - Waits for manual intervention
Code Locations:¶
- Manual rollback: controllers/metadataservices_controller.go:169-238
- Automatic rollback: controllers/metadataservices_controller.go:295-381
- Failure detection: controllers/metadataservices_controller.go:1158-1213
- All code includes comprehensive documentation
Testing Plan¶
Manual Testing Steps¶
-
Initial Deployment (Blue):
-
Update Image Tag (Trigger Green):
-
Verify Common ConfigMap Updated:
-
Check Release History:
-
Test Rollback:
Current Code State (Nov 5, 2025 - After Phase 4 Testing)¶
✅ Working Features¶
- Labels: Include
strategyfield (dynamically set based on active strategy) - Resource names: All include strategy suffix (-blue/-green)
- Release tracking: Fully implemented with auto-generation and history (last 10)
- Change detection: Uses
generationvsobservedGenerationto detect ANY spec change - Blue-green logic: Complete state machine with automatic switching
- Grace period: 2-minute wait after deployment ready before switching
- Cleanup: Automatic cleanup of old strategy resources
- First deployment: Only creates active strategy (blue), no blue-green
- Subsequent deployments: Triggers blue-green automatically
✅ Test Results¶
- Test 1 (Blue→Green): ✅ SUCCESS
- Change detected (CPU: 2→3)
- Green deployed alongside blue
- Waited for ready + 2 min grace period
- Switched active: blue→green
- Cleaned up blue resources
- Final state: Only green running, Stable phase
✅ Known Issues - ALL FIXED¶
- ✅ FIXED: Active deployment gets updated during changes
- Solution: Added
return ctrl.Result{Requeue: true}, nilafter each status update in state machine - Location: Lines 237, 265, 284, 331
-
Result: State transitions complete before reconciliation runs
-
✅ FIXED: Immutable field handling
- Solution: Added error handling to catch immutable field errors, delete old deployment, recreate new one
- Location: Lines 561-580 (storage), Lines 1438-1457 (schema)
-
Result: Graceful handling of immutable field changes
-
✅ FIXED: Status showing Degraded when pods Running
- Solution: Added "Degraded" to phases that can be overridden when pods recover
- Location: Line 1238
- Result: Status correctly reflects Running state after recovery
Implementation Status¶
✅ ALL PHASES COMPLETED¶
All planned phases have been successfully implemented and tested: - ✅ Phase 1: Foundation (CRD types, helper functions) - ✅ Phase 2: Blue-Green resource naming - ✅ Phase 3: Release version generation - ✅ Phase 4: Blue-Green switching logic - ✅ Phase 5: Rollback support (manual + automatic)
Future Enhancements (Not in Original Plan)¶
These items were not part of the original blue-green implementation but could be added:
References¶
- Main controller:
controllers/metadataservices_controller.go - CRD types:
api/v1alpha1/metadataservices_types.go - Sample CR:
config/samples/e6data_v1alpha1_metadataservices.yaml