Skip to content

Blue-Green Deployment Implementation Plan

Current Status

✅ Completed Phases

Phase 1 - Foundation

  1. CRD Types Extended (api/v1alpha1/metadataservices_types.go)
  2. Added ReleaseVersion to MetadataServicesSpec (line 28-31)
  3. Added release tracking fields to MetadataServicesStatus:
    • ActiveStrategy - Currently active color (blue/green)
    • ActiveReleaseVersion - Currently active release version
    • PendingStrategy - Strategy being prepared for deployment
    • ReleaseHistory[] - Last 10 releases
  4. Added ReleaseRecord type to track individual releases:

    • Version, Strategy, StorageTag, SchemaTag, Timestamp, Status
  5. Helper Functions Added (controllers/metadataservices_controller.go)

  6. buildLabels() - Now includes strategy in labels (line 680-691)
  7. getActiveStrategy() - Returns active strategy or "blue" for first deployment (line 693-698)
  8. needsNewRelease() - Checks if image tags changed (line 700-720)

  9. Build Status: ✅ Successfully compiling

Phase 2 - Blue-Green Resource Naming ✅ COMPLETED

Goal: Add strategy suffix to all resource names while maintaining single active deployment.

Status: ✅ All resources now include strategy suffix (-blue/-green)

Changes Completed:

  1. Updated Resource Names - Added strategy suffix to:
  2. Deployments: <name>-storage-<strategy>, <name>-schema-<strategy>
  3. ConfigMaps: <name>-storage-<strategy>, <name>-schema-<strategy>
  4. Services: <name>-storage-<strategy>, <name>-schema-<strategy>

  5. Update Reconciliation Functions:

    // In each reconcile function:
    func (r *MetadataServicesReconciler) reconcileStorageDeployment(ctx context.Context, qm *e6datav1alpha1.MetadataServices) error {
        strategy := r.getActiveStrategy(qm)
        deploymentName := fmt.Sprintf("%s-storage-%s", qm.Name, strategy)
        // ... rest of logic
    }
    

  6. Files to Update:

  7. reconcileStorageConfigMaps() - line 282
  8. reconcileStorageDeployment() - line 372
  9. reconcileStorageService() - line 722
  10. reconcileSchemaConfigMaps() - line 925
  11. reconcileSchemaDeployment() - line 1007
  12. reconcileSchemaService() - line 1297
  13. constructStorageDeployment() - line 396
  14. constructSchemaDeployment() - line 1037

  15. Volume Mount Updates: In constructStorageDeployment and constructSchemaDeployment, update configmap references:

    // Storage configmap mount
    Name: fmt.Sprintf("%s-storage-%s", qm.Name, strategy)
    

Phase 3 - Release Version Generation ✅ COMPLETED

Goal: Auto-generate release versions and track them.

Status: ✅ Release tracking fully implemented with auto-generation and history

Implementation Completed:

  1. Add generateReleaseVersion() function:

    func (r *MetadataServicesReconciler) generateReleaseVersion() string {
        timestamp := time.Now().Format("20060102-150405")
        hash := fmt.Sprintf("%x", sha256.Sum256([]byte(timestamp)))[:7]
        return fmt.Sprintf("v%s-%s", timestamp, hash)
    }
    

  2. Add createReleaseRecord() function:

    func (r *MetadataServicesReconciler) createReleaseRecord(qm *e6datav1alpha1.MetadataServices, strategy string) e6datav1alpha1.ReleaseRecord {
        version := qm.Spec.ReleaseVersion
        if version == "" {
            version = r.generateReleaseVersion()
        }
    
        return e6datav1alpha1.ReleaseRecord{
            Version:    version,
            Strategy:   strategy,
            StorageTag: qm.Spec.Storage.ImageTag,
            SchemaTag:  qm.Spec.Schema.ImageTag,
            Timestamp:  metav1.Now(),
            Status:     "Active",
        }
    }
    

  3. Add addToReleaseHistory() function:

    func (r *MetadataServicesReconciler) addToReleaseHistory(qm *e6datav1alpha1.MetadataServices, record e6datav1alpha1.ReleaseRecord) {
        // Mark previous releases as "Superseded"
        for i := range qm.Status.ReleaseHistory {
            if qm.Status.ReleaseHistory[i].Status == "Active" {
                qm.Status.ReleaseHistory[i].Status = "Superseded"
            }
        }
    
        // Add new record
        qm.Status.ReleaseHistory = append(qm.Status.ReleaseHistory, record)
    
        // Keep only last 10 releases
        if len(qm.Status.ReleaseHistory) > 10 {
            qm.Status.ReleaseHistory = qm.Status.ReleaseHistory[len(qm.Status.ReleaseHistory)-10:]
        }
    }
    

  4. Update Reconcile() to track releases:

    // In Reconcile(), after setting defaults:
    if r.needsNewRelease(qm) {
        strategy := r.determineTargetStrategy(qm)
        record := r.createReleaseRecord(qm, strategy)
        r.addToReleaseHistory(qm, record)
    
        qm.Status.ActiveStrategy = strategy
        qm.Status.ActiveReleaseVersion = record.Version
    
        if err := r.Status().Update(ctx, qm); err != nil {
            return ctrl.Result{}, err
        }
    }
    

Phase 4 - Blue-Green Switching Logic ✅ COMPLETED

Goal: Deploy new version alongside old, switch traffic, cleanup old.

Status: ✅ Full state machine implemented with automatic deployment, switching, and cleanup

High-Level Flow (IMPLEMENTED):

1. Detect image tag change
2. Determine target strategy (flip from active)
3. Deploy new strategy resources:
   - Create <name>-storage-<new-strategy> configmap
   - Create <name>-storage-<new-strategy> deployment
   - Create <name>-storage-<new-strategy> service
   - Same for schema
4. Wait for new deployment to be ready (2 min grace period)
5. Update common configmap to point to new services:
   STORAGE_SERVICE_HOST=<name>-storage-<new-strategy>
   SCHEMA_SERVICE_HOST=<name>-schema-<new-strategy>
6. Update status:
   - ActiveStrategy = new-strategy
   - PendingStrategy = ""
7. Delete old strategy resources:
   - Delete <name>-storage-<old-strategy> deployment
   - Delete <name>-storage-<old-strategy> configmap
   - Delete <name>-storage-<old-strategy> service
   - Same for schema

Implementation Details Completed:

  1. Added DeploymentPhase to Status (api/v1alpha1/metadataservices_types.go:218-229):
  2. Added DeploymentPhase enum: Stable, Deploying, Switching, Cleanup
  3. Added PendingStrategyDeployedAt timestamp
  4. Added OldStrategy for cleanup tracking

  5. Added Helper Functions (controllers/metadataservices_controller.go:800-970):

  6. determineTargetStrategy() - Returns opposite of active strategy
  7. isStrategyReady() - Checks if both storage and schema deployments are ready
  8. switchCommonConfigMap() - Updates service hosts to new strategy
  9. cleanupOldStrategy() - Deletes all old strategy resources

  10. Implemented State Machine (controllers/metadataservices_controller.go:169-277):

  11. Stable: Detects changes and initiates deployment
  12. Deploying: Waits for new strategy to be ready + 2 min grace period
  13. Switching: Updates common configmap, switches active strategy
  14. Cleanup: Deletes old resources, returns to Stable

  15. Updated Reconciliation Logic:

  16. Reconciles both active AND pending strategies during Deploying phase
  17. All reconcile functions now accept strategy parameter
  18. Renamed to reconcile*ForStrategy() pattern

Original Implementation Notes:

  1. Phase to Status (for reference):

    // In MetadataServicesStatus, add:
    // +kubebuilder:validation:Enum=Stable;Deploying;Switching;Cleanup
    DeploymentPhase string `json:"deploymentPhase,omitempty"`
    

  2. State Machine in Reconcile():

    switch qm.Status.DeploymentPhase {
    case "":
        // Initialize to Stable
        qm.Status.DeploymentPhase = "Stable"
    
    case "Stable":
        if r.needsNewRelease(qm) {
            // Transition to Deploying
            qm.Status.DeploymentPhase = "Deploying"
            qm.Status.PendingStrategy = r.determineTargetStrategy(qm)
            // Deploy new strategy resources
        }
    
    case "Deploying":
        // Check if new deployment is ready
        if r.isStrategyReady(qm, qm.Status.PendingStrategy) {
            // Check grace period (2 minutes)
            if time.Since(qm.Status.PendingStrategyDeployedAt.Time) > 2*time.Minute {
                qm.Status.DeploymentPhase = "Switching"
            }
        }
    
    case "Switching":
        // Update common configmap
        r.switchCommonConfigMap(qm, qm.Status.PendingStrategy)
    
        // Update status
        oldStrategy := qm.Status.ActiveStrategy
        qm.Status.ActiveStrategy = qm.Status.PendingStrategy
        qm.Status.PendingStrategy = ""
        qm.Status.DeploymentPhase = "Cleanup"
        qm.Status.OldStrategy = oldStrategy
    
    case "Cleanup":
        // Delete old strategy resources
        r.cleanupOldStrategy(qm, qm.Status.OldStrategy)
        qm.Status.DeploymentPhase = "Stable"
        qm.Status.OldStrategy = ""
    }
    

  3. Add Helper Functions:

// isStrategyReady checks if deployment is ready
func (r *MetadataServicesReconciler) isStrategyReady(ctx context.Context, qm *e6datav1alpha1.MetadataServices, strategy string) bool {
    // Check storage deployment
    storageDep := &appsv1.Deployment{}
    err := r.Get(ctx, types.NamespacedName{
        Name: fmt.Sprintf("%s-storage-%s", qm.Name, strategy),
        Namespace: qm.Namespace,
    }, storageDep)
    if err != nil || storageDep.Status.ReadyReplicas != *storageDep.Spec.Replicas {
        return false
    }

    // Check schema deployment
    schemaDep := &appsv1.Deployment{}
    err = r.Get(ctx, types.NamespacedName{
        Name: fmt.Sprintf("%s-schema-%s", qm.Name, strategy),
        Namespace: qm.Namespace,
    }, schemaDep)
    if err != nil || schemaDep.Status.ReadyReplicas != *schemaDep.Spec.Replicas {
        return false
    }

    return true
}

// switchCommonConfigMap updates service hosts to new strategy
func (r *MetadataServicesReconciler) switchCommonConfigMap(ctx context.Context, qm *e6datav1alpha1.MetadataServices, newStrategy string) error {
    configMap := &corev1.ConfigMap{}
    err := r.Get(ctx, types.NamespacedName{
        Name: fmt.Sprintf("%s-common", qm.Spec.Workspace),
        Namespace: qm.Namespace,
    }, configMap)
    if err != nil {
        return err
    }

    configContent := ""
    configContent += fmt.Sprintf("STORAGE_SERVICE_HOST=%s-storage-%s\n", qm.Name, newStrategy)
    configContent += fmt.Sprintf("SECONDARY_STORAGE_SERVICE_HOST=%s-secondary-storage-%s\n", qm.Name, newStrategy)
    configContent += fmt.Sprintf("SCHEMA_SERVICE_HOST=%s-schema-%s\n", qm.Name, newStrategy)

    configMap.Data["config.properties"] = configContent
    return r.Update(ctx, configMap)
}

// cleanupOldStrategy deletes old strategy resources
func (r *MetadataServicesReconciler) cleanupOldStrategy(ctx context.Context, qm *e6datav1alpha1.MetadataServices, oldStrategy string) error {
    // Delete storage deployment
    storageDep := &appsv1.Deployment{}
    storageDep.Name = fmt.Sprintf("%s-storage-%s", qm.Name, oldStrategy)
    storageDep.Namespace = qm.Namespace
    r.Delete(ctx, storageDep)

    // Delete schema deployment
    schemaDep := &appsv1.Deployment{}
    schemaDep.Name = fmt.Sprintf("%s-schema-%s", qm.Name, oldStrategy)
    schemaDep.Namespace = qm.Namespace
    r.Delete(ctx, schemaDep)

    // Delete configmaps
    storageConfigMap := &corev1.ConfigMap{}
    storageConfigMap.Name = fmt.Sprintf("%s-storage-%s", qm.Name, oldStrategy)
    storageConfigMap.Namespace = qm.Namespace
    r.Delete(ctx, storageConfigMap)

    // Delete services
    storageService := &corev1.Service{}
    storageService.Name = fmt.Sprintf("%s-storage-%s", qm.Name, oldStrategy)
    storageService.Namespace = qm.Namespace
    r.Delete(ctx, storageService)

    // Same for schema resources...

    return nil
}

Phase 5 - Rollback Support ✅ COMPLETED

Goal: Allow rolling back to previous release (both manual and automatic).

Implementation Completed:

  1. Manual Rollback via Annotation (Lines 169-238):
  2. Annotation: e6data.io/rollback-to: <version>
  3. Searches release history for target version
  4. Updates CR spec with target release's image tags
  5. Removes annotation to prevent retry loops
  6. Triggers blue-green deployment with rolled-back version
  7. Comprehensive comments explain the process

  8. Automatic Rollback on Failure (Lines 295-381):

  9. Detects failures after 2-minute timeout (configurable)
  10. Checks pod status for: ImagePullBackOff, CrashLoopBackOff, etc.
  11. Marks failed release as "Failed" in history
  12. Cleans up failed deployment resources
  13. Two scenarios handled:
    • First deployment failure: No rollback possible, mark as Failed, wait for manual fix
    • Subsequent failures: Automatic rollback to active strategy (zero-downtime)
  14. Comprehensive comments explain both scenarios

  15. Helper Function - isStrategyFailed() (Lines 1158-1213):

  16. Checks both storage and schema deployments for failures
  17. Uses checkPodStatus() to detect various failure types
  18. Returns (bool, string) with failure status and reason
  19. Comprehensive function documentation

Test Results:

Manual Rollback: - Tested rollback from blue to green via annotation - Successfully reverted to previous release - Blue-green deployment triggered correctly

Automatic Rollback: - Tested with bad image tag (ImagePullBackOff) - Detected failure after 2 minutes - Automatically rolled back to active strategy - Failed deployment cleaned up - Failed release marked in history

First Deployment Failure: - Logic implemented to handle first deployment failures - Sets phase to "Failed" when no previous version exists - Waits for manual intervention

Code Locations:

  • Manual rollback: controllers/metadataservices_controller.go:169-238
  • Automatic rollback: controllers/metadataservices_controller.go:295-381
  • Failure detection: controllers/metadataservices_controller.go:1158-1213
  • All code includes comprehensive documentation

Testing Plan

Manual Testing Steps

  1. Initial Deployment (Blue):

    kubectl apply -f config/samples/e6data_v1alpha1_metadataservices.yaml
    # Verify: Should create blue resources
    kubectl get deployment -n autoscalingv2 | grep blue
    kubectl get status metadataservices sample1 -n autoscalingv2
    # Should show: activeStrategy: blue
    

  2. Update Image Tag (Trigger Green):

    kubectl edit metadataservices sample1 -n autoscalingv2
    # Change storage.imageTag to new version
    # Wait 2+ minutes
    kubectl get deployment -n autoscalingv2
    # Should see both blue and green briefly, then only green
    

  3. Verify Common ConfigMap Updated:

    kubectl get configmap <workspace>-common -o yaml
    # Should show: STORAGE_SERVICE_HOST=sample1-storage-green
    

  4. Check Release History:

    kubectl get metadataservices sample1 -o jsonpath='{.status.releaseHistory}'
    # Should show 2 releases
    

  5. Test Rollback:

    kubectl annotate metadataservices sample1 e6data.io/rollback-to=<previous-version>
    # Wait for rollback to complete
    

Current Code State (Nov 5, 2025 - After Phase 4 Testing)

✅ Working Features

  • Labels: Include strategy field (dynamically set based on active strategy)
  • Resource names: All include strategy suffix (-blue/-green)
  • Release tracking: Fully implemented with auto-generation and history (last 10)
  • Change detection: Uses generation vs observedGeneration to detect ANY spec change
  • Blue-green logic: Complete state machine with automatic switching
  • Grace period: 2-minute wait after deployment ready before switching
  • Cleanup: Automatic cleanup of old strategy resources
  • First deployment: Only creates active strategy (blue), no blue-green
  • Subsequent deployments: Triggers blue-green automatically

✅ Test Results

  • Test 1 (Blue→Green): ✅ SUCCESS
  • Change detected (CPU: 2→3)
  • Green deployed alongside blue
  • Waited for ready + 2 min grace period
  • Switched active: blue→green
  • Cleaned up blue resources
  • Final state: Only green running, Stable phase

✅ Known Issues - ALL FIXED

  1. FIXED: Active deployment gets updated during changes
  2. Solution: Added return ctrl.Result{Requeue: true}, nil after each status update in state machine
  3. Location: Lines 237, 265, 284, 331
  4. Result: State transitions complete before reconciliation runs

  5. FIXED: Immutable field handling

  6. Solution: Added error handling to catch immutable field errors, delete old deployment, recreate new one
  7. Location: Lines 561-580 (storage), Lines 1438-1457 (schema)
  8. Result: Graceful handling of immutable field changes

  9. FIXED: Status showing Degraded when pods Running

  10. Solution: Added "Degraded" to phases that can be overridden when pods recover
  11. Location: Line 1238
  12. Result: Status correctly reflects Running state after recovery

Implementation Status

✅ ALL PHASES COMPLETED

All planned phases have been successfully implemented and tested: - ✅ Phase 1: Foundation (CRD types, helper functions) - ✅ Phase 2: Blue-Green resource naming - ✅ Phase 3: Release version generation - ✅ Phase 4: Blue-Green switching logic - ✅ Phase 5: Rollback support (manual + automatic)

Future Enhancements (Not in Original Plan)

These items were not part of the original blue-green implementation but could be added:

References

  • Main controller: controllers/metadataservices_controller.go
  • CRD types: api/v1alpha1/metadataservices_types.go
  • Sample CR: config/samples/e6data_v1alpha1_metadataservices.yaml