Skip to content

Upgrade Guide

This guide explains how to upgrade the E6 Operator, CRDs, and MetadataServices workloads safely.

Table of Contents

Overview

The E6 Operator consists of three independently upgradeable components:

  1. CRDs (CustomResourceDefinitions) - Schema for MetadataServices resources
  2. Operator - Controller that manages MetadataServices resources
  3. MetadataServices Workloads - Storage and Schema deployments

Upgrade Order

⚠️ Important: Always upgrade in this order:

1. CRDs (if schema changes)
2. Operator
3. MetadataServices workloads (automatic via blue-green)

Upgrading out of order may cause compatibility issues.

Upgrade Types

Minor Upgrade

Example: v0.1.0 → v0.2.0

  • New features, bug fixes
  • May include new CRD fields (backward compatible)
  • No breaking changes
  • Downtime: None (blue-green deployment)

Major Upgrade

Example: v0.x → v1.0.0

  • Breaking changes possible
  • CRD schema changes
  • May require manual intervention
  • Downtime: Minimal (plan carefully)

Patch Upgrade

Example: v0.1.0 → v0.1.1

  • Bug fixes only
  • No CRD changes
  • Fully backward compatible
  • Downtime: None

Pre-Upgrade Checklist

1. Review Release Notes

# Check release notes for breaking changes
curl -s https://api.github.com/repos/e6data/e6-operator/releases/latest | \
  jq -r '.body'

2. Backup Current State

#!/bin/bash
# backup-operator-state.sh

BACKUP_DIR="backup-$(date +%Y%m%d-%H%M%S)"
mkdir -p $BACKUP_DIR

echo "Backing up operator state..."

# Backup all MetadataServices resources
kubectl get metadataservices --all-namespaces -o yaml > $BACKUP_DIR/metadataservices.yaml

# Backup CRD
kubectl get crd metadataservices.e6data.io -o yaml > $BACKUP_DIR/crd.yaml

# Backup operator deployment
kubectl get deployment -n e6-operator-system e6-operator-controller-manager -o yaml \
  > $BACKUP_DIR/operator-deployment.yaml

# Backup operator RBAC
kubectl get clusterrole metadataservices-operator-manager-role -o yaml \
  > $BACKUP_DIR/clusterrole.yaml
kubectl get clusterrolebinding metadataservices-operator-manager-rolebinding -o yaml \
  > $BACKUP_DIR/clusterrolebinding.yaml

tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR/
echo "Backup complete: $BACKUP_DIR.tar.gz"

3. Check Current Versions

# Operator version
kubectl get deployment -n e6-operator-system e6-operator-controller-manager \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# CRD version
kubectl get crd metadataservices.e6data.io \
  -o jsonpath='{.spec.versions[*].name}'

# MetadataServices workload versions
kubectl get metadataservices -A \
  -o custom-columns=NAME:.metadata.name,NAMESPACE:.metadata.namespace,\
STORAGE:.spec.storage.imageTag,SCHEMA:.spec.schema.imageTag

4. Verify Cluster Health

# Check operator health
kubectl get pods -n e6-operator-system

# Check MetadataServices resources
kubectl get metadataservices -A

# Check for degraded workloads
kubectl get metadataservices -A -o json | \
  jq -r '.items[] | select(.status.phase != "Stable") | "\(.metadata.name) - \(.status.phase)"'

5. Review Capacity

# Check cluster resources
kubectl top nodes

# Ensure sufficient capacity for blue-green deployments (2x during upgrade)

Upgrading the Operator

Step 1: Update Helm Repository

# Add/update Helm repo
helm repo add e6data https://e6data.github.io/helm-charts
helm repo update

Step 2: Review Changes

# Check what will change
helm diff upgrade e6-operator e6data/e6-operator \
  --namespace e6-operator-system \
  --version 0.2.0

# OR with custom values
helm diff upgrade e6-operator e6data/e6-operator \
  --namespace e6-operator-system \
  --version 0.2.0 \
  -f custom-values.yaml

Step 3: Upgrade Operator

# Upgrade with default values
helm upgrade e6-operator e6data/e6-operator \
  --namespace e6-operator-system \
  --version 0.2.0

# OR with custom values
helm upgrade e6-operator e6data/e6-operator \
  --namespace e6-operator-system \
  --version 0.2.0 \
  -f custom-values.yaml

Step 4: Verify Upgrade

# Check rollout status
kubectl rollout status deployment/e6-operator-controller-manager \
  -n e6-operator-system

# Verify new version
kubectl get deployment -n e6-operator-system e6-operator-controller-manager \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

# Check logs for errors
kubectl logs -n e6-operator-system \
  deployment/e6-operator-controller-manager \
  --tail=50

Method 2: kubectl/Kustomize Upgrade

Step 1: Update Manifests

# Clone or pull latest version
git clone https://github.com/e6data/e6-operator.git
cd e6-operator
git checkout v0.2.0

Step 2: Review Changes

# Preview what will change
kubectl diff -k config/default

Step 3: Apply Upgrade

# Apply updated manifests
kubectl apply -k config/default

# OR with custom overlay
kubectl apply -k overlays/production

Step 4: Verify Upgrade

# Check deployment status
kubectl rollout status deployment/e6-operator-controller-manager \
  -n e6-operator-system

# Verify version
kubectl get deployment -n e6-operator-system e6-operator-controller-manager \
  -o jsonpath='{.spec.template.spec.containers[0].image}'

Method 3: Direct Image Update

⚠️ Not recommended - Use only for quick testing

# Update image directly
kubectl set image deployment/e6-operator-controller-manager \
  manager=your-registry/e6-operator:0.2.0 \
  -n e6-operator-system

# Watch rollout
kubectl rollout status deployment/e6-operator-controller-manager \
  -n e6-operator-system

Upgrading CRDs

When to Upgrade CRDs

Upgrade CRDs when: - Release notes indicate CRD schema changes - New fields added to MetadataServices spec/status - Validation rules updated - New API versions introduced

CRD Upgrade Procedure

# Upgrade CRDs chart first
helm upgrade e6-operator-crds e6data/e6-operator-crds \
  --namespace e6-operator-system \
  --version 0.2.0

# Verify CRD updated
kubectl get crd metadataservices.e6data.io \
  -o jsonpath='{.spec.versions[*].name}'

Method 2: kubectl apply

# Apply updated CRD
kubectl apply -f https://github.com/e6data/e6-operator/releases/download/v0.2.0/metadataservices.e6data.io.yaml

# Verify
kubectl get crd metadataservices.e6data.io -o yaml | grep "version:"

CRD Upgrade Considerations

⚠️ Important CRD Limitations:

  1. Cannot remove fields - Only add new optional fields
  2. Cannot change field types - Field types are immutable
  3. Cannot rename fields - Use new fields, deprecate old ones
  4. Validation rules - Can be added but not easily removed

Multi-Version CRD Support

The operator may support multiple CRD versions simultaneously:

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
spec:
  versions:
  - name: v1alpha1  # Old version
    served: true
    storage: false  # Deprecated
  - name: v1beta1   # New version
    served: true
    storage: true   # Default

Migration Path:

# 1. New version added (both served)
# 2. Convert resources to new version
# 3. Deprecate old version
# 4. Remove old version (major release)

Upgrading MetadataServices Workloads

MetadataServices workloads (storage and schema) use automatic blue-green deployment when image tags change.

Zero-Downtime Upgrade Process

Step 1: Update Image Tags

# Edit MetadataServices resource
kubectl edit metadataservices sample1 -n autoscalingv2

# Update image tags:
spec:
  storage:
    imageTag: "1.0.500-new"  # Updated from 1.0.437-old
  schema:
    imageTag: "1.0.600-new"  # Updated from 1.0.547-old

OR via kubectl patch:

kubectl patch metadataservices sample1 -n autoscalingv2 --type=merge -p '
{
  "spec": {
    "storage": {"imageTag": "1.0.500-new"},
    "schema": {"imageTag": "1.0.600-new"}
  }
}'

Step 2: Monitor Upgrade Progress

# Watch deployment phase
watch -n 2 "kubectl get metadataservices sample1 -n autoscalingv2 \
  -o jsonpath='{.status.deploymentPhase}: {.status.activeStrategy} -> {.status.pendingStrategy}'"

# Phases: Stable -> Deploying -> Switching -> Cleanup -> Stable

Expected Timeline:

Phase Duration Description
Stable → Deploying 0s New strategy deployment initiated
Deploying 2-5 min New pods starting, passing health checks
Deploying (grace) 2 min Grace period for stability
Switching 10s Traffic switched to new version
Cleanup 30s Old version resources deleted
Stable - Upgrade complete

Total: ~5-10 minutes

Step 3: Verify Upgrade

# Check active version
kubectl get metadataservices sample1 -n autoscalingv2 \
  -o jsonpath='{.status.activeReleaseVersion}'

# Check current image tags
kubectl get metadataservices sample1 -n autoscalingv2 \
  -o jsonpath='{.spec.storage.imageTag}: {.spec.schema.imageTag}'

# Check pods are running
kubectl get pods -n autoscalingv2 -l app=sample1

# Check release history
kubectl get metadataservices sample1 -n autoscalingv2 \
  -o jsonpath='{.status.releaseHistory}' | jq .

Step 4: Test Application

# Test storage service
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -v http://sample1-storage-green:9005

# Test schema service
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -- \
  curl -v http://sample1-schema-green:9006

# Check logs for errors
kubectl logs -n autoscalingv2 -l app=sample1 --tail=100 | grep ERROR

Batch Upgrades

For upgrading multiple MetadataServices resources:

#!/bin/bash
# batch-upgrade.sh

NEW_STORAGE_TAG="1.0.500-new"
NEW_SCHEMA_TAG="1.0.600-new"
NAMESPACE="autoscalingv2"

# Get all MetadataServices resources
RESOURCES=$(kubectl get metadataservices -n $NAMESPACE -o name)

for resource in $RESOURCES; do
  name=$(echo $resource | cut -d'/' -f2)

  echo "Upgrading $name..."

  kubectl patch metadataservices $name -n $NAMESPACE --type=merge -p "
{
  \"spec\": {
    \"storage\": {\"imageTag\": \"$NEW_STORAGE_TAG\"},
    \"schema\": {\"imageTag\": \"$NEW_SCHEMA_TAG\"}
  }
}"

  # Wait for upgrade to complete
  echo "Waiting for $name to stabilize..."
  while true; do
    phase=$(kubectl get metadataservices $name -n $NAMESPACE -o jsonpath='{.status.deploymentPhase}')
    if [ "$phase" = "Stable" ]; then
      echo "$name upgrade complete"
      break
    fi
    echo "  Phase: $phase"
    sleep 10
  done

  echo ""
done

echo "All upgrades complete"

Canary Upgrades

For gradual rollout:

  1. Test on staging first

    # Upgrade staging environment
    kubectl patch metadataservices sample1-staging -n staging --type=merge -p '...'
    
    # Verify for 24 hours
    # Monitor metrics, logs, errors
    

  2. Upgrade production in phases

    # Phase 1: 10% of workloads
    kubectl patch metadataservices workspace-1 -n prod --type=merge -p '...'
    
    # Monitor for 1 hour
    
    # Phase 2: 50% of workloads
    # Phase 3: 100% of workloads
    

Rollback Procedures

Rollback Operator

Helm Rollback

# List release history
helm history e6-operator -n e6-operator-system

# Rollback to previous version
helm rollback e6-operator -n e6-operator-system

# OR rollback to specific revision
helm rollback e6-operator 3 -n e6-operator-system

kubectl Rollback

# Rollback deployment
kubectl rollout undo deployment/e6-operator-controller-manager \
  -n e6-operator-system

# OR to specific revision
kubectl rollout undo deployment/e6-operator-controller-manager \
  -n e6-operator-system --to-revision=2

Rollback CRDs

⚠️ Warning: CRD rollback is risky and not recommended.

Why CRD Rollback is Dangerous: - May remove fields that resources are using - Can cause validation errors - May lose data in removed fields

If absolutely necessary:

# Re-apply old CRD version
kubectl apply -f crd-v0.1.0.yaml

# Verify all resources still valid
kubectl get metadataservices -A

# Check for validation errors in operator logs
kubectl logs -n e6-operator-system deployment/e6-operator-controller-manager \
  | grep ERROR

Rollback MetadataServices Workloads

See Rollback Guide for detailed workload rollback procedures.

Quick rollback:

# Manual rollback to previous version
kubectl annotate metadataservices sample1 -n autoscalingv2 \
  e6data.io/rollback-to=$(kubectl get metadataservices sample1 -n autoscalingv2 \
    -o jsonpath='{.status.releaseHistory[-2].version}')

# OR automatic rollback on failure (happens automatically after 2 min)

Version Compatibility

Compatibility Matrix

Operator Version CRD Version Min Kubernetes Storage Image Schema Image
v0.1.0 v1alpha1 1.20+ 1.0.437+ 1.0.547+
v0.2.0 v1alpha1 1.22+ 1.0.450+ 1.0.550+
v1.0.0 v1beta1 1.24+ 1.1.0+ 1.1.0+

Skipping Versions

Minor versions: Can be skipped safely

# v0.1.0 -> v0.3.0 ✅ OK

Major versions: Must upgrade sequentially

# v0.5.0 -> v1.0.0 -> v2.0.0 ✅ OK
# v0.5.0 -> v2.0.0 ❌ NOT SAFE

Kubernetes Version Requirements

E6 Operator Min K8s Recommended K8s
v0.1.x 1.20 1.24+
v0.2.x 1.22 1.26+
v1.0.x 1.24 1.28+

Testing Upgrades

Dry-Run Upgrade

# Helm dry-run
helm upgrade e6-operator e6data/e6-operator \
  --namespace e6-operator-system \
  --version 0.2.0 \
  --dry-run --debug

# kubectl dry-run
kubectl apply -k config/default --dry-run=client

Test in Staging

# 1. Deploy operator to staging cluster
helm install e6-operator-staging e6data/e6-operator \
  --namespace e6-operator-staging \
  --create-namespace

# 2. Create test MetadataServices
kubectl apply -f test-metadataservices.yaml -n staging

# 3. Upgrade operator
helm upgrade e6-operator-staging e6data/e6-operator \
  --namespace e6-operator-staging \
  --version 0.2.0

# 4. Verify everything works
# 5. Proceed with production upgrade

Monitoring Upgrades

Key Metrics to Monitor

# Reconciliation errors during upgrade
rate(controller_runtime_reconcile_errors_total[5m])

# Deployment unavailability
count(kube_deployment_status_replicas_unavailable > 0)

# Pod restarts
rate(kube_pod_container_status_restarts_total[5m])

# Workqueue depth
workqueue_depth{name="metadataservices"}

Alert Rules for Upgrades

- alert: UpgradeStalled
  expr: kube_deployment_status_observed_generation != kube_deployment_metadata_generation
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Deployment upgrade stalled"

- alert: HighErrorRateDuringUpgrade
  expr: rate(controller_runtime_reconcile_errors_total[5m]) > 0.1
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate during upgrade"

Best Practices

1. Always Upgrade During Maintenance Window

  • Schedule upgrades during low-traffic periods
  • Inform stakeholders of planned upgrades
  • Have rollback plan ready

2. Test in Non-Production First

Development → Staging → Production

3. Upgrade One Component at a Time

CRDs → Operator → Workloads (one by one)

4. Monitor Closely

  • Watch operator logs
  • Monitor metrics
  • Check resource status
  • Verify application functionality

5. Document Upgrade Process

  • Record versions before/after
  • Document any issues encountered
  • Note rollback procedures used
  • Share lessons learned

6. Backup Before Upgrade

Always backup: - MetadataServices resources - CRD definitions - Operator configuration - RBAC manifests

Troubleshooting Upgrades

Operator Upgrade Fails

Check pod status:

kubectl get pods -n e6-operator-system
kubectl describe pod <pod-name> -n e6-operator-system
kubectl logs <pod-name> -n e6-operator-system --all-containers

Common issues: - Webhook certificate not ready - RBAC permission changes - Image pull errors - Resource constraints


CRD Upgrade Fails

Error: "field is immutable"

# CRD fields cannot be changed once set
# Solution: Create new CRD version, migrate resources

Error: "existing resources don't validate"

# Check which resources fail validation
kubectl get metadataservices -A -o yaml | kubectl apply --dry-run=server -f -

# Fix resources or adjust validation

Workload Upgrade Stuck

See Troubleshooting Guide - Blue-Green Issues

Additional Resources