Skip to content

CatalogRefreshSchedule

API Version: e6data.io/v1alpha1 Kind: CatalogRefreshSchedule Short Names: crs, refreshschedule


1. Purpose

CatalogRefreshSchedule creates recurring catalog refresh operations using cron syntax. This is similar to a Kubernetes CronJob - it triggers CatalogRefresh CRs on a schedule.

Use CatalogRefreshSchedule for:

  • Nightly full refresh: Sync all changes overnight
  • Frequent delta refresh: Catch new tables every 30 minutes
  • Business hours refresh: Only refresh during work hours
  • Maintenance windows: Weekly full sync on weekends

2. High-level Behavior

When you create a CatalogRefreshSchedule CR, the operator:

  1. Parses cron schedule and calculates next run time
  2. At scheduled time, creates a CatalogRefresh CR
  3. Applies concurrency policy (skip, allow, or replace concurrent runs)
  4. Tracks history (recent runs, statistics)
  5. Cleans up old CatalogRefresh CRs based on history limits

Schedule Evaluation

The operator evaluates schedules every 60 seconds. A refresh is triggered if: - Current time >= next scheduled time - Schedule is not suspended - Concurrency policy allows (no running refresh, or policy is Allow/Replace)


3. Spec Reference

3.1 Top-level Fields

Field Type Required Default Description
e6CatalogRef LocalObjectReference Yes - Reference to E6Catalog
schedule string Yes - Cron expression (standard 5-field)
refreshType string No delta Type: full or delta
databases []string No All Specific databases to refresh
concurrencyPolicy string No Forbid Forbid, Allow, or Replace
suspend bool No false Suspend schedule
successfulRefreshHistoryLimit int32 No 3 Keep N successful CRs
failedRefreshHistoryLimit int32 No 1 Keep N failed CRs
timeout string No 30m Timeout per refresh

3.2 Cron Schedule Format

Standard 5-field cron expression:

┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6, Sun = 0)
│ │ │ │ │
* * * * *

Common patterns:

Pattern Schedule
0 2 * * * Daily at 2:00 AM
*/30 * * * * Every 30 minutes
0 */6 * * * Every 6 hours
0 0 * * 0 Weekly on Sunday at midnight
0 9-17 * * 1-5 Hourly 9 AM - 5 PM, Mon-Fri
0 3 * * 6 Saturday at 3:00 AM

3.3 Concurrency Policy

Policy Behavior
Forbid Skip new run if previous still running (recommended)
Allow Allow concurrent runs (not recommended for catalog refresh)
Replace Cancel running refresh and start new one

4. Example Manifests

4.1 Nightly Full Refresh

apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
  name: data-lake-nightly
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  schedule: "0 2 * * *"  # Daily at 2:00 AM
  refreshType: full
  timeout: 2h
  concurrencyPolicy: Forbid
  successfulRefreshHistoryLimit: 7  # Keep last week
  failedRefreshHistoryLimit: 3

4.2 Frequent Delta Refresh

apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
  name: data-lake-frequent
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  schedule: "*/30 * * * *"  # Every 30 minutes
  refreshType: delta
  timeout: 15m
  concurrencyPolicy: Forbid
  successfulRefreshHistoryLimit: 3
  failedRefreshHistoryLimit: 1

4.3 Business Hours Refresh

apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
  name: data-lake-business-hours
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  schedule: "0 9-17 * * 1-5"  # Hourly, 9 AM - 5 PM, Mon-Fri
  refreshType: delta
  timeout: 30m
  concurrencyPolicy: Forbid

4.4 Weekend Full Refresh

apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
  name: data-lake-weekly-full
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  schedule: "0 3 * * 6"  # Saturday at 3:00 AM
  refreshType: full
  timeout: 4h
  concurrencyPolicy: Forbid
  successfulRefreshHistoryLimit: 4  # Keep last month

4.5 Database-Specific Schedule

apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
  name: sales-db-hourly
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  schedule: "0 * * * *"  # Every hour
  refreshType: delta
  databases:
    - sales
    - orders
  timeout: 15m

4.6 Combined Strategy (Full + Delta)

Create two schedules for the same catalog:

# Nightly full refresh
apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
  name: data-lake-nightly-full
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  schedule: "0 2 * * *"
  refreshType: full
  timeout: 2h
---
# Every 30 min delta refresh (skips during full refresh via Forbid)
apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
  name: data-lake-frequent-delta
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  schedule: "*/30 * * * *"
  refreshType: delta
  timeout: 15m
  concurrencyPolicy: Forbid

5. Status & Lifecycle

5.1 Status Fields

Field Type Description
lastScheduleTime Time When last refresh was triggered
lastSuccessfulTime Time When last refresh succeeded
lastRefreshStatus string Last refresh result
active []ObjectReference Currently running CatalogRefresh CRs
recentHistory []RefreshHistoryEntry Last 5 refresh executions
statistics RefreshStatistics Aggregate metrics
conditions []Condition Detailed conditions

5.2 Recent History

status:
  recentHistory:
    - refreshName: data-lake-nightly-20240115-020000
      startTime: "2024-01-15T02:00:00Z"
      completionTime: "2024-01-15T02:45:30Z"
      status: Succeeded
      databasesRefreshed: 15
      tablesRefreshed: 1250
      durationSeconds: 2730
    - refreshName: data-lake-nightly-20240114-020000
      startTime: "2024-01-14T02:00:00Z"
      completionTime: "2024-01-14T02:38:15Z"
      status: PartialSuccess
      databasesRefreshed: 15
      tablesRefreshed: 1245
      durationSeconds: 2295
      failureMessage: "5 tables failed to refresh"

5.3 Statistics

status:
  statistics:
    totalRuns: 45
    successfulRuns: 42
    failedRuns: 2
    timedOutRuns: 1
    averageDurationSeconds: 2500
    lastFailureTime: "2024-01-10T02:30:00Z"

5.4 Conditions

Type Description
Scheduled Schedule is valid and active
Suspended Schedule is suspended
CatalogReady Referenced catalog is ready

Dependencies

CRD Relationship
E6Catalog Required - must exist in same namespace

Creates

CRD Relationship
CatalogRefresh Creates CRs on schedule

7. Troubleshooting

7.1 Common Issues

Schedule Not Triggering

Symptoms: No CatalogRefresh CRs created at scheduled time.

Checks:

# Verify schedule is not suspended
kubectl get crs data-lake-nightly -o jsonpath='{.spec.suspend}'

# Check last schedule time
kubectl get crs data-lake-nightly -o jsonpath='{.status.lastScheduleTime}'

# Verify catalog is ready
kubectl get e6cat data-lake -o jsonpath='{.status.phase}'

# Check operator logs
kubectl logs -n e6-operator-system -l app=e6-operator | grep -i schedule

Too Many CatalogRefresh CRs

Symptoms: Many old CatalogRefresh CRs cluttering namespace.

Cause: History limits too high or cleanup not running.

Resolution:

spec:
  successfulRefreshHistoryLimit: 3  # Reduce from default
  failedRefreshHistoryLimit: 1

Manual cleanup:

# Delete old successful refreshes
kubectl delete catalogrefresh -l e6data.io/schedule=data-lake-nightly \
  --field-selector=status.phase=Succeeded

Refreshes Running Concurrently

Symptoms: Multiple CatalogRefresh CRs in Running phase.

Cause: concurrencyPolicy: Allow set.

Resolution:

spec:
  concurrencyPolicy: Forbid  # Recommended

Refresh Always Skipped

Symptoms: lastScheduleTime updates but no new CatalogRefresh created.

Cause: Previous refresh always running when schedule triggers.

Resolution: 1. Increase timeout to let refreshes complete 2. Reduce schedule frequency 3. Use database-specific refreshes 4. Consider concurrencyPolicy: Replace (cancels slow runs)

7.2 Useful Commands

# Get schedule status
kubectl get crs data-lake-nightly -o yaml

# List all schedules
kubectl get crs

# Check active refreshes
kubectl get crs data-lake-nightly -o jsonpath='{.status.active}'

# View recent history
kubectl get crs data-lake-nightly -o jsonpath='{.status.recentHistory}' | jq

# View statistics
kubectl get crs data-lake-nightly -o jsonpath='{.status.statistics}' | jq

# Suspend schedule
kubectl patch crs data-lake-nightly --type=merge -p '{"spec":{"suspend":true}}'

# Resume schedule
kubectl patch crs data-lake-nightly --type=merge -p '{"spec":{"suspend":false}}'

# List CatalogRefresh CRs created by schedule
kubectl get catalogrefresh -l e6data.io/schedule=data-lake-nightly

# Calculate next run time (approximate)
# The operator logs next run time - check operator logs

8. Best Practices

8.1 Schedule Strategy

Use Case Recommended Pattern
Near real-time Delta every 15-30 min
Daily sync Full at 2-4 AM local time
High-change catalogs Delta every 15 min + full weekly
Stable catalogs Delta hourly + full monthly
Cost-sensitive Delta every 6 hours + full weekly

8.2 History Limits

Environment successfulRefreshHistoryLimit failedRefreshHistoryLimit
Development 1 1
Staging 3 2
Production 7 (weekly full) or 3 (frequent delta) 3

8.3 Timeout Guidelines

Set timeout to 2x expected duration for buffer:

# Check average duration from statistics
kubectl get crs data-lake-nightly -o jsonpath='{.status.statistics.averageDurationSeconds}'

8.4 Monitoring

Key metrics to monitor: - successfulRuns vs failedRuns ratio - averageDurationSeconds trend (increasing = problem) - lastFailureTime (recent failures) - Active refreshes count (should be 0 or 1)

Set up alerts for: - Failed refresh (immediate) - Refresh duration > 2x average - No successful refresh in expected window