CatalogRefreshSchedule¶
API Version: e6data.io/v1alpha1 Kind: CatalogRefreshSchedule Short Names: crs, refreshschedule
1. Purpose¶
CatalogRefreshSchedule creates recurring catalog refresh operations using cron syntax. This is similar to a Kubernetes CronJob - it triggers CatalogRefresh CRs on a schedule.
Use CatalogRefreshSchedule for:
- Nightly full refresh: Sync all changes overnight
- Frequent delta refresh: Catch new tables every 30 minutes
- Business hours refresh: Only refresh during work hours
- Maintenance windows: Weekly full sync on weekends
2. High-level Behavior¶
When you create a CatalogRefreshSchedule CR, the operator:
- Parses cron schedule and calculates next run time
- At scheduled time, creates a CatalogRefresh CR
- Applies concurrency policy (skip, allow, or replace concurrent runs)
- Tracks history (recent runs, statistics)
- Cleans up old CatalogRefresh CRs based on history limits
Schedule Evaluation¶
The operator evaluates schedules every 60 seconds. A refresh is triggered if: - Current time >= next scheduled time - Schedule is not suspended - Concurrency policy allows (no running refresh, or policy is Allow/Replace)
3. Spec Reference¶
3.1 Top-level Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
e6CatalogRef | LocalObjectReference | Yes | - | Reference to E6Catalog |
schedule | string | Yes | - | Cron expression (standard 5-field) |
refreshType | string | No | delta | Type: full or delta |
databases | []string | No | All | Specific databases to refresh |
concurrencyPolicy | string | No | Forbid | Forbid, Allow, or Replace |
suspend | bool | No | false | Suspend schedule |
successfulRefreshHistoryLimit | int32 | No | 3 | Keep N successful CRs |
failedRefreshHistoryLimit | int32 | No | 1 | Keep N failed CRs |
timeout | string | No | 30m | Timeout per refresh |
3.2 Cron Schedule Format¶
Standard 5-field cron expression:
┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of week (0 - 6, Sun = 0)
│ │ │ │ │
* * * * *
Common patterns:
| Pattern | Schedule |
|---|---|
0 2 * * * | Daily at 2:00 AM |
*/30 * * * * | Every 30 minutes |
0 */6 * * * | Every 6 hours |
0 0 * * 0 | Weekly on Sunday at midnight |
0 9-17 * * 1-5 | Hourly 9 AM - 5 PM, Mon-Fri |
0 3 * * 6 | Saturday at 3:00 AM |
3.3 Concurrency Policy¶
| Policy | Behavior |
|---|---|
Forbid | Skip new run if previous still running (recommended) |
Allow | Allow concurrent runs (not recommended for catalog refresh) |
Replace | Cancel running refresh and start new one |
4. Example Manifests¶
4.1 Nightly Full Refresh¶
apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
name: data-lake-nightly
namespace: workspace-analytics-prod
spec:
e6CatalogRef:
name: data-lake
schedule: "0 2 * * *" # Daily at 2:00 AM
refreshType: full
timeout: 2h
concurrencyPolicy: Forbid
successfulRefreshHistoryLimit: 7 # Keep last week
failedRefreshHistoryLimit: 3
4.2 Frequent Delta Refresh¶
apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
name: data-lake-frequent
namespace: workspace-analytics-prod
spec:
e6CatalogRef:
name: data-lake
schedule: "*/30 * * * *" # Every 30 minutes
refreshType: delta
timeout: 15m
concurrencyPolicy: Forbid
successfulRefreshHistoryLimit: 3
failedRefreshHistoryLimit: 1
4.3 Business Hours Refresh¶
apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
name: data-lake-business-hours
namespace: workspace-analytics-prod
spec:
e6CatalogRef:
name: data-lake
schedule: "0 9-17 * * 1-5" # Hourly, 9 AM - 5 PM, Mon-Fri
refreshType: delta
timeout: 30m
concurrencyPolicy: Forbid
4.4 Weekend Full Refresh¶
apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
name: data-lake-weekly-full
namespace: workspace-analytics-prod
spec:
e6CatalogRef:
name: data-lake
schedule: "0 3 * * 6" # Saturday at 3:00 AM
refreshType: full
timeout: 4h
concurrencyPolicy: Forbid
successfulRefreshHistoryLimit: 4 # Keep last month
4.5 Database-Specific Schedule¶
apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
name: sales-db-hourly
namespace: workspace-analytics-prod
spec:
e6CatalogRef:
name: data-lake
schedule: "0 * * * *" # Every hour
refreshType: delta
databases:
- sales
- orders
timeout: 15m
4.6 Combined Strategy (Full + Delta)¶
Create two schedules for the same catalog:
# Nightly full refresh
apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
name: data-lake-nightly-full
namespace: workspace-analytics-prod
spec:
e6CatalogRef:
name: data-lake
schedule: "0 2 * * *"
refreshType: full
timeout: 2h
---
# Every 30 min delta refresh (skips during full refresh via Forbid)
apiVersion: e6data.io/v1alpha1
kind: CatalogRefreshSchedule
metadata:
name: data-lake-frequent-delta
namespace: workspace-analytics-prod
spec:
e6CatalogRef:
name: data-lake
schedule: "*/30 * * * *"
refreshType: delta
timeout: 15m
concurrencyPolicy: Forbid
5. Status & Lifecycle¶
5.1 Status Fields¶
| Field | Type | Description |
|---|---|---|
lastScheduleTime | Time | When last refresh was triggered |
lastSuccessfulTime | Time | When last refresh succeeded |
lastRefreshStatus | string | Last refresh result |
active | []ObjectReference | Currently running CatalogRefresh CRs |
recentHistory | []RefreshHistoryEntry | Last 5 refresh executions |
statistics | RefreshStatistics | Aggregate metrics |
conditions | []Condition | Detailed conditions |
5.2 Recent History¶
status:
recentHistory:
- refreshName: data-lake-nightly-20240115-020000
startTime: "2024-01-15T02:00:00Z"
completionTime: "2024-01-15T02:45:30Z"
status: Succeeded
databasesRefreshed: 15
tablesRefreshed: 1250
durationSeconds: 2730
- refreshName: data-lake-nightly-20240114-020000
startTime: "2024-01-14T02:00:00Z"
completionTime: "2024-01-14T02:38:15Z"
status: PartialSuccess
databasesRefreshed: 15
tablesRefreshed: 1245
durationSeconds: 2295
failureMessage: "5 tables failed to refresh"
5.3 Statistics¶
status:
statistics:
totalRuns: 45
successfulRuns: 42
failedRuns: 2
timedOutRuns: 1
averageDurationSeconds: 2500
lastFailureTime: "2024-01-10T02:30:00Z"
5.4 Conditions¶
| Type | Description |
|---|---|
Scheduled | Schedule is valid and active |
Suspended | Schedule is suspended |
CatalogReady | Referenced catalog is ready |
6. Related Resources¶
Dependencies¶
| CRD | Relationship |
|---|---|
| E6Catalog | Required - must exist in same namespace |
Creates¶
| CRD | Relationship |
|---|---|
| CatalogRefresh | Creates CRs on schedule |
7. Troubleshooting¶
7.1 Common Issues¶
Schedule Not Triggering¶
Symptoms: No CatalogRefresh CRs created at scheduled time.
Checks:
# Verify schedule is not suspended
kubectl get crs data-lake-nightly -o jsonpath='{.spec.suspend}'
# Check last schedule time
kubectl get crs data-lake-nightly -o jsonpath='{.status.lastScheduleTime}'
# Verify catalog is ready
kubectl get e6cat data-lake -o jsonpath='{.status.phase}'
# Check operator logs
kubectl logs -n e6-operator-system -l app=e6-operator | grep -i schedule
Too Many CatalogRefresh CRs¶
Symptoms: Many old CatalogRefresh CRs cluttering namespace.
Cause: History limits too high or cleanup not running.
Resolution:
Manual cleanup:
# Delete old successful refreshes
kubectl delete catalogrefresh -l e6data.io/schedule=data-lake-nightly \
--field-selector=status.phase=Succeeded
Refreshes Running Concurrently¶
Symptoms: Multiple CatalogRefresh CRs in Running phase.
Cause: concurrencyPolicy: Allow set.
Resolution:
Refresh Always Skipped¶
Symptoms: lastScheduleTime updates but no new CatalogRefresh created.
Cause: Previous refresh always running when schedule triggers.
Resolution: 1. Increase timeout to let refreshes complete 2. Reduce schedule frequency 3. Use database-specific refreshes 4. Consider concurrencyPolicy: Replace (cancels slow runs)
7.2 Useful Commands¶
# Get schedule status
kubectl get crs data-lake-nightly -o yaml
# List all schedules
kubectl get crs
# Check active refreshes
kubectl get crs data-lake-nightly -o jsonpath='{.status.active}'
# View recent history
kubectl get crs data-lake-nightly -o jsonpath='{.status.recentHistory}' | jq
# View statistics
kubectl get crs data-lake-nightly -o jsonpath='{.status.statistics}' | jq
# Suspend schedule
kubectl patch crs data-lake-nightly --type=merge -p '{"spec":{"suspend":true}}'
# Resume schedule
kubectl patch crs data-lake-nightly --type=merge -p '{"spec":{"suspend":false}}'
# List CatalogRefresh CRs created by schedule
kubectl get catalogrefresh -l e6data.io/schedule=data-lake-nightly
# Calculate next run time (approximate)
# The operator logs next run time - check operator logs
8. Best Practices¶
8.1 Schedule Strategy¶
| Use Case | Recommended Pattern |
|---|---|
| Near real-time | Delta every 15-30 min |
| Daily sync | Full at 2-4 AM local time |
| High-change catalogs | Delta every 15 min + full weekly |
| Stable catalogs | Delta hourly + full monthly |
| Cost-sensitive | Delta every 6 hours + full weekly |
8.2 History Limits¶
| Environment | successfulRefreshHistoryLimit | failedRefreshHistoryLimit |
|---|---|---|
| Development | 1 | 1 |
| Staging | 3 | 2 |
| Production | 7 (weekly full) or 3 (frequent delta) | 3 |
8.3 Timeout Guidelines¶
Set timeout to 2x expected duration for buffer:
# Check average duration from statistics
kubectl get crs data-lake-nightly -o jsonpath='{.status.statistics.averageDurationSeconds}'
8.4 Monitoring¶
Key metrics to monitor: - successfulRuns vs failedRuns ratio - averageDurationSeconds trend (increasing = problem) - lastFailureTime (recent failures) - Active refreshes count (should be 0 or 1)
Set up alerts for: - Failed refresh (immediate) - Refresh duration > 2x average - No successful refresh in expected window