CatalogRefresh¶
API Version: e6data.io/v1alpha1 Kind: CatalogRefresh Short Names: cr, catalogref
1. Purpose¶
CatalogRefresh triggers a one-time metadata refresh operation on an E6Catalog. This is similar to a Kubernetes Job - it runs once and tracks completion status.
Use CatalogRefresh when you need to:
- Full Refresh: Sync all metadata from scratch (new tables, schema changes, dropped tables)
- Delta Refresh: Incrementally sync only new tables not already in cache (faster)
- Database-specific Refresh: Refresh only specific databases
For scheduled refreshes, use CatalogRefreshSchedule instead.
2. High-level Behavior¶
When you create a CatalogRefresh CR, the operator:
- Validates the referenced E6Catalog exists and is in
Readyphase - Acquires global catalog lock (prevents concurrent refreshes on same catalog)
- Calls storage service HTTP API to initiate refresh
- Polls operation status every 10 seconds until completion
- Updates CR status with results (success, partial_success, failed, timed out)
- Releases lock when complete
API Operation¶
| Action | HTTP Endpoint | Method |
|---|---|---|
| Refresh | /api/v1/catalogs/{name}/refresh | POST |
Concurrency Control¶
Only one refresh can run per catalog at a time. If a CatalogRefresh is created while another is running: - The new refresh enters Pending phase - It waits for the running refresh to complete - Then proceeds automatically
3. Spec Reference¶
3.1 Top-level Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
e6CatalogRef | LocalObjectReference | Yes | - | Reference to E6Catalog to refresh |
refreshType | string | Yes | - | Type: full or delta |
databases | []string | No | All | Specific databases to refresh |
timeout | string | No | 30m | Maximum operation duration |
3.2 Refresh Types¶
| Type | Description | Use Case |
|---|---|---|
full | Refreshes ALL metadata from source catalog | Schema changes, dropped tables, full sync |
delta | Only refreshes NEW tables not in cache | New tables added, faster incremental sync |
3.3 Timeout Format¶
Duration string with units: ns, us, ms, s, m, h
Examples: 30m, 1h, 90s, 1h30m
4. Example Manifests¶
4.1 Full Refresh (All Databases)¶
apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
name: data-lake-full-refresh
namespace: workspace-analytics-prod
spec:
e6CatalogRef:
name: data-lake
refreshType: full
timeout: 1h
4.2 Delta Refresh (Incremental)¶
apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
name: data-lake-delta-refresh
namespace: workspace-analytics-prod
spec:
e6CatalogRef:
name: data-lake
refreshType: delta
timeout: 30m
4.3 Database-Specific Refresh¶
apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
name: sales-db-refresh
namespace: workspace-analytics-prod
spec:
e6CatalogRef:
name: data-lake
refreshType: full
databases:
- sales
- orders
timeout: 15m
4.4 Generated Name Pattern (for Automation)¶
apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
generateName: data-lake-refresh- # Kubernetes adds random suffix
namespace: workspace-analytics-prod
spec:
e6CatalogRef:
name: data-lake
refreshType: delta
5. Status & Lifecycle¶
5.1 Status Fields¶
| Field | Type | Description |
|---|---|---|
phase | CatalogRefreshPhase | Current phase |
startTime | Time | When refresh started |
completionTime | Time | When refresh completed |
databasesRefreshed | int | Number of databases processed |
tablesRefreshed | int | Number of tables processed |
failures | []RefreshFailure | List of failures (if any) |
message | string | Human-readable status |
diagnosticsFilePath | string | Path to detailed diagnostics |
conditions | []Condition | Detailed conditions |
5.2 Phase Values¶
| Phase | Description |
|---|---|
Pending | Waiting for another refresh to complete |
Running | Refresh in progress |
Succeeded | Completed successfully (all items) |
PartialSuccess | Completed with some failures |
Failed | Complete failure |
TimedOut | Exceeded timeout duration |
5.3 Phase Transitions¶
┌─────────────────────────────────┐
│ │
▼ │
┌─────────┐ ┌─────────┐ ┌───────────┐ │
│ Pending │───▶│ Running │───▶│ Succeeded │ │
└─────────┘ └────┬────┘ └───────────┘ │
│ │
├────────▶ PartialSuccess ───────┤
│ │
├────────▶ Failed ───────────────┤
│ │
└────────▶ TimedOut ─────────────┘
5.4 Failure Details¶
status:
phase: PartialSuccess
databasesRefreshed: 10
tablesRefreshed: 450
failures:
- type: table
name: sales.corrupted_table
reason: "Failed to read schema: Invalid parquet file"
- type: database
name: temp_db
reason: "Access denied: insufficient permissions"
diagnosticsFilePath: "s3://bucket/diagnostics/refresh-2024-01-15T10-00-00.json"
message: "Refresh completed with 2 failures. Duration: 5m30s"
6. Related Resources¶
Dependencies¶
| CRD | Relationship |
|---|---|
| E6Catalog | Required - must be in Ready phase |
Created By¶
| CRD | Relationship |
|---|---|
| CatalogRefreshSchedule | Creates CatalogRefresh CRs on schedule |
7. Troubleshooting¶
7.1 Common Issues¶
Refresh Stuck in Pending¶
Symptoms:
Cause: Another refresh is running on the same catalog.
Check:
# Find running refreshes
kubectl get catalogrefresh -l e6data.io/catalog=data-lake --field-selector=status.phase=Running
# Check if catalog is being created/updated
kubectl get e6cat data-lake -o jsonpath='{.status.phase}'
Refresh TimedOut¶
Symptoms: Phase is TimedOut.
Causes: 1. Large catalog with many tables 2. Slow source catalog (Hive/Glue API) 3. Network issues
Resolution:
Or refresh specific databases:
PartialSuccess - Some Tables Failed¶
Symptoms: Phase is PartialSuccess with failures.
Resolution:
# View inline failures
kubectl get catalogrefresh data-lake-refresh -o jsonpath='{.status.failures}' | jq
# Get full diagnostics
DIAG_PATH=$(kubectl get catalogrefresh data-lake-refresh -o jsonpath='{.status.diagnosticsFilePath}')
aws s3 cp "$DIAG_PATH" - | jq
Common failure reasons: - Schema inference errors (unsupported types) - Permission denied on specific tables - Corrupt metadata files - Network timeouts on large tables
7.2 Useful Commands¶
# Get refresh status
kubectl get catalogrefresh data-lake-refresh -o yaml
# Watch refresh progress
kubectl get catalogrefresh -w
# List all refreshes for a catalog
kubectl get catalogrefresh -l e6data.io/catalog=data-lake
# Get refresh duration
kubectl get catalogrefresh data-lake-refresh -o jsonpath='{.status.message}'
# Delete old refreshes
kubectl delete catalogrefresh -l e6data.io/catalog=data-lake --field-selector=status.phase=Succeeded
# Count tables refreshed
kubectl get catalogrefresh data-lake-refresh -o jsonpath='{.status.tablesRefreshed}'
8. Best Practices¶
8.1 Choosing Refresh Type¶
| Scenario | Recommended Type |
|---|---|
| New tables added | delta |
| Schema changed | full |
| Tables dropped | full |
| Regular sync | delta |
| After catalog outage | full |
| Partition updates | delta (partitions auto-refresh) |
8.2 Timeout Guidelines¶
| Catalog Size | Recommended Timeout |
|---|---|
| Small (< 100 tables) | 15m |
| Medium (100-1000 tables) | 30m |
| Large (1000-10000 tables) | 1h |
| Very large (> 10000 tables) | 2h or database-specific |
8.3 Database-Specific Refresh¶
For very large catalogs, refresh databases individually:
# Create multiple targeted refreshes
apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
name: refresh-sales
spec:
e6CatalogRef:
name: data-lake
refreshType: full
databases: [sales]
timeout: 30m
---
apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
name: refresh-analytics
spec:
e6CatalogRef:
name: data-lake
refreshType: full
databases: [analytics]
timeout: 30m
8.4 Cleanup¶
CatalogRefresh CRs are not automatically deleted. Clean up old refreshes: