Skip to content

CatalogRefresh

API Version: e6data.io/v1alpha1 Kind: CatalogRefresh Short Names: cr, catalogref


1. Purpose

CatalogRefresh triggers a one-time metadata refresh operation on an E6Catalog. This is similar to a Kubernetes Job - it runs once and tracks completion status.

Use CatalogRefresh when you need to:

  • Full Refresh: Sync all metadata from scratch (new tables, schema changes, dropped tables)
  • Delta Refresh: Incrementally sync only new tables not already in cache (faster)
  • Database-specific Refresh: Refresh only specific databases

For scheduled refreshes, use CatalogRefreshSchedule instead.


2. High-level Behavior

When you create a CatalogRefresh CR, the operator:

  1. Validates the referenced E6Catalog exists and is in Ready phase
  2. Acquires global catalog lock (prevents concurrent refreshes on same catalog)
  3. Calls storage service HTTP API to initiate refresh
  4. Polls operation status every 10 seconds until completion
  5. Updates CR status with results (success, partial_success, failed, timed out)
  6. Releases lock when complete

API Operation

Action HTTP Endpoint Method
Refresh /api/v1/catalogs/{name}/refresh POST

Concurrency Control

Only one refresh can run per catalog at a time. If a CatalogRefresh is created while another is running: - The new refresh enters Pending phase - It waits for the running refresh to complete - Then proceeds automatically


3. Spec Reference

3.1 Top-level Fields

Field Type Required Default Description
e6CatalogRef LocalObjectReference Yes - Reference to E6Catalog to refresh
refreshType string Yes - Type: full or delta
databases []string No All Specific databases to refresh
timeout string No 30m Maximum operation duration

3.2 Refresh Types

Type Description Use Case
full Refreshes ALL metadata from source catalog Schema changes, dropped tables, full sync
delta Only refreshes NEW tables not in cache New tables added, faster incremental sync

3.3 Timeout Format

Duration string with units: ns, us, ms, s, m, h

Examples: 30m, 1h, 90s, 1h30m


4. Example Manifests

4.1 Full Refresh (All Databases)

apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
  name: data-lake-full-refresh
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  refreshType: full
  timeout: 1h

4.2 Delta Refresh (Incremental)

apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
  name: data-lake-delta-refresh
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  refreshType: delta
  timeout: 30m

4.3 Database-Specific Refresh

apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
  name: sales-db-refresh
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  refreshType: full
  databases:
    - sales
    - orders
  timeout: 15m

4.4 Generated Name Pattern (for Automation)

apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
  generateName: data-lake-refresh-  # Kubernetes adds random suffix
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  refreshType: delta

5. Status & Lifecycle

5.1 Status Fields

Field Type Description
phase CatalogRefreshPhase Current phase
startTime Time When refresh started
completionTime Time When refresh completed
databasesRefreshed int Number of databases processed
tablesRefreshed int Number of tables processed
failures []RefreshFailure List of failures (if any)
message string Human-readable status
diagnosticsFilePath string Path to detailed diagnostics
conditions []Condition Detailed conditions

5.2 Phase Values

Phase Description
Pending Waiting for another refresh to complete
Running Refresh in progress
Succeeded Completed successfully (all items)
PartialSuccess Completed with some failures
Failed Complete failure
TimedOut Exceeded timeout duration

5.3 Phase Transitions

                    ┌─────────────────────────────────┐
                    │                                 │
                    ▼                                 │
┌─────────┐    ┌─────────┐    ┌───────────┐         │
│ Pending │───▶│ Running │───▶│ Succeeded │         │
└─────────┘    └────┬────┘    └───────────┘         │
                    │                                │
                    ├────────▶ PartialSuccess ───────┤
                    │                                │
                    ├────────▶ Failed ───────────────┤
                    │                                │
                    └────────▶ TimedOut ─────────────┘

5.4 Failure Details

status:
  phase: PartialSuccess
  databasesRefreshed: 10
  tablesRefreshed: 450
  failures:
    - type: table
      name: sales.corrupted_table
      reason: "Failed to read schema: Invalid parquet file"
    - type: database
      name: temp_db
      reason: "Access denied: insufficient permissions"
  diagnosticsFilePath: "s3://bucket/diagnostics/refresh-2024-01-15T10-00-00.json"
  message: "Refresh completed with 2 failures. Duration: 5m30s"

Dependencies

CRD Relationship
E6Catalog Required - must be in Ready phase

Created By

CRD Relationship
CatalogRefreshSchedule Creates CatalogRefresh CRs on schedule

7. Troubleshooting

7.1 Common Issues

Refresh Stuck in Pending

Symptoms:

$ kubectl get catalogrefresh
NAME                      CATALOG     TYPE   PHASE
data-lake-refresh-abc     data-lake   full   Pending

Cause: Another refresh is running on the same catalog.

Check:

# Find running refreshes
kubectl get catalogrefresh -l e6data.io/catalog=data-lake --field-selector=status.phase=Running

# Check if catalog is being created/updated
kubectl get e6cat data-lake -o jsonpath='{.status.phase}'

Refresh TimedOut

Symptoms: Phase is TimedOut.

Causes: 1. Large catalog with many tables 2. Slow source catalog (Hive/Glue API) 3. Network issues

Resolution:

# Increase timeout for large catalogs
spec:
  timeout: 2h  # Instead of default 30m

Or refresh specific databases:

spec:
  databases:
    - high_priority_db
  timeout: 30m

PartialSuccess - Some Tables Failed

Symptoms: Phase is PartialSuccess with failures.

Resolution:

# View inline failures
kubectl get catalogrefresh data-lake-refresh -o jsonpath='{.status.failures}' | jq

# Get full diagnostics
DIAG_PATH=$(kubectl get catalogrefresh data-lake-refresh -o jsonpath='{.status.diagnosticsFilePath}')
aws s3 cp "$DIAG_PATH" - | jq

Common failure reasons: - Schema inference errors (unsupported types) - Permission denied on specific tables - Corrupt metadata files - Network timeouts on large tables

7.2 Useful Commands

# Get refresh status
kubectl get catalogrefresh data-lake-refresh -o yaml

# Watch refresh progress
kubectl get catalogrefresh -w

# List all refreshes for a catalog
kubectl get catalogrefresh -l e6data.io/catalog=data-lake

# Get refresh duration
kubectl get catalogrefresh data-lake-refresh -o jsonpath='{.status.message}'

# Delete old refreshes
kubectl delete catalogrefresh -l e6data.io/catalog=data-lake --field-selector=status.phase=Succeeded

# Count tables refreshed
kubectl get catalogrefresh data-lake-refresh -o jsonpath='{.status.tablesRefreshed}'

8. Best Practices

8.1 Choosing Refresh Type

Scenario Recommended Type
New tables added delta
Schema changed full
Tables dropped full
Regular sync delta
After catalog outage full
Partition updates delta (partitions auto-refresh)

8.2 Timeout Guidelines

Catalog Size Recommended Timeout
Small (< 100 tables) 15m
Medium (100-1000 tables) 30m
Large (1000-10000 tables) 1h
Very large (> 10000 tables) 2h or database-specific

8.3 Database-Specific Refresh

For very large catalogs, refresh databases individually:

# Create multiple targeted refreshes
apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
  name: refresh-sales
spec:
  e6CatalogRef:
    name: data-lake
  refreshType: full
  databases: [sales]
  timeout: 30m
---
apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
  name: refresh-analytics
spec:
  e6CatalogRef:
    name: data-lake
  refreshType: full
  databases: [analytics]
  timeout: 30m

8.4 Cleanup

CatalogRefresh CRs are not automatically deleted. Clean up old refreshes:

# Delete refreshes older than 7 days
kubectl delete catalogrefresh --field-selector=status.phase!=Running \
  --selector='created-before=7d'  # Requires label

# Or use CatalogRefreshSchedule which auto-cleans history