CatalogRefresh¶

API Version: e6data.io/v1alpha1 Kind: CatalogRefresh Short Names: cr, catalogref

1. Purpose¶

CatalogRefresh triggers a one-time metadata refresh operation on an E6Catalog. This is similar to a Kubernetes Job - it runs once and tracks completion status.

Use CatalogRefresh when you need to:

Full Refresh: Sync all metadata from scratch (new tables, schema changes, dropped tables)
Delta Refresh: Incrementally sync only new tables not already in cache (faster)
Database-specific Refresh: Refresh only specific databases

For scheduled refreshes, use CatalogRefreshSchedule instead.

2. High-level Behavior¶

When you create a CatalogRefresh CR, the operator:

Validates the referenced E6Catalog exists and is in Ready phase
Acquires global catalog lock (prevents concurrent refreshes on same catalog)
Calls storage service HTTP API to initiate refresh
Polls operation status every 10 seconds until completion
Updates CR status with results (success, partial_success, failed, timed out)
Releases lock when complete

API Operation¶

Action	HTTP Endpoint	Method
Refresh	`/api/v1/catalogs/{name}/refresh`	POST

Concurrency Control¶

Only one refresh can run per catalog at a time. If a CatalogRefresh is created while another is running: - The new refresh enters Pending phase - It waits for the running refresh to complete - Then proceeds automatically

3. Spec Reference¶

3.1 Top-level Fields¶

Field	Type	Required	Default	Description
`e6CatalogRef`	LocalObjectReference	Yes	-	Reference to E6Catalog to refresh
`refreshType`	string	Yes	-	Type: `full` or `delta`
`databases`	[]string	No	All	Specific databases to refresh
`timeout`	string	No	`30m`	Maximum operation duration

3.2 Refresh Types¶

Type	Description	Use Case
`full`	Refreshes ALL metadata from source catalog	Schema changes, dropped tables, full sync
`delta`	Only refreshes NEW tables not in cache	New tables added, faster incremental sync

3.3 Timeout Format¶

Duration string with units: ns, us, ms, s, m, h

Examples: 30m, 1h, 90s, 1h30m

4. Example Manifests¶

4.1 Full Refresh (All Databases)¶

apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
  name: data-lake-full-refresh
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  refreshType: full
  timeout: 1h

4.2 Delta Refresh (Incremental)¶

apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
  name: data-lake-delta-refresh
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  refreshType: delta
  timeout: 30m

4.3 Database-Specific Refresh¶

apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
  name: sales-db-refresh
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  refreshType: full
  databases:
    - sales
    - orders
  timeout: 15m

4.4 Generated Name Pattern (for Automation)¶

apiVersion: e6data.io/v1alpha1
kind: CatalogRefresh
metadata:
  generateName: data-lake-refresh-  # Kubernetes adds random suffix
  namespace: workspace-analytics-prod
spec:
  e6CatalogRef:
    name: data-lake
  refreshType: delta

5. Status & Lifecycle¶

5.1 Status Fields¶

Field	Type	Description
`phase`	CatalogRefreshPhase	Current phase
`startTime`	Time	When refresh started
`completionTime`	Time	When refresh completed
`databasesRefreshed`	int	Number of databases processed
`tablesRefreshed`	int	Number of tables processed
`failures`	[]RefreshFailure	List of failures (if any)
`message`	string	Human-readable status
`diagnosticsFilePath`	string	Path to detailed diagnostics
`conditions`	[]Condition	Detailed conditions

5.2 Phase Values¶

Phase	Description
`Pending`	Waiting for another refresh to complete
`Running`	Refresh in progress
`Succeeded`	Completed successfully (all items)
`PartialSuccess`	Completed with some failures
`Failed`	Complete failure
`TimedOut`	Exceeded timeout duration

5.3 Phase Transitions¶

                    ┌─────────────────────────────────┐
                    │                                 │
                    ▼                                 │
┌─────────┐    ┌─────────┐    ┌───────────┐         │
│ Pending │───▶│ Running │───▶│ Succeeded │         │
└─────────┘    └────┬────┘    └───────────┘         │
                    │                                │
                    ├────────▶ PartialSuccess ───────┤
                    │                                │
                    ├────────▶ Failed ───────────────┤
                    │                                │
                    └────────▶ TimedOut ─────────────┘

5.4 Failure Details¶

status:
  phase: PartialSuccess
  databasesRefreshed: 10
  tablesRefreshed: 450
  failures:
    - type: table
      name: sales.corrupted_table
      reason: "Failed to read schema: Invalid parquet file"
    - type: database
      name: temp_db
      reason: "Access denied: insufficient permissions"
  diagnosticsFilePath: "s3://bucket/diagnostics/refresh-2024-01-15T10-00-00.json"
  message: "Refresh completed with 2 failures. Duration: 5m30s"

Dependencies¶

CRD	Relationship
E6Catalog	Required - must be in `Ready` phase

Created By¶

CRD	Relationship
CatalogRefreshSchedule	Creates CatalogRefresh CRs on schedule

7. Troubleshooting¶

7.1 Common Issues¶

Refresh Stuck in Pending¶

Symptoms:

$ kubectl get catalogrefresh
NAME                      CATALOG     TYPE   PHASE
data-lake-refresh-abc     data-lake   full   Pending

Cause: Another refresh is running on the same catalog.

Check:

# Find running refreshes
kubectl get catalogrefresh -l e6data.io/catalog=data-lake --field-selector=status.phase=Running

# Check if catalog is being created/updated
kubectl get e6cat data-lake -o jsonpath='{.status.phase}'

Refresh TimedOut¶

Symptoms: Phase is TimedOut.

Causes: 1. Large catalog with many tables 2. Slow source catalog (Hive/Glue API) 3. Network issues

Resolution:

# Increase timeout for large catalogs
spec:
  timeout: 2h  # Instead of default 30m

Or refresh specific databases:

spec:
  databases:
    - high_priority_db
  timeout: 30m

PartialSuccess - Some Tables Failed¶

Symptoms: Phase is PartialSuccess with failures.