E6Catalog¶
API Version: e6data.io/v1alpha1 Kind: E6Catalog Short Names: e6cat
1. Purpose¶
E6Catalog registers and manages external data catalogs with the e6data storage service. It supports multiple catalog types:
- HIVE: Apache Hive Metastore
- GLUE: AWS Glue Data Catalog
- UNITY: Databricks Unity Catalog
- ICEBERG: Apache Iceberg catalogs (REST, Hive, Hadoop)
- DELTA: Delta Lake catalogs
Create an E6Catalog after MetadataServices is running to connect your data lake metadata to e6data for querying.
2. High-level Behavior¶
When you create an E6Catalog CR, the operator:
- Discovers MetadataServices via
metadataServicesRefto find storage service endpoint - Attempts primary storage service first, falls back to secondary (if HA enabled)
- Calls storage service HTTP API to register the catalog asynchronously
- Polls operation status until complete (success, partial_success, or failed)
- Updates CR status with catalog details and any failures
API Operations¶
| Action | HTTP Endpoint | Method |
|---|---|---|
| Create | /api/v1/catalogs | POST |
| Update | /api/v1/catalogs/{name} | PUT |
| Delete | /api/v1/catalogs/{name} | DELETE |
| Status | /api/v1/catalogs/{name}/status | GET |
No Child Resources¶
E6Catalog does not create Kubernetes resources. It manages catalog registration in the storage service via HTTP API calls.
3. Spec Reference¶
3.1 Top-level Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
catalogName | string | No | CR name | Catalog name in storage service |
catalogType | string | Yes | - | Type: HIVE, GLUE, UNITY, ICEBERG, DELTA |
metadataServicesRef | string | Yes | - | Name of MetadataServices in same namespace |
connectionMetadata | ConnectionMetadata | Yes | - | Catalog and storage connection details |
isDefault | bool | No | false | Set as default catalog for queries |
schemas | []string | No | ["*"] | Schemas to include (["*"] = all) |
tables | map[string][]string | No | {"*": ["*"]} | Tables per schema to include |
columns | map[string]map[string][]string | No | All | Columns per table to include |
userContext | UserContext | No | - | Governance user context |
3.2 ConnectionMetadata¶
| Field | Type | Required | Description |
|---|---|---|---|
catalogConnection | CatalogConnection | Yes | Catalog-specific connection |
storageConnection | StorageConnection | No | Storage backend (deprecated, use MetadataServices) |
3.3 CatalogConnection¶
| Field | Type | Required | Description |
|---|---|---|---|
hiveConnection | HiveConnection | Conditional | For HIVE type |
glueConnection | GlueConnection | Conditional | For GLUE type |
unityConnection | UnityConnection | Conditional | For UNITY type |
icebergConnection | IcebergConnection | Conditional | For ICEBERG type |
deltaConnection | DeltaConnection | Conditional | For DELTA type |
3.4 HiveConnection¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
host | string | Yes | - | Hive metastore host |
port | int32 | Yes | - | Hive metastore port (usually 9083) |
catalog | string | No | default | Hive catalog name |
3.5 GlueConnection¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
region | string | Yes | - | AWS region (e.g., us-east-1) |
catalogId | string | No | Current account | AWS account ID |
3.6 UnityConnection¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
host | string | Yes | - | Databricks workspace host |
port | int32 | Yes | - | Connection port (usually 443) |
bearerToken | string | Yes | - | Databricks API token |
catalogName | string | Yes | - | Unity Catalog name |
Understanding Unity Catalog¶
Databricks Unity Catalog is a unified governance solution for data and AI on the Databricks Lakehouse Platform. When integrating with e6data:
Key Concepts: - Metastore: The top-level container for Unity Catalog metadata - Catalog: A container for schemas (databases) - Schema: A container for tables, views, and functions - Table: Data stored in Delta Lake format
Authentication: - Uses Bearer Token authentication via Databricks Personal Access Tokens (PAT) - Generate tokens in Databricks workspace: Settings → Developer → Access Tokens - Token requires SELECT privilege on catalogs/schemas you want to access
Data Access: Unity Catalog manages data in cloud storage (S3, ADLS, GCS). e6data: 1. Connects to Unity Catalog for metadata (table definitions, schemas) 2. Reads data directly from underlying cloud storage 3. Respects Unity Catalog governance policies (if userContext is configured)
Naming Conventions: - catalogName in unityConnection: The Unity Catalog name in Databricks (e.g., main, prod_catalog) - spec.catalogName: How e6data will reference this catalog (usually matches the Unity name)
Governance Integration: When using Unity Catalog with governance enabled in MetadataServices (governance.provider: unity): - e6data syncs permissions from Unity Catalog - Row-level security and column masking are applied - Set userContext for user-specific access filtering
3.7 IcebergConnection¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
type | string | Yes | - | Catalog type: hive, hadoop, rest |
uri | string | No | - | Catalog URI |
warehouse | string | No | - | Warehouse location |
properties | map[string]string | No | {} | Additional catalog properties |
3.8 DeltaConnection¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
type | string | Yes | - | Catalog type: glue, hive |
properties | map[string]string | No | {} | Delta-specific properties |
3.9 UserContext¶
| Field | Type | Required | Description |
|---|---|---|---|
userName | string | No | Username for governance filtering |
userEmail | string | No | Email for governance filtering |
4. Example Manifests¶
4.1 AWS Glue Catalog¶
apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
name: data-lake
namespace: workspace-analytics-prod
spec:
catalogType: GLUE
metadataServicesRef: analytics-prod
isDefault: true
connectionMetadata:
catalogConnection:
glueConnection:
region: us-east-1
# catalogId: "123456789012" # Optional, defaults to current account
4.2 Hive Metastore Catalog¶
apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
name: hive-warehouse
namespace: workspace-analytics-prod
spec:
catalogType: HIVE
metadataServicesRef: analytics-prod
connectionMetadata:
catalogConnection:
hiveConnection:
host: hive-metastore.data-platform.svc.cluster.local
port: 9083
catalog: default
# Include only specific schemas
schemas:
- sales
- marketing
- finance
# Include specific tables per schema
tables:
sales:
- orders
- customers
- products
marketing:
- "*" # All tables in marketing
finance:
- transactions
- accounts
4.3 Databricks Unity Catalog¶
Basic Unity Catalog Setup:
apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
name: unity-prod
namespace: workspace-analytics-prod
spec:
catalogType: UNITY
metadataServicesRef: analytics-prod
connectionMetadata:
catalogConnection:
unityConnection:
# Your Databricks workspace URL (without https://)
host: adb-1234567890.azuredatabricks.net
# Always 443 for Databricks
port: 443
# Personal Access Token from Databricks
bearerToken: dapi1234567890abcdef
# The Unity Catalog name (visible in Databricks Catalog Explorer)
catalogName: main
# Optional: Governance integration for user-specific access
userContext:
userName: analytics-service
userEmail: analytics@example.com
Unity Catalog with Schema Filtering:
apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
name: unity-filtered
namespace: workspace-analytics-prod
spec:
catalogType: UNITY
metadataServicesRef: analytics-prod
isDefault: true # Make this the default catalog for queries
connectionMetadata:
catalogConnection:
unityConnection:
host: my-workspace.cloud.databricks.com
port: 443
bearerToken: dapi_your_token_here
catalogName: production_catalog
# Only include specific schemas from Unity Catalog
schemas:
- gold # Curated data
- silver # Cleaned data
- bronze # Raw data
# Excludes: staging, temp, dev schemas
# Table filtering per schema
tables:
gold:
- customers
- orders
- products
silver:
- "*" # All tables in silver
bronze:
- events # Only events table from bronze
Unity Catalog with Azure Databricks:
apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
name: azure-unity
namespace: workspace-analytics-prod
spec:
catalogType: UNITY
metadataServicesRef: analytics-prod
connectionMetadata:
catalogConnection:
unityConnection:
# Azure Databricks workspace URL
host: adb-123456789012345.6.azuredatabricks.net
port: 443
bearerToken: dapi_azure_token_12345
catalogName: main
# For Azure: Ensure MetadataServices has ADLS access configured
# via Azure Workload Identity or storage account keys
Unity Catalog with AWS Databricks:
apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
name: aws-unity
namespace: workspace-analytics-prod
spec:
catalogType: UNITY
metadataServicesRef: analytics-prod
connectionMetadata:
catalogConnection:
unityConnection:
# AWS Databricks workspace URL
host: my-workspace.cloud.databricks.com
port: 443
bearerToken: dapi_aws_token_67890
catalogName: main
# For AWS: Ensure MetadataServices has S3 access configured
# via IRSA (IAM Roles for Service Accounts)
Important Notes for Unity Catalog:
- Token Generation:
- Go to Databricks workspace → Settings → Developer → Access Tokens
- Create a new token with sufficient expiration
-
Store token securely (consider using Kubernetes Secrets)
-
Required Permissions:
USE CATALOGon the catalogUSE SCHEMAon schemas you want to access-
SELECTon tables you want to query -
Data Access:
- e6data queries underlying data directly (S3/ADLS/GCS)
- Ensure MetadataServices ServiceAccount has cloud storage access
-
Unity Catalog handles metadata; e6data handles data reading
-
Best Practices:
- Use service accounts/tokens for production
- Rotate tokens regularly
- Filter to only necessary schemas/tables
4.4 Apache Iceberg Catalog (REST)¶
apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
name: iceberg-lakehouse
namespace: workspace-analytics-prod
spec:
catalogType: ICEBERG
metadataServicesRef: analytics-prod
connectionMetadata:
catalogConnection:
icebergConnection:
type: rest
uri: https://iceberg-rest-catalog.example.com
warehouse: s3://data-lake/warehouse
properties:
credential: "client-id:client-secret"
scope: "catalog"
4.5 Delta Lake Catalog¶
apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
name: delta-tables
namespace: workspace-analytics-prod
spec:
catalogType: DELTA
metadataServicesRef: analytics-prod
connectionMetadata:
catalogConnection:
deltaConnection:
type: glue
properties:
"spark.sql.catalog.delta.type": "glue"
# Column-level filtering
columns:
sales:
customers:
- id
- name
- email # Exclude PII columns like SSN, phone
orders:
- "*" # All columns
4.6 Fine-Grained Filtering Example¶
apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
name: filtered-catalog
namespace: workspace-analytics-prod
spec:
catalogType: GLUE
metadataServicesRef: analytics-prod
connectionMetadata:
catalogConnection:
glueConnection:
region: us-east-1
# Include only these schemas
schemas:
- public_data
- analytics
- reports
# Table filtering per schema
tables:
public_data:
- "*" # All tables
analytics:
- user_activity
- page_views
- conversions
reports:
- daily_summary
- weekly_metrics
# Column filtering per table
columns:
analytics:
user_activity:
- timestamp
- event_type
- user_id # Exclude raw IP, user_agent
page_views:
- "*"
5. Status & Lifecycle¶
5.1 Status Fields¶
| Field | Type | Description |
|---|---|---|
phase | string | Current lifecycle phase |
operationStatus | OperationStatus | Current async operation details |
activeStorageService | string | Storage service being used |
storageServiceEndpoint | string | Full HTTP endpoint |
catalogDetails | CatalogDetails | Catalog info from API |
lastRefreshTime | Time | Last successful refresh |
conditions | []Condition | Detailed status conditions |
observedGeneration | int64 | Last observed spec generation |
5.2 Phase Values¶
| Phase | Description |
|---|---|
Waiting | Waiting for MetadataServices |
Creating | Create operation in progress |
Ready | Catalog registered and operational |
Updating | Update operation in progress |
Refreshing | Refresh operation in progress |
Deleting | Delete operation in progress |
Failed | Operation failed |
5.3 Operation Status¶
operationStatus:
operation: create
status: partial_success
message: "Some tables failed to sync"
startTime: "2024-01-15T10:00:00Z"
lastUpdated: "2024-01-15T10:05:00Z"
totalDBsRefreshed: 5
totalTablesRefreshed: 150
diagnosticsFilePath: "s3://bucket/diagnostics/catalog-create-2024-01-15.json"
failures:
- type: table
name: sales.broken_table
reason: "Schema inference failed: unsupported column type"
- type: table
name: analytics.corrupt_data
reason: "Unable to read partition metadata"
5.4 Partial Success Handling¶
The operator handles three operation outcomes:
| Status | Phase | Meaning |
|---|---|---|
success | Ready | All tables/databases synced |
partial_success | Ready | Catalog usable, some items failed |
failed | Failed | Complete failure, catalog not usable |
Partial success means the catalog is operational but some tables couldn't be synced. Check failures array for details.
6. Related Resources¶
Dependencies¶
| CRD | Relationship |
|---|---|
| MetadataServices | Required - provides storage service endpoint |
Referencing CRDs¶
| CRD | Reference Field |
|---|---|
| CatalogRefresh | spec.e6CatalogRef.name |
| CatalogRefreshSchedule | spec.e6CatalogRef.name |
| Governance | spec.catalogName |
7. Troubleshooting¶
7.1 Common Issues¶
Catalog Stuck in "Creating"¶
Symptoms:
Causes: 1. Storage service not responding 2. Network connectivity issues 3. Catalog source (Glue/Hive) unreachable
Checks:
# Check operation status
kubectl get e6cat data-lake -o jsonpath='{.status.operationStatus}'
# Verify storage service is running
kubectl get pods -l app.kubernetes.io/name=storage
# Check storage service logs
kubectl logs -l app.kubernetes.io/name=storage --tail=100 | grep -i catalog
Partial Success with Failures¶
Symptoms:
Resolution: 1. Check the failures array for specific issues 2. Download full diagnostics from diagnosticsFilePath 3. Fix source catalog issues (permissions, corrupt tables) 4. Trigger manual refresh
# Get diagnostics file path
kubectl get e6cat data-lake -o jsonpath='{.status.operationStatus.diagnosticsFilePath}'
# Download and inspect (example for S3)
aws s3 cp s3://bucket/diagnostics/catalog-create-2024-01-15.json - | jq
Connection Refused to Hive Metastore¶
Symptoms: Phase Failed with connection refused error.
Checks:
# Verify Hive metastore is accessible
kubectl run -it --rm debug --image=busybox -- nc -zv hive-metastore.data-platform 9083
# Check DNS resolution
kubectl run -it --rm debug --image=busybox -- nslookup hive-metastore.data-platform
AWS Glue Access Denied¶
Symptoms: Phase Failed with AWS authorization error.
Checks:
# Verify IAM permissions on storage service SA
kubectl get sa analytics-prod -o yaml
# Check IRSA annotation
kubectl get sa analytics-prod -o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}'
# Test Glue access from storage pod
kubectl exec -it analytics-prod-storage-blue-xxx -- aws glue get-databases
7.2 Useful Commands¶
# Get catalog status
kubectl get e6cat data-lake -o yaml
# Check all catalogs in namespace
kubectl get e6cat
# View operation details
kubectl get e6cat data-lake -o jsonpath='{.status.operationStatus}' | jq
# Get catalog details from API
kubectl get e6cat data-lake -o jsonpath='{.status.catalogDetails}' | jq
# Force recreation (delete and recreate)
kubectl delete e6cat data-lake
kubectl apply -f catalog.yaml
# Check storage service endpoint being used
kubectl get e6cat data-lake -o jsonpath='{.status.storageServiceEndpoint}'
8. Validation Webhooks¶
E6Catalog has 20+ validation checks:
| Check | Error Message |
|---|---|
| Missing catalogType | spec.catalogType is required |
| Invalid catalogType | spec.catalogType must be HIVE, GLUE, UNITY, ICEBERG, or DELTA |
| Missing metadataServicesRef | spec.metadataServicesRef is required |
| Missing connection for type | hiveConnection is required when catalogType is HIVE |
| Invalid Hive port | hiveConnection.port must be between 1 and 65535 |
| Missing Glue region | glueConnection.region is required |
| Missing Unity token | unityConnection.bearerToken is required |
| Invalid Iceberg type | icebergConnection.type must be hive, hadoop, or rest |