E6Catalog¶

API Version: e6data.io/v1alpha1 Kind: E6Catalog Short Names: e6cat

1. Purpose¶

E6Catalog registers and manages external data catalogs with the e6data storage service. It supports multiple catalog types:

HIVE: Apache Hive Metastore
GLUE: AWS Glue Data Catalog
UNITY: Databricks Unity Catalog
ICEBERG: Apache Iceberg catalogs (REST, Hive, Hadoop)
DELTA: Delta Lake catalogs

Create an E6Catalog after MetadataServices is running to connect your data lake metadata to e6data for querying.

2. High-level Behavior¶

When you create an E6Catalog CR, the operator:

Discovers MetadataServices via metadataServicesRef to find storage service endpoint
Attempts primary storage service first, falls back to secondary (if HA enabled)
Calls storage service HTTP API to register the catalog asynchronously
Polls operation status until complete (success, partial_success, or failed)
Updates CR status with catalog details and any failures

API Operations¶

Action	HTTP Endpoint	Method
Create	`/api/v1/catalogs`	POST
Update	`/api/v1/catalogs/{name}`	PUT
Delete	`/api/v1/catalogs/{name}`	DELETE
Status	`/api/v1/catalogs/{name}/status`	GET

No Child Resources¶

E6Catalog does not create Kubernetes resources. It manages catalog registration in the storage service via HTTP API calls.

3. Spec Reference¶

3.1 Top-level Fields¶

Field	Type	Required	Default	Description
`catalogName`	string	No	CR name	Catalog name in storage service
`catalogType`	string	Yes	-	Type: `HIVE`, `GLUE`, `UNITY`, `ICEBERG`, `DELTA`
`metadataServicesRef`	string	Yes	-	Name of MetadataServices in same namespace
`connectionMetadata`	ConnectionMetadata	Yes	-	Catalog and storage connection details
`isDefault`	bool	No	`false`	Set as default catalog for queries
`schemas`	[]string	No	`["*"]`	Schemas to include (`["*"]` = all)
`tables`	map[string][]string	No	`{"": [""]}`	Tables per schema to include
`columns`	map[string]map[string][]string	No	All	Columns per table to include
`userContext`	UserContext	No	-	Governance user context

3.2 ConnectionMetadata¶

Field	Type	Required	Description
`catalogConnection`	CatalogConnection	Yes	Catalog-specific connection
`storageConnection`	StorageConnection	No	Storage backend (deprecated, use MetadataServices)

3.3 CatalogConnection¶

Field	Type	Required	Description
`hiveConnection`	HiveConnection	Conditional	For `HIVE` type
`glueConnection`	GlueConnection	Conditional	For `GLUE` type
`unityConnection`	UnityConnection	Conditional	For `UNITY` type
`icebergConnection`	IcebergConnection	Conditional	For `ICEBERG` type
`deltaConnection`	DeltaConnection	Conditional	For `DELTA` type

3.4 HiveConnection¶

Field	Type	Required	Default	Description
`host`	string	Yes	-	Hive metastore host
`port`	int32	Yes	-	Hive metastore port (usually 9083)
`catalog`	string	No	`default`	Hive catalog name

3.5 GlueConnection¶

Field	Type	Required	Default	Description
`region`	string	Yes	-	AWS region (e.g., `us-east-1`)
`catalogId`	string	No	Current account	AWS account ID

3.6 UnityConnection¶

Field	Type	Required	Default	Description
`host`	string	Yes	-	Databricks workspace host
`port`	int32	Yes	-	Connection port (usually 443)
`bearerToken`	string	Yes	-	Databricks API token
`catalogName`	string	Yes	-	Unity Catalog name

Understanding Unity Catalog¶

Databricks Unity Catalog is a unified governance solution for data and AI on the Databricks Lakehouse Platform. When integrating with e6data:

Key Concepts: - Metastore: The top-level container for Unity Catalog metadata - Catalog: A container for schemas (databases) - Schema: A container for tables, views, and functions - Table: Data stored in Delta Lake format

Authentication: - Uses Bearer Token authentication via Databricks Personal Access Tokens (PAT) - Generate tokens in Databricks workspace: Settings → Developer → Access Tokens - Token requires SELECT privilege on catalogs/schemas you want to access

Data Access: Unity Catalog manages data in cloud storage (S3, ADLS, GCS). e6data: 1. Connects to Unity Catalog for metadata (table definitions, schemas) 2. Reads data directly from underlying cloud storage 3. Respects Unity Catalog governance policies (if userContext is configured)

Naming Conventions: - catalogName in unityConnection: The Unity Catalog name in Databricks (e.g., main, prod_catalog) - spec.catalogName: How e6data will reference this catalog (usually matches the Unity name)

Governance Integration: When using Unity Catalog with governance enabled in MetadataServices (governance.provider: unity): - e6data syncs permissions from Unity Catalog - Row-level security and column masking are applied - Set userContext for user-specific access filtering

3.7 IcebergConnection¶

Field	Type	Required	Default	Description
`type`	string	Yes	-	Catalog type: `hive`, `hadoop`, `rest`
`uri`	string	No	-	Catalog URI
`warehouse`	string	No	-	Warehouse location
`properties`	map[string]string	No	`{}`	Additional catalog properties

3.8 DeltaConnection¶

Field	Type	Required	Default	Description
`type`	string	Yes	-	Catalog type: `glue`, `hive`
`properties`	map[string]string	No	`{}`	Delta-specific properties

3.9 UserContext¶

Field	Type	Required	Description
`userName`	string	No	Username for governance filtering
`userEmail`	string	No	Email for governance filtering

4. Example Manifests¶

4.1 AWS Glue Catalog¶

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: data-lake
  namespace: workspace-analytics-prod
spec:
  catalogType: GLUE
  metadataServicesRef: analytics-prod
  isDefault: true

  connectionMetadata:
    catalogConnection:
      glueConnection:
        region: us-east-1
        # catalogId: "123456789012"  # Optional, defaults to current account

4.2 Hive Metastore Catalog¶

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: hive-warehouse
  namespace: workspace-analytics-prod
spec:
  catalogType: HIVE
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      hiveConnection:
        host: hive-metastore.data-platform.svc.cluster.local
        port: 9083
        catalog: default

  # Include only specific schemas
  schemas:
    - sales
    - marketing
    - finance

  # Include specific tables per schema
  tables:
    sales:
      - orders
      - customers
      - products
    marketing:
      - "*"  # All tables in marketing
    finance:
      - transactions
      - accounts

4.3 Databricks Unity Catalog¶

Basic Unity Catalog Setup:

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: unity-prod
  namespace: workspace-analytics-prod
spec:
  catalogType: UNITY
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      unityConnection:
        # Your Databricks workspace URL (without https://)
        host: adb-1234567890.azuredatabricks.net
        # Always 443 for Databricks
        port: 443
        # Personal Access Token from Databricks
        bearerToken: dapi1234567890abcdef
        # The Unity Catalog name (visible in Databricks Catalog Explorer)
        catalogName: main

  # Optional: Governance integration for user-specific access
  userContext:
    userName: analytics-service
    userEmail: analytics@example.com

Unity Catalog with Schema Filtering:

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: unity-filtered
  namespace: workspace-analytics-prod
spec:
  catalogType: UNITY
  metadataServicesRef: analytics-prod
  isDefault: true  # Make this the default catalog for queries

  connectionMetadata:
    catalogConnection:
      unityConnection:
        host: my-workspace.cloud.databricks.com
        port: 443
        bearerToken: dapi_your_token_here
        catalogName: production_catalog

  # Only include specific schemas from Unity Catalog
  schemas:
    - gold        # Curated data
    - silver      # Cleaned data
    - bronze      # Raw data
    # Excludes: staging, temp, dev schemas

  # Table filtering per schema
  tables:
    gold:
      - customers
      - orders
      - products
    silver:
      - "*"  # All tables in silver
    bronze:
      - events  # Only events table from bronze

Unity Catalog with Azure Databricks:

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: azure-unity
  namespace: workspace-analytics-prod
spec:
  catalogType: UNITY
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      unityConnection:
        # Azure Databricks workspace URL
        host: adb-123456789012345.6.azuredatabricks.net
        port: 443
        bearerToken: dapi_azure_token_12345
        catalogName: main

  # For Azure: Ensure MetadataServices has ADLS access configured
  # via Azure Workload Identity or storage account keys

Unity Catalog with AWS Databricks:

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: aws-unity
  namespace: workspace-analytics-prod
spec:
  catalogType: UNITY
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      unityConnection:
        # AWS Databricks workspace URL
        host: my-workspace.cloud.databricks.com
        port: 443
        bearerToken: dapi_aws_token_67890
        catalogName: main

  # For AWS: Ensure MetadataServices has S3 access configured
  # via IRSA (IAM Roles for Service Accounts)

Important Notes for Unity Catalog:

Token Generation:
Go to Databricks workspace → Settings → Developer → Access Tokens
Create a new token with sufficient expiration
Store token securely (consider using Kubernetes Secrets)
Required Permissions:
USE CATALOG on the catalog
USE SCHEMA on schemas you want to access
SELECT on tables you want to query
Data Access:
e6data queries underlying data directly (S3/ADLS/GCS)
Ensure MetadataServices ServiceAccount has cloud storage access
Unity Catalog handles metadata; e6data handles data reading
Best Practices:
Use service accounts/tokens for production
Rotate tokens regularly
Filter to only necessary schemas/tables

4.4 Apache Iceberg Catalog (REST)¶

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: iceberg-lakehouse
  namespace: workspace-analytics-prod
spec:
  catalogType: ICEBERG
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      icebergConnection:
        type: rest
        uri: https://iceberg-rest-catalog.example.com
        warehouse: s3://data-lake/warehouse
        properties:
          credential: "client-id:client-secret"
          scope: "catalog"

4.5 Delta Lake Catalog¶

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: delta-tables
  namespace: workspace-analytics-prod
spec:
  catalogType: DELTA
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      deltaConnection:
        type: glue
        properties:
          "spark.sql.catalog.delta.type": "glue"

  # Column-level filtering
  columns:
    sales:
      customers:
        - id
        - name
        - email  # Exclude PII columns like SSN, phone
      orders:
        - "*"  # All columns

4.6 Fine-Grained Filtering Example¶

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: filtered-catalog
  namespace: workspace-analytics-prod
spec:
  catalogType: GLUE
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      glueConnection:
        region: us-east-1

  # Include only these schemas
  schemas:
    - public_data
    - analytics
    - reports

  # Table filtering per schema
  tables:
    public_data:
      - "*"  # All tables
    analytics:
      - user_activity
      - page_views
      - conversions
    reports:
      - daily_summary
      - weekly_metrics

  # Column filtering per table
  columns:
    analytics:
      user_activity:
        - timestamp
        - event_type
        - user_id  # Exclude raw IP, user_agent
      page_views:
        - "*"

5. Status & Lifecycle¶

5.1 Status Fields¶

Field	Type	Description
`phase`	string	Current lifecycle phase
`operationStatus`	OperationStatus	Current async operation details
`activeStorageService`	string	Storage service being used
`storageServiceEndpoint`	string	Full HTTP endpoint
`catalogDetails`	CatalogDetails	Catalog info from API
`lastRefreshTime`	Time	Last successful refresh
`conditions`	[]Condition	Detailed status conditions
`observedGeneration`	int64	Last observed spec generation

5.2 Phase Values¶

Phase	Description
`Waiting`	Waiting for MetadataServices
`Creating`	Create operation in progress
`Ready`	Catalog registered and operational
`Updating`	Update operation in progress
`Refreshing`	Refresh operation in progress
`Deleting`	Delete operation in progress
`Failed`	Operation failed

5.3 Operation Status¶

operationStatus:
  operation: create
  status: partial_success
  message: "Some tables failed to sync"
  startTime: "2024-01-15T10:00:00Z"
  lastUpdated: "2024-01-15T10:05:00Z"
  totalDBsRefreshed: 5
  totalTablesRefreshed: 150
  diagnosticsFilePath: "s3://bucket/diagnostics/catalog-create-2024-01-15.json"
  failures:
    - type: table
      name: sales.broken_table
      reason: "Schema inference failed: unsupported column type"
    - type: table
      name: analytics.corrupt_data
      reason: "Unable to read partition metadata"

5.4 Partial Success Handling¶

The operator handles three operation outcomes:

Status	Phase	Meaning
`success`	`Ready`	All tables/databases synced
`partial_success`	`Ready`	Catalog usable, some items failed
`failed`	`Failed`	Complete failure, catalog not usable

Partial success means the catalog is operational but some tables couldn't be synced. Check failures array for details.

Dependencies¶

CRD	Relationship
MetadataServices	Required - provides storage service endpoint

Referencing CRDs¶

CRD	Reference Field
CatalogRefresh	`spec.e6CatalogRef.name`
CatalogRefreshSchedule	`spec.e6CatalogRef.name`
Governance	`spec.catalogName`

7. Troubleshooting¶

7.1 Common Issues¶

Catalog Stuck in "Creating"¶

Symptoms:

$ kubectl get e6cat
NAME        TYPE   PHASE      SERVICE
data-lake   GLUE   Creating   analytics-prod-storage

Causes: 1. Storage service not responding 2. Network connectivity issues 3. Catalog source (Glue/Hive) unreachable

Checks:

# Check operation status
kubectl get e6cat data-lake -o jsonpath='{.status.operationStatus}'

# Verify storage service is running
kubectl get pods -l app.kubernetes.io/name=storage

# Check storage service logs
kubectl logs -l app.kubernetes.io/name=storage --tail=100 | grep -i catalog

Partial Success with Failures¶

Symptoms:

$ kubectl get e6cat data-lake -o jsonpath='{.status.operationStatus.failures}'

Resolution: 1. Check the failures array for specific issues 2. Download full diagnostics from diagnosticsFilePath 3. Fix source catalog issues (permissions, corrupt tables) 4. Trigger manual refresh

# Get diagnostics file path
kubectl get e6cat data-lake -o jsonpath='{.status.operationStatus.diagnosticsFilePath}'

# Download and inspect (example for S3)
aws s3 cp s3://bucket/diagnostics/catalog-create-2024-01-15.json - | jq

Connection Refused to Hive Metastore¶

Symptoms: Phase Failed with connection refused error.

Checks:

# Verify Hive metastore is accessible
kubectl run -it --rm debug --image=busybox -- nc -zv hive-metastore.data-platform 9083

# Check DNS resolution
kubectl run -it --rm debug --image=busybox -- nslookup hive-metastore.data-platform

AWS Glue Access Denied¶

Symptoms: Phase Failed with AWS authorization error.

Checks:

# Verify IAM permissions on storage service SA
kubectl get sa analytics-prod -o yaml

# Check IRSA annotation
kubectl get sa analytics-prod -o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}'

# Test Glue access from storage pod
kubectl exec -it analytics-prod-storage-blue-xxx -- aws glue get-databases

7.2 Useful Commands¶

# Get catalog status
kubectl get e6cat data-lake -o yaml

# Check all catalogs in namespace
kubectl get e6cat

# View operation details
kubectl get e6cat data-lake -o jsonpath='{.status.operationStatus}' | jq

# Get catalog details from API
kubectl get e6cat data-lake -o jsonpath='{.status.catalogDetails}' | jq

# Force recreation (delete and recreate)
kubectl delete e6cat data-lake
kubectl apply -f catalog.yaml

# Check storage service endpoint being used
kubectl get e6cat data-lake -o jsonpath='{.status.storageServiceEndpoint}'

8. Validation Webhooks¶

E6Catalog has 20+ validation checks:

Check	Error Message
Missing catalogType	`spec.catalogType is required`
Invalid catalogType	`spec.catalogType must be HIVE, GLUE, UNITY, ICEBERG, or DELTA`
Missing metadataServicesRef	`spec.metadataServicesRef is required`
Missing connection for type	`hiveConnection is required when catalogType is HIVE`
Invalid Hive port	`hiveConnection.port must be between 1 and 65535`
Missing Glue region	`glueConnection.region is required`
Missing Unity token	`unityConnection.bearerToken is required`
Invalid Iceberg type	`icebergConnection.type must be hive, hadoop, or rest`

E6Catalog¶

1. Purpose¶

2. High-level Behavior¶

API Operations¶

No Child Resources¶

3. Spec Reference¶

3.1 Top-level Fields¶

3.2 ConnectionMetadata¶

3.3 CatalogConnection¶

3.4 HiveConnection¶

3.5 GlueConnection¶

3.6 UnityConnection¶

Understanding Unity Catalog¶

3.7 IcebergConnection¶

3.8 DeltaConnection¶

3.9 UserContext¶

4. Example Manifests¶

4.1 AWS Glue Catalog¶

4.2 Hive Metastore Catalog¶

4.3 Databricks Unity Catalog¶

4.4 Apache Iceberg Catalog (REST)¶

4.5 Delta Lake Catalog¶

4.6 Fine-Grained Filtering Example¶

5. Status & Lifecycle¶

5.1 Status Fields¶

5.2 Phase Values¶

5.3 Operation Status¶

5.4 Partial Success Handling¶

6. Related Resources¶

Dependencies¶

Referencing CRDs¶

7. Troubleshooting¶

7.1 Common Issues¶

Catalog Stuck in "Creating"¶

Partial Success with Failures¶

Connection Refused to Hive Metastore¶

AWS Glue Access Denied¶

7.2 Useful Commands¶

8. Validation Webhooks¶