Skip to content

E6Catalog

API Version: e6data.io/v1alpha1 Kind: E6Catalog Short Names: e6cat


1. Purpose

E6Catalog registers and manages external data catalogs with the e6data storage service. It supports multiple catalog types:

  • HIVE: Apache Hive Metastore
  • GLUE: AWS Glue Data Catalog
  • UNITY: Databricks Unity Catalog
  • ICEBERG: Apache Iceberg catalogs (REST, Hive, Hadoop)
  • DELTA: Delta Lake catalogs

Create an E6Catalog after MetadataServices is running to connect your data lake metadata to e6data for querying.


2. High-level Behavior

When you create an E6Catalog CR, the operator:

  1. Discovers MetadataServices via metadataServicesRef to find storage service endpoint
  2. Attempts primary storage service first, falls back to secondary (if HA enabled)
  3. Calls storage service HTTP API to register the catalog asynchronously
  4. Polls operation status until complete (success, partial_success, or failed)
  5. Updates CR status with catalog details and any failures

API Operations

Action HTTP Endpoint Method
Create /api/v1/catalogs POST
Update /api/v1/catalogs/{name} PUT
Delete /api/v1/catalogs/{name} DELETE
Status /api/v1/catalogs/{name}/status GET

No Child Resources

E6Catalog does not create Kubernetes resources. It manages catalog registration in the storage service via HTTP API calls.


3. Spec Reference

3.1 Top-level Fields

Field Type Required Default Description
catalogName string No CR name Catalog name in storage service
catalogType string Yes - Type: HIVE, GLUE, UNITY, ICEBERG, DELTA
metadataServicesRef string Yes - Name of MetadataServices in same namespace
connectionMetadata ConnectionMetadata Yes - Catalog and storage connection details
isDefault bool No false Set as default catalog for queries
schemas []string No ["*"] Schemas to include (["*"] = all)
tables map[string][]string No {"*": ["*"]} Tables per schema to include
columns map[string]map[string][]string No All Columns per table to include
userContext UserContext No - Governance user context

3.2 ConnectionMetadata

Field Type Required Description
catalogConnection CatalogConnection Yes Catalog-specific connection
storageConnection StorageConnection No Storage backend (deprecated, use MetadataServices)

3.3 CatalogConnection

Field Type Required Description
hiveConnection HiveConnection Conditional For HIVE type
glueConnection GlueConnection Conditional For GLUE type
unityConnection UnityConnection Conditional For UNITY type
icebergConnection IcebergConnection Conditional For ICEBERG type
deltaConnection DeltaConnection Conditional For DELTA type

3.4 HiveConnection

Field Type Required Default Description
host string Yes - Hive metastore host
port int32 Yes - Hive metastore port (usually 9083)
catalog string No default Hive catalog name

3.5 GlueConnection

Field Type Required Default Description
region string Yes - AWS region (e.g., us-east-1)
catalogId string No Current account AWS account ID

3.6 UnityConnection

Field Type Required Default Description
host string Yes - Databricks workspace host
port int32 Yes - Connection port (usually 443)
bearerToken string Yes - Databricks API token
catalogName string Yes - Unity Catalog name

Understanding Unity Catalog

Databricks Unity Catalog is a unified governance solution for data and AI on the Databricks Lakehouse Platform. When integrating with e6data:

Key Concepts: - Metastore: The top-level container for Unity Catalog metadata - Catalog: A container for schemas (databases) - Schema: A container for tables, views, and functions - Table: Data stored in Delta Lake format

Authentication: - Uses Bearer Token authentication via Databricks Personal Access Tokens (PAT) - Generate tokens in Databricks workspace: Settings → Developer → Access Tokens - Token requires SELECT privilege on catalogs/schemas you want to access

Data Access: Unity Catalog manages data in cloud storage (S3, ADLS, GCS). e6data: 1. Connects to Unity Catalog for metadata (table definitions, schemas) 2. Reads data directly from underlying cloud storage 3. Respects Unity Catalog governance policies (if userContext is configured)

Naming Conventions: - catalogName in unityConnection: The Unity Catalog name in Databricks (e.g., main, prod_catalog) - spec.catalogName: How e6data will reference this catalog (usually matches the Unity name)

Governance Integration: When using Unity Catalog with governance enabled in MetadataServices (governance.provider: unity): - e6data syncs permissions from Unity Catalog - Row-level security and column masking are applied - Set userContext for user-specific access filtering

3.7 IcebergConnection

Field Type Required Default Description
type string Yes - Catalog type: hive, hadoop, rest
uri string No - Catalog URI
warehouse string No - Warehouse location
properties map[string]string No {} Additional catalog properties

3.8 DeltaConnection

Field Type Required Default Description
type string Yes - Catalog type: glue, hive
properties map[string]string No {} Delta-specific properties

3.9 UserContext

Field Type Required Description
userName string No Username for governance filtering
userEmail string No Email for governance filtering

4. Example Manifests

4.1 AWS Glue Catalog

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: data-lake
  namespace: workspace-analytics-prod
spec:
  catalogType: GLUE
  metadataServicesRef: analytics-prod
  isDefault: true

  connectionMetadata:
    catalogConnection:
      glueConnection:
        region: us-east-1
        # catalogId: "123456789012"  # Optional, defaults to current account

4.2 Hive Metastore Catalog

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: hive-warehouse
  namespace: workspace-analytics-prod
spec:
  catalogType: HIVE
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      hiveConnection:
        host: hive-metastore.data-platform.svc.cluster.local
        port: 9083
        catalog: default

  # Include only specific schemas
  schemas:
    - sales
    - marketing
    - finance

  # Include specific tables per schema
  tables:
    sales:
      - orders
      - customers
      - products
    marketing:
      - "*"  # All tables in marketing
    finance:
      - transactions
      - accounts

4.3 Databricks Unity Catalog

Basic Unity Catalog Setup:

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: unity-prod
  namespace: workspace-analytics-prod
spec:
  catalogType: UNITY
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      unityConnection:
        # Your Databricks workspace URL (without https://)
        host: adb-1234567890.azuredatabricks.net
        # Always 443 for Databricks
        port: 443
        # Personal Access Token from Databricks
        bearerToken: dapi1234567890abcdef
        # The Unity Catalog name (visible in Databricks Catalog Explorer)
        catalogName: main

  # Optional: Governance integration for user-specific access
  userContext:
    userName: analytics-service
    userEmail: analytics@example.com

Unity Catalog with Schema Filtering:

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: unity-filtered
  namespace: workspace-analytics-prod
spec:
  catalogType: UNITY
  metadataServicesRef: analytics-prod
  isDefault: true  # Make this the default catalog for queries

  connectionMetadata:
    catalogConnection:
      unityConnection:
        host: my-workspace.cloud.databricks.com
        port: 443
        bearerToken: dapi_your_token_here
        catalogName: production_catalog

  # Only include specific schemas from Unity Catalog
  schemas:
    - gold        # Curated data
    - silver      # Cleaned data
    - bronze      # Raw data
    # Excludes: staging, temp, dev schemas

  # Table filtering per schema
  tables:
    gold:
      - customers
      - orders
      - products
    silver:
      - "*"  # All tables in silver
    bronze:
      - events  # Only events table from bronze

Unity Catalog with Azure Databricks:

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: azure-unity
  namespace: workspace-analytics-prod
spec:
  catalogType: UNITY
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      unityConnection:
        # Azure Databricks workspace URL
        host: adb-123456789012345.6.azuredatabricks.net
        port: 443
        bearerToken: dapi_azure_token_12345
        catalogName: main

  # For Azure: Ensure MetadataServices has ADLS access configured
  # via Azure Workload Identity or storage account keys

Unity Catalog with AWS Databricks:

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: aws-unity
  namespace: workspace-analytics-prod
spec:
  catalogType: UNITY
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      unityConnection:
        # AWS Databricks workspace URL
        host: my-workspace.cloud.databricks.com
        port: 443
        bearerToken: dapi_aws_token_67890
        catalogName: main

  # For AWS: Ensure MetadataServices has S3 access configured
  # via IRSA (IAM Roles for Service Accounts)

Important Notes for Unity Catalog:

  1. Token Generation:
  2. Go to Databricks workspace → Settings → Developer → Access Tokens
  3. Create a new token with sufficient expiration
  4. Store token securely (consider using Kubernetes Secrets)

  5. Required Permissions:

  6. USE CATALOG on the catalog
  7. USE SCHEMA on schemas you want to access
  8. SELECT on tables you want to query

  9. Data Access:

  10. e6data queries underlying data directly (S3/ADLS/GCS)
  11. Ensure MetadataServices ServiceAccount has cloud storage access
  12. Unity Catalog handles metadata; e6data handles data reading

  13. Best Practices:

  14. Use service accounts/tokens for production
  15. Rotate tokens regularly
  16. Filter to only necessary schemas/tables

4.4 Apache Iceberg Catalog (REST)

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: iceberg-lakehouse
  namespace: workspace-analytics-prod
spec:
  catalogType: ICEBERG
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      icebergConnection:
        type: rest
        uri: https://iceberg-rest-catalog.example.com
        warehouse: s3://data-lake/warehouse
        properties:
          credential: "client-id:client-secret"
          scope: "catalog"

4.5 Delta Lake Catalog

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: delta-tables
  namespace: workspace-analytics-prod
spec:
  catalogType: DELTA
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      deltaConnection:
        type: glue
        properties:
          "spark.sql.catalog.delta.type": "glue"

  # Column-level filtering
  columns:
    sales:
      customers:
        - id
        - name
        - email  # Exclude PII columns like SSN, phone
      orders:
        - "*"  # All columns

4.6 Fine-Grained Filtering Example

apiVersion: e6data.io/v1alpha1
kind: E6Catalog
metadata:
  name: filtered-catalog
  namespace: workspace-analytics-prod
spec:
  catalogType: GLUE
  metadataServicesRef: analytics-prod

  connectionMetadata:
    catalogConnection:
      glueConnection:
        region: us-east-1

  # Include only these schemas
  schemas:
    - public_data
    - analytics
    - reports

  # Table filtering per schema
  tables:
    public_data:
      - "*"  # All tables
    analytics:
      - user_activity
      - page_views
      - conversions
    reports:
      - daily_summary
      - weekly_metrics

  # Column filtering per table
  columns:
    analytics:
      user_activity:
        - timestamp
        - event_type
        - user_id  # Exclude raw IP, user_agent
      page_views:
        - "*"

5. Status & Lifecycle

5.1 Status Fields

Field Type Description
phase string Current lifecycle phase
operationStatus OperationStatus Current async operation details
activeStorageService string Storage service being used
storageServiceEndpoint string Full HTTP endpoint
catalogDetails CatalogDetails Catalog info from API
lastRefreshTime Time Last successful refresh
conditions []Condition Detailed status conditions
observedGeneration int64 Last observed spec generation

5.2 Phase Values

Phase Description
Waiting Waiting for MetadataServices
Creating Create operation in progress
Ready Catalog registered and operational
Updating Update operation in progress
Refreshing Refresh operation in progress
Deleting Delete operation in progress
Failed Operation failed

5.3 Operation Status

operationStatus:
  operation: create
  status: partial_success
  message: "Some tables failed to sync"
  startTime: "2024-01-15T10:00:00Z"
  lastUpdated: "2024-01-15T10:05:00Z"
  totalDBsRefreshed: 5
  totalTablesRefreshed: 150
  diagnosticsFilePath: "s3://bucket/diagnostics/catalog-create-2024-01-15.json"
  failures:
    - type: table
      name: sales.broken_table
      reason: "Schema inference failed: unsupported column type"
    - type: table
      name: analytics.corrupt_data
      reason: "Unable to read partition metadata"

5.4 Partial Success Handling

The operator handles three operation outcomes:

Status Phase Meaning
success Ready All tables/databases synced
partial_success Ready Catalog usable, some items failed
failed Failed Complete failure, catalog not usable

Partial success means the catalog is operational but some tables couldn't be synced. Check failures array for details.


Dependencies

CRD Relationship
MetadataServices Required - provides storage service endpoint

Referencing CRDs

CRD Reference Field
CatalogRefresh spec.e6CatalogRef.name
CatalogRefreshSchedule spec.e6CatalogRef.name
Governance spec.catalogName

7. Troubleshooting

7.1 Common Issues

Catalog Stuck in "Creating"

Symptoms:

$ kubectl get e6cat
NAME        TYPE   PHASE      SERVICE
data-lake   GLUE   Creating   analytics-prod-storage

Causes: 1. Storage service not responding 2. Network connectivity issues 3. Catalog source (Glue/Hive) unreachable

Checks:

# Check operation status
kubectl get e6cat data-lake -o jsonpath='{.status.operationStatus}'

# Verify storage service is running
kubectl get pods -l app.kubernetes.io/name=storage

# Check storage service logs
kubectl logs -l app.kubernetes.io/name=storage --tail=100 | grep -i catalog

Partial Success with Failures

Symptoms:

$ kubectl get e6cat data-lake -o jsonpath='{.status.operationStatus.failures}'

Resolution: 1. Check the failures array for specific issues 2. Download full diagnostics from diagnosticsFilePath 3. Fix source catalog issues (permissions, corrupt tables) 4. Trigger manual refresh

# Get diagnostics file path
kubectl get e6cat data-lake -o jsonpath='{.status.operationStatus.diagnosticsFilePath}'

# Download and inspect (example for S3)
aws s3 cp s3://bucket/diagnostics/catalog-create-2024-01-15.json - | jq

Connection Refused to Hive Metastore

Symptoms: Phase Failed with connection refused error.

Checks:

# Verify Hive metastore is accessible
kubectl run -it --rm debug --image=busybox -- nc -zv hive-metastore.data-platform 9083

# Check DNS resolution
kubectl run -it --rm debug --image=busybox -- nslookup hive-metastore.data-platform

AWS Glue Access Denied

Symptoms: Phase Failed with AWS authorization error.

Checks:

# Verify IAM permissions on storage service SA
kubectl get sa analytics-prod -o yaml

# Check IRSA annotation
kubectl get sa analytics-prod -o jsonpath='{.metadata.annotations.eks\.amazonaws\.com/role-arn}'

# Test Glue access from storage pod
kubectl exec -it analytics-prod-storage-blue-xxx -- aws glue get-databases

7.2 Useful Commands

# Get catalog status
kubectl get e6cat data-lake -o yaml

# Check all catalogs in namespace
kubectl get e6cat

# View operation details
kubectl get e6cat data-lake -o jsonpath='{.status.operationStatus}' | jq

# Get catalog details from API
kubectl get e6cat data-lake -o jsonpath='{.status.catalogDetails}' | jq

# Force recreation (delete and recreate)
kubectl delete e6cat data-lake
kubectl apply -f catalog.yaml

# Check storage service endpoint being used
kubectl get e6cat data-lake -o jsonpath='{.status.storageServiceEndpoint}'

8. Validation Webhooks

E6Catalog has 20+ validation checks:

Check Error Message
Missing catalogType spec.catalogType is required
Invalid catalogType spec.catalogType must be HIVE, GLUE, UNITY, ICEBERG, or DELTA
Missing metadataServicesRef spec.metadataServicesRef is required
Missing connection for type hiveConnection is required when catalogType is HIVE
Invalid Hive port hiveConnection.port must be between 1 and 65535
Missing Glue region glueConnection.region is required
Missing Unity token unityConnection.bearerToken is required
Invalid Iceberg type icebergConnection.type must be hive, hadoop, or rest