Skip to content

MetadataServices

API Version: e6data.io/v1alpha1 Kind: MetadataServices Short Names: mds, metadata


1. Purpose

MetadataServices manages the storage service and schema service components of the e6data analytics platform. These services are responsible for:

  • Storage Service: Handles table metadata caching, partition discovery, and data file location resolution across cloud object stores (S3, GCS, Azure Blob)
  • Schema Service: Provides schema inference, column statistics, and metadata for query optimization

Create a MetadataServices resource when setting up a new e6data workspace. This is typically the first CRD you deploy after NamespaceConfig, as QueryService and E6Catalog depend on it.

Note: Infrastructure settings (cloud, storage backend, tolerations, node selectors, image pull secrets) are now managed by NamespaceConfig. MetadataServices inherits these settings automatically.


2. High-level Behavior

When you create a MetadataServices CR, the operator:

  1. Inherits infrastructure settings from NamespaceConfig in the same namespace
  2. Creates ConfigMaps with auto-populated configuration variables (CLOUD, WORKSPACE, E6_BUCKET, etc.)
  3. Deploys Storage Service (primary, and optionally secondary for HA)
  4. Deploys Schema Service for schema inference
  5. Creates Services (ClusterIP) for internal access
  6. Implements Blue-Green deployment for zero-downtime updates
  7. Tracks release history (last 10 releases) for rollback support

Prerequisites

  • NamespaceConfig must exist in the same namespace (provides cloud, storage backend, scheduling config)

Child Resources Created

Resource Type Name Pattern Purpose
Deployment {name}-storage-{blue\|green} Storage service pods
Deployment {name}-storage-secondary-{blue\|green} Secondary storage (if HA enabled)
Deployment {name}-schema-{blue\|green} Schema service pods
ConfigMap {name}-storage-config-{blue\|green} Storage config.properties
ConfigMap {name}-schema-config-{blue\|green} Schema config.properties
ConfigMap {name}-common-config Active strategy routing
Secret {name}-common-secret Shared secrets
Service {name}-storage Storage service endpoint
Service {name}-storage-secondary Secondary storage endpoint
Service {name}-schema Schema service endpoint
ServiceAccount {workspace} Pod identity (if autoCreateRBAC)

External Dependencies

  • Object Storage: S3, GCS, or Azure Blob bucket (specified in storageBackend)
  • IAM/Workload Identity: Service account with read access to data lake
  • Kubernetes: 1.24+ recommended

3. Spec Reference

3.1 Top-level Fields

Field Type Required Default Description
workspace string No CR name Workspace name (used for namespacing and node scheduling)
tenant string Yes - Tenant identifier (customer/organization ID)
releaseVersion string No Auto-generated Version identifier for tracking releases
storage StorageSpec No See defaults Storage service configuration
schema SchemaSpec No See defaults Schema service configuration
podAnnotations map[string]string No {} Annotations for all pods (Prometheus scraping, etc.)
governance GovernanceSpec No disabled Data governance configuration

Inherited from NamespaceConfig

The following fields are inherited from NamespaceConfig and no longer specified in MetadataServices:

Field Description
cloud Cloud provider (AWS, GCP, AZURE)
storageBackend Object storage path (s3a://, gs://, abfs://)
s3Endpoint Custom S3 endpoint for S3-compatible storage
imageRepository Container registry path
imagePullSecrets Secrets for private registries
tolerations Pod tolerations for scheduling
nodeSelector Node labels for pod placement
affinity Advanced scheduling rules
karpenterNodePool Karpenter NodePool name
serviceAccount ServiceAccount for pods (via serviceAccounts.data)

3.2 Storage (StorageSpec)

Field Type Required Default Description
imageTag string Yes - Image tag/version (e.g., 3.0.217)
replicas int32 No 1 Number of storage pods
resources ResourceSpec No - CPU/Memory limits
ports PortSpec No See defaults Service ports
environmentVariables map[string]string No {} Container environment variables
configVariables map[string]string No {} config.properties entries
ha HASpec No disabled High availability (secondary storage)

Auto-populated Environment Variables: - IS_KUBE=true - POD_NAME, POD_IP, NAMESPACE (from pod metadata) - JAVA_TOOL_OPTIONS (auto-calculated Xmx/Xms at 80% of memory)

Auto-populated Config Variables: - CLOUD, ALIAS, WORKSPACE, E6_BUCKET - STORAGE_SERVICE_HOST, SCHEMA_SERVICE_HOST

3.3 Schema (SchemaSpec)

Field Type Required Default Description
imageTag string Yes - Image tag/version
replicas int32 No 1 Number of schema pods
resources ResourceSpec No 30Gi memory, 16 CPU CPU/Memory limits
ports PortSpec No See defaults Service ports
environmentVariables map[string]string No {} Container environment variables
configVariables map[string]string No {} config.properties entries

3.4 ResourceSpec

Field Type Required Default Description
memory string Yes - Memory limit (e.g., 8Gi, 30Gi)
cpu string Yes - CPU limit (e.g., 2, 16)

3.5 PortSpec

Field Type Default Description
thrift int32 9005 (storage), 9006 (schema) Thrift RPC port
web int32 8081 HTTP API port
metrics int32 9090 Prometheus metrics port

3.6 Governance (GovernanceSpec)

Field Type Required Default Description
enabled bool No false Enable governance integration
provider string No ranger Provider: ranger, unity, lakeformation
policyPath string No - Path in bucket for Ranger policies
unity UnityGovernanceSpec No - Unity Catalog settings
lakeFormation LakeFormationGovernanceSpec No - AWS Lake Formation settings
filtering FilteringSpec No all enabled Catalog/schema/table/column filtering
queryRewriting QueryRewritingSpec No enabled Row-level filtering and column masking

3.7 HA (HASpec)

Field Type Required Default Description
enabled bool No false Deploy secondary storage for HA
replicas int32 No Primary replicas Override replica count
resources ResourceSpec No Primary resources Override resources

3.8 ConfigVariables Reference

ConfigVariables are written to a config.properties file mounted in the container. These configure storage and schema service behavior.

Auto-Populated Variables (Do Not Override)

The operator automatically sets these values. Specifying them in configVariables will be ignored:

Variable Auto-Value Description
CLOUD spec.cloud or auto-detected Cloud provider (AWS/GCP/AZURE)
ALIAS spec.workspace Alias (same as workspace)
WORKSPACE spec.workspace Workspace name
E6_BUCKET spec.storageBackend Object storage path
STORAGE_SERVICE_HOST {name}-storage Storage service hostname
SCHEMA_SERVICE_HOST {name}-schema Schema service hostname

Common Storage ConfigVariables

Variable Type Default Description
ENABLE_TABLES_BACKGROUND_REFRESH bool true Enable background table refresh
BACKGROUND_REFRESH_TABLE_ACCESS_WINDOW_MINUTES int 60 Window for recently accessed tables
MAX_TABLES_TO_REFRESH int 100 Max tables per background refresh cycle
REFRESH_TIMEOUT_SECONDS int 1800 Refresh operation timeout (30 min)
ENABLE_RANGER_AUTH bool false Enable Ranger authorization
ENABLE_SCHEMA_AUTHZ bool false Enable schema authorization
PERMISSIONS_REFRESH_INTERVAL_SECONDS int 30 Permissions cache refresh interval
DELTA_READER_THREADPOOL_SIZE int 1000 Delta reader thread pool size
DELTA_SKIP_TABLE_UUID_CHECK bool true Skip Delta table UUID validation
DELTA_TABLE_PARTITION_SOFT_REFRESH_DURATION_SECONDS int 300 Delta partition soft refresh
ENABLE_ICEBERG_POSITIONAL_DELETES bool false Support Iceberg positional deletes
INITIALIZE_TABLES_WITH_PARTITIONS_ON_STARTUP bool true Load partitions on startup
IS_128BIT_NUMERIC_SUPPORTED bool true Support 128-bit decimals
ENABLE_V2 bool true Enable V2 API
FETCH_PERMISSION_FROM_UNITY_CATALOG bool false Fetch permissions from Unity

Common Schema ConfigVariables

Variable Type Default Description
SCHEMA_CACHE_TTL_SECONDS int 3600 Schema cache time-to-live
MAX_SCHEMA_CACHE_SIZE int 10000 Maximum schemas in cache
ENABLE_COLUMN_STATISTICS bool true Collect column statistics
STATISTICS_SAMPLE_ROWS int 10000 Rows to sample for statistics

3.9 EnvironmentVariables Reference

EnvironmentVariables are set as container environment variables.

Auto-Populated Variables (Do Not Override)

The operator automatically sets these values from pod metadata:

Variable Auto-Value Description
IS_KUBE "true" Indicates Kubernetes environment
POD_NAME Pod metadata Current pod name
POD_IP Pod status Current pod IP
NAMESPACE Pod metadata Current namespace
JAVA_TOOL_OPTIONS Auto-calculated JVM options (80% of memory)

Note on JAVA_TOOL_OPTIONS: The operator automatically calculates JVM heap settings based on container memory: - Xmx and Xms are set to 80% of resources.memory - Example: For 30Gi memory → -Xmx24G -Xms24G - Includes: -Djava.io.tmpdir=/tmp, OOM exit settings, G1GC config, JMX agent

Common EnvironmentVariables

Variable Type Default Description
E6_LOGGING_LEVEL string E6_INFO Log level: E6_DEBUG, E6_INFO, E6_WARN, E6_ERROR
LOG_FORMAT string json Log format: json, text
TZ string UTC Timezone
ENABLE_JMX bool true Enable JMX metrics

Example: Custom Configuration

storage:
  imageTag: "3.0.217"
  resources:
    memory: "30Gi"
    cpu: "16"
  configVariables:
    ENABLE_TABLES_BACKGROUND_REFRESH: "true"
    BACKGROUND_REFRESH_TABLE_ACCESS_WINDOW_MINUTES: "120"
    MAX_TABLES_TO_REFRESH: "200"
    ENABLE_RANGER_AUTH: "true"
    ENABLE_SCHEMA_AUTHZ: "true"
  environmentVariables:
    E6_LOGGING_LEVEL: "E6_DEBUG"

schema:
  imageTag: "3.0.217"
  resources:
    memory: "30Gi"
    cpu: "16"
  configVariables:
    SCHEMA_CACHE_TTL_SECONDS: "7200"
    ENABLE_COLUMN_STATISTICS: "true"
  environmentVariables:
    E6_LOGGING_LEVEL: "E6_INFO"

4. Example Manifests

Important: Before creating MetadataServices, ensure a NamespaceConfig exists in the same namespace with cloud, storage backend, and scheduling settings.

4.1 Minimal Example

# First, create NamespaceConfig
apiVersion: e6data.io/v1alpha1
kind: NamespaceConfig
metadata:
  name: config
  namespace: workspace-analytics-prod
spec:
  storageBackend: s3a://acme-data-lake
---
# Then, create MetadataServices
apiVersion: e6data.io/v1alpha1
kind: MetadataServices
metadata:
  name: analytics-prod
  namespace: workspace-analytics-prod
spec:
  workspace: analytics-prod
  tenant: acme-corp

  storage:
    imageTag: "3.0.217"
    resources:
      memory: "8Gi"
      cpu: "4"

  schema:
    imageTag: "3.0.217"
    resources:
      memory: "16Gi"
      cpu: "8"

4.2 Production Example (Full Configuration)

# NamespaceConfig with all infrastructure settings
apiVersion: e6data.io/v1alpha1
kind: NamespaceConfig
metadata:
  name: config
  namespace: workspace-analytics-prod
spec:
  cloud: AWS
  storageBackend: s3a://acme-data-lake-prod
  imageRepository: us-docker.pkg.dev/e6data-analytics/e6-engine
  imagePullSecrets:
    - e6data-registry-secret
  serviceAccounts:
    data: analytics-prod-sa
  karpenterNodePool: metadata-services
  tolerations:
    - key: "e6data-workspace-name"
      operator: "Equal"
      value: "analytics-prod"
      effect: "NoSchedule"
  nodeSelector:
    e6data-workspace-name: analytics-prod
---
# MetadataServices - now simplified (inherits from NamespaceConfig)
apiVersion: e6data.io/v1alpha1
kind: MetadataServices
metadata:
  name: analytics-prod
  namespace: workspace-analytics-prod
  labels:
    e6data.io/workspace: analytics-prod
    e6data.io/environment: production
spec:
  workspace: analytics-prod
  tenant: acme-corp

  # Storage service configuration
  storage:
    imageTag: "3.0.217"
    replicas: 2
    resources:
      memory: "30Gi"
      cpu: "16"
    ports:
      thrift: 9005
      web: 8081
      metrics: 9090
    environmentVariables:
      E6_LOGGING_LEVEL: "E6_INFO"
    configVariables:
      ENABLE_TABLES_BACKGROUND_REFRESH: "true"
      BACKGROUND_REFRESH_TABLE_ACCESS_WINDOW_MINUTES: "60"
      MAX_TABLES_TO_REFRESH: "100"
      REFRESH_TIMEOUT_SECONDS: "1800"

    # High availability with secondary storage
    ha:
      enabled: true
      replicas: 2
      resources:
        memory: "30Gi"
        cpu: "16"

  # Schema service configuration
  schema:
    imageTag: "3.0.217"
    replicas: 2
    resources:
      memory: "30Gi"
      cpu: "16"

  # Governance configuration
  governance:
    enabled: true
    provider: ranger
    policyPath: "governance/policies"
    filtering:
      catalog: true
      schema: true
      table: true
      column: true
    queryRewriting:
      enabled: true

  # Pod annotations for Prometheus scraping
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8081"
    prometheus.io/path: "/metrics"

4.3 S3-Compatible Storage (Linode/Wasabi/MinIO)

# NamespaceConfig for S3-compatible storage
apiVersion: e6data.io/v1alpha1
kind: NamespaceConfig
metadata:
  name: config
  namespace: workspace-analytics
spec:
  cloud: AWS  # Use AWS for S3-compatible storage
  storageBackend: s3a://my-bucket
  s3Endpoint: https://us-east-1.linodeobjects.com
---
apiVersion: e6data.io/v1alpha1
kind: MetadataServices
metadata:
  name: analytics-linode
  namespace: workspace-analytics
spec:
  workspace: analytics-linode
  tenant: startup-corp

  storage:
    imageTag: "3.0.217"
    resources:
      memory: "8Gi"
      cpu: "4"

  schema:
    imageTag: "3.0.217"
    resources:
      memory: "16Gi"
      cpu: "8"

5. Status & Lifecycle

5.1 Status Fields

Field Type Description
phase string Current lifecycle phase
message string Human-readable status message
ready bool true when all components are ready
storageDeployment DeploymentStatus Storage deployment status
secondaryStorageDeployment DeploymentStatus Secondary storage status (if HA enabled)
schemaDeployment DeploymentStatus Schema deployment status
observedGeneration int64 Last observed spec generation
activeStrategy string Current active deployment (blue or green)
activeReleaseVersion string Currently running version
pendingStrategy string Deployment being prepared
deploymentPhase string Blue-green phase: Stable, Deploying, Switching, Draining, Cleanup
releaseHistory []ReleaseRecord Last 10 releases for rollback

5.2 Phase Values

Phase Description
Pending Resource created, waiting to reconcile
Creating Initial deployment in progress
Running All components healthy and serving
Updating Blue-green update in progress
Failed Deployment failed (check conditions)
Terminating Deletion in progress
Degraded Partially healthy (some pods unhealthy)

5.3 Deployment Phases (Blue-Green)

Phase Description
Stable Single strategy active, no changes pending
Deploying New strategy being deployed
Switching Traffic switching to new strategy
Cleanup Old strategy resources being removed

5.4 Conditions

Type Description
Ready All deployments are ready
StorageReady Storage service is healthy
SchemaReady Schema service is healthy
SecondaryStorageReady Secondary storage is healthy (if HA)
Progressing Reconciliation in progress
Available At least one pod is available

Dependencies

CRD Relationship
NamespaceConfig Required - provides cloud, storage, scheduling configuration

CRDs that Reference MetadataServices

CRD Reference Field Relationship
E6Catalog spec.metadataServicesRef Discovers storage service endpoint
QueryService Same namespace Uses same NamespaceConfig settings

Labels Applied to Child Resources

app.kubernetes.io/name: {storage|schema|storage-secondary}
app.kubernetes.io/instance: {cr-name}
app.kubernetes.io/component: {storage|schema}
app.kubernetes.io/managed-by: e6-operator
e6data.io/workspace: {workspace}
e6data.io/strategy: {blue|green}

7. Troubleshooting

7.1 Common Issues

Storage Service CrashLoopBackOff

Symptoms:

$ kubectl get pods -l app.kubernetes.io/name=storage
NAME                                    READY   STATUS             RESTARTS
analytics-prod-storage-blue-xxx         0/1     CrashLoopBackOff   5

Possible Causes: 1. Invalid storageBackend path 2. Missing IAM permissions for S3/GCS/Azure 3. Incorrect s3Endpoint for S3-compatible storage 4. Java heap too large for container memory

Suggested Checks:

# Check pod logs
kubectl logs -l app.kubernetes.io/name=storage --tail=100

# Verify storage backend access
kubectl exec -it analytics-prod-storage-blue-xxx -- aws s3 ls s3://bucket

# Check Java options
kubectl get cm analytics-prod-storage-config-blue -o yaml | grep JAVA_TOOL_OPTIONS

Pods Stuck in Pending

Symptoms:

$ kubectl get pods
NAME                              READY   STATUS
analytics-prod-storage-blue-xxx   0/1     Pending

Possible Causes: 1. Insufficient cluster resources 2. NodeSelector/tolerations don't match any nodes 3. Karpenter provisioner not ready

Suggested Checks:

# Check pod events
kubectl describe pod analytics-prod-storage-blue-xxx

# Check node availability
kubectl get nodes -l e6data-workspace-name=analytics-prod

# Check Karpenter provisioner
kubectl get nodepools

Blue-Green Stuck in Deploying

Symptoms:

$ kubectl get metadataservices
NAME             PHASE      DEPLOYMENT-PHASE
analytics-prod   Updating   Deploying

Possible Causes: 1. New deployment pods failing health checks 2. Insufficient resources for new strategy 3. Image pull failures

Suggested Checks:

# Check both strategies
kubectl get deploy -l e6data.io/workspace=analytics-prod

# Check pending strategy pods
kubectl get pods -l e6data.io/strategy=green

# Force rollback via annotation (if needed)
kubectl annotate metadataservices analytics-prod e6data.io/rollback-to=previous

7.2 Useful Commands

# Get MetadataServices status
kubectl get mds analytics-prod -o yaml

# Watch deployment progress
kubectl get mds -w

# Check all resources created by operator
kubectl get all -l app.kubernetes.io/instance=analytics-prod

# View release history
kubectl get mds analytics-prod -o jsonpath='{.status.releaseHistory[*].version}'

# Trigger manual rollback
kubectl annotate mds analytics-prod e6data.io/rollback-to=v1.0.0

# Check operator logs
kubectl logs -n e6-operator-system -l app=e6-operator --tail=200

8. Validation Webhooks

MetadataServices has 30+ validation checks at apply time:

Check Error Message
Missing workspace spec.workspace is required
Missing tenant spec.tenant is required
Invalid cloud spec.cloud must be AWS, GCP, or AZURE
Invalid storageBackend spec.storageBackend must start with s3a://, gs://, or abfs://
StorageBackend/cloud mismatch spec.storageBackend (s3a://) requires cloud=AWS
Missing imageTag spec.storage.imageTag is required
"latest" tag in production WARNING: Using "latest" tag is not recommended
Minimum resources memory must be at least 1Gi
Port conflicts thrift, web, and metrics ports must be different
Immutable fields on update spec.workspace cannot be changed