MetadataServices¶

API Version: e6data.io/v1alpha1 Kind: MetadataServices Short Names: mds, metadata

1. Purpose¶

MetadataServices manages the storage service and schema service components of the e6data analytics platform. These services are responsible for:

Storage Service: Handles table metadata caching, partition discovery, and data file location resolution across cloud object stores (S3, GCS, Azure Blob)
Schema Service: Provides schema inference, column statistics, and metadata for query optimization

Create a MetadataServices resource when setting up a new e6data workspace. This is typically the first CRD you deploy after NamespaceConfig, as QueryService and E6Catalog depend on it.

Note: Infrastructure settings (cloud, storage backend, tolerations, node selectors, image pull secrets) are now managed by NamespaceConfig. MetadataServices inherits these settings automatically.

2. High-level Behavior¶

When you create a MetadataServices CR, the operator:

Inherits infrastructure settings from NamespaceConfig in the same namespace
Creates ConfigMaps with auto-populated configuration variables (CLOUD, WORKSPACE, E6_BUCKET, etc.)
Deploys Storage Service (primary, and optionally secondary for HA)
Deploys Schema Service for schema inference
Creates Services (ClusterIP) for internal access
Implements Blue-Green deployment for zero-downtime updates
Tracks release history (last 10 releases) for rollback support

Prerequisites¶

NamespaceConfig must exist in the same namespace (provides cloud, storage backend, scheduling config)

Child Resources Created¶

Resource Type	Name Pattern	Purpose
Deployment	`{name}-storage-{blue\\|green}`	Storage service pods
Deployment	`{name}-storage-secondary-{blue\\|green}`	Secondary storage (if HA enabled)
Deployment	`{name}-schema-{blue\\|green}`	Schema service pods
ConfigMap	`{name}-storage-config-{blue\\|green}`	Storage config.properties
ConfigMap	`{name}-schema-config-{blue\\|green}`	Schema config.properties
ConfigMap	`{name}-common-config`	Active strategy routing
Secret	`{name}-common-secret`	Shared secrets
Service	`{name}-storage`	Storage service endpoint
Service	`{name}-storage-secondary`	Secondary storage endpoint
Service	`{name}-schema`	Schema service endpoint
ServiceAccount	`{workspace}`	Pod identity (if autoCreateRBAC)

External Dependencies¶

Object Storage: S3, GCS, or Azure Blob bucket (specified in storageBackend)
IAM/Workload Identity: Service account with read access to data lake
Kubernetes: 1.24+ recommended

3. Spec Reference¶

3.1 Top-level Fields¶

Field	Type	Required	Default	Description
`workspace`	string	No	CR name	Workspace name (used for namespacing and node scheduling)
`tenant`	string	Yes	-	Tenant identifier (customer/organization ID)
`releaseVersion`	string	No	Auto-generated	Version identifier for tracking releases
`storage`	StorageSpec	No	See defaults	Storage service configuration
`schema`	SchemaSpec	No	See defaults	Schema service configuration
`podAnnotations`	map[string]string	No	`{}`	Annotations for all pods (Prometheus scraping, etc.)
`governance`	GovernanceSpec	No	disabled	Data governance configuration

Inherited from NamespaceConfig¶

The following fields are inherited from NamespaceConfig and no longer specified in MetadataServices:

Field	Description
`cloud`	Cloud provider (AWS, GCP, AZURE)
`storageBackend`	Object storage path (s3a://, gs://, abfs://)
`s3Endpoint`	Custom S3 endpoint for S3-compatible storage
`imageRepository`	Container registry path
`imagePullSecrets`	Secrets for private registries
`tolerations`	Pod tolerations for scheduling
`nodeSelector`	Node labels for pod placement
`affinity`	Advanced scheduling rules
`karpenterNodePool`	Karpenter NodePool name
`serviceAccount`	ServiceAccount for pods (via `serviceAccounts.data`)

3.2 Storage (StorageSpec)¶

Field	Type	Required	Default	Description
`imageTag`	string	Yes	-	Image tag/version (e.g., `3.0.217`)
`replicas`	int32	No	`1`	Number of storage pods
`resources`	ResourceSpec	No	-	CPU/Memory limits
`ports`	PortSpec	No	See defaults	Service ports
`environmentVariables`	map[string]string	No	`{}`	Container environment variables
`configVariables`	map[string]string	No	`{}`	config.properties entries
`ha`	HASpec	No	disabled	High availability (secondary storage)

Auto-populated Environment Variables: - IS_KUBE=true - POD_NAME, POD_IP, NAMESPACE (from pod metadata) - JAVA_TOOL_OPTIONS (auto-calculated Xmx/Xms at 80% of memory)

Auto-populated Config Variables: - CLOUD, ALIAS, WORKSPACE, E6_BUCKET - STORAGE_SERVICE_HOST, SCHEMA_SERVICE_HOST

3.3 Schema (SchemaSpec)¶

Field	Type	Required	Default	Description
`imageTag`	string	Yes	-	Image tag/version
`replicas`	int32	No	`1`	Number of schema pods
`resources`	ResourceSpec	No	`30Gi` memory, `16` CPU	CPU/Memory limits
`ports`	PortSpec	No	See defaults	Service ports
`environmentVariables`	map[string]string	No	`{}`	Container environment variables
`configVariables`	map[string]string	No	`{}`	config.properties entries

3.4 ResourceSpec¶

Field	Type	Required	Default	Description
`memory`	string	Yes	-	Memory limit (e.g., `8Gi`, `30Gi`)
`cpu`	string	Yes	-	CPU limit (e.g., `2`, `16`)

3.5 PortSpec¶

Field	Type	Default	Description
`thrift`	int32	`9005` (storage), `9006` (schema)	Thrift RPC port
`web`	int32	`8081`	HTTP API port
`metrics`	int32	`9090`	Prometheus metrics port

3.6 Governance (GovernanceSpec)¶

Field	Type	Required	Default	Description
`enabled`	bool	No	`false`	Enable governance integration
`provider`	string	No	`ranger`	Provider: `ranger`, `unity`, `lakeformation`
`policyPath`	string	No	-	Path in bucket for Ranger policies
`unity`	UnityGovernanceSpec	No	-	Unity Catalog settings
`lakeFormation`	LakeFormationGovernanceSpec	No	-	AWS Lake Formation settings
`filtering`	FilteringSpec	No	all enabled	Catalog/schema/table/column filtering
`queryRewriting`	QueryRewritingSpec	No	enabled	Row-level filtering and column masking

3.7 HA (HASpec)¶

Field	Type	Required	Default	Description
`enabled`	bool	No	`false`	Deploy secondary storage for HA
`replicas`	int32	No	Primary replicas	Override replica count
`resources`	ResourceSpec	No	Primary resources	Override resources

3.8 ConfigVariables Reference¶

ConfigVariables are written to a config.properties file mounted in the container. These configure storage and schema service behavior.

Auto-Populated Variables (Do Not Override)¶

The operator automatically sets these values. Specifying them in configVariables will be ignored:

Variable	Auto-Value	Description
`CLOUD`	`spec.cloud` or auto-detected	Cloud provider (AWS/GCP/AZURE)
`ALIAS`	`spec.workspace`	Alias (same as workspace)
`WORKSPACE`	`spec.workspace`	Workspace name
`E6_BUCKET`	`spec.storageBackend`	Object storage path
`STORAGE_SERVICE_HOST`	`{name}-storage`	Storage service hostname
`SCHEMA_SERVICE_HOST`	`{name}-schema`	Schema service hostname

Common Storage ConfigVariables¶

Variable	Type	Default	Description
`ENABLE_TABLES_BACKGROUND_REFRESH`	bool	`true`	Enable background table refresh
`BACKGROUND_REFRESH_TABLE_ACCESS_WINDOW_MINUTES`	int	`60`	Window for recently accessed tables
`MAX_TABLES_TO_REFRESH`	int	`100`	Max tables per background refresh cycle
`REFRESH_TIMEOUT_SECONDS`	int	`1800`	Refresh operation timeout (30 min)
`ENABLE_RANGER_AUTH`	bool	`false`	Enable Ranger authorization
`ENABLE_SCHEMA_AUTHZ`	bool	`false`	Enable schema authorization
`PERMISSIONS_REFRESH_INTERVAL_SECONDS`	int	`30`	Permissions cache refresh interval
`DELTA_READER_THREADPOOL_SIZE`	int	`1000`	Delta reader thread pool size
`DELTA_SKIP_TABLE_UUID_CHECK`	bool	`true`	Skip Delta table UUID validation
`DELTA_TABLE_PARTITION_SOFT_REFRESH_DURATION_SECONDS`	int	`300`	Delta partition soft refresh
`ENABLE_ICEBERG_POSITIONAL_DELETES`	bool	`false`	Support Iceberg positional deletes
`INITIALIZE_TABLES_WITH_PARTITIONS_ON_STARTUP`	bool	`true`	Load partitions on startup
`IS_128BIT_NUMERIC_SUPPORTED`	bool	`true`	Support 128-bit decimals
`ENABLE_V2`	bool	`true`	Enable V2 API
`FETCH_PERMISSION_FROM_UNITY_CATALOG`	bool	`false`	Fetch permissions from Unity

Common Schema ConfigVariables¶

Variable	Type	Default	Description
`SCHEMA_CACHE_TTL_SECONDS`	int	`3600`	Schema cache time-to-live
`MAX_SCHEMA_CACHE_SIZE`	int	`10000`	Maximum schemas in cache
`ENABLE_COLUMN_STATISTICS`	bool	`true`	Collect column statistics
`STATISTICS_SAMPLE_ROWS`	int	`10000`	Rows to sample for statistics

3.9 EnvironmentVariables Reference¶

EnvironmentVariables are set as container environment variables.

Auto-Populated Variables (Do Not Override)¶

The operator automatically sets these values from pod metadata:

Variable	Auto-Value	Description
`IS_KUBE`	`"true"`	Indicates Kubernetes environment
`POD_NAME`	Pod metadata	Current pod name
`POD_IP`	Pod status	Current pod IP
`NAMESPACE`	Pod metadata	Current namespace
`JAVA_TOOL_OPTIONS`	Auto-calculated	JVM options (80% of memory)

Note on JAVA_TOOL_OPTIONS: The operator automatically calculates JVM heap settings based on container memory: - Xmx and Xms are set to 80% of resources.memory - Example: For 30Gi memory → -Xmx24G -Xms24G - Includes: -Djava.io.tmpdir=/tmp, OOM exit settings, G1GC config, JMX agent

Common EnvironmentVariables¶

Variable	Type	Default	Description
`E6_LOGGING_LEVEL`	string	`E6_INFO`	Log level: `E6_DEBUG`, `E6_INFO`, `E6_WARN`, `E6_ERROR`
`LOG_FORMAT`	string	`json`	Log format: `json`, `text`
`TZ`	string	`UTC`	Timezone
`ENABLE_JMX`	bool	`true`	Enable JMX metrics

Example: Custom Configuration¶

storage:
  imageTag: "3.0.217"
  resources:
    memory: "30Gi"
    cpu: "16"
  configVariables:
    ENABLE_TABLES_BACKGROUND_REFRESH: "true"
    BACKGROUND_REFRESH_TABLE_ACCESS_WINDOW_MINUTES: "120"
    MAX_TABLES_TO_REFRESH: "200"
    ENABLE_RANGER_AUTH: "true"
    ENABLE_SCHEMA_AUTHZ: "true"
  environmentVariables:
    E6_LOGGING_LEVEL: "E6_DEBUG"

schema:
  imageTag: "3.0.217"
  resources:
    memory: "30Gi"
    cpu: "16"
  configVariables:
    SCHEMA_CACHE_TTL_SECONDS: "7200"
    ENABLE_COLUMN_STATISTICS: "true"
  environmentVariables:
    E6_LOGGING_LEVEL: "E6_INFO"

4. Example Manifests¶

Important: Before creating MetadataServices, ensure a NamespaceConfig exists in the same namespace with cloud, storage backend, and scheduling settings.

4.1 Minimal Example¶

# First, create NamespaceConfig
apiVersion: e6data.io/v1alpha1
kind: NamespaceConfig
metadata:
  name: config
  namespace: workspace-analytics-prod
spec:
  storageBackend: s3a://acme-data-lake
---
# Then, create MetadataServices
apiVersion: e6data.io/v1alpha1
kind: MetadataServices
metadata:
  name: analytics-prod
  namespace: workspace-analytics-prod
spec:
  workspace: analytics-prod
  tenant: acme-corp

  storage:
    imageTag: "3.0.217"
    resources:
      memory: "8Gi"
      cpu: "4"

  schema:
    imageTag: "3.0.217"
    resources:
      memory: "16Gi"
      cpu: "8"

4.2 Production Example (Full Configuration)¶

# NamespaceConfig with all infrastructure settings
apiVersion: e6data.io/v1alpha1
kind: NamespaceConfig
metadata:
  name: config
  namespace: workspace-analytics-prod
spec:
  cloud: AWS
  storageBackend: s3a://acme-data-lake-prod
  imageRepository: us-docker.pkg.dev/e6data-analytics/e6-engine
  imagePullSecrets:
    - e6data-registry-secret
  serviceAccounts:
    data: analytics-prod-sa
  karpenterNodePool: metadata-services
  tolerations:
    - key: "e6data-workspace-name"
      operator: "Equal"
      value: "analytics-prod"
      effect: "NoSchedule"
  nodeSelector:
    e6data-workspace-name: analytics-prod
---
# MetadataServices - now simplified (inherits from NamespaceConfig)
apiVersion: e6data.io/v1alpha1
kind: MetadataServices
metadata:
  name: analytics-prod
  namespace: workspace-analytics-prod
  labels:
    e6data.io/workspace: analytics-prod
    e6data.io/environment: production
spec:
  workspace: analytics-prod
  tenant: acme-corp

  # Storage service configuration
  storage:
    imageTag: "3.0.217"
    replicas: 2
    resources:
      memory: "30Gi"
      cpu: "16"
    ports:
      thrift: 9005
      web: 8081
      metrics: 9090
    environmentVariables:
      E6_LOGGING_LEVEL: "E6_INFO"
    configVariables:
      ENABLE_TABLES_BACKGROUND_REFRESH: "true"
      BACKGROUND_REFRESH_TABLE_ACCESS_WINDOW_MINUTES: "60"
      MAX_TABLES_TO_REFRESH: "100"
      REFRESH_TIMEOUT_SECONDS: "1800"

    # High availability with secondary storage
    ha:
      enabled: true
      replicas: 2
      resources:
        memory: "30Gi"
        cpu: "16"

  # Schema service configuration
  schema:
    imageTag: "3.0.217"
    replicas: 2
    resources:
      memory: "30Gi"
      cpu: "16"

  # Governance configuration
  governance:
    enabled: true
    provider: ranger
    policyPath: "governance/policies"
    filtering:
      catalog: true
      schema: true
      table: true
      column: true
    queryRewriting:
      enabled: true

  # Pod annotations for Prometheus scraping
  podAnnotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8081"
    prometheus.io/path: "/metrics"

4.3 S3-Compatible Storage (Linode/Wasabi/MinIO)¶

# NamespaceConfig for S3-compatible storage
apiVersion: e6data.io/v1alpha1
kind: NamespaceConfig
metadata:
  name: config
  namespace: workspace-analytics
spec:
  cloud: AWS  # Use AWS for S3-compatible storage
  storageBackend: s3a://my-bucket
  s3Endpoint: https://us-east-1.linodeobjects.com
---
apiVersion: e6data.io/v1alpha1
kind: MetadataServices
metadata:
  name: analytics-linode
  namespace: workspace-analytics
spec:
  workspace: analytics-linode
  tenant: startup-corp

  storage:
    imageTag: "3.0.217"
    resources:
      memory: "8Gi"
      cpu: "4"

  schema:
    imageTag: "3.0.217"
    resources:
      memory: "16Gi"
      cpu: "8"

5. Status & Lifecycle¶

5.1 Status Fields¶

Field	Type	Description
`phase`	string	Current lifecycle phase
`message`	string	Human-readable status message
`ready`	bool	`true` when all components are ready
`storageDeployment`	DeploymentStatus	Storage deployment status
`secondaryStorageDeployment`	DeploymentStatus	Secondary storage status (if HA enabled)
`schemaDeployment`	DeploymentStatus	Schema deployment status
`observedGeneration`	int64	Last observed spec generation
`activeStrategy`	string	Current active deployment (`blue` or `green`)
`activeReleaseVersion`	string	Currently running version
`pendingStrategy`	string	Deployment being prepared
`deploymentPhase`	string	Blue-green phase: `Stable`, `Deploying`, `Switching`, `Draining`, `Cleanup`
`releaseHistory`	[]ReleaseRecord	Last 10 releases for rollback

5.2 Phase Values¶

Phase	Description
`Pending`	Resource created, waiting to reconcile
`Creating`	Initial deployment in progress
`Running`	All components healthy and serving
`Updating`	Blue-green update in progress
`Failed`	Deployment failed (check conditions)
`Terminating`	Deletion in progress
`Degraded`	Partially healthy (some pods unhealthy)

5.3 Deployment Phases (Blue-Green)¶

Phase	Description
`Stable`	Single strategy active, no changes pending
`Deploying`	New strategy being deployed
`Switching`	Traffic switching to new strategy
`Cleanup`	Old strategy resources being removed

5.4 Conditions¶

Type	Description
`Ready`	All deployments are ready
`StorageReady`	Storage service is healthy
`SchemaReady`	Schema service is healthy
`SecondaryStorageReady`	Secondary storage is healthy (if HA)
`Progressing`	Reconciliation in progress
`Available`	At least one pod is available

Dependencies¶

CRD	Relationship
NamespaceConfig	Required - provides cloud, storage, scheduling configuration

CRDs that Reference MetadataServices¶

CRD	Reference Field	Relationship
E6Catalog	`spec.metadataServicesRef`	Discovers storage service endpoint
QueryService	Same namespace	Uses same NamespaceConfig settings

Labels Applied to Child Resources¶

app.kubernetes.io/name: {storage|schema|storage-secondary}
app.kubernetes.io/instance: {cr-name}
app.kubernetes.io/component: {storage|schema}
app.kubernetes.io/managed-by: e6-operator
e6data.io/workspace: {workspace}
e6data.io/strategy: {blue|green}

7. Troubleshooting¶

7.1 Common Issues¶

Storage Service CrashLoopBackOff¶

Symptoms:

$ kubectl get pods -l app.kubernetes.io/name=storage
NAME                                    READY   STATUS             RESTARTS
analytics-prod-storage-blue-xxx         0/1     CrashLoopBackOff   5

Possible Causes: 1. Invalid storageBackend path 2. Missing IAM permissions for S3/GCS/Azure 3. Incorrect s3Endpoint for S3-compatible storage 4. Java heap too large for container memory

Suggested Checks:

# Check pod logs
kubectl logs -l app.kubernetes.io/name=storage --tail=100

# Verify storage backend access
kubectl exec -it analytics-prod-storage-blue-xxx -- aws s3 ls s3://bucket

# Check Java options
kubectl get cm analytics-prod-storage-config-blue -o yaml | grep JAVA_TOOL_OPTIONS

Pods Stuck in Pending¶

Symptoms:

$ kubectl get pods
NAME                              READY   STATUS
analytics-prod-storage-blue-xxx   0/1     Pending

Possible Causes: 1. Insufficient cluster resources 2. NodeSelector/tolerations don't match any nodes 3. Karpenter provisioner not ready

Suggested Checks:

# Check pod events
kubectl describe pod analytics-prod-storage-blue-xxx

# Check node availability
kubectl get nodes -l e6data-workspace-name=analytics-prod

# Check Karpenter provisioner
kubectl get nodepools

Blue-Green Stuck in Deploying¶

Symptoms:

$ kubectl get metadataservices
NAME             PHASE      DEPLOYMENT-PHASE
analytics-prod   Updating   Deploying

Possible Causes: 1. New deployment pods failing health checks 2. Insufficient resources for new strategy 3. Image pull failures

Suggested Checks:

# Check both strategies
kubectl get deploy -l e6data.io/workspace=analytics-prod

# Check pending strategy pods
kubectl get pods -l e6data.io/strategy=green

# Force rollback via annotation (if needed)
kubectl annotate metadataservices analytics-prod e6data.io/rollback-to=previous

7.2 Useful Commands¶

# Get MetadataServices status
kubectl get mds analytics-prod -o yaml

# Watch deployment progress
kubectl get mds -w

# Check all resources created by operator
kubectl get all -l app.kubernetes.io/instance=analytics-prod

# View release history
kubectl get mds analytics-prod -o jsonpath='{.status.releaseHistory[*].version}'

# Trigger manual rollback
kubectl annotate mds analytics-prod e6data.io/rollback-to=v1.0.0

# Check operator logs
kubectl logs -n e6-operator-system -l app=e6-operator --tail=200

8. Validation Webhooks¶

MetadataServices has 30+ validation checks at apply time:

Check	Error Message
Missing workspace	`spec.workspace is required`
Missing tenant	`spec.tenant is required`
Invalid cloud	`spec.cloud must be AWS, GCP, or AZURE`
Invalid storageBackend	`spec.storageBackend must start with s3a://, gs://, or abfs://`
StorageBackend/cloud mismatch	`spec.storageBackend (s3a://) requires cloud=AWS`
Missing imageTag	`spec.storage.imageTag is required`
"latest" tag in production	`WARNING: Using "latest" tag is not recommended`
Minimum resources	`memory must be at least 1Gi`
Port conflicts	`thrift, web, and metrics ports must be different`
Immutable fields on update	`spec.workspace cannot be changed`

MetadataServices¶

1. Purpose¶

2. High-level Behavior¶

Prerequisites¶

Child Resources Created¶

External Dependencies¶

3. Spec Reference¶

3.1 Top-level Fields¶

Inherited from NamespaceConfig¶

3.2 Storage (StorageSpec)¶

3.3 Schema (SchemaSpec)¶

3.4 ResourceSpec¶

3.5 PortSpec¶

3.6 Governance (GovernanceSpec)¶

3.7 HA (HASpec)¶

3.8 ConfigVariables Reference¶

Auto-Populated Variables (Do Not Override)¶

Common Storage ConfigVariables¶

Common Schema ConfigVariables¶

3.9 EnvironmentVariables Reference¶

Auto-Populated Variables (Do Not Override)¶

Common EnvironmentVariables¶

Example: Custom Configuration¶

4. Example Manifests¶

4.1 Minimal Example¶

4.2 Production Example (Full Configuration)¶

4.3 S3-Compatible Storage (Linode/Wasabi/MinIO)¶

5. Status & Lifecycle¶

5.1 Status Fields¶

5.2 Phase Values¶

5.3 Deployment Phases (Blue-Green)¶

5.4 Conditions¶

6. Related Resources¶

Dependencies¶

CRDs that Reference MetadataServices¶

Labels Applied to Child Resources¶

7. Troubleshooting¶

7.1 Common Issues¶

Storage Service CrashLoopBackOff¶

Pods Stuck in Pending¶

Blue-Green Stuck in Deploying¶

7.2 Useful Commands¶

8. Validation Webhooks¶