Pool¶

API Version: e6data.io/v1alpha1 Kind: Pool Short Names: pool

1. Purpose¶

Pool provides shared compute resources that multiple QueryServices can use for burst capacity. Instead of each QueryService provisioning its own nodes, they share a common pool of warm nodes.

Key benefits:

Cost efficiency: Share nodes across multiple clusters
Faster scaling: Pre-warmed nodes with cached images
Burst capacity: Scale beyond regular node allocation
Resource optimization: Better utilization of expensive instances

2. High-level Behavior¶

When you create a Pool CR, the operator:

Detects cloud provider and provisioning method (Karpenter, cluster-autoscaler, etc.)
Creates Karpenter NodePool/NodeClass (for AWS/GCP/Azure with Karpenter)
Deploys warmup DaemonSets to pre-cache executor images on pool nodes
Tracks allocations from QueryServices that reference the pool
Manages capacity (available vs occupied executors)

Karpenter vs Non-Karpenter Mode¶

The Pool CRD operates in two distinct modes depending on whether Karpenter is available:

Feature	Karpenter Mode	Non-Karpenter Mode
Clouds	AWS, GCP, Azure	Linode, DigitalOcean, On-prem
Node Provisioning	Automatic via Karpenter	Manual (pre-existing node pools)
NodePool/NodeClass	Created by operator	Not created
Instance Type	Configurable, dynamic	Fixed by cloud provider
nodeSelector	Optional (derived from Karpenter)	Required
Scale-to-Zero	Yes	Depends on cloud provider

Karpenter Mode (AWS/GCP/Azure)¶

When Karpenter is detected, the operator: 1. Creates a Karpenter NodePool with scaling limits 2. Creates cloud-specific NodeClass (EC2NodeClass, GCPNodeClass, AKSNodeClass) 3. Automatically provisions/deprovisions nodes based on demand 4. Derives instance type from attached QueryServices or explicit config

# Karpenter mode - operator creates NodePool and NodeClass
spec:
  minExecutors: 0
  maxExecutors: 20
  instanceConfig:
    instanceType: r7gd.16xlarge  # Optional - can be derived
    spotEnabled: true

Non-Karpenter Mode (Linode/DigitalOcean/On-prem)¶

When Karpenter is not available, the operator: 1. Does NOT create any Karpenter resources 2. Relies on pre-existing node pools (LKE pools, DOKS pools, etc.) 3. Uses nodeSelector to target pool nodes 4. Uses tolerations if pool nodes have taints

# Non-Karpenter mode - requires nodeSelector
spec:
  minExecutors: 2
  maxExecutors: 10

  # REQUIRED: Identify which nodes belong to this pool
  nodeSelector:
    lke.linode.com/pool-id: "785603"  # Linode LKE pool ID

  # Optional: If pool nodes have taints
  tolerations:
    - key: "e6data.io/pool"
      operator: "Equal"
      value: "burst"
      effect: "NoSchedule"

Important: For non-Karpenter clouds, you must: 1. Create the node pool manually in your cloud console (e.g., LKE node pool, DOKS node pool) 2. Note the identifying label (pool ID, node pool name, etc.) 3. Specify that label in nodeSelector

Child Resources Created¶

Resource Type	Name Pattern	Purpose
NodePool (Karpenter)	`{name}-nodepool`	Node provisioning rules
EC2NodeClass (AWS)	`{name}-nodeclass`	AWS-specific node config
GCPNodeClass (GCP)	`{name}-nodeclass`	GCP-specific node config
AKSNodeClass (Azure)	`{name}-nodeclass`	Azure-specific node config
DaemonSet	`{name}-warmup-{image-hash}`	Image caching per unique image

QueryService Integration¶

When a QueryService references a Pool via executor.poolRef: 1. Pool validates QueryService compatibility (resources fit on pool nodes) 2. QueryService creates a pool executor deployment ({name}-executor-pool-{strategy}) 3. Pool executors schedule on pool nodes (via node selector/affinity) 4. Pool tracks allocation in status.allocations

3. Spec Reference¶

3.1 Top-level Fields¶

Field	Type	Required	Default	Description
`minExecutors`	int32	No	`0`	Minimum executor slots (baseline capacity)
`maxExecutors`	int32	Yes	-	Maximum executor slots
`executorsPerNode`	int32	No	`1`	Executors per node
`instanceConfig`	PoolInstanceConfig	No	-	Node/instance configuration
`inheritNodeConfigFrom`	QueryServiceReference	No	-	Inherit config from QueryService
`imageConfig`	PoolImageConfig	No	-	Image caching configuration
`allowedQueryServices`	[]QueryServiceReference	No	-	Explicit allowed list
`queryServiceSelector`	LabelSelector	No	-	Label-based selection
`storageAgent`	PoolStorageAgentSpec	No	-	Storage agent DaemonSet
`nodeSelector`	map[string]string	No	-	Node labels for pool nodes
`tolerations`	[]Toleration	No	`[]`	Tolerations for pool workloads

Note: Either allowedQueryServices OR queryServiceSelector must be specified (not both empty).

3.2 InstanceConfig¶

Field	Type	Required	Default	Description
`instanceType`	string	No	Derived	Explicit instance type (e.g., `r7gd.16xlarge`)
`instanceFamily`	string	No	-	Preferred family for auto-selection
`autoUpgrade`	bool	No	`false`	Auto-upgrade instance when larger QS attaches
`spotEnabled`	bool	No	`false`	Use spot/preemptible instances

3.3 QueryServiceReference¶

Field	Type	Required	Default	Description
`name`	string	Yes	-	QueryService name
`namespace`	string	No	Pool namespace	QueryService namespace

3.4 ImageConfig¶

Field	Type	Required	Default	Description
`pullSecret`	SecretReference	No	-	Registry credentials
`cachedImages`	[]string	No	`[]`	Explicit images to cache
`autoCollectImages`	bool	No	`true`	Auto-cache from attached QueryServices
`unusedImageRetention`	string	No	`1h`	Keep unused warmup DaemonSets

4. Example Manifests¶

4.1 Basic Burst Pool¶

apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
  name: burst-pool
  namespace: e6-pools
spec:
  minExecutors: 2      # Always keep 2 slots warm
  maxExecutors: 20     # Can scale to 20 executors
  executorsPerNode: 1  # One executor per node

  # Inherit instance type from existing QueryService
  inheritNodeConfigFrom:
    name: analytics-cluster
    namespace: workspace-analytics-prod

  # Auto-cache images from attached QueryServices
  imageConfig:
    autoCollectImages: true
    unusedImageRetention: 2h

  # Allow any QueryService with this label
  queryServiceSelector:
    matchLabels:
      e6data.io/pool: burst-pool

4.2 Explicit Instance Type Pool¶

apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
  name: high-memory-pool
  namespace: e6-pools
spec:
  minExecutors: 0      # Scale to zero when idle
  maxExecutors: 50
  executorsPerNode: 1

  instanceConfig:
    instanceType: r7gd.16xlarge  # Explicit instance type
    spotEnabled: true             # Use spot instances

  imageConfig:
    autoCollectImages: true
    pullSecret:
      name: e6data-registry-secret
      namespace: e6-pools

  # Explicit allow list
  allowedQueryServices:
    - name: analytics-cluster
      namespace: workspace-analytics-prod
    - name: reporting-cluster
      namespace: workspace-reporting

4.3 Non-Karpenter Pool (Linode/DigitalOcean)¶

apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
  name: linode-pool
  namespace: e6-pools
spec:
  minExecutors: 2
  maxExecutors: 10
  executorsPerNode: 1

  # For non-Karpenter clouds, nodeSelector is REQUIRED
  nodeSelector:
    lke.linode.com/pool-id: "785603"  # Linode LKE pool ID

  # Tolerations if pool nodes have taints
  tolerations:
    - key: "e6data.io/pool"
      operator: "Equal"
      value: "burst"
      effect: "NoSchedule"

  imageConfig:
    autoCollectImages: true

  queryServiceSelector:
    matchLabels:
      e6data.io/pool: linode-pool

4.4 Multi-Executor Per Node Pool¶

apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
  name: shared-node-pool
  namespace: e6-pools
spec:
  minExecutors: 4
  maxExecutors: 32
  executorsPerNode: 4  # 4 executors share each node

  instanceConfig:
    instanceType: r6gd.8xlarge  # 32 vCPU, 256 GiB (enough for 4 executors)
    spotEnabled: true

  imageConfig:
    autoCollectImages: true

  queryServiceSelector:
    matchLabels:
      e6data.io/pool: shared-pool

4.5 Pool with Explicit Cached Images¶

apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
  name: prewarmed-pool
  namespace: e6-pools
spec:
  minExecutors: 5
  maxExecutors: 25
  executorsPerNode: 1

  instanceConfig:
    instanceFamily: r7gd
    autoUpgrade: true  # Upgrade instance if larger QS attaches

  imageConfig:
    autoCollectImages: false  # Don't auto-collect
    cachedImages:
      - us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.217
      - us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.218
      - us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.219
    pullSecret:
      name: registry-secret

  allowedQueryServices:
    - name: prod-cluster
      namespace: workspace-prod

5. Status & Lifecycle¶

5.1 Status Fields¶

Field	Type	Description
`phase`	string	Current lifecycle phase
`message`	string	Human-readable status
`cloud`	string	Detected cloud provider
`provisioningMethod`	string	Node provisioning method
`derivedInstanceType`	string	Instance type in use
`derivedFrom`	string	Where instance type came from
`totalExecutors`	int32	Total executor capacity
`availableExecutors`	int32	Free executor slots
`occupiedExecutors`	int32	In-use executor slots
`currentNodes`	int32	Active pool nodes
`nodePoolName`	string	Karpenter NodePool name
`nodeClassName`	string	Karpenter NodeClass name
`allocations`	[]PoolAllocation	Per-QueryService allocations
`cachedImages`	[]CachedImageStatus	Image caching status
`attachedQueryServices`	[]AttachedQueryServiceStatus	Compatibility status

5.2 Phase Values¶

Phase	Description
`Pending`	Initial setup in progress
`Creating`	Creating Karpenter resources
`Active`	Pool ready for allocations
`Suspended`	Pool suspended (no new allocations)
`Suspending`	Suspension in progress
`Resuming`	Resume in progress
`Failed`	Setup failed
`Deleting`	Cleanup in progress

5.3 Allocations¶

status:
  allocations:
    - queryService:
        name: analytics-cluster
        namespace: workspace-analytics-prod
      poolExecutors: 5
      regularExecutors: 4  # For reference
      allocatedAt: "2024-01-15T10:00:00Z"
    - queryService:
        name: reporting-cluster
        namespace: workspace-reporting
      poolExecutors: 3
      regularExecutors: 2
      allocatedAt: "2024-01-15T11:30:00Z"

5.4 Cached Images Status¶

status:
  cachedImages:
    - image: us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.217
      hash: a1b2c3d4
      source: QueryService/workspace-analytics-prod/analytics-cluster
      warmupStatus: Ready
      daemonSetName: burst-pool-warmup-a1b2c3d4
      nodesReady: 5
      nodesTotal: 5
    - image: us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.218
      hash: e5f6g7h8
      source: QueryService/workspace-reporting/reporting-cluster
      warmupStatus: Pending
      daemonSetName: burst-pool-warmup-e5f6g7h8
      nodesReady: 2
      nodesTotal: 5

5.5 Attached QueryServices¶

status:
  attachedQueryServices:
    - queryService:
        name: analytics-cluster
        namespace: workspace-analytics-prod
      compatible: true
      instanceType: r7gd.16xlarge
      requiredCpu: "30"
      requiredMemory: "60Gi"
      message: "Compatible with pool instance type"
      lastChecked: "2024-01-15T12:00:00Z"
    - queryService:
        name: huge-cluster
        namespace: workspace-huge
      compatible: false
      requiredCpu: "120"
      requiredMemory: "500Gi"
      message: "Executor resources exceed pool instance capacity"
      lastChecked: "2024-01-15T12:00:00Z"

References¶

CRD	Relationship
QueryService	References Pool via `executor.poolRef`

Creates (Karpenter clouds)¶

Resource	API Group
NodePool	`karpenter.sh/v1`
EC2NodeClass	`karpenter.k8s.aws/v1`
GCPNodeClass	`karpenter.k8s.gcp/v1`
AKSNodeClass	`karpenter.azure.com/v1`

7. Troubleshooting¶

7.1 Common Issues¶

Pool Stuck in Pending¶

Symptoms:

$ kubectl get pool
NAME         PHASE     AVAILABLE   OCCUPIED   TOTAL
burst-pool   Pending   0           0          0

Causes: 1. Karpenter not installed (for AWS/GCP/Azure) 2. Missing nodeSelector (for non-Karpenter clouds) 3. Neither allowedQueryServices nor queryServiceSelector specified

Checks:

# Check pool events
kubectl describe pool burst-pool

# Verify Karpenter is running
kubectl get pods -n karpenter

# Check operator logs
kubectl logs -n e6-operator-system -l app=e6-operator | grep -i pool

QueryService Can't Attach to Pool¶

Symptoms: Pool executor deployment not created.

Checks:

# Verify QueryService has poolRef
kubectl get qs analytics-cluster -o jsonpath='{.spec.executor.poolRef}'

# Check if QueryService matches pool's selector
kubectl get qs analytics-cluster -o jsonpath='{.metadata.labels}'

# Check attached status
kubectl get pool burst-pool -o jsonpath='{.status.attachedQueryServices}' | jq

# Look for compatibility issues
kubectl get pool burst-pool -o jsonpath='{.status.attachedQueryServices[?(@.compatible==false)]}' | jq

Warmup DaemonSets Not Running¶

Symptoms: cachedImages[].warmupStatus: Failed or Pending.

Checks:

# List warmup DaemonSets
kubectl get ds -l e6data.io/pool=burst-pool

# Check DaemonSet status
kubectl describe ds burst-pool-warmup-a1b2c3d4

# Check for image pull errors
kubectl get pods -l e6data.io/component=warmup -o wide

# Verify pull secret exists
kubectl get secret e6data-registry-secret

Pool Nodes Not Scaling¶

Symptoms: currentNodes: 0 despite allocations.

Checks:

# Check Karpenter NodePool
kubectl get nodepool burst-pool-nodepool -o yaml

# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter | grep burst-pool

# Verify instance type availability
# (AWS example)
aws ec2 describe-instance-type-offerings --location-type availability-zone \
  --filters Name=instance-type,Values=r7gd.16xlarge

7.2 Useful Commands¶

# Get pool status
kubectl get pool burst-pool -o yaml

# Watch pool status
kubectl get pool -w

# Check allocations
kubectl get pool burst-pool -o jsonpath='{.status.allocations}' | jq

# Check available capacity
kubectl get pool burst-pool -o jsonpath='{.status.availableExecutors}'

# List pool nodes
kubectl get nodes -l karpenter.sh/nodepool=burst-pool-nodepool

# Check warmup status
kubectl get pool burst-pool -o jsonpath='{.status.cachedImages}' | jq

# Force warmup DaemonSet recreation
kubectl delete ds -l e6data.io/pool=burst-pool,e6data.io/component=warmup

# Check Karpenter NodePool
kubectl get nodepool burst-pool-nodepool -o yaml

# Check Karpenter NodeClass (AWS)
kubectl get ec2nodeclass burst-pool-nodeclass -o yaml

8. Best Practices¶

8.1 Sizing Guidelines¶

Cluster Count	minExecutors	maxExecutors
1-2 clusters	0-2	10-20
3-5 clusters	2-5	30-50
5+ clusters	5-10	50-100

8.2 Instance Type Selection¶

Executor Memory	Recommended Instance (AWS)
30Gi	r7gd.4xlarge, r6gd.4xlarge
60Gi	r7gd.8xlarge, r6gd.8xlarge
120Gi	r7gd.16xlarge, r6gd.16xlarge
240Gi+	r7gd.metal, x2gd instances

8.3 Cost Optimization¶

Use spot instances for burst capacity:
```
instanceConfig:
  spotEnabled: true
```
Set minExecutors: 0 for infrequently used pools
Share pools across multiple QueryServices with similar requirements
Use inheritNodeConfigFrom to automatically match existing QueryService instance types

8.4 Image Caching Strategy¶

autoCollectImages: true for most cases (automatic)
Explicit cachedImages when you need specific versions pre-warmed
unusedImageRetention: 2h (default 1h) to avoid thrashing during deployments

Pool¶

1. Purpose¶

2. High-level Behavior¶

Karpenter vs Non-Karpenter Mode¶

Karpenter Mode (AWS/GCP/Azure)¶

Non-Karpenter Mode (Linode/DigitalOcean/On-prem)¶

Child Resources Created¶

QueryService Integration¶

3. Spec Reference¶

3.1 Top-level Fields¶

3.2 InstanceConfig¶

3.3 QueryServiceReference¶

3.4 ImageConfig¶

4. Example Manifests¶

4.1 Basic Burst Pool¶

4.2 Explicit Instance Type Pool¶

4.3 Non-Karpenter Pool (Linode/DigitalOcean)¶

4.4 Multi-Executor Per Node Pool¶

4.5 Pool with Explicit Cached Images¶

5. Status & Lifecycle¶

5.1 Status Fields¶

5.2 Phase Values¶

5.3 Allocations¶

5.4 Cached Images Status¶

5.5 Attached QueryServices¶

6. Related Resources¶

References¶

Creates (Karpenter clouds)¶

7. Troubleshooting¶

7.1 Common Issues¶

Pool Stuck in Pending¶

QueryService Can't Attach to Pool¶

Warmup DaemonSets Not Running¶

Pool Nodes Not Scaling¶

7.2 Useful Commands¶

8. Best Practices¶

8.1 Sizing Guidelines¶

8.2 Instance Type Selection¶

8.3 Cost Optimization¶

8.4 Image Caching Strategy¶