Skip to content

Pool

API Version: e6data.io/v1alpha1 Kind: Pool Short Names: pool


1. Purpose

Pool provides shared compute resources that multiple QueryServices can use for burst capacity. Instead of each QueryService provisioning its own nodes, they share a common pool of warm nodes.

Key benefits:

  • Cost efficiency: Share nodes across multiple clusters
  • Faster scaling: Pre-warmed nodes with cached images
  • Burst capacity: Scale beyond regular node allocation
  • Resource optimization: Better utilization of expensive instances

2. High-level Behavior

When you create a Pool CR, the operator:

  1. Detects cloud provider and provisioning method (Karpenter, cluster-autoscaler, etc.)
  2. Creates Karpenter NodePool/NodeClass (for AWS/GCP/Azure with Karpenter)
  3. Deploys warmup DaemonSets to pre-cache executor images on pool nodes
  4. Tracks allocations from QueryServices that reference the pool
  5. Manages capacity (available vs occupied executors)

Karpenter vs Non-Karpenter Mode

The Pool CRD operates in two distinct modes depending on whether Karpenter is available:

Feature Karpenter Mode Non-Karpenter Mode
Clouds AWS, GCP, Azure Linode, DigitalOcean, On-prem
Node Provisioning Automatic via Karpenter Manual (pre-existing node pools)
NodePool/NodeClass Created by operator Not created
Instance Type Configurable, dynamic Fixed by cloud provider
nodeSelector Optional (derived from Karpenter) Required
Scale-to-Zero Yes Depends on cloud provider

Karpenter Mode (AWS/GCP/Azure)

When Karpenter is detected, the operator: 1. Creates a Karpenter NodePool with scaling limits 2. Creates cloud-specific NodeClass (EC2NodeClass, GCPNodeClass, AKSNodeClass) 3. Automatically provisions/deprovisions nodes based on demand 4. Derives instance type from attached QueryServices or explicit config

# Karpenter mode - operator creates NodePool and NodeClass
spec:
  minExecutors: 0
  maxExecutors: 20
  instanceConfig:
    instanceType: r7gd.16xlarge  # Optional - can be derived
    spotEnabled: true

Non-Karpenter Mode (Linode/DigitalOcean/On-prem)

When Karpenter is not available, the operator: 1. Does NOT create any Karpenter resources 2. Relies on pre-existing node pools (LKE pools, DOKS pools, etc.) 3. Uses nodeSelector to target pool nodes 4. Uses tolerations if pool nodes have taints

# Non-Karpenter mode - requires nodeSelector
spec:
  minExecutors: 2
  maxExecutors: 10

  # REQUIRED: Identify which nodes belong to this pool
  nodeSelector:
    lke.linode.com/pool-id: "785603"  # Linode LKE pool ID

  # Optional: If pool nodes have taints
  tolerations:
    - key: "e6data.io/pool"
      operator: "Equal"
      value: "burst"
      effect: "NoSchedule"

Important: For non-Karpenter clouds, you must: 1. Create the node pool manually in your cloud console (e.g., LKE node pool, DOKS node pool) 2. Note the identifying label (pool ID, node pool name, etc.) 3. Specify that label in nodeSelector

Child Resources Created

Resource Type Name Pattern Purpose
NodePool (Karpenter) {name}-nodepool Node provisioning rules
EC2NodeClass (AWS) {name}-nodeclass AWS-specific node config
GCPNodeClass (GCP) {name}-nodeclass GCP-specific node config
AKSNodeClass (Azure) {name}-nodeclass Azure-specific node config
DaemonSet {name}-warmup-{image-hash} Image caching per unique image

QueryService Integration

When a QueryService references a Pool via executor.poolRef: 1. Pool validates QueryService compatibility (resources fit on pool nodes) 2. QueryService creates a pool executor deployment ({name}-executor-pool-{strategy}) 3. Pool executors schedule on pool nodes (via node selector/affinity) 4. Pool tracks allocation in status.allocations


3. Spec Reference

3.1 Top-level Fields

Field Type Required Default Description
minExecutors int32 No 0 Minimum executor slots (baseline capacity)
maxExecutors int32 Yes - Maximum executor slots
executorsPerNode int32 No 1 Executors per node
instanceConfig PoolInstanceConfig No - Node/instance configuration
inheritNodeConfigFrom QueryServiceReference No - Inherit config from QueryService
imageConfig PoolImageConfig No - Image caching configuration
allowedQueryServices []QueryServiceReference No - Explicit allowed list
queryServiceSelector LabelSelector No - Label-based selection
storageAgent PoolStorageAgentSpec No - Storage agent DaemonSet
nodeSelector map[string]string No - Node labels for pool nodes
tolerations []Toleration No [] Tolerations for pool workloads

Note: Either allowedQueryServices OR queryServiceSelector must be specified (not both empty).

3.2 InstanceConfig

Field Type Required Default Description
instanceType string No Derived Explicit instance type (e.g., r7gd.16xlarge)
instanceFamily string No - Preferred family for auto-selection
autoUpgrade bool No false Auto-upgrade instance when larger QS attaches
spotEnabled bool No false Use spot/preemptible instances

3.3 QueryServiceReference

Field Type Required Default Description
name string Yes - QueryService name
namespace string No Pool namespace QueryService namespace

3.4 ImageConfig

Field Type Required Default Description
pullSecret SecretReference No - Registry credentials
cachedImages []string No [] Explicit images to cache
autoCollectImages bool No true Auto-cache from attached QueryServices
unusedImageRetention string No 1h Keep unused warmup DaemonSets

4. Example Manifests

4.1 Basic Burst Pool

apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
  name: burst-pool
  namespace: e6-pools
spec:
  minExecutors: 2      # Always keep 2 slots warm
  maxExecutors: 20     # Can scale to 20 executors
  executorsPerNode: 1  # One executor per node

  # Inherit instance type from existing QueryService
  inheritNodeConfigFrom:
    name: analytics-cluster
    namespace: workspace-analytics-prod

  # Auto-cache images from attached QueryServices
  imageConfig:
    autoCollectImages: true
    unusedImageRetention: 2h

  # Allow any QueryService with this label
  queryServiceSelector:
    matchLabels:
      e6data.io/pool: burst-pool

4.2 Explicit Instance Type Pool

apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
  name: high-memory-pool
  namespace: e6-pools
spec:
  minExecutors: 0      # Scale to zero when idle
  maxExecutors: 50
  executorsPerNode: 1

  instanceConfig:
    instanceType: r7gd.16xlarge  # Explicit instance type
    spotEnabled: true             # Use spot instances

  imageConfig:
    autoCollectImages: true
    pullSecret:
      name: e6data-registry-secret
      namespace: e6-pools

  # Explicit allow list
  allowedQueryServices:
    - name: analytics-cluster
      namespace: workspace-analytics-prod
    - name: reporting-cluster
      namespace: workspace-reporting

4.3 Non-Karpenter Pool (Linode/DigitalOcean)

apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
  name: linode-pool
  namespace: e6-pools
spec:
  minExecutors: 2
  maxExecutors: 10
  executorsPerNode: 1

  # For non-Karpenter clouds, nodeSelector is REQUIRED
  nodeSelector:
    lke.linode.com/pool-id: "785603"  # Linode LKE pool ID

  # Tolerations if pool nodes have taints
  tolerations:
    - key: "e6data.io/pool"
      operator: "Equal"
      value: "burst"
      effect: "NoSchedule"

  imageConfig:
    autoCollectImages: true

  queryServiceSelector:
    matchLabels:
      e6data.io/pool: linode-pool

4.4 Multi-Executor Per Node Pool

apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
  name: shared-node-pool
  namespace: e6-pools
spec:
  minExecutors: 4
  maxExecutors: 32
  executorsPerNode: 4  # 4 executors share each node

  instanceConfig:
    instanceType: r6gd.8xlarge  # 32 vCPU, 256 GiB (enough for 4 executors)
    spotEnabled: true

  imageConfig:
    autoCollectImages: true

  queryServiceSelector:
    matchLabels:
      e6data.io/pool: shared-pool

4.5 Pool with Explicit Cached Images

apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
  name: prewarmed-pool
  namespace: e6-pools
spec:
  minExecutors: 5
  maxExecutors: 25
  executorsPerNode: 1

  instanceConfig:
    instanceFamily: r7gd
    autoUpgrade: true  # Upgrade instance if larger QS attaches

  imageConfig:
    autoCollectImages: false  # Don't auto-collect
    cachedImages:
      - us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.217
      - us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.218
      - us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.219
    pullSecret:
      name: registry-secret

  allowedQueryServices:
    - name: prod-cluster
      namespace: workspace-prod

5. Status & Lifecycle

5.1 Status Fields

Field Type Description
phase string Current lifecycle phase
message string Human-readable status
cloud string Detected cloud provider
provisioningMethod string Node provisioning method
derivedInstanceType string Instance type in use
derivedFrom string Where instance type came from
totalExecutors int32 Total executor capacity
availableExecutors int32 Free executor slots
occupiedExecutors int32 In-use executor slots
currentNodes int32 Active pool nodes
nodePoolName string Karpenter NodePool name
nodeClassName string Karpenter NodeClass name
allocations []PoolAllocation Per-QueryService allocations
cachedImages []CachedImageStatus Image caching status
attachedQueryServices []AttachedQueryServiceStatus Compatibility status

5.2 Phase Values

Phase Description
Pending Initial setup in progress
Creating Creating Karpenter resources
Active Pool ready for allocations
Suspended Pool suspended (no new allocations)
Suspending Suspension in progress
Resuming Resume in progress
Failed Setup failed
Deleting Cleanup in progress

5.3 Allocations

status:
  allocations:
    - queryService:
        name: analytics-cluster
        namespace: workspace-analytics-prod
      poolExecutors: 5
      regularExecutors: 4  # For reference
      allocatedAt: "2024-01-15T10:00:00Z"
    - queryService:
        name: reporting-cluster
        namespace: workspace-reporting
      poolExecutors: 3
      regularExecutors: 2
      allocatedAt: "2024-01-15T11:30:00Z"

5.4 Cached Images Status

status:
  cachedImages:
    - image: us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.217
      hash: a1b2c3d4
      source: QueryService/workspace-analytics-prod/analytics-cluster
      warmupStatus: Ready
      daemonSetName: burst-pool-warmup-a1b2c3d4
      nodesReady: 5
      nodesTotal: 5
    - image: us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.218
      hash: e5f6g7h8
      source: QueryService/workspace-reporting/reporting-cluster
      warmupStatus: Pending
      daemonSetName: burst-pool-warmup-e5f6g7h8
      nodesReady: 2
      nodesTotal: 5

5.5 Attached QueryServices

status:
  attachedQueryServices:
    - queryService:
        name: analytics-cluster
        namespace: workspace-analytics-prod
      compatible: true
      instanceType: r7gd.16xlarge
      requiredCpu: "30"
      requiredMemory: "60Gi"
      message: "Compatible with pool instance type"
      lastChecked: "2024-01-15T12:00:00Z"
    - queryService:
        name: huge-cluster
        namespace: workspace-huge
      compatible: false
      requiredCpu: "120"
      requiredMemory: "500Gi"
      message: "Executor resources exceed pool instance capacity"
      lastChecked: "2024-01-15T12:00:00Z"

References

CRD Relationship
QueryService References Pool via executor.poolRef

Creates (Karpenter clouds)

Resource API Group
NodePool karpenter.sh/v1
EC2NodeClass karpenter.k8s.aws/v1
GCPNodeClass karpenter.k8s.gcp/v1
AKSNodeClass karpenter.azure.com/v1

7. Troubleshooting

7.1 Common Issues

Pool Stuck in Pending

Symptoms:

$ kubectl get pool
NAME         PHASE     AVAILABLE   OCCUPIED   TOTAL
burst-pool   Pending   0           0          0

Causes: 1. Karpenter not installed (for AWS/GCP/Azure) 2. Missing nodeSelector (for non-Karpenter clouds) 3. Neither allowedQueryServices nor queryServiceSelector specified

Checks:

# Check pool events
kubectl describe pool burst-pool

# Verify Karpenter is running
kubectl get pods -n karpenter

# Check operator logs
kubectl logs -n e6-operator-system -l app=e6-operator | grep -i pool

QueryService Can't Attach to Pool

Symptoms: Pool executor deployment not created.

Checks:

# Verify QueryService has poolRef
kubectl get qs analytics-cluster -o jsonpath='{.spec.executor.poolRef}'

# Check if QueryService matches pool's selector
kubectl get qs analytics-cluster -o jsonpath='{.metadata.labels}'

# Check attached status
kubectl get pool burst-pool -o jsonpath='{.status.attachedQueryServices}' | jq

# Look for compatibility issues
kubectl get pool burst-pool -o jsonpath='{.status.attachedQueryServices[?(@.compatible==false)]}' | jq

Warmup DaemonSets Not Running

Symptoms: cachedImages[].warmupStatus: Failed or Pending.

Checks:

# List warmup DaemonSets
kubectl get ds -l e6data.io/pool=burst-pool

# Check DaemonSet status
kubectl describe ds burst-pool-warmup-a1b2c3d4

# Check for image pull errors
kubectl get pods -l e6data.io/component=warmup -o wide

# Verify pull secret exists
kubectl get secret e6data-registry-secret

Pool Nodes Not Scaling

Symptoms: currentNodes: 0 despite allocations.

Checks:

# Check Karpenter NodePool
kubectl get nodepool burst-pool-nodepool -o yaml

# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter | grep burst-pool

# Verify instance type availability
# (AWS example)
aws ec2 describe-instance-type-offerings --location-type availability-zone \
  --filters Name=instance-type,Values=r7gd.16xlarge

7.2 Useful Commands

# Get pool status
kubectl get pool burst-pool -o yaml

# Watch pool status
kubectl get pool -w

# Check allocations
kubectl get pool burst-pool -o jsonpath='{.status.allocations}' | jq

# Check available capacity
kubectl get pool burst-pool -o jsonpath='{.status.availableExecutors}'

# List pool nodes
kubectl get nodes -l karpenter.sh/nodepool=burst-pool-nodepool

# Check warmup status
kubectl get pool burst-pool -o jsonpath='{.status.cachedImages}' | jq

# Force warmup DaemonSet recreation
kubectl delete ds -l e6data.io/pool=burst-pool,e6data.io/component=warmup

# Check Karpenter NodePool
kubectl get nodepool burst-pool-nodepool -o yaml

# Check Karpenter NodeClass (AWS)
kubectl get ec2nodeclass burst-pool-nodeclass -o yaml

8. Best Practices

8.1 Sizing Guidelines

Cluster Count minExecutors maxExecutors
1-2 clusters 0-2 10-20
3-5 clusters 2-5 30-50
5+ clusters 5-10 50-100

8.2 Instance Type Selection

Executor Memory Recommended Instance (AWS)
30Gi r7gd.4xlarge, r6gd.4xlarge
60Gi r7gd.8xlarge, r6gd.8xlarge
120Gi r7gd.16xlarge, r6gd.16xlarge
240Gi+ r7gd.metal, x2gd instances

8.3 Cost Optimization

  1. Use spot instances for burst capacity:

    instanceConfig:
      spotEnabled: true
    

  2. Set minExecutors: 0 for infrequently used pools

  3. Share pools across multiple QueryServices with similar requirements

  4. Use inheritNodeConfigFrom to automatically match existing QueryService instance types

8.4 Image Caching Strategy

  • autoCollectImages: true for most cases (automatic)
  • Explicit cachedImages when you need specific versions pre-warmed
  • unusedImageRetention: 2h (default 1h) to avoid thrashing during deployments