Pool¶
API Version: e6data.io/v1alpha1 Kind: Pool Short Names: pool
1. Purpose¶
Pool provides shared compute resources that multiple QueryServices can use for burst capacity. Instead of each QueryService provisioning its own nodes, they share a common pool of warm nodes.
Key benefits:
- Cost efficiency: Share nodes across multiple clusters
- Faster scaling: Pre-warmed nodes with cached images
- Burst capacity: Scale beyond regular node allocation
- Resource optimization: Better utilization of expensive instances
2. High-level Behavior¶
When you create a Pool CR, the operator:
- Detects cloud provider and provisioning method (Karpenter, cluster-autoscaler, etc.)
- Creates Karpenter NodePool/NodeClass (for AWS/GCP/Azure with Karpenter)
- Deploys warmup DaemonSets to pre-cache executor images on pool nodes
- Tracks allocations from QueryServices that reference the pool
- Manages capacity (available vs occupied executors)
Karpenter vs Non-Karpenter Mode¶
The Pool CRD operates in two distinct modes depending on whether Karpenter is available:
| Feature | Karpenter Mode | Non-Karpenter Mode |
|---|---|---|
| Clouds | AWS, GCP, Azure | Linode, DigitalOcean, On-prem |
| Node Provisioning | Automatic via Karpenter | Manual (pre-existing node pools) |
| NodePool/NodeClass | Created by operator | Not created |
| Instance Type | Configurable, dynamic | Fixed by cloud provider |
| nodeSelector | Optional (derived from Karpenter) | Required |
| Scale-to-Zero | Yes | Depends on cloud provider |
Karpenter Mode (AWS/GCP/Azure)¶
When Karpenter is detected, the operator: 1. Creates a Karpenter NodePool with scaling limits 2. Creates cloud-specific NodeClass (EC2NodeClass, GCPNodeClass, AKSNodeClass) 3. Automatically provisions/deprovisions nodes based on demand 4. Derives instance type from attached QueryServices or explicit config
# Karpenter mode - operator creates NodePool and NodeClass
spec:
minExecutors: 0
maxExecutors: 20
instanceConfig:
instanceType: r7gd.16xlarge # Optional - can be derived
spotEnabled: true
Non-Karpenter Mode (Linode/DigitalOcean/On-prem)¶
When Karpenter is not available, the operator: 1. Does NOT create any Karpenter resources 2. Relies on pre-existing node pools (LKE pools, DOKS pools, etc.) 3. Uses nodeSelector to target pool nodes 4. Uses tolerations if pool nodes have taints
# Non-Karpenter mode - requires nodeSelector
spec:
minExecutors: 2
maxExecutors: 10
# REQUIRED: Identify which nodes belong to this pool
nodeSelector:
lke.linode.com/pool-id: "785603" # Linode LKE pool ID
# Optional: If pool nodes have taints
tolerations:
- key: "e6data.io/pool"
operator: "Equal"
value: "burst"
effect: "NoSchedule"
Important: For non-Karpenter clouds, you must: 1. Create the node pool manually in your cloud console (e.g., LKE node pool, DOKS node pool) 2. Note the identifying label (pool ID, node pool name, etc.) 3. Specify that label in nodeSelector
Child Resources Created¶
| Resource Type | Name Pattern | Purpose |
|---|---|---|
| NodePool (Karpenter) | {name}-nodepool | Node provisioning rules |
| EC2NodeClass (AWS) | {name}-nodeclass | AWS-specific node config |
| GCPNodeClass (GCP) | {name}-nodeclass | GCP-specific node config |
| AKSNodeClass (Azure) | {name}-nodeclass | Azure-specific node config |
| DaemonSet | {name}-warmup-{image-hash} | Image caching per unique image |
QueryService Integration¶
When a QueryService references a Pool via executor.poolRef: 1. Pool validates QueryService compatibility (resources fit on pool nodes) 2. QueryService creates a pool executor deployment ({name}-executor-pool-{strategy}) 3. Pool executors schedule on pool nodes (via node selector/affinity) 4. Pool tracks allocation in status.allocations
3. Spec Reference¶
3.1 Top-level Fields¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
minExecutors | int32 | No | 0 | Minimum executor slots (baseline capacity) |
maxExecutors | int32 | Yes | - | Maximum executor slots |
executorsPerNode | int32 | No | 1 | Executors per node |
instanceConfig | PoolInstanceConfig | No | - | Node/instance configuration |
inheritNodeConfigFrom | QueryServiceReference | No | - | Inherit config from QueryService |
imageConfig | PoolImageConfig | No | - | Image caching configuration |
allowedQueryServices | []QueryServiceReference | No | - | Explicit allowed list |
queryServiceSelector | LabelSelector | No | - | Label-based selection |
storageAgent | PoolStorageAgentSpec | No | - | Storage agent DaemonSet |
nodeSelector | map[string]string | No | - | Node labels for pool nodes |
tolerations | []Toleration | No | [] | Tolerations for pool workloads |
Note: Either allowedQueryServices OR queryServiceSelector must be specified (not both empty).
3.2 InstanceConfig¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
instanceType | string | No | Derived | Explicit instance type (e.g., r7gd.16xlarge) |
instanceFamily | string | No | - | Preferred family for auto-selection |
autoUpgrade | bool | No | false | Auto-upgrade instance when larger QS attaches |
spotEnabled | bool | No | false | Use spot/preemptible instances |
3.3 QueryServiceReference¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name | string | Yes | - | QueryService name |
namespace | string | No | Pool namespace | QueryService namespace |
3.4 ImageConfig¶
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
pullSecret | SecretReference | No | - | Registry credentials |
cachedImages | []string | No | [] | Explicit images to cache |
autoCollectImages | bool | No | true | Auto-cache from attached QueryServices |
unusedImageRetention | string | No | 1h | Keep unused warmup DaemonSets |
4. Example Manifests¶
4.1 Basic Burst Pool¶
apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
name: burst-pool
namespace: e6-pools
spec:
minExecutors: 2 # Always keep 2 slots warm
maxExecutors: 20 # Can scale to 20 executors
executorsPerNode: 1 # One executor per node
# Inherit instance type from existing QueryService
inheritNodeConfigFrom:
name: analytics-cluster
namespace: workspace-analytics-prod
# Auto-cache images from attached QueryServices
imageConfig:
autoCollectImages: true
unusedImageRetention: 2h
# Allow any QueryService with this label
queryServiceSelector:
matchLabels:
e6data.io/pool: burst-pool
4.2 Explicit Instance Type Pool¶
apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
name: high-memory-pool
namespace: e6-pools
spec:
minExecutors: 0 # Scale to zero when idle
maxExecutors: 50
executorsPerNode: 1
instanceConfig:
instanceType: r7gd.16xlarge # Explicit instance type
spotEnabled: true # Use spot instances
imageConfig:
autoCollectImages: true
pullSecret:
name: e6data-registry-secret
namespace: e6-pools
# Explicit allow list
allowedQueryServices:
- name: analytics-cluster
namespace: workspace-analytics-prod
- name: reporting-cluster
namespace: workspace-reporting
4.3 Non-Karpenter Pool (Linode/DigitalOcean)¶
apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
name: linode-pool
namespace: e6-pools
spec:
minExecutors: 2
maxExecutors: 10
executorsPerNode: 1
# For non-Karpenter clouds, nodeSelector is REQUIRED
nodeSelector:
lke.linode.com/pool-id: "785603" # Linode LKE pool ID
# Tolerations if pool nodes have taints
tolerations:
- key: "e6data.io/pool"
operator: "Equal"
value: "burst"
effect: "NoSchedule"
imageConfig:
autoCollectImages: true
queryServiceSelector:
matchLabels:
e6data.io/pool: linode-pool
4.4 Multi-Executor Per Node Pool¶
apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
name: shared-node-pool
namespace: e6-pools
spec:
minExecutors: 4
maxExecutors: 32
executorsPerNode: 4 # 4 executors share each node
instanceConfig:
instanceType: r6gd.8xlarge # 32 vCPU, 256 GiB (enough for 4 executors)
spotEnabled: true
imageConfig:
autoCollectImages: true
queryServiceSelector:
matchLabels:
e6data.io/pool: shared-pool
4.5 Pool with Explicit Cached Images¶
apiVersion: e6data.io/v1alpha1
kind: Pool
metadata:
name: prewarmed-pool
namespace: e6-pools
spec:
minExecutors: 5
maxExecutors: 25
executorsPerNode: 1
instanceConfig:
instanceFamily: r7gd
autoUpgrade: true # Upgrade instance if larger QS attaches
imageConfig:
autoCollectImages: false # Don't auto-collect
cachedImages:
- us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.217
- us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.218
- us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.219
pullSecret:
name: registry-secret
allowedQueryServices:
- name: prod-cluster
namespace: workspace-prod
5. Status & Lifecycle¶
5.1 Status Fields¶
| Field | Type | Description |
|---|---|---|
phase | string | Current lifecycle phase |
message | string | Human-readable status |
cloud | string | Detected cloud provider |
provisioningMethod | string | Node provisioning method |
derivedInstanceType | string | Instance type in use |
derivedFrom | string | Where instance type came from |
totalExecutors | int32 | Total executor capacity |
availableExecutors | int32 | Free executor slots |
occupiedExecutors | int32 | In-use executor slots |
currentNodes | int32 | Active pool nodes |
nodePoolName | string | Karpenter NodePool name |
nodeClassName | string | Karpenter NodeClass name |
allocations | []PoolAllocation | Per-QueryService allocations |
cachedImages | []CachedImageStatus | Image caching status |
attachedQueryServices | []AttachedQueryServiceStatus | Compatibility status |
5.2 Phase Values¶
| Phase | Description |
|---|---|
Pending | Initial setup in progress |
Creating | Creating Karpenter resources |
Active | Pool ready for allocations |
Suspended | Pool suspended (no new allocations) |
Suspending | Suspension in progress |
Resuming | Resume in progress |
Failed | Setup failed |
Deleting | Cleanup in progress |
5.3 Allocations¶
status:
allocations:
- queryService:
name: analytics-cluster
namespace: workspace-analytics-prod
poolExecutors: 5
regularExecutors: 4 # For reference
allocatedAt: "2024-01-15T10:00:00Z"
- queryService:
name: reporting-cluster
namespace: workspace-reporting
poolExecutors: 3
regularExecutors: 2
allocatedAt: "2024-01-15T11:30:00Z"
5.4 Cached Images Status¶
status:
cachedImages:
- image: us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.217
hash: a1b2c3d4
source: QueryService/workspace-analytics-prod/analytics-cluster
warmupStatus: Ready
daemonSetName: burst-pool-warmup-a1b2c3d4
nodesReady: 5
nodesTotal: 5
- image: us-docker.pkg.dev/e6data-analytics/e6-engine/executor:3.0.218
hash: e5f6g7h8
source: QueryService/workspace-reporting/reporting-cluster
warmupStatus: Pending
daemonSetName: burst-pool-warmup-e5f6g7h8
nodesReady: 2
nodesTotal: 5
5.5 Attached QueryServices¶
status:
attachedQueryServices:
- queryService:
name: analytics-cluster
namespace: workspace-analytics-prod
compatible: true
instanceType: r7gd.16xlarge
requiredCpu: "30"
requiredMemory: "60Gi"
message: "Compatible with pool instance type"
lastChecked: "2024-01-15T12:00:00Z"
- queryService:
name: huge-cluster
namespace: workspace-huge
compatible: false
requiredCpu: "120"
requiredMemory: "500Gi"
message: "Executor resources exceed pool instance capacity"
lastChecked: "2024-01-15T12:00:00Z"
6. Related Resources¶
References¶
| CRD | Relationship |
|---|---|
| QueryService | References Pool via executor.poolRef |
Creates (Karpenter clouds)¶
| Resource | API Group |
|---|---|
| NodePool | karpenter.sh/v1 |
| EC2NodeClass | karpenter.k8s.aws/v1 |
| GCPNodeClass | karpenter.k8s.gcp/v1 |
| AKSNodeClass | karpenter.azure.com/v1 |
7. Troubleshooting¶
7.1 Common Issues¶
Pool Stuck in Pending¶
Symptoms:
Causes: 1. Karpenter not installed (for AWS/GCP/Azure) 2. Missing nodeSelector (for non-Karpenter clouds) 3. Neither allowedQueryServices nor queryServiceSelector specified
Checks:
# Check pool events
kubectl describe pool burst-pool
# Verify Karpenter is running
kubectl get pods -n karpenter
# Check operator logs
kubectl logs -n e6-operator-system -l app=e6-operator | grep -i pool
QueryService Can't Attach to Pool¶
Symptoms: Pool executor deployment not created.
Checks:
# Verify QueryService has poolRef
kubectl get qs analytics-cluster -o jsonpath='{.spec.executor.poolRef}'
# Check if QueryService matches pool's selector
kubectl get qs analytics-cluster -o jsonpath='{.metadata.labels}'
# Check attached status
kubectl get pool burst-pool -o jsonpath='{.status.attachedQueryServices}' | jq
# Look for compatibility issues
kubectl get pool burst-pool -o jsonpath='{.status.attachedQueryServices[?(@.compatible==false)]}' | jq
Warmup DaemonSets Not Running¶
Symptoms: cachedImages[].warmupStatus: Failed or Pending.
Checks:
# List warmup DaemonSets
kubectl get ds -l e6data.io/pool=burst-pool
# Check DaemonSet status
kubectl describe ds burst-pool-warmup-a1b2c3d4
# Check for image pull errors
kubectl get pods -l e6data.io/component=warmup -o wide
# Verify pull secret exists
kubectl get secret e6data-registry-secret
Pool Nodes Not Scaling¶
Symptoms: currentNodes: 0 despite allocations.
Checks:
# Check Karpenter NodePool
kubectl get nodepool burst-pool-nodepool -o yaml
# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter | grep burst-pool
# Verify instance type availability
# (AWS example)
aws ec2 describe-instance-type-offerings --location-type availability-zone \
--filters Name=instance-type,Values=r7gd.16xlarge
7.2 Useful Commands¶
# Get pool status
kubectl get pool burst-pool -o yaml
# Watch pool status
kubectl get pool -w
# Check allocations
kubectl get pool burst-pool -o jsonpath='{.status.allocations}' | jq
# Check available capacity
kubectl get pool burst-pool -o jsonpath='{.status.availableExecutors}'
# List pool nodes
kubectl get nodes -l karpenter.sh/nodepool=burst-pool-nodepool
# Check warmup status
kubectl get pool burst-pool -o jsonpath='{.status.cachedImages}' | jq
# Force warmup DaemonSet recreation
kubectl delete ds -l e6data.io/pool=burst-pool,e6data.io/component=warmup
# Check Karpenter NodePool
kubectl get nodepool burst-pool-nodepool -o yaml
# Check Karpenter NodeClass (AWS)
kubectl get ec2nodeclass burst-pool-nodeclass -o yaml
8. Best Practices¶
8.1 Sizing Guidelines¶
| Cluster Count | minExecutors | maxExecutors |
|---|---|---|
| 1-2 clusters | 0-2 | 10-20 |
| 3-5 clusters | 2-5 | 30-50 |
| 5+ clusters | 5-10 | 50-100 |
8.2 Instance Type Selection¶
| Executor Memory | Recommended Instance (AWS) |
|---|---|
| 30Gi | r7gd.4xlarge, r6gd.4xlarge |
| 60Gi | r7gd.8xlarge, r6gd.8xlarge |
| 120Gi | r7gd.16xlarge, r6gd.16xlarge |
| 240Gi+ | r7gd.metal, x2gd instances |
8.3 Cost Optimization¶
-
Use spot instances for burst capacity:
-
Set minExecutors: 0 for infrequently used pools
-
Share pools across multiple QueryServices with similar requirements
-
Use
inheritNodeConfigFromto automatically match existing QueryService instance types
8.4 Image Caching Strategy¶
- autoCollectImages: true for most cases (automatic)
- Explicit cachedImages when you need specific versions pre-warmed
- unusedImageRetention: 2h (default 1h) to avoid thrashing during deployments