KEDA Autoscaling Behavior Documentation
Overview
This document describes how KEDA autoscales raibid-ci agents based on Redis Streams queue depth using ScaledJob.
Scaling Lifecycle
1. Queue Empty (Scale-to-Zero)
State: No jobs in Redis Streams KEDA Action: No Kubernetes Jobs exist Resource Usage: Zero (only KEDA operator running)
Redis Queue: []
Kubernetes Jobs: 0
Agent Pods: 0
2. Job Added to Queue
Event: Developer pushes code, triggering CI job Action: Job dispatcher adds entry to Redis Streams
XADD raibid:jobs * \
job_id abc123 \
repo raibid-labs/app \
branch feature/new \
commit def456Redis State:
Stream: raibid:jobs
Length: 1
Pending Entries: 1 (in consumer group raibid-workers)
3. KEDA Detects Job (Polling)
Timing: Within 10 seconds (polling interval) Action: KEDA queries Redis Streams trigger
KEDA runs this logic:
pending_count = XPENDING raibid:jobs raibid-workers
if pending_count >= pendingEntriesCount (1):
create_kubernetes_job()
KEDA Logs:
[INFO] scaledjob/raibid-ci-agent: Scaling from 0 to 1 jobs
[INFO] redis-streams: pending entries = 1, threshold = 1
4. Kubernetes Job Created
Timing: Within 15 seconds of queue detection Action: KEDA creates Job from ScaledJob template
apiVersion: batch/v1
kind: Job
metadata:
name: raibid-ci-agent-abc123
namespace: raibid-ci
ownerReferences:
- apiVersion: keda.sh/v1alpha1
kind: ScaledJob
name: raibid-ci-agent
spec:
template:
spec:
containers:
- name: rust-agent
image: ghcr.io/raibid-labs/rust-agent:latest
# ... environment, resources, etc5. Pod Scheduled and Running
Timing: Within 30 seconds (image pull + pod start) Actions:
- Kubernetes scheduler assigns pod to node
- Kubelet pulls container image (if not cached)
- Container starts, agent begins execution
Agent Actions:
1. Connect to Redis
2. Read from consumer group: XREADGROUP raibid-workers consumer1 raibid:jobs
3. Process job (build, test, etc)
4. Acknowledge job: XACK raibid:jobs raibid-workers <job-id>
5. Exit with code 0 (success) or 1 (failure)
6. Job Completion
Timing: Variable (depends on job duration) Actions:
- Pod exits
- Job status updated to Complete or Failed
- Pod enters Completed state
Kubernetes State:
Job: raibid-ci-agent-abc123
Status: Complete
Succeeded: 1
Failed: 0
Start Time: 2025-11-01T10:00:00Z
Completion Time: 2025-11-01T10:05:30Z
Duration: 5m30s
7. Job History Management
KEDA Action: Keep completed jobs based on history limits
successfulJobsHistoryLimit: 3 # Keep last 3 successful
failedJobsHistoryLimit: 5 # Keep last 5 failedOld jobs are automatically deleted to prevent resource accumulation.
8. Return to Scale-to-Zero
Condition: No pending entries in Redis Streams Timing: Immediate (no cooldown for ScaledJob) Action: No new jobs created
Redis Queue: [] (all processed)
Kubernetes Jobs: 3 (completed, kept for history)
Active Pods: 0
Scaling Scenarios
Scenario A: Single Job
Time Queue Jobs Pods Action
0s 0 0 0 Idle
10s 1 0 0 KEDA detects job
15s 1 1 0 Job created
20s 1 1 1 Pod running
5m 0 1 0 Job complete, pod terminated
Scenario B: Burst of Jobs
Time Queue Jobs Pods Action
0s 0 0 0 Idle
5s 10 0 0 10 jobs added
15s 10 10 5 KEDA creates 10 jobs, 5 pods running
20s 10 10 10 All 10 pods running (max replicas)
25s 8 10 10 2 jobs complete
30s 5 10 10 5 jobs complete
35s 2 10 8 8 jobs complete
40s 0 10 2 All jobs complete, last 2 pods finishing
45s 0 10 0 All pods terminated
Scenario C: Continuous Flow
Time Queue Jobs Pods Action
0s 5 5 5 5 jobs processing
10s 5 5 5 2 complete, 2 new jobs added
20s 5 5 5 Steady state (jobs in = jobs out)
30s 8 8 8 Burst: 3 new jobs added
40s 10 10 10 Max replicas reached, 2 jobs queued
50s 7 10 10 3 jobs complete
60s 5 7 7 Back to steady state
Scenario D: Overload (Queue Backup)
Time Queue Jobs Pods Action
0s 50 0 0 Massive job backlog
10s 50 10 5 KEDA creates max jobs (10), 5 running
20s 50 10 10 All 10 pods running
5m 45 10 10 5 jobs complete, 5 new jobs started
10m 40 10 10 Still at max capacity
15m 30 10 10 Processing continues
...
Queue processes at: 10 jobs per average_job_duration
Key Point: Queue will process at maximum throughput (10 concurrent jobs). Excess jobs wait in Redis.
Scaling Triggers
Redis Streams Trigger
KEDA queries Redis for pending entries:
# What KEDA runs
XPENDING raibid:jobs raibid-workers
# Returns:
# [
# lowest_pending_id,
# highest_pending_id,
# pending_count,
# consumers
# ]Scaling Logic:
def should_scale():
pending = get_pending_count()
running = get_running_jobs()
desired = min(pending, max_replicas)
if desired > running:
create_jobs(desired - running)
# ScaledJob doesn't scale down - jobs complete naturallyTrigger Metadata
metadata:
address: raibid-redis-master.raibid-redis.svc.cluster.local:6379
stream: raibid:jobs
consumerGroup: raibid-workers
pendingEntriesCount: "1" # Minimum to trigger
streamLength: "5" # Total stream length threshold
lagCount: "5" # Consumer lag threshold
activationLagCount: "0" # Start scaling immediatelyMulti-Metric Scaling
KEDA can combine multiple metrics:
triggers:
- type: redis-streams
metadata:
pendingEntriesCount: "1"
- type: cron
metadata:
timezone: UTC
start: 0 8 * * 1-5 # 8 AM weekdays
end: 0 18 * * 1-5 # 6 PM weekdays
desiredReplicas: "5" # Keep 5 warm during business hoursScaling Strategies
Default Strategy
Algorithm: 1 job per pending entry Behavior: Conservative, predictable
scalingStrategy:
strategy: "default"Example:
- 5 pending entries → 5 Jobs created
- 20 pending entries, max 10 → 10 Jobs created
Accurate Strategy
Algorithm: Precise calculation, minimal overprovision Behavior: Slower to scale, most efficient
scalingStrategy:
strategy: "accurate"Use Case: Cost-sensitive environments, predictable workloads
Eager Strategy
Algorithm: Aggressive scaling Behavior: Fast response, may overprovision
scalingStrategy:
strategy: "eager"Use Case: Time-sensitive CI, fast feedback required
Custom Strategy
Algorithm: User-defined logic
scalingStrategy:
strategy: "custom"
customScalingQueueLengthDeduction: 1
customScalingRunningJobPercentage: "0.5"
pendingPodConditions:
- "Ready"
- "PodScheduled"Parameters:
customScalingQueueLengthDeduction: Subtract from queue length (accounts for already-running jobs)customScalingRunningJobPercentage: Consider percentage of running jobspendingPodConditions: Wait for pod conditions before counting as “running”
Performance Characteristics
Latency Metrics
| Metric | Target | Actual (Typical) |
|---|---|---|
| Queue detection | 10s | 5-15s (polling interval) |
| Job creation | 5s | 2-5s |
| Pod start (cached image) | 10s | 5-15s |
| Pod start (pull image) | 60s | 30-120s |
| Total (cached) | 25s | 15-35s |
| Total (uncached) | 75s | 45-150s |
Throughput
Maximum Throughput: max_replicas / average_job_duration
Examples:
- 10 max replicas, 5-minute jobs: 2 jobs/minute = 120 jobs/hour
- 10 max replicas, 30-second jobs: 20 jobs/minute = 1,200 jobs/hour
Resource Efficiency
Idle Cost: $0 (scale-to-zero) Active Cost: Only running jobs Overhead: KEDA operator (~250m CPU, ~320Mi RAM)
Scaling Policies
Job History Retention
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5Why:
- Keep recent jobs for debugging
- Prevent resource accumulation
- Failed jobs retained longer for troubleshooting
Polling Interval
pollingInterval: 10 # secondsTrade-offs:
- Lower (5s): Faster response, higher Redis load
- Higher (30s): Lower overhead, slower response
Recommendation: 10s for most workloads
Maximum Replicas
maxReplicaCount: 10Calculation:
max_replicas = min(
available_cluster_resources / job_resource_request,
desired_parallelism,
cost_budget_limit
)DGX Spark Example (20 cores, 128GB RAM):
# Each job: 1 CPU, 2GB RAM
max_cpu_replicas = 20 / 1 = 20
max_mem_replicas = 128 / 2 = 64
max_replicas = min(20, 64) = 20
# With 50% reserved for system:
max_replicas = 10Autoscaling Best Practices
1. Right-Size Resource Requests
resources:
requests:
cpu: 1000m # Based on actual usage
memory: 2Gi # 80% of typical usage
limits:
cpu: 4000m # 150-200% of requests
memory: 8Gi # 150-200% of requests2. Use Consumer Groups Correctly
# Create consumer group before deploying ScaledJob
redis-cli XGROUP CREATE raibid:jobs raibid-workers 0 MKSTREAM3. Monitor Queue Depth
# Watch queue depth
watch -n 5 'redis-cli XLEN raibid:jobs'
# Check pending entries
redis-cli XPENDING raibid:jobs raibid-workers SUMMARY4. Set Appropriate Job Timeouts
jobTargetRef:
template:
spec:
activeDeadlineSeconds: 3600 # Kill after 1 hour
backoffLimit: 2 # Retry failed jobs twice5. Implement Health Checks
containers:
- name: rust-agent
livenessProbe:
exec:
command: ["/bin/sh", "-c", "pgrep -f rust-agent"]
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command: ["/bin/sh", "-c", "test -f /tmp/healthy"]
initialDelaySeconds: 5
periodSeconds: 56. Use Pod Anti-Affinity
Spread jobs across nodes:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: raibid-ci-agent
topologyKey: kubernetes.io/hostname7. Enable Metrics Collection
# Agent should expose metrics
- name: METRICS_ENABLED
value: "true"
- name: METRICS_PORT
value: "9090"Troubleshooting Scaling Issues
Jobs Not Scaling
Symptom: Queue has jobs, but no Kubernetes Jobs created
Debug Steps:
# 1. Check KEDA operator logs
kubectl logs -n keda -l app=keda-operator --tail=100
# 2. Check ScaledJob status
kubectl describe scaledjob raibid-ci-agent -n raibid-ci
# 3. Verify trigger authentication
kubectl get secret raibid-redis-auth -n raibid-ci
# 4. Test Redis connection
kubectl run redis-test --rm -it --image=redis -- \
redis-cli -h raibid-redis-master.raibid-redis.svc.cluster.local PING
# 5. Check pending entries
kubectl exec -n raibid-redis raibid-redis-master-0 -- \
redis-cli XPENDING raibid:jobs raibid-workersSlow Scaling
Symptom: Jobs created but pods take too long to start
Debug Steps:
# Check pod events
kubectl describe pod -n raibid-ci <pod-name>
# Common issues:
# - Image pull (pull image to all nodes beforehand)
# - Resource constraints (check node resources)
# - Scheduling delays (check node availability)Stuck Jobs
Symptom: Jobs running but never complete
Debug Steps:
# Check job logs
kubectl logs -n raibid-ci job/<job-name>
# Check Redis ACK
kubectl exec -n raibid-redis raibid-redis-master-0 -- \
redis-cli XPENDING raibid:jobs raibid-workers
# Common issues:
# - Agent not calling XACK
# - Agent crashed before completion
# - Redis connection lostMetrics and Monitoring
Key Metrics to Track
- Queue Depth:
XLEN raibid:jobs - Pending Entries:
XPENDING raibid:jobs raibid-workers - Active Jobs:
kubectl get jobs -n raibid-ci - Job Success Rate:
successful_jobs / total_jobs - Average Job Duration: Time from start to completion
- Time to Scale: Time from queue add to pod running
- Resource Utilization: CPU/memory usage per job
Prometheus Metrics
# KEDA exposes metrics
- keda_scaler_errors_total
- keda_scaled_job_paused
- keda_scaledjob_max_replicas