KEDA Deployment Checklist
Use this checklist when deploying KEDA for raibid-ci.
Pre-Deployment
- k3s cluster is running (
kubectl cluster-info) - Redis is deployed and healthy (
kubectl get pods -n raibid-redis) - Helm 3.x is installed (
helm version) - kubectl has cluster access
- Sufficient cluster resources (min 500m CPU, 1Gi RAM available)
Deployment Steps
1. Deploy KEDA Operator
-
Add KEDA Helm repository
helm repo add kedacore https://kedacore.github.io/charts helm repo update -
Create namespaces
kubectl apply -f namespace.yaml -
Install KEDA via Helm
helm upgrade --install raibid-keda kedacore/keda \ --namespace keda \ --version 2.12.0 \ --values values.yaml \ --wait -
Verify KEDA pods are running
kubectl get pods -n kedaExpected: keda-operator, keda-metrics-apiserver, keda-admission-webhooks (all Running)
-
Verify CRDs are installed
kubectl get crd | grep kedaExpected: scaledobjects.keda.sh, scaledjobs.keda.sh, triggerauthentications.keda.sh
2. Configure Authentication
-
Ensure Redis auth secret exists in raibid-ci namespace
kubectl get secret raibid-redis-auth -n raibid-ci -
If missing, create Redis auth secret:
kubectl create secret generic raibid-redis-auth \ -n raibid-ci \ --from-literal=password=<redis-password> \ --from-literal=address=raibid-redis-master.raibid-redis.svc.cluster.local:6379 -
Deploy TriggerAuthentication
kubectl apply -f triggerauth.yaml -
Verify TriggerAuthentication
kubectl get triggerauthentication -n raibid-ci kubectl describe triggerauthentication raibid-redis-trigger-auth -n raibid-ci
3. Deploy ScaledJob
-
Review ScaledJob configuration
- Max replicas: 10
- Polling interval: 10s
- Container image: ghcr.io/raibid-labs/rust-agent:latest
- Resource limits appropriate
-
Deploy ScaledJob
kubectl apply -f scaledjob.yaml -
Verify ScaledJob is created
kubectl get scaledjob raibid-ci-agent -n raibid-ci kubectl describe scaledjob raibid-ci-agent -n raibid-ci
4. Create Redis Consumer Group
- Create consumer group in Redis (if not exists)
Note: Ignore “BUSYGROUP” error if group existskubectl exec -n raibid-redis raibid-redis-master-0 -- \ redis-cli -a <password> XGROUP CREATE raibid:jobs raibid-workers 0 MKSTREAM
Post-Deployment Validation
Automated Validation
- Run validation script
Expected: All checks pass./validate-keda.sh
Manual Validation
-
Check KEDA operator logs for errors
kubectl logs -n keda -l app=keda-operator --tail=50 -
Verify no error events
kubectl get events -n keda --sort-by='.lastTimestamp' | tail -10 kubectl get events -n raibid-ci --sort-by='.lastTimestamp' | tail -10 -
Check ScaledJob status
kubectl get scaledjob raibid-ci-agent -n raibid-ci -o yamlLook for:
status.conditionsshowing Ready=True
Functional Testing
Test Scale From Zero
-
Ensure no jobs are running
kubectl get jobs -n raibid-ci -
Add test job to Redis
kubectl port-forward -n raibid-redis svc/raibid-redis-master 6379:6379 & redis-cli -a <password> XADD raibid:jobs '*' \ job_id test-001 \ repo raibid-labs/test \ branch main \ commit abc123 -
Watch KEDA create job (within 15 seconds)
kubectl get jobs -n raibid-ci -wExpected: New job appears
-
Watch pod spawn
kubectl get pods -n raibid-ci -wExpected: New pod in Running state
-
Clean up test
kubectl delete jobs -n raibid-ci -l app=raibid-ci-agent
Test Autoscaling
- Run autoscaling test script
Expected: 5 jobs created, pods spawn, scaling works./test-autoscaling.sh 5
Performance Verification
-
Measure scale-up latency
- Queue detection: < 15 seconds
- Job creation: < 5 seconds
- Pod start (cached image): < 30 seconds
- Total latency: < 50 seconds
-
Verify resource usage
kubectl top pods -n keda kubectl top pods -n raibid-ci -
Check KEDA operator resource consumption Expected: < 100m CPU, < 200Mi RAM
Monitoring Setup
- Configure log aggregation for KEDA operator
- Set up alerts for KEDA failures
- Create dashboard for queue depth and scaling metrics
- Configure job success/failure tracking
Documentation
- Update deployment runbook with any environment-specific notes
- Document any custom configuration changes
- Record KEDA operator version deployed
- Note any known issues or workarounds
Troubleshooting Checklist
If issues occur, check:
- KEDA operator logs:
kubectl logs -n keda -l app=keda-operator - ScaledJob events:
kubectl describe scaledjob raibid-ci-agent -n raibid-ci - Redis connectivity:
kubectl exec -n raibid-redis raibid-redis-master-0 -- redis-cli PING - TriggerAuth secret:
kubectl get secret raibid-redis-auth -n raibid-ci - Consumer group exists:
kubectl exec -n raibid-redis raibid-redis-master-0 -- redis-cli XINFO GROUPS raibid:jobs - Resource quotas:
kubectl describe resourcequota -n raibid-ci - Image pull secrets:
kubectl get pods -n raibid-ci(check for ImagePullBackOff)
Rollback Procedure
If deployment fails and rollback is needed:
-
Delete ScaledJob
kubectl delete scaledjob raibid-ci-agent -n raibid-ci -
Delete TriggerAuthentication
kubectl delete triggerauthentication raibid-redis-trigger-auth -n raibid-ci -
Uninstall KEDA
helm uninstall raibid-keda -n keda -
Delete namespace (if needed)
kubectl delete namespace keda -
Review logs and errors before redeployment
Sign-Off
- Deployment completed by: _________________ Date: _________
- Validation completed by: _________________ Date: _________
- Approved for production: ________________ Date: _________
Notes
Record any deployment-specific notes, issues, or deviations from standard procedure:
_________________________________________________________________________
_________________________________________________________________________
_________________________________________________________________________