Infrastructure as Code
This directory contains all infrastructure-as-code (IaC) for the raibid-ci system. The infrastructure is separated from application code to enable independent validation, versioning, and deployment.
Directory Structure
infra/
├── k3s/ # k3s cluster configuration
├── gitea/ # Gitea Git server manifests
├── redis/ # Redis job queue manifests
├── flux/ # Flux GitOps configuration
├── keda/ # KEDA autoscaling resources
├── scripts/ # Validation and deployment scripts
├── Taskfile.yml # Task automation
└── README.md # This file
Component Overview
k3s - Lightweight Kubernetes
- Purpose: Foundation Kubernetes cluster for DGX Spark
- Location:
/infra/k3s/ - Deployment: Installed via raibid-cli
- Dependencies: None (base layer)
Gitea - Git Server & OCI Registry
- Purpose: Self-hosted Git with container registry
- Location:
/infra/gitea/ - Deployment: Helm chart with custom values
- Dependencies: k3s
Redis - Job Queue
- Purpose: Redis Streams for job queue management
- Location:
/infra/redis/ - Deployment: Helm chart with custom values
- Dependencies: k3s
Flux - GitOps CD
- Purpose: Continuous delivery from Gitea
- Location:
/infra/flux/ - Deployment: Flux bootstrap
- Dependencies: k3s, Gitea
KEDA - Event-Driven Autoscaling
- Purpose: Scale CI agents based on job queue depth
- Location:
/infra/keda/ - Deployment: Helm chart with ScaledObject
- Dependencies: k3s, Redis
Deployment Sequence
Infrastructure components must be deployed in dependency order:
- k3s - Base Kubernetes cluster
- Redis - Job queue (can be parallel with Gitea)
- Gitea - Git server (can be parallel with Redis)
- KEDA - Autoscaler (requires Redis)
- Flux - GitOps (requires Gitea)
Automated Deployment
# Deploy all components in correct order
task infra:deploy-all
# Or deploy individually
task infra:deploy-k3s
task infra:deploy-redis
task infra:deploy-gitea
task infra:deploy-keda
task infra:deploy-fluxManual Deployment
# Via raibid-cli (recommended)
raibid-cli setup all
# Or individual components
raibid-cli setup k3s
raibid-cli setup redis
raibid-cli setup gitea
raibid-cli setup keda
raibid-cli setup fluxValidation
All infrastructure manifests can be validated before deployment:
# Validate all manifests
task infra:validate
# Validate specific component
task infra:validate-gitea
task infra:validate-redis
task infra:validate-keda
# Lint manifests
task infra:lintValidation Scripts
Located in /infra/scripts/:
validate-manifests.sh- YAML syntax and schema validationlint-manifests.sh- Linting with yamllint and kubevalcheck-dependencies.sh- Verify dependency order
Configuration
Each component has its own configuration approach:
Helm-based Components (Gitea, Redis, KEDA)
Configuration via Helm values files:
values.yaml- Default production valuesvalues-dev.yaml- Development overridesvalues-test.yaml- Testing overrides
k3s Configuration
Configuration via install scripts and config files:
config.yaml- k3s cluster configinstall-flags.txt- Installation flags
Flux Configuration
Configuration via GitRepository and Kustomization:
flux-system/- Flux system componentsclusters/- Cluster-specific configs
CI/CD Integration
Infrastructure validation is integrated into GitHub Actions:
# .github/workflows/infra-validation.yml
- Validates all manifests on PR
- Lints YAML files
- Checks Helm chart syntax
- Verifies dependency orderManifest Standards
All Kubernetes manifests must follow these standards:
YAML Format
- 2-space indentation
- No tabs
- UTF-8 encoding
- LF line endings
Metadata
- All resources must have
metadata.labels:app.kubernetes.io/nameapp.kubernetes.io/componentapp.kubernetes.io/part-of: raibid-ciapp.kubernetes.io/managed-by
Namespaces
- Each component in dedicated namespace
- Namespace naming:
raibid-{component} - Namespace manifests included in component directory
Resource Limits
- All pods must have requests and limits
- CPU in millicores (e.g.,
100m) - Memory with unit suffix (e.g.,
256Mi)
Development Workflow
Adding New Infrastructure
- Create component directory under
/infra/ - Add manifests with proper labels and namespaces
- Create Helm values file if using Helm
- Add validation tests
- Update dependency chain if needed
- Document in component README
Testing Changes
# Validate changes
task infra:validate
# Dry-run deployment
kubectl apply --dry-run=client -f infra/component/
# Deploy to test cluster
task infra:deploy-testSubmitting Changes
- Create feature branch from
main - Make infrastructure changes
- Validate locally:
task infra:validate - Commit with descriptive message
- Push and create PR
- CI will validate automatically
- Merge after approval
Troubleshooting
Validation Failures
# Check YAML syntax
yamllint infra/
# Check Kubernetes schemas
kubeval infra/**/*.yaml
# Verify Helm charts
helm lint infra/gitea/chartDeployment Issues
# Check component status
kubectl get all -n raibid-{component}
# View component logs
kubectl logs -n raibid-{component} -l app.kubernetes.io/name={component}
# Describe failing pods
kubectl describe pod -n raibid-{component} {pod-name}Rollback
# Rollback via Helm (for Helm-based components)
helm rollback {release} -n {namespace}
# Or tear down and redeploy
raibid-cli teardown {component}
raibid-cli setup {component}Security Considerations
Secrets Management
- Never commit secrets to Git
- Use k8s Secrets or external secret managers
- Rotate credentials regularly
Network Policies
- Isolate namespaces with NetworkPolicies
- Restrict ingress/egress traffic
- Allow only required pod-to-pod communication
RBAC
- Apply least-privilege principle
- Use dedicated ServiceAccounts
- Limit cluster-admin usage
Image Security
- Use specific image tags (not
latest) - Scan images for vulnerabilities
- Use private registry for production
Monitoring
Health Checks
- All pods have liveness and readiness probes
- Services have health check endpoints
- Ingress has health checks configured
Metrics
- Prometheus metrics enabled for all components
- ServiceMonitors for metric scraping
- Custom metrics for autoscaling
Logging
- Structured JSON logs
- Centralized log aggregation (future)
- Log retention policies
Backup and Recovery
GitOps State
- Flux ensures declarative state
- Git is source of truth
- Restore by re-applying from Git
Persistent Data
Gitea Repositories
# Backup
kubectl exec -n raibid-gitea {pod} -- tar czf /tmp/repos.tar.gz /data/git
kubectl cp raibid-gitea/{pod}:/tmp/repos.tar.gz ./backup/repos.tar.gz
# Restore
kubectl cp ./backup/repos.tar.gz raibid-gitea/{pod}:/tmp/repos.tar.gz
kubectl exec -n raibid-gitea {pod} -- tar xzf /tmp/repos.tar.gz -C /Redis Data
# Backup
kubectl exec -n raibid-redis {pod} -- redis-cli SAVE
kubectl cp raibid-redis/{pod}:/data/dump.rdb ./backup/redis.rdb
# Restore
kubectl cp ./backup/redis.rdb raibid-redis/{pod}:/data/dump.rdb
kubectl delete pod -n raibid-redis {pod} # Restart to load dumpProduction Readiness Checklist
Before deploying to production:
- All manifests validated
- Resource limits configured
- Secrets externalized
- Network policies applied
- RBAC configured
- Monitoring enabled
- Logging configured
- Backup strategy implemented
- Disaster recovery tested
- Documentation updated
References
Official Documentation
Helm Charts
Related Documentation
- Project README
- CLAUDE.md - Project overview
- Workstreams - Development workstreams
Support
For issues or questions:
- Open issue on GitHub
- Check docs for detailed guides
- Review component-specific READMEs