k3s Configuration

Lightweight Kubernetes distribution optimized for DGX Spark ARM64 platform.

Overview

k3s is the foundation layer for the raibid-ci infrastructure. It provides a production-ready Kubernetes cluster with a minimal resource footprint, specifically configured for the DGX Spark’s ARM64 architecture.

Quick Start

# Automated installation (recommended)
./install.sh
 
# Rootless mode (no root required)
./install.sh --rootless
 
# Validate installation
./validate-installation.sh

Configuration Files

FilePurpose
config.yamlStandard mode k3s cluster configuration
rootless-config.yamlRootless mode configuration
install-flags.txtInstallation flags reference
namespaces.yamlNamespace definitions for CI, infrastructure, monitoring
registries.yamlOCI registry configuration for Gitea
storageclass.yamlLocal storage provisioner configuration
resource-quotas.yamlResource limits and quotas for namespaces
coredns-custom.yamlCustom DNS entries and CoreDNS configuration
install.shAutomated installation script with checksum verification
validate-installation.shPost-installation validation script
INSTALLATION.mdDetailed installation runbook

Installation

The automated installation script handles everything:

cd /home/beengud/raibid-labs/raibid-ci/infra/k3s
sudo ./install.sh

What it does:

  • Verifies ARM64 architecture
  • Downloads k3s with checksum verification
  • Configures DGX Spark optimizations
  • Creates namespaces and applies manifests
  • Sets up storage and resource quotas
  • Validates installation

See INSTALLATION.md for detailed installation guide.

Via raibid-cli (Future)

raibid-cli setup k3s

Manual Installation

See INSTALLATION.md for step-by-step manual installation.

Features

Lightweight & Fast

  • Single binary: ~100MB
  • Fast startup: <2 minutes to cluster ready
  • Low memory footprint: ~512MB base
  • Optimized for ARM64

Integrated Components

  • Traefik: Disabled (using custom ingress)
  • CoreDNS: Customized for raibid-ci
  • Local-path storage: Configured for DGX Spark
  • Metrics server: Enabled for autoscaling
  • Flannel CNI: VXLAN backend for networking

Security Features

  • Secrets encryption at rest
  • TLS for all components
  • RBAC enabled by default
  • Network policies support

DGX Spark Optimizations

  • ARM64 native binary
  • Resource reservations (4 cores, 16GB for system)
  • Kubernetes reservations (2 cores, 8GB for k3s)
  • Max 110 pods per node
  • Overlayfs snapshotter for performance

Configuration Options

Standard Mode

# /etc/rancher/k3s/config.yaml
write-kubeconfig-mode: "0644"
node-label:
  - "raibid-ci=true"
  - "arch=arm64"
disable:
  - traefik
secrets-encryption: true
snapshotter: "overlayfs"

Rootless Mode

# ~/.config/k3s/config.yaml
rootless: true
write-kubeconfig-mode: "0644"
snapshotter: "overlayfs"
disable:
  - traefik

Resource Reservations

ComponentCPUMemory
System Reserved4000m16Gi
Kubernetes Reserved2000m8Gi
Available for Workloads14 cores104Gi

Namespace Quotas

NamespaceCPU QuotaMemory QuotaStorage Quota
raibid-ci10 cores80Gi100Gi
raibid-infrastructure6-8 cores32-40Gi500Gi
raibid-monitoring2-4 cores8-16Gi100Gi

Resource Requirements

Minimum

  • CPU: 2 cores
  • Memory: 4GB
  • Disk: 20GB
  • CPU: 20 cores (10x Cortex-X925, 10x Cortex-A725)
  • Memory: 128GB LPDDR5x
  • Disk: 100GB+ NVMe

Validation

Automated Validation

./validate-installation.sh

Tests performed:

  • k3s binary and service status
  • kubectl connectivity
  • Node ready state and labels
  • Namespace creation
  • System pods (CoreDNS, metrics-server)
  • Storage provisioning
  • DNS resolution
  • Networking (CNI plugins)
  • Resource quotas and limits
  • Platform verification (ARM64)

Manual Validation

# Check cluster status
kubectl cluster-info
kubectl get nodes -o wide
 
# Verify namespaces
kubectl get namespaces
 
# Check system pods
kubectl get pods -n kube-system
 
# Test storage
kubectl get storageclass
kubectl get pvc -A
 
# Verify resource quotas
kubectl get resourcequota -A
kubectl get limitrange -A
 
# Test cluster
kubectl run test --image=nginx --rm -it -- /bin/sh

Troubleshooting

Service Not Starting

# Check service status
sudo systemctl status k3s
 
# View logs
sudo journalctl -u k3s -f
 
# Restart service
sudo systemctl restart k3s

Network Issues

# Check CNI plugins
ls /var/lib/rancher/k3s/data/current/bin
 
# Test DNS
kubectl run test --rm -it --image=busybox -- nslookup kubernetes.default
 
# Restart CoreDNS
kubectl rollout restart deployment/coredns -n kube-system

Storage Issues

# Check storage provisioner
kubectl get pods -n kube-system -l app=local-path-provisioner
kubectl logs -n kube-system -l app=local-path-provisioner
 
# Verify storage directory
ls -la /var/lib/rancher/k3s/storage

See INSTALLATION.md for comprehensive troubleshooting guide.

Uninstallation

Standard Mode

# Via raibid-cli (future)
raibid-cli teardown k3s
 
# Manual
sudo /usr/local/bin/k3s-uninstall.sh

Rootless Mode

# Via raibid-cli (future)
raibid-cli teardown k3s
 
# Manual
k3s-rootless-uninstall.sh

Upgrading

Via Automated Script

# Set desired version
export K3S_VERSION=v1.29.0+k3s1
 
# Run install script (handles upgrade)
sudo ./install.sh

Manual Upgrade

# Stop k3s
sudo systemctl stop k3s
 
# Download new version
curl -sfL https://get.k3s.io | K3S_VERSION=v1.29.0+k3s1 sh -
 
# Restart k3s
sudo systemctl start k3s

Monitoring

Cluster Metrics

# Node metrics
kubectl top nodes
 
# Pod metrics
kubectl top pods -A
 
# Describe node for detailed info
kubectl describe node

Health Checks

# API server health
kubectl get --raw='/livez?verbose'
 
# Component status
kubectl get componentstatus
 
# Events
kubectl get events -A --sort-by='.lastTimestamp'

Architecture

Deployment Model

DGX Spark (ARM64)
├─ System Layer (4 cores, 16GB)
│  └─ Ubuntu 22.04 LTS
├─ k3s Layer (2 cores, 8GB)
│  ├─ API Server
│  ├─ Controller Manager
│  ├─ Scheduler
│  ├─ CoreDNS
│  ├─ Flannel CNI
│  └─ Local Path Provisioner
├─ Infrastructure Layer (6-8 cores, 32-40GB)
│  ├─ Gitea (namespace: raibid-infrastructure)
│  ├─ Redis (namespace: raibid-infrastructure)
│  ├─ KEDA (namespace: keda)
│  └─ Flux (namespace: flux-system)
├─ CI Layer (10 cores, 80GB)
│  └─ CI Agents (namespace: raibid-ci)
└─ Monitoring Layer (2-4 cores, 8-16GB)
   └─ Observability Stack (namespace: raibid-monitoring)

Network Architecture

  • Cluster CIDR: 10.42.0.0/16
  • Service CIDR: 10.43.0.0/16
  • Flannel Backend: VXLAN
  • DNS: CoreDNS (10.43.0.10)

Storage Architecture

  • Provisioner: local-path (Rancher)
  • Storage Path: /var/lib/rancher/k3s/storage
  • Reclaim Policy: Delete
  • Volume Binding: WaitForFirstConsumer

Best Practices

Resource Management

  • Always set resource requests and limits
  • Use resource quotas to prevent resource exhaustion
  • Monitor resource usage regularly

Security

  • Enable secrets encryption (already configured)
  • Use RBAC for access control
  • Apply network policies for isolation
  • Rotate credentials regularly

High Availability

  • For production, consider multi-node setup
  • Regular backups of etcd data
  • Monitor cluster health proactively

Performance

  • Use overlayfs snapshotter for better I/O
  • Configure appropriate resource reservations
  • Enable metrics server for autoscaling
  • Tune garbage collection thresholds

References

Official Documentation

DGX Spark

Support

For issues or questions: