k3s Installation Runbook
Complete guide for installing and configuring k3s on DGX Spark.
Table of Contents
- Prerequisites
- Pre-Installation Checklist
- Installation Methods
- Post-Installation Verification
- Troubleshooting
- Rollback Procedure
Prerequisites
Hardware Requirements
- Platform: NVIDIA DGX Spark
- Architecture: ARM64 (aarch64)
- CPU: 20 cores (10x Cortex-X925, 10x Cortex-A725)
- Memory: 128GB LPDDR5x
- Storage: 20GB+ available in
/var/lib
Software Requirements
- OS: Ubuntu 22.04 LTS
- Kernel: 5.15+
- User: Non-root user with sudo privileges
- Network: Internet connectivity for downloading k3s
Optional (for Rootless Mode)
slirp4netnsfuse-overlayfsuidmap- User subordinate UID/GID mappings
Pre-Installation Checklist
Before running the installation script, verify:
- System architecture is ARM64:
uname -mshowsaarch64 - At least 4GB RAM available:
free -h - At least 20GB disk space in
/var/lib:df -h /var/lib - User has sudo privileges:
sudo -v - No existing k3s installation (or planned upgrade)
- Firewall allows required ports (see below)
Required Ports
| Port | Protocol | Purpose | Direction |
|---|---|---|---|
| 6443 | TCP | Kubernetes API | Inbound |
| 10250 | TCP | Kubelet metrics | Inbound |
| 8472 | UDP | Flannel VXLAN | Bidirectional |
Installation Methods
Method 1: Automated Installation (Recommended)
Use the provided installation script for a fully automated setup.
Standard Mode (Requires Root)
# Navigate to k3s directory
cd /home/beengud/raibid-labs/raibid-ci/infra/k3s
# Run installation script
sudo ./install.shWhat it does:
- Checks system architecture and requirements
- Downloads k3s v1.28.4+k3s1 for ARM64
- Verifies checksum for security
- Installs k3s binary to
/usr/local/bin - Configures k3s with DGX Spark optimizations
- Creates namespaces, storage, and resource quotas
- Configures CoreDNS customizations
- Validates installation
Duration: ~5 minutes
Rootless Mode (No Root Required)
# Navigate to k3s directory
cd /home/beengud/raibid-labs/raibid-ci/infra/k3s
# Run installation script in rootless mode
./install.sh --rootlessWhat it does:
- Checks rootless prerequisites
- Installs rootless dependencies if needed
- Configures subordinate UID/GID mappings
- Installs k3s in rootless mode for user
raibid-agent - Configures user-level kubeconfig
- Applies manifests
Duration: ~7 minutes (includes dependency installation)
Note: Rootless mode has some limitations:
- No LoadBalancer service type
- No HostPort access
- No privileged containers
- Slower networking (uses slirp4netns)
Method 2: Manual Installation
For advanced users who need fine-grained control.
Step 1: Download and Verify k3s
# Set version
K3S_VERSION=v1.28.4+k3s1
# Download k3s binary
curl -sfL "https://github.com/k3s-io/k3s/releases/download/${K3S_VERSION}/k3s-arm64" \
-o /tmp/k3s
# Download checksum
curl -sfL "https://github.com/k3s-io/k3s/releases/download/${K3S_VERSION}/sha256sum-arm64.txt" \
-o /tmp/k3s-checksum.txt
# Verify checksum
expected=$(grep "k3s-arm64" /tmp/k3s-checksum.txt | awk '{print $1}')
actual=$(sha256sum /tmp/k3s | awk '{print $1}')
if [ "$expected" = "$actual" ]; then
echo "Checksum verified"
else
echo "Checksum mismatch!"
exit 1
fi
# Install binary
sudo install -o root -g root -m 0755 /tmp/k3s /usr/local/bin/k3s
sudo ln -sf /usr/local/bin/k3s /usr/local/bin/kubectlStep 2: Configure k3s
# Create config directory
sudo mkdir -p /etc/rancher/k3s
# Copy configuration files
sudo cp config.yaml /etc/rancher/k3s/config.yaml
sudo cp registries.yaml /etc/rancher/k3s/registries.yamlStep 3: Install k3s Service
# Run k3s installer
curl -sfL https://get.k3s.io | sh -s - --config=/etc/rancher/k3s/config.yaml
# Wait for k3s to be ready
sudo k3s kubectl get nodesStep 4: Setup Kubeconfig
# Create .kube directory
mkdir -p ~/.kube
# Copy kubeconfig
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config
sudo chown $(id -u):$(id -g) ~/.kube/config
chmod 600 ~/.kube/config
# Test kubectl
kubectl cluster-infoStep 5: Apply Manifests
# Create namespaces
kubectl apply -f namespaces.yaml
# Configure storage
kubectl apply -f storageclass.yaml
# Apply resource quotas
kubectl apply -f resource-quotas.yaml
# Customize CoreDNS
kubectl apply -f coredns-custom.yaml
kubectl rollout restart deployment/coredns -n kube-systemMethod 3: Via raibid-cli (Future)
Once the raibid-cli tool is fully implemented:
# One-command installation
raibid-cli setup k3s
# Or with options
raibid-cli setup k3s --rootless --version=v1.28.4+k3s1Post-Installation Verification
After installation, run the validation script:
# Navigate to k3s directory
cd /home/beengud/raibid-labs/raibid-ci/infra/k3s
# Run validation tests
./validate-installation.shExpected Output
==================================
k3s Installation Validation
==================================
Testing: k3s binary exists... PASS
Testing: k3s service is active... PASS
Testing: kubectl command available... PASS
Testing: kubectl cluster communication... PASS
Testing: Node is Ready... PASS
Testing: Node has raibid-ci label... PASS
Checking namespaces...
Testing: Namespace kube-system exists... PASS
Testing: Namespace raibid-ci exists... PASS
Testing: Namespace raibid-infrastructure exists... PASS
Testing: Namespace raibid-monitoring exists... PASS
Checking system pods...
Testing: CoreDNS is running... PASS
Testing: Metrics server is running... PASS
Testing: Local storage class exists... PASS
Testing storage provisioning...
Testing: PVC creation and binding... PASS
Checking networking...
Testing: CNI plugins exist... PASS
Testing: DNS resolution... PASS
Testing: kubeconfig is readable... PASS
Checking resource configuration...
Testing: Max pods configuration... PASS
Checking platform...
Testing: k3s is ARM64 binary... PASS
==================================
Validation Summary
==================================
Total tests: 16
Passed tests: 16
Failed tests: 0
All validation tests passed!
k3s cluster is ready for use.
Manual Verification Commands
# Check cluster info
kubectl cluster-info
# Check node status
kubectl get nodes -o wide
# Check all namespaces
kubectl get namespaces
# Check system pods
kubectl get pods -A
# Check storage classes
kubectl get storageclass
# Check resource quotas
kubectl get resourcequota -A
# Check limit ranges
kubectl get limitrange -A
# Test PVC creation
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
namespace: default
spec:
accessModes:
- ReadWriteOnce
storageClassName: local-path
resources:
requests:
storage: 1Gi
EOF
# Verify PVC is bound
kubectl get pvc test-pvc
# Cleanup
kubectl delete pvc test-pvcTroubleshooting
Installation Fails
Problem: Installation script exits with error.
Solution:
-
Check system logs:
sudo journalctl -u k3s -n 50 -
Verify architecture:
uname -m # Should show aarch64 -
Check available resources:
free -h df -h /var/lib -
Ensure no firewall blocking:
sudo ufw status
k3s Service Won’t Start
Problem: k3s service fails to start.
Solution:
-
Check service status:
sudo systemctl status k3s -
View detailed logs:
sudo journalctl -u k3s -f -
Check configuration:
sudo cat /etc/rancher/k3s/config.yaml -
Restart service:
sudo systemctl restart k3s
kubectl Commands Fail
Problem: kubectl cannot communicate with cluster.
Solution:
-
Check kubeconfig:
ls -la ~/.kube/config cat ~/.kube/config -
Verify k3s is running:
sudo systemctl status k3s -
Check API server:
curl -k https://localhost:6443/livez?verbose -
Recreate kubeconfig:
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config sudo chown $(id -u):$(id -g) ~/.kube/config
Pods Not Starting
Problem: Pods stuck in Pending or CrashLoopBackOff.
Solution:
-
Describe the pod:
kubectl describe pod <pod-name> -n <namespace> -
Check events:
kubectl get events -n <namespace> --sort-by='.lastTimestamp' -
Check resource availability:
kubectl top nodes kubectl describe node -
Check logs:
kubectl logs <pod-name> -n <namespace>
Storage Issues
Problem: PVCs not binding.
Solution:
-
Check storage class:
kubectl get storageclass -
Check local-path provisioner:
kubectl get pods -n kube-system -l app=local-path-provisioner kubectl logs -n kube-system -l app=local-path-provisioner -
Verify storage directory:
ls -la /var/lib/rancher/k3s/storage -
Check PVC status:
kubectl describe pvc <pvc-name>
DNS Not Working
Problem: Pods cannot resolve DNS names.
Solution:
-
Check CoreDNS:
kubectl get pods -n kube-system -l k8s-app=kube-dns kubectl logs -n kube-system -l k8s-app=kube-dns -
Test DNS from a pod:
kubectl run test --rm -it --image=busybox -- nslookup kubernetes.default -
Check CoreDNS config:
kubectl get configmap coredns -n kube-system -o yaml -
Restart CoreDNS:
kubectl rollout restart deployment/coredns -n kube-system
Rollback Procedure
If installation fails and you need to start over:
Standard Mode
# Stop k3s service
sudo systemctl stop k3s
# Uninstall k3s
sudo /usr/local/bin/k3s-uninstall.sh
# Remove configuration
sudo rm -rf /etc/rancher/k3s
# Remove data
sudo rm -rf /var/lib/rancher/k3s
# Remove kubeconfig
rm -rf ~/.kube
# Verify cleanup
ps aux | grep k3s # Should show nothingRootless Mode
# Stop k3s service
systemctl --user stop k3s-rootless
# Uninstall k3s
k3s-rootless-uninstall.sh
# Remove configuration
rm -rf ~/.config/k3s
# Remove data
rm -rf ~/.local/share/k3s
# Remove kubeconfig
rm -rf ~/.kube
# Verify cleanup
ps aux | grep k3s # Should show nothingAfter Rollback
- Review errors from previous installation
- Fix any issues (resources, configuration, etc.)
- Run installation script again
Resource Allocation Summary
After successful installation, the DGX Spark resources are allocated as follows:
| Component | CPU Reservation | Memory Reservation |
|---|---|---|
| System | 4 cores | 16GB |
| k3s (Kubernetes) | 2 cores | 8GB |
| Infrastructure | 6 cores (quota) | 32GB (quota) |
| CI Agents | 10 cores (quota) | 80GB (quota) |
| Monitoring | 2 cores (quota) | 8GB (quota) |
Total Reserved: 6 cores + quotas / 24GB + quotas
Available for Workloads: ~14 cores / ~104GB (after system reservations)
Next Steps
After successful k3s installation:
-
Deploy Redis (Job Queue)
cd ../redis raibid-cli setup redis -
Deploy Gitea (Git Server + OCI Registry)
cd ../gitea raibid-cli setup gitea -
Deploy KEDA (Autoscaling)
cd ../keda raibid-cli setup keda -
Deploy Flux (GitOps)
cd ../flux raibid-cli setup flux