Infrastructure Error Handling

This directory contains comprehensive error handling and recovery mechanisms for infrastructure operations.

Modules

`error.rs`

Defines comprehensive error types with detailed context and actionable suggestions.

Key Types:

InfraError: Main error enum with variants for all failure scenarios
InstallPhase: Installation phase tracking for detailed error reporting
HelmOperation: Helm operation types
ValidationError: Validation error details

Features:

Detailed error messages with context
Actionable recovery suggestions
Transient vs fatal error classification
Automatic retry delay calculation

`retry.rs`

Implements retry logic with exponential backoff for transient failures.

Key Types:

RetryConfig: Configurable retry behavior
retry_with_backoff: Synchronous retry function
retry_with_backoff_async: Asynchronous retry function
poll_until: Poll for condition with timeout
poll_until_async: Async polling

Features:

Exponential backoff with jitter
Configurable max attempts and delays
Automatic transient error detection
Quick, slow, and custom configurations

`preflight.rs`

Provides pre-flight validation checks before installation.

Key Types:

SystemRequirements: System prerequisites definition
PreFlightValidator: Validation executor
PreFlightResult: Validation results

Checks:

Disk space availability
Memory availability
Required commands in PATH
Optional commands
Required directories
Network connectivity

Predefined Requirements:

k3s_requirements()
gitea_requirements()
redis_requirements()
keda_requirements()
flux_requirements()

`rollback.rs`

Implements transaction-like rollback for infrastructure changes.

Key Types:

RollbackManager: Manages rollback actions
RollbackContext: Tracks installed resources
RollbackAction: Cleanup function type

Features:

Automatic rollback on failure (via Drop)
LIFO action execution
Resource tracking (files, directories, K8s resources, Helm releases)
Partial cleanup reporting
Commit to disable rollback on success

`healthcheck.rs`

Provides health checking with timeout support.

Key Types:

HealthStatus: Health status enum
HealthCheckResult: Health check results
K3sHealthChecker: K3s cluster health checker
HelmHealthChecker: Helm release health checker

Features:

Multiple check types
Configurable timeouts
Wait-until-healthy functionality
Detailed check results

Usage Examples

Basic Error Handling

use raibid_cli::infrastructure::{InfraError, InfraResult, InstallPhase};
 
fn install_component() -> InfraResult<()> {
    // Return detailed errors
    Err(InfraError::installation(
        "k3s",
        InstallPhase::Download,
        "Failed to download binary",
    ))
}

Retry Logic

use raibid_cli::infrastructure::{RetryConfig, retry_with_backoff};
 
let config = RetryConfig::quick();
let result = retry_with_backoff(&config, "download", || {
    download_file()
})?;

Pre-flight Validation

use raibid_cli::infrastructure::{PreFlightValidator, k3s_requirements};
 
let validator = PreFlightValidator::new(k3s_requirements());
validator.validate("k3s")?;

Rollback

use raibid_cli::infrastructure::RollbackManager;
 
let mut rollback = RollbackManager::new("k3s");
 
// Add cleanup actions
rollback.add_action("remove binary", Box::new(|| {
    std::fs::remove_file("/usr/local/bin/k3s")?;
    Ok(())
}));
 
// On success
rollback.commit();
 
// On failure, rollback is automatic

Health Checks

use raibid_cli::infrastructure::K3sHealthChecker;
use std::time::Duration;
 
let checker = K3sHealthChecker::new("/path/to/kubeconfig")
    .with_timeout(Duration::from_secs(300));
 
checker.wait_until_healthy()?;

Testing

Run the comprehensive error handling tests:

cargo test --test error_handling_test

Tests cover:

Error type creation and formatting
Retry logic with various scenarios
Pre-flight validation
Rollback execution
Health check evaluation
Integration workflows

Documentation

See error-recovery.md for:

Detailed error type documentation
Troubleshooting guide
Best practices
Recovery procedures
Complete examples

Design Principles

Fail Fast with Context: Provide immediate, actionable feedback
Automatic Recovery: Retry transient failures automatically
Clean Failures: Always clean up on failure
Observable Operations: Comprehensive logging and status reporting
Type Safety: Leverage Rust’s type system for correctness

Integration

All infrastructure installers support these error handling features:

k3s: Pre-flight checks, retry downloads, automatic rollback
Gitea: Helm operation retries, namespace cleanup
Redis: Health checks, connection validation
KEDA: CRD validation, operator health checks
Flux: Binary verification, bootstrap retries

Raibid Labs Documentation

Explorer

ERROR_HANDLING

Infrastructure Error Handling

Modules

`error.rs`

`retry.rs`

`preflight.rs`

`rollback.rs`

`healthcheck.rs`

Usage Examples

Basic Error Handling

Retry Logic

Pre-flight Validation

Rollback

Health Checks

Testing

Documentation

Design Principles

Integration

Graph View

Table of Contents