DGX-Pixels Orchestration and Parallelization Patterns - Comprehensive Analysis
Executive Summary
DGX-Pixels demonstrates a mature, enterprise-scale orchestration pattern combining:
- Multi-tier orchestration (Meta Orchestrator → Domain Orchestrators → Workstreams)
- Rust + Nushell + Justfile automation stack
- Parallel workstream coordination with dependency management
- Docker Compose microservices with GPU integration
- ZeroMQ IPC for inter-process communication
- Agent-based development workflow with GitHub integration
This document provides the complete blueprint for replicating these patterns in Sparky.
1. ORCHESTRATOR ARCHITECTURE
1.1 Multi-Tier Orchestration Model
DGX-Pixels uses a hierarchical orchestration pattern:
┌─────────────────────────────────────────┐
│ Meta Orchestrator (M0-M5) │
│ ─ Spawns domain orchestrators │
│ ─ Manages dependencies │
│ ─ Handles phase gates │
└──────┬──────────────────────────────────┘
│
┌───┴───┬──────────┬─────────────────┐
│ │ │ │
▼ ▼ ▼ ▼
┌──────┐┌──────┐┌──────────┐┌──────────┐
│Found-│Model │Interface │Integration
│ation │Orch │Orch │Orch
│Orch │ │ │
└──┬───┘└──┬───┘└────┬─────┘└────┬─────┘
│ │ │ │
┌──┴─┬────┴┬────────┴─┬──────────┴────┐
│WS-1│WS-2│WS-3 WS-4 WS-5... WS-18│
└────┴────┴─────────────────────────────┘
Key Characteristics:
-
Foundation Orchestrator (M0)
- Owns: WS-01, WS-02, WS-03
- Duration: Weeks 1-2
- Purpose: Hardware baselines, reproducibility, benchmarking
- Blocks: ALL other phases
- Pattern: Sequential execution (gates dependencies)
-
Model Orchestrator (M1, M3)
- Owns: WS-04, WS-05, WS-06, WS-07
- Duration: Weeks 3-5
- Purpose: ComfyUI, SDXL optimization, LoRA training
- Blocks: Interface Orchestrator (needs ComfyUI working)
- Pattern: Parallel where possible
-
Interface Orchestrator (M2)
- Owns: WS-08, WS-09, WS-10, WS-11, WS-12
- Duration: Weeks 3-6
- Purpose: Rust TUI, ZeroMQ backend, Sixel preview
- Blocks: Integration Orchestrator (needs TUI + backend)
- Pattern: Mixed sequential/parallel
-
Integration Orchestrator (M4, M5)
- Owns: WS-13, WS-14, WS-15, WS-16, WS-17, WS-18
- Duration: Weeks 7-12
- Purpose: Bevy MCP, observability, deployment
- Blocks: Nothing (final phase)
- Pattern: Sequential then parallel
1.2 Orchestration Spawning Protocol
Location: docs/orchestration/meta-orchestrator.md
Phase Gates Control Progress:
Gate 1: Foundation → Model/Interface (End of Week 2)
✓ Hardware verification complete
✓ Baseline measurements recorded
✓ Reproducibility framework working
✓ Benchmark suite running
Gate 2: Model/Interface → Integration (End of Week 6)
✓ ComfyUI generating images (M1)
✓ Rust TUI functional with preview (M2)
✓ Python backend operational (M2)
✓ LoRA training pipeline working (M3)
Gate 3: Integration → Production (End of Week 11)
✓ Bevy MCP integration complete (M4)
✓ Asset deployment pipeline working (M4)
✓ Example game using generated sprites (M4)
Spawning Commands Pattern:
# Phase 1: Foundation Only (Week 1)
claude-flow spawn orchestrator foundation \
--workstreams WS-01,WS-02,WS-03 \
--phase sequential
# Phase 2: Models + Interface (Week 3, after Gate 1)
claude-flow spawn orchestrator models \
--workstreams WS-04,WS-05,WS-06,WS-07 \
--phase parallel \
--depends-on foundation
claude-flow spawn orchestrator interface \
--workstreams WS-08,WS-09,WS-10,WS-11,WS-12 \
--phase parallel \
--depends-on WS-04
# Phase 3: Integration (Week 7, after Gate 2)
claude-flow spawn orchestrator integration \
--workstreams WS-13,WS-14,WS-15,WS-16,WS-17,WS-18 \
--phase sequential-then-parallel \
--depends-on interface,models2. PARALLEL WORKSTREAM COORDINATION
2.1 Workstream Structure
Each workstream follows a standardized format:
Location: docs/orchestration/workstreams/
Structure:
ws01-hardware-baselines/
├── README.md # Workstream specification
├── COMPLETION_SUMMARY.md # Agent completion report
└── (sub-directories for domain-specific docs)
README.md Contains:
- Objective and deliverables
- Acceptance criteria (must-haves)
- Technical requirements
- Dependencies (blocks/unblocks)
- Estimated LOC
- Related issues/references
Example: WS-01
# WS-01: Hardware Baselines
Owner: Foundation Orchestrator
Agent Type: devops-automator
Duration: 3-4 days
Priority: P0 (critical path)
Objective: Document verified DGX-Spark GB10 hardware specs
Deliverables:
1. repro/hardware_verification.sh - Automated detection
2. bench/baselines/hardware_baseline.json - Recorded metrics
3. Updated docs/hardware.md with measurements
4. Topology diagrams
Acceptance Criteria:
✓ Script captures: GPU model, VRAM, CUDA, driver, CPU, RAM
✓ Baseline JSON recorded in bench/baselines/
✓ Docs match actual hardware (GB10, 128GB unified, ARM)
✓ Verification script exits 0 on success
Dependencies: None (blocks all other phases)2.2 Workstream Plan Matrix
Total Workstreams: 18 across 12 weeks
| ID | Name | Orch | Milestone | Duration | Dependencies |
|---|---|---|---|---|---|
| WS-01 | Hardware Baselines | Foundation | M0 | 3-4d | None |
| WS-02 | Reproducibility | Foundation | M0 | 4-5d | WS-01 |
| WS-03 | Benchmark Suite | Foundation | M0 | 3-4d | WS-01 |
| WS-04 | ComfyUI Setup | Model | M1 | 4-5d | WS-01 |
| WS-05 | SDXL Optimization | Model | M1 | 5-7d | WS-04 |
| WS-06 | LoRA Training | Model | M3 | 7-10d | WS-05 |
| WS-07 | Dataset Tools | Model | M3 | 5-6d | WS-05 |
| WS-08 | Rust TUI Core | Interface | M2 | 6-8d | WS-01 |
| WS-09 | ZeroMQ IPC | Interface | M2 | 4-5d | WS-08 |
| WS-10 | Python Backend | Interface | M2 | 5-6d | WS-04, WS-09 |
| WS-11 | Sixel Preview | Interface | M2 | 3-4d | WS-08, WS-10 |
| WS-12 | Model Comparison | Interface | M2 | 4-5d | WS-10, WS-11 |
| WS-13 | FastMCP Server | Integration | M4 | 5-6d | WS-10 |
| WS-14 | Bevy Plugin | Integration | M4 | 6-7d | WS-13 |
| WS-15 | Asset Deployment | Integration | M4 | 4-5d | WS-13, WS-14 |
| WS-16 | DCGM Metrics | Integration | M5 | 5-6d | WS-05 |
| WS-17 | Docker Deployment | Integration | M5 | 4-5d | WS-10, WS-16 |
| WS-18 | CI/CD Pipeline | Integration | M5 | 6-8d | WS-17 |
Timeline: 90-110 days sequential → 60-70 days with proper parallelization
2.3 Dependency Management
Critical Dependencies:
M0 (Foundation) BLOCKS ALL
└─→ [Gate 1] ─→ M1 (Models) ─→╮
└─→ M2 (Interface) ┤
└─→ [Gate 2] ─→ M4/M5 (Integration)
└─→ M3 (Training)
Blocking Rules:
- Foundation Orchestrator must complete before Model/Interface start
- ComfyUI (WS-04) must complete before TUI integration (WS-10, WS-12)
- Python backend (WS-10) must complete before Bevy integration (WS-13, WS-14)
- All predecessors must complete before dependent workstreams
3. IMPLEMENTATION PATTERNS
3.1 Justfile Command Organization
Location: /home/beengud/raibid-labs/dgx-pixels/justfile
Structure Pattern:
# === Project Initialization ===
init:
#!/usr/bin/env bash
# Create directory structure
mkdir -p rust/src/{ui,zmq_client}
mkdir -p python/workers
mkdir -p workflows
mkdir -p models/{checkpoints,loras,configs}
# Initialize virtual environments, etc.
# === Build Commands ===
build:
cargo build --workspace
# === Development Commands ===
tui:
cargo run --package dgx-pixels-tui
backend PORT="5555":
source venv/bin/activate
python python/workers/generation_worker.py --port {{PORT}}
# === Testing ===
test:
cargo test --workspace
pytest python/tests/ -v
# === Code Quality ===
fmt:
cargo fmt --all
lint:
cargo clippy --workspace -- -D warnings
ci: fmt lint test
@echo "✅ All CI checks passed!"
# === Orchestration Commands ===
orch-foundation:
@echo "🚀 Starting Foundation Orchestrator..."
# === Docker Commands ===
docker-setup:
./scripts/setup_docker.sh
docker-up:
cd docker && docker compose up -d
# === Git Commands ===
branch WS_ID:
#!/usr/bin/env nu
use scripts/nu/modules/github.nu *
gh-create-branch "{{WS_ID}}"
pr TITLE:
#!/usr/bin/env nu
use scripts/nu/modules/github.nu *
gh-create-pr "{{TITLE}}"Key Patterns:
- Recipes organized by functional area
- Bash/Nushell shebang for polyglot support
- Parameters using
{{var}}syntax - Recipes can chain with
recipe1 recipe2 recipe3 @suppresses echo,!is important
3.2 Nushell Automation Scripts
Location: scripts/nu/
Module Pattern:
#!/usr/bin/env nu
# File: scripts/nu/modules/github.nu
use ../config.nu [COLORS, log-success, log-error, log-warning, log-info]
# Export reusable functions
export def gh-create-branch [
branch_name: string,
base_branch: string = "main"
] {
# Function body with error handling, logging
try {
log-info $"Creating branch: ($branch_name)"
git checkout -b $branch_name
log-success $"Created: ($branch_name)"
return true
} catch {|err|
log-error $"Failed: ($err.msg)"
return false
}
}
export def gh-create-pr [
title: string,
--body: string,
--base: string = "main",
--draft,
--labels: list<string> = []
] {
# Full implementation with validation
}Three-Layer Module Structure:
-
config.nu - Core utilities
- Color constants and logging functions
- Project paths (project-root, docs-dir, models-dir)
- File system utilities
- Git utilities (current-branch, is-git-clean)
- Hardware detection (has-nvidia-gpu, gpu-model)
- Environment checks
-
modules/github.nu - GitHub automation
- gh-create-branch
- gh-create-pr
- gh-auto-merge
- gh-rebase-main
- gh-check-status
- gh-list-prs
- gh-request-review
-
modules/dgx.nu - Hardware utilities
- dgx-gpu-stats
- dgx-validate-hardware
- dgx-benchmark-memory
- dgx-export-topology
Usage Pattern:
# In justfile
validate-gpu:
#!/usr/bin/env nu
use scripts/nu/config.nu *
check-dgx-prerequisites
# Or directly
use scripts/nu/modules/github.nu *
let result = (gh-create-branch "feature/new-ui")
if $result {
print "Branch created successfully"
}3.3 Rust Project Structure
Location: rust/ with monorepo Cargo workspace
rust/
├── Cargo.toml # Workspace definition
├── Cargo.lock
├── src/
│ ├── main.rs # TUI entry point
│ ├── app.rs # Application state (Screen enum, JobStatus, App struct)
│ ├── zmq_client.rs # ZeroMQ communication
│ ├── messages.rs # Protocol messages (Request, Response, ProgressUpdate)
│ ├── comparison.rs # Side-by-side model comparison logic
│ ├── reports.rs # Report generation
│ ├── events/
│ │ ├── mod.rs
│ │ └── handler.rs # Event handling (keyboard, resize)
│ ├── ui/
│ │ ├── mod.rs # Main render function
│ │ ├── layout.rs # Grid layout for screens
│ │ ├── theme.rs # Colors and styles
│ │ └── screens/
│ │ ├── mod.rs
│ │ ├── generation.rs
│ │ ├── comparison.rs # NEW: Side-by-side UI
│ │ ├── gallery.rs
│ │ ├── models.rs
│ │ ├── queue.rs
│ │ ├── monitor.rs
│ │ ├── settings.rs
│ │ └── help.rs
│ └── sixel/
│ ├── mod.rs
│ ├── image_renderer.rs # Sixel encoding
│ ├── preview_manager.rs # Async image preview loading
│ └── terminal_detection.rs # Detect Sixel support
├── tests/
│ └── integration_test.rs
└── benches/
└── (benchmarks)
Cargo.toml Pattern:
[package]
name = "dgx-pixels-tui"
version = "0.1.0"
edition = "2021"
rust-version = "1.70"
[[bin]]
name = "dgx-pixels-tui"
path = "src/main.rs"
[dependencies]
# TUI framework
ratatui = "0.26"
crossterm = "0.27"
# Async runtime
tokio = { version = "1.35", features = ["full"] }
# Serialization
serde = { version = "1.0", features = ["derive"] }
# Error handling
anyhow = "1.0"
thiserror = "1.0"
# Logging
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
# IPC
zmq = "0.10.0"
rmp-serde = "1.3.0"
# Image processing
image = "0.24"
viuer = "0.7"
# Performance
dashmap = "5.5"
parking_lot = "0.12"
[profile.release]
opt-level = 3
lto = true
codegen-units = 1
strip = true3.4 Python Project Structure
python/
├── requirements.txt
├── requirements-base.txt
├── requirements-extra.txt
├── requirements-comfyui.txt
├── workers/
│ ├── __init__.py
│ ├── generation_worker.py # Main worker
│ └── zmq_server.py # ZeroMQ server
├── mcp_server/
│ ├── __init__.py
│ └── server.py # FastMCP server
├── training/
│ ├── lora_trainer.py
│ └── dataset_tools.py
├── tests/
│ ├── test_worker.py
│ └── test_mcp.py
└── pyproject.toml
Key Implementation Patterns:
-
ZeroMQ Server (REQ-REP + PUB-SUB)
- REQ-REP (Request-Reply): Job submission, status queries
- PUB-SUB (Publish-Subscribe): Progress updates, notifications
-
Worker Loop
- Listen on REQ-REP socket for job requests
- Publish progress on SUB socket
- Communicate with ComfyUI via HTTP
- Return results and status
-
MCP Server
- FastMCP for MCP protocol
- Tools for Bevy integration
- Handles asset deployment
4. DOCKER & CONTAINERIZATION STRATEGY
4.1 Docker Compose Architecture
Location: docker/docker-compose.yml
Service Stack:
services:
comfyui:
# AI inference engine
build: docker/Dockerfile.comfyui
depends_on: none
ports: 8188
volumes: models, outputs, workflows
backend-worker:
# Python ZeroMQ server + ComfyUI client
build: docker/Dockerfile.backend
depends_on: comfyui (service_healthy)
ports: 5555 (REQ-REP), 5556 (PUB-SUB)
volumes: workflows, outputs
mcp-server:
# FastMCP for Bevy integration
build: docker/Dockerfile.mcp
depends_on: backend-worker (service_healthy)
ports: 3001
dcgm-exporter:
# GPU metrics (NVIDIA DCGM)
image: nvidia/dcgm-exporter:3.1.7
ports: 9400
prometheus:
# Time-series metrics database
image: prom/prometheus:v2.48.0
ports: 9090
depends_on: dcgm-exporter, backend-worker
grafana:
# Metrics visualization
image: grafana/grafana:10.2.2
ports: 3000
depends_on: prometheus
node-exporter:
# Host system metrics
image: prom/node-exporter:v1.7.0
ports: 9100
dgx-pixels-dev:
# Development container (optional, profile:dev)
build: docker/Dockerfile
depends_on: none
profiles: [dev]Network Configuration:
networks:
dgx-pixels-net:
driver: bridge
subnet: 172.28.0.0/16Volume Configuration:
volumes:
# Persistent storage for models (shared across services)
comfyui-models:
# Persistent storage for outputs
comfyui-outputs:
backend-outputs:
# Development bind mounts
dgx-pixels-models: (bind to host ./models)
dgx-pixels-outputs: (bind to host ./outputs)
# Observability
prometheus-data:
grafana-data:4.2 Dockerfile Patterns
Base Image Strategy:
# Main development container (ARM64 compatible)
FROM nvcr.io/nvidia/pytorch:24.11-py3
# Why NGC base:
# - PyTorch wheels for ARM+CUDA unavailable on PyPI
# - Pre-built with CUDA support
# - NVIDIA optimized
# - Includes Python 3.12 + PyTorch 2.6Layer Optimization:
# 1. System packages (slow - do first)
RUN apt-get update && apt-get install -y --no-install-recommends \
vim curl git ...
# 2. Python dependencies (medium - do middle)
COPY requirements.txt /tmp/
RUN pip install --no-cache-dir -r /tmp/requirements.txt
# 3. Application code (fast - do last)
COPY src/ /app/src/Health Checks:
HEALTHCHECK --interval=30s --timeout=10s --retries=3 --start-period=60s \
CMD curl -f http://localhost:8188/system_stats4.3 Docker Setup Script
Location: scripts/setup_docker.sh
Checks Performed:
- Docker v20.10+
- Docker Compose v2+
- NVIDIA Container Toolkit
- NVIDIA drivers
- DGX-Spark GB10 hardware
Creates:
docker/.envconfiguration- Directory structure
- Initial Docker images
- Networks
5. ZEROMQ IPC PATTERNS
5.1 Communication Architecture
Pattern: REQ-REP + PUB-SUB Hybrid
Rust TUI Python Backend
│ │
├──→ REQ (Request) ──────→┐ │
│ │ │
│ ┌────▼────┐ │
│ │ REQ-REP │ │
│ │ Socket │ │
│ │ Port 5555│ │
│ └────┬────┘ │
│ │ │
│ ┌──────────PUB/SUB───┤ │
│ │ Updates (Progress) │ │
│ │ Port 5556 │ │
│ │ │ │
└────┴──────────Sub────────────────┘
│
└─ Receives progress updates
5.2 Message Patterns
Rust Implementation (zmq_client.rs):
pub struct ZmqClient {
req_sender: Sender<ClientRequest>,
resp_receiver: Receiver<Response>,
update_receiver: Receiver<ProgressUpdate>,
_req_thread: thread::JoinHandle<()>,
_sub_thread: thread::JoinHandle<()>,
}
impl ZmqClient {
pub fn new(req_addr: &str, pub_addr: &str) -> Result<Self> {
// Spawn two threads:
// 1. REQ-REP thread for request/response
// 2. PUB-SUB thread for updates
}
}Message Protocol (messages.rs):
pub enum Request {
Generate { prompt: String, model: String },
GetStatus { job_id: String },
Cancel { job_id: String },
}
pub enum Response {
JobQueued { job_id: String },
Status { job_id: String, status: JobStatus },
Result { image_path: PathBuf, metadata: ... },
Error { message: String },
}
pub struct ProgressUpdate {
job_id: String,
stage: String,
progress: f32,
eta_s: f32,
}Serialization: MessagePack (rmp-serde)
- Binary format (smaller than JSON)
- Fast serialization/deserialization
- Type-safe in both Rust and Python
6. PROJECT STRUCTURE TEMPLATE
6.1 Directory Organization
sparky/
├── README.md
├── CONTRIBUTING.md
├── CLAUDE.md # Claude Code guidance
├── justfile # Task automation
│
├── docs/
│ ├── orchestration/ # Orchestration patterns
│ │ ├── meta-orchestrator.md
│ │ ├── workstream-plan.md
│ │ ├── orchestrators/
│ │ │ ├── foundation.md
│ │ │ ├── model.md
│ │ │ ├── interface.md
│ │ │ └── integration.md
│ │ └── workstreams/
│ │ ├── start-here.md
│ │ ├── template.md
│ │ ├── ws01-xxx/README.md
│ │ └── ...
│ ├── adr/ # Architecture Decision Records
│ │ └── 0001-decision.md
│ └── (domain-specific docs)
│
├── rust/
│ ├── Cargo.toml # Workspace
│ ├── src/
│ │ ├── main.rs
│ │ ├── app.rs
│ │ ├── events/
│ │ ├── ui/
│ │ └── (domain modules)
│ ├── tests/
│ └── benches/
│
├── python/
│ ├── requirements.txt
│ ├── workers/
│ ├── mcp_server/
│ ├── training/
│ └── tests/
│
├── docker/
│ ├── docker-compose.yml
│ ├── Dockerfile
│ ├── Dockerfile.backend
│ ├── Dockerfile.mcp
│ ├── requirements-base.txt
│ └── requirements-comfyui.txt
│
├── scripts/
│ ├── setup_docker.sh
│ ├── docker_health_check.sh
│ ├── docker_cleanup.sh
│ └── nu/
│ ├── config.nu
│ └── modules/
│ ├── github.nu
│ ├── dgx.nu
│ └── (domain modules)
│
├── config/
│ ├── mcp_config.yaml
│ └── (service configs)
│
├── deploy/
│ ├── prometheus/
│ ├── grafana/
│ └── dcgm/
│
├── models/
│ ├── checkpoints/
│ ├── loras/
│ └── configs/
│
├── workflows/
│ └── (workflow templates)
│
└── examples/
└── (example implementations)
6.2 Configuration Files
docker/.env
# Ports
COMFYUI_PORT=8188
ZMQ_PORT=5555
GRAFANA_PORT=3000
# Paths
PROJECT_ROOT=/path/to/sparky
MODELS_DIR=./models
OUTPUTS_DIR=./outputs
# Credentials
GRAFANA_ADMIN_PASSWORD=admindgx-pixels.toml (project config)
[api]
port = 8000
zmq_req_port = 5555
zmq_pub_port = 5556
[comfyui]
url = "http://localhost:8188"
[models]
dir = "models"
[observability]
prometheus_url = "http://localhost:9090"
grafana_url = "http://localhost:3000"7. CI/CD AND TESTING PATTERNS
7.1 Test-Driven Development (TDD)
Pattern Used:
- Write tests FIRST
- Implement code
- Run tests
- Code review with passing tests
Test Locations:
rust/tests/integration_test.rs
python/tests/test_*.py
Test Command:
just test # Run all tests
just test-coverage # With coverage report
just test-integration # Integration tests only7.2 Pre-commit Checks
Command Chain:
just ci # Runs: fmt + lint + testComponents:
-
fmt - Code formatting
cargo fmt --all # Rust ruff format python/ # Python -
lint - Code quality checks
cargo clippy --workspace -- -D warnings -
test - Unit + integration tests
7.3 GitHub Workflow
Standard PR Workflow:
# 1. Create branch for workstream
just branch WS-01
# 2. Implement with TDD
# (write tests, implement, run tests)
# 3. Run CI checks
just ci
# 4. Create PR
just pr "Implement WS-01: Title"
# 5. Enable auto-merge
gh-auto-merge --merge-method squash
# 6. Before next workstream, rebase
gh-rebase-main
# 7. Push with force-with-lease
git push --force-with-lease8. MONITORING & OBSERVABILITY
8.1 Observability Stack
Components:
- DCGM Exporter - GPU metrics (NVIDIA Data Center GPU Manager)
- Prometheus - Time-series metrics collection
- Grafana - Visualization dashboards
- Node Exporter - Host system metrics
Metrics Collected:
- GPU utilization, memory, temperature
- Power draw, clock speeds
- Inference latency, throughput
- Queue depth, job completion rate
- Model accuracy metrics
URLs:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
- DCGM metrics: http://localhost:9400/metrics
9. AGENT SPAWNING PATTERNS
9.1 Agent Types by Workstream
| Workstream Type | Agent Type | Rationale |
|---|---|---|
| Infrastructure/DevOps | devops-automator | Docker, shell, CI/CD |
| Rust TUI | rust-pro | TUI frameworks, async |
| Python/AI | ai-engineer + python-pro | ML models, optimization |
| Integration | backend-architect | API design, protocols |
9.2 Workstream Spawn Command
Standard Pattern:
npx claude-flow@alpha spawn agent {agent-type} \
--workstream {WS-XX} \
--spec docs/orchestration/workstreams/{ws-name}/README.md \
--priority {P0-P3} \
--depends {WS-YY} \
--output {completion-path}Example:
npx claude-flow@alpha spawn agent devops-automator \
--workstream WS-01 \
--spec docs/orchestration/workstreams/ws01-hardware-baselines/README.md \
--priority P0 \
--output docs/orchestration/workstreams/ws01-hardware-baselines/COMPLETION_SUMMARY.md9.3 Orchestrator Coordination
Meta Orchestrator monitors:
- Workstream completion status
- Cross-domain dependencies
- Blocker resolution
- Phase gate readiness
Status Update Format (JSON):
{
"orchestrator": "Model Orchestrator",
"status": "active",
"workstreams": {
"WS-04": {"status": "complete", "completion_time": "2025-11-12T14:30:00Z"},
"WS-05": {"status": "in_progress", "progress": 0.65, "eta": "2025-11-13T10:00:00Z"},
"WS-06": {"status": "blocked", "blocker": "WS-05 incomplete"}
},
"blockers": [],
"decisions_needed": []
}10. KEY TAKEAWAYS FOR SPARKY
10.1 Core Patterns to Adopt
-
Multi-tier Orchestration
- One Meta Orchestrator
- Multiple Domain Orchestrators
- Sequential → Parallel progression
-
Workstream Structure
- Standardized README format
- Clear acceptance criteria
- Tracked completion summaries
-
Automation Stack
- Justfile for command orchestration
- Nushell modules for reusable automation
- GitHub CLI for workflow automation
-
Technology Stack
- Rust for performance-critical components
- Python for ML/AI components
- ZeroMQ for inter-process communication
- Docker Compose for deployment
-
Phase Gates
- Gating prevents out-of-order work
- Clear acceptance criteria
- Blocks dependent workstreams
10.2 Implementation Roadmap for Sparky
Week 1-2: Foundation
- Copy orchestration structure from dgx-pixels
- Adapt workstream templates
- Set up initial Docker Compose
- Establish Justfile and Nushell modules
Week 3-4: Domain Orchestrators
- Create domain orchestrator specs
- Populate workstream specs
- Establish automation scripts
- Set up GitHub workflows
Week 5+: Parallel Execution
- Spawn agents for first workstreams
- Monitor progress via status reports
- Manage phase gates
- Escalate blockers as needed
10.3 File Reference Summary
Must-Read Patterns:
/home/beengud/raibid-labs/dgx-pixels/justfile- Task automation/home/beengud/raibid-labs/dgx-pixels/docs/orchestration/meta-orchestrator.md- Orchestration strategy/home/beengud/raibid-labs/dgx-pixels/docker/docker-compose.yml- Service architecture/home/beengud/raibid-labs/dgx-pixels/scripts/nu/config.nu- Nushell utilities/home/beengud/raibid-labs/dgx-pixels/scripts/nu/modules/github.nu- GitHub automation/home/beengud/raibid-labs/dgx-pixels/CONTRIBUTING.md- Development workflow
11. CRITICAL INSIGHTS
-
Orchestration is Hierarchical
- Don’t try to manage 18 workstreams flat
- Group by domain (4 orchestrators)
- Each orchestrator owns 3-6 workstreams
-
Phase Gates Are Crucial
- Prevent out-of-order work
- Save rework and integration pain
- Make dependencies explicit
-
Automation Saves Repetition
- Nushell modules provide reusable functions
- Justfile provides task entry points
- GitHub automation reduces manual work
-
Docker is Non-Negotiable
- Reproducible environments
- GPU access through NVIDIA Container Toolkit
- Network of interdependent services
-
ZeroMQ + MessagePack is Optimal for IPC
- Low latency (<1ms)
- Binary format saves bandwidth
- REQ-REP + PUB-SUB patterns are flexible
-
Metrics-First Observability
- DCGM for GPU metrics
- Prometheus for time-series
- Grafana for visualization
- Track from day 1, not at end
-
Test-Driven Development
- Tests first, implementation second
- Prevents integration surprises
- CI gates enforce quality
-
Documentation as Code
- Markdown in docs/ alongside code
- Workstream specs are contracts
- Architecture decisions in ADRs
APPENDIX: Quick Command Reference
# Project initialization
just init # One-time setup
just validate-gpu # Verify hardware
# Development
just tui # Run Rust TUI
just backend # Run Python backend
just comfyui # Run ComfyUI server
# Testing & Quality
just test # Run tests
just ci # Format, lint, test
just pre-commit # Pre-commit checks
# Docker
docker compose up -d # Start all services
docker compose logs -f # Follow logs
./scripts/docker_health_check.sh
# Git workflow
just branch WS-01 # Create workstream branch
just pr "Title" # Create PR
gh-auto-merge --merge-method squash
gh-rebase-main # Rebase onto main
# Monitoring
just gpu-status # One-time GPU stats
just gpu-watch # Live GPU monitoring
just hw-info # All hardware info
open http://localhost:3000 # Grafana