DGX-Spark MCP Server

Persistent hardware context and intelligent Spark optimization for Claude Code on NVIDIA DGX systems

CI License: MIT

Problem

Claude Code forgets your DGX hardware specifications between sessions. This leads to:

  • ❌ Asking about GPU count/specs repeatedly
  • ❌ Generating sub-optimal Spark configurations
  • ❌ Missing DGX-specific optimization opportunities
  • ❌ No real-time GPU availability awareness

Solution

An MCP (Model Context Protocol) server that provides:

  • Persistent Hardware Context: Always knows your DGX specs
  • Real-Time GPU Status: Check availability before suggesting jobs
  • Intelligent Spark Configs: Generate optimal configs for your hardware
  • DGX Documentation: Instant access to DGX Spark best practices
  • Resource Estimation: Predict job requirements accurately

Quick Start

This MCP server is designed to be used as part of the raibid-labs/workspace.

Installation via Workspace

# Clone the workspace (includes this MCP server as a submodule)
git clone --recursive https://github.com/raibid-labs/workspace.git
cd workspace
 
# Follow workspace setup instructions
# The DGX Spark MCP server will be automatically configured

Standalone Installation (Advanced)

If you need to install this MCP server independently:

# Clone and build
git clone https://github.com/raibid-labs/dgx-spark-mcp.git
cd dgx-spark-mcp
npm install
npm run build
 
# Add to your Claude Code MCP settings:
{
  "mcpServers": {
    "dgx-spark": {
      "command": "node",
      "args": ["/path/to/dgx-spark-mcp/dist/index.js"]
    }
  }
}

Usage in Claude Code

Once configured via workspace, you can ask Claude:

  • “What GPUs are available right now?”
  • “Generate optimal Spark config for 1TB ETL job”
  • “How should I configure executors for ML training?”
  • “Search DGX documentation for best practices”

Features

MCP Resources (Static Context)

Claude Code can read these at any time:

  • dgx://hardware/specs - Complete hardware specifications
  • dgx://hardware/topology - GPU interconnect and system topology
  • dgx://system/capabilities - What your system can do
  • dgx://docs/spark/{topic} - DGX Spark documentation

MCP Tools (Dynamic Operations)

Claude Code can invoke these tools:

  • check_gpu_availability - Current GPU utilization and availability
  • get_optimal_spark_config - Generate Spark config for workload
  • search_documentation - Search DGX Spark docs
  • estimate_resources - Estimate job resource requirements
  • get_system_health - Current system health status

Architecture

Claude Code ←→ MCP Protocol ←→ DGX-Spark MCP Server
                                      ↓
                         ┌────────────┼────────────┐
                         ↓            ↓            ↓
                  Hardware Detection  Intelligence  Documentation
                  (nvidia-smi, /proc) (Spark Optimizer) (Search & Index)
                         ↓
                   DGX Hardware

See Architecture Overview for details.

Development

This project was developed using parallel workstreams:

WorkstreamStatusDescription
WS1: MCP Server Foundation✅ CompleteCore MCP protocol implementation
WS2: Hardware Detection✅ CompleteGPU and system introspection
WS3: MCP Resources & Tools✅ CompleteResource and tool integration
WS4: Documentation System✅ CompleteSearchable docs with indexing
WS5: DGX Spark Intelligence✅ CompleteWorkload analysis and optimization
WS6: Testing & DevOps✅ CompleteComprehensive test suite and CI/CD

See completion reports in docs/workstreams/ for detailed implementation notes.

Local Development

# Navigate to workspace
cd workspace/dgx-spark-mcp
 
# Install dependencies
npm install
 
# Run tests
npm test
 
# Build
npm run build
 
# Use justfile for common tasks
just build    # Build the project
just test     # Run tests
just lint     # Run linting

Standalone Development

# Clone repository
git clone https://github.com/raibid-labs/dgx-spark-mcp.git
cd dgx-spark-mcp
 
# Install dependencies
npm install
 
# Run tests
npm test
 
# Build
npm run build

Documentation

Requirements

  • Node.js: 20+
  • NVIDIA Drivers: Latest
  • nvidia-smi: Must be in PATH
  • Operating System: Linux (tested on Ubuntu 22.04)
  • Hardware: NVIDIA DGX or compatible system

Contributing

This project uses multi-agent parallel development. See:

License

MIT License - see LICENSE file

Project Status

Production Ready

All core workstreams completed:

  • ✅ WS1: MCP Server Foundation
  • ✅ WS2: Hardware Detection System
  • ✅ WS3: MCP Resources & Tools Integration
  • ✅ WS4: Documentation System
  • ✅ WS5: DGX Spark Intelligence Engine
  • ✅ WS6: Testing & DevOps Infrastructure

Part of: raibid-labs/workspace - An integrated development environment for DGX systems