DGX Spark Configuration Guide
This guide covers essential configuration settings for running Apache Spark optimally on NVIDIA DGX systems with GPU acceleration.
Configuration Files
Main Configuration Files
$SPARK_HOME/conf/spark-defaults.conf- Default Spark properties$SPARK_HOME/conf/spark-env.sh- Environment variables$SPARK_HOME/conf/log4j2.properties- Logging configuration$SPARK_HOME/conf/workers- Cluster worker nodes
Configuration Hierarchy
Configuration priority (highest to lowest):
- SparkConf in application code
- Command-line flags (—conf)
- spark-defaults.conf
- Environment variables
GPU Configuration
RAPIDS Accelerator Settings
# Enable RAPIDS SQL Plugin
spark.plugins com.nvidia.spark.SQLPlugin
spark.rapids.sql.enabled true
# GPU Resource Allocation
spark.executor.resource.gpu.amount 1
spark.task.resource.gpu.amount 0.125
spark.executor.resource.gpu.discoveryScript /opt/spark/getGpusResources.sh
# RAPIDS SQL Configuration
spark.rapids.sql.concurrentGpuTasks 2
spark.rapids.sql.explain NOT_ON_GPU
spark.rapids.sql.incompatibleOps.enabled false
# Memory Management
spark.rapids.memory.pinnedPool.size 8g
spark.rapids.sql.batchSizeBytes 2147483647GPU Task Scheduling
# Fine-grained GPU sharing
spark.task.resource.gpu.amount 0.125 # 8 tasks per GPU
# Coarse-grained (1 task per GPU)
spark.task.resource.gpu.amount 1.0
# Medium-grained (2 tasks per GPU)
spark.task.resource.gpu.amount 0.5Memory Configuration
Executor Memory Settings
For DGX A100 (80GB GPU, 2TB RAM):
# Executor Memory
spark.executor.memory 128g
spark.driver.memory 64g
# Memory Overhead (for off-heap, network buffers, etc.)
spark.executor.memoryOverhead 32g
spark.driver.memoryOverhead 16g
# On-heap memory fractions
spark.memory.fraction 0.6
spark.memory.storageFraction 0.5Off-Heap Memory
# Enable off-heap memory
spark.memory.offHeap.enabled true
spark.memory.offHeap.size 32g
# For GPU operations
spark.rapids.memory.pinnedPool.size 16gMemory Tuning Guidelines
- Total Executor Memory = executor.memory + executor.memoryOverhead
- Per GPU: Allocate 1-2GB RAM per 1GB GPU memory
- Driver Memory: 1/2 to 1/4 of executor memory
- Storage Fraction: 0.3-0.5 for caching-heavy workloads
CPU Configuration
Core Allocation
# Cores per executor
spark.executor.cores 8
# Total executor instances (for YARN/K8s)
spark.executor.instances 16
# Dynamic allocation
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 4
spark.dynamicAllocation.maxExecutors 32
spark.dynamicAllocation.initialExecutors 8CPU Guidelines for DGX A100
- 128 CPU cores total
- Recommend 8-16 cores per executor
- Leave 8-16 cores for system/driver
- Total executors = (Total Cores - System Cores) / Cores per Executor
Storage Configuration
Shuffle Configuration
# Shuffle partitions
spark.sql.shuffle.partitions 400
spark.sql.adaptive.shuffle.targetPostShuffleInputSize 134217728 # 128MB
# Shuffle behavior
spark.shuffle.service.enabled true
spark.shuffle.compress true
spark.shuffle.spill.compress true
# RAPIDS shuffle
spark.rapids.shuffle.mode MULTITHREADED
spark.rapids.shuffle.multiThreaded.reader.threads 20
spark.rapids.shuffle.multiThreaded.writer.threads 20Storage Levels
# Default storage level
spark.storage.level MEMORY_AND_DISK_SER
# Replication
spark.storage.replication 2
# Cleanup
spark.cleaner.ttl 3600Local Directories
# Use NVMe drives for temp data
spark.local.dir /mnt/nvme0/spark-temp,/mnt/nvme1/spark-temp
# I/O encryption (optional)
spark.io.encryption.enabled falseNetwork Configuration
Basic Network Settings
# Timeouts
spark.network.timeout 800s
spark.executor.heartbeatInterval 60s
spark.rpc.askTimeout 600s
# Connection settings
spark.rpc.numRetries 5
spark.rpc.retry.wait 5s
# Network buffer
spark.network.maxRemoteBlockSizeFetchToMem 512mHigh-Performance Networking (InfiniBand)
# Enable RDMA for shuffle
spark.shuffle.io.preferDirectBufs true
spark.shuffle.io.numConnectionsPerPeer 2
# UCX settings (for InfiniBand)
spark.executorEnv.UCX_TLS rc,cuda_copy,cuda_ipc
spark.executorEnv.UCX_MEMTYPE_CACHE nParallelism Configuration
Default Parallelism
# Default parallelism (2-3x total cores)
spark.default.parallelism 256
# SQL shuffle partitions
spark.sql.shuffle.partitions 400
# Adaptive Query Execution
spark.sql.adaptive.enabled true
spark.sql.adaptive.coalescePartitions.enabled true
spark.sql.adaptive.skewJoin.enabled truePartition Size Guidelines
- Small partitions: < 100MB - Increase parallelism
- Optimal: 128MB - 1GB per partition
- Large partitions: > 2GB - Decrease parallelism
Catalyst Optimizer Settings
Query Optimization
# Broadcast join threshold
spark.sql.autoBroadcastJoinThreshold 100m
# File scan parallelism
spark.sql.files.maxPartitionBytes 134217728 # 128MB
spark.sql.files.openCostInBytes 4194304 # 4MB
# Bucketing
spark.sql.sources.bucketing.enabled true
spark.sql.sources.bucketing.autoBucketedScan.enabled trueAdaptive Query Execution (AQE)
# Enable AQE
spark.sql.adaptive.enabled true
# Coalesce partitions
spark.sql.adaptive.coalescePartitions.enabled true
spark.sql.adaptive.coalescePartitions.minPartitionSize 1m
spark.sql.adaptive.advisoryPartitionSizeInBytes 64m
# Skew join optimization
spark.sql.adaptive.skewJoin.enabled true
spark.sql.adaptive.skewJoin.skewedPartitionFactor 5
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes 256mLogging Configuration
Log Levels
Edit conf/log4j2.properties:
# Root logger level
rootLogger.level = INFO
# Reduce verbosity for specific packages
logger.spark.name = org.apache.spark
logger.spark.level = WARN
logger.rapids.name = com.nvidia.spark.rapids
logger.rapids.level = INFO
# Event logging
spark.eventLog.enabled true
spark.eventLog.dir /opt/spark/logs/events
spark.eventLog.compress trueSecurity Configuration
Authentication
# Enable authentication
spark.authenticate true
spark.authenticate.secret <secret-key>
# SSL/TLS
spark.ssl.enabled true
spark.ssl.keyStore /path/to/keystore
spark.ssl.keyStorePassword <password>Encryption
# Network encryption
spark.network.crypto.enabled true
spark.network.crypto.saslFallback false
# Local storage encryption
spark.io.encryption.enabled trueEnvironment Variables
Edit conf/spark-env.sh:
#!/usr/bin/env bash
# Java options
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=8"
export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
# Memory settings
export SPARK_WORKER_MEMORY=512g
export SPARK_DRIVER_MEMORY=64g
# GPU settings
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
# Logging
export SPARK_LOG_DIR=/var/log/spark
export SPARK_PID_DIR=/var/run/spark
# Python
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3Application-Specific Configuration
Batch Processing
spark.executor.memory 128g
spark.executor.cores 16
spark.sql.shuffle.partitions 800
spark.default.parallelism 800
spark.rapids.sql.concurrentGpuTasks 4Streaming
spark.executor.memory 64g
spark.executor.cores 8
spark.streaming.backpressure.enabled true
spark.streaming.receiver.maxRate 10000
spark.streaming.kafka.maxRatePerPartition 10000Machine Learning
spark.executor.memory 256g
spark.executor.cores 32
spark.rapids.ml.uvm.enabled true
spark.rapids.sql.enabled true
spark.task.resource.gpu.amount 0.25Configuration Templates
Template 1: Development (Single Node)
spark.master local[*]
spark.driver.memory 32g
spark.executor.memory 64g
spark.rapids.sql.enabled true
spark.executor.resource.gpu.amount 1Template 2: Production Cluster
spark.master spark://dgx-master:7077
spark.executor.instances 16
spark.executor.memory 128g
spark.executor.cores 16
spark.executor.resource.gpu.amount 1
spark.rapids.sql.enabled true
spark.dynamicAllocation.enabled trueTemplate 3: Kubernetes
spark.master k8s://https://kubernetes.default.svc:443
spark.kubernetes.container.image nvcr.io/nvidia/spark:24.08
spark.kubernetes.namespace spark
spark.executor.instances 32
spark.kubernetes.executor.request.cores 8
spark.kubernetes.executor.limit.cores 16Validation
Test Configuration
# Validate configuration
spark-submit \
--master local[*] \
--conf spark.rapids.sql.enabled=true \
--conf spark.executor.resource.gpu.amount=1 \
--dry-run \
--class org.apache.spark.examples.SparkPi \
$SPARK_HOME/examples/jars/spark-examples*.jarMonitor Configuration
# Check active configuration
spark-shell --conf spark.rapids.sql.enabled=true
scala> spark.conf.getAll.foreach(println)Next Steps
- Performance Tuning Guide - Advanced optimization techniques
- Troubleshooting Guide - Common issues and solutions
- Best Practices - Production recommendations