GPU Acceleration
Complete guide to GPU acceleration in DBX using CUDA.
Table of contents
- Overview
- Requirements
- Installation
- Basic Usage
- GPU Operations
- Hash Strategies
- Sharding Strategies
- PTX Persistent Kernel
- CUDA Stream Management
- SQL Integration
- Performance Tuning
- Benchmarking
- Troubleshooting
- Advanced Features
- Best Practices
- Next Steps
Overview
DBX provides optional CUDA-based GPU acceleration for analytical queries, offering significant performance improvements for large datasets.
Performance Gains
| Operation | Dataset Size | CPU Time | GPU Time | Speedup |
|---|---|---|---|---|
| SUM | 1M rows | 456.66µs | 783.36µs | 0.58x |
| Filter (>500K) | 1M rows | 2.06ms | 673.38µs | 3.06x |
| SUM | 10M rows | 4.5ms | 1.2ms | 3.75x |
| Filter | 10M rows | 20ms | 4.4ms | 4.57x |
| GROUP BY | 10M rows | 35ms | 12ms | 2.92x |
| Hash Join | 10M rows | 50ms | 18ms | 2.78x |
Note: GPU shows greater performance gains on larger datasets (>10M rows).
Requirements
Hardware
- NVIDIA GPU with CUDA Compute Capability 6.0+
- Minimum 2GB VRAM (4GB+ recommended)
- PCIe 3.0 x16 or better
Software
- CUDA Toolkit 12.x or later
- NVIDIA Driver 525.60.13+ (Linux) or 528.33+ (Windows)
- Rust 1.70+ with CUDA support
Installation
1. Install CUDA Toolkit
Linux:
# Ubuntu/Debian
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-12-3
Windows: Download and install from NVIDIA CUDA Downloads
2. Verify CUDA Installation
nvcc --version
nvidia-smi
3. Enable GPU Features in Cargo.toml
[dependencies]
dbx-core = { version = "0.0.6-beta", features = ["gpu"] }
4. Build with GPU Support
cargo build --features gpu --release
Basic Usage
Initialize GPU Manager
use dbx_core::Database;
fn main() -> dbx_core::DbxResult<()> {
let db = Database::open_in_memory()?;
// GPU manager is automatically initialized if available
if let Some(gpu) = db.gpu_manager() {
println!("GPU acceleration available!");
} else {
println!("GPU not available, using CPU");
}
Ok(())
}
Sync Data to GPU
use dbx_core::Database;
fn main() -> dbx_core::DbxResult<()> {
let db = Database::open_in_memory()?;
// ... register table with data ...
// Sync table to GPU cache
db.sync_gpu_cache("users")?;
Ok(())
}
GPU Operations
Aggregations
SUM
use dbx_core::Database;
fn main() -> dbx_core::DbxResult<()> {
let db = Database::open_in_memory()?;
// ... register table ...
db.sync_gpu_cache("orders")?;
if let Some(gpu) = db.gpu_manager() {
// GPU-accelerated SUM
let total = gpu.sum("orders", "amount")?;
println!("Total: {}", total);
}
Ok(())
}
COUNT
if let Some(gpu) = db.gpu_manager() {
let count = gpu.count("users")?;
println!("Count: {}", count);
}
MIN / MAX
if let Some(gpu) = db.gpu_manager() {
let min_age = gpu.min("users", "age")?;
let max_age = gpu.max("users", "age")?;
println!("Age range: {} - {}", min_age, max_age);
}
AVG
if let Some(gpu) = db.gpu_manager() {
let avg_price = gpu.avg("products", "price")?;
println!("Average price: {:.2}", avg_price);
}
Filtering
Greater Than
if let Some(gpu) = db.gpu_manager() {
// Filter rows where age > 30
let filtered = gpu.filter_gt("users", "age", 30)?;
println!("Found {} users", filtered.len());
}
Less Than
let filtered = gpu.filter_lt("products", "price", 100.0)?;
Equal
let filtered = gpu.filter_eq("users", "status", "active")?;
Range
let filtered = gpu.filter_range("orders", "amount", 100.0, 1000.0)?;
GROUP BY
use dbx_core::Database;
fn main() -> dbx_core::DbxResult<()> {
let db = Database::open_in_memory()?;
// ... register table ...
db.sync_gpu_cache("orders")?;
if let Some(gpu) = db.gpu_manager() {
// GROUP BY city, SUM(amount)
let results = gpu.group_by_sum("orders", "city", "amount")?;
for (city, total) in results {
println!("{}: {}", city, total);
}
}
Ok(())
}
Hash Join
if let Some(gpu) = db.gpu_manager() {
// Hash join users and orders
let results = gpu.hash_join(
"users", "id",
"orders", "user_id"
)?;
}
Hash Strategies
DBX supports three GPU hash strategies for different performance characteristics:
Linear Probing (Default)
Characteristics:
- Stable performance
- Low memory overhead
- Best for small to medium groups
Usage:
use dbx_core::gpu::HashStrategy;
db.set_gpu_hash_strategy(HashStrategy::Linear)?;
Performance:
- Baseline performance
- Consistent across workloads
Cuckoo Hashing
Characteristics:
- Aggressive performance
- Higher memory overhead
- Best for large datasets with high collision rates
Usage:
use dbx_core::gpu::HashStrategy;
db.set_gpu_hash_strategy(HashStrategy::Cuckoo)?;
Performance:
- SUM: +73% faster than Linear
- Filtering: +32% faster than Linear
- GROUP BY: +45% faster than Linear
Robin Hood Hashing
Characteristics:
- Balanced performance
- Moderate memory overhead
- Best for general-purpose workloads
Usage:
use dbx_core::gpu::HashStrategy;
db.set_gpu_hash_strategy(HashStrategy::RobinHood)?;
Performance:
- SUM: +7% faster than Linear
- Filtering: +10% faster than Linear
- GROUP BY: +12% faster than Linear
Strategy Selection Guide
| Workload | Recommended Strategy | Reason |
|---|---|---|
| Small datasets (<1M rows) | Linear | Lower overhead |
| Large datasets (>10M rows) | Cuckoo | Maximum performance |
| Mixed workloads | Robin Hood | Balanced |
| Memory-constrained | Linear | Lowest memory usage |
| High collision rate | Cuckoo | Best collision handling |
Sharding Strategies
For multi-GPU environments, DBX provides three sharding strategies to distribute data across devices:
| Strategy | Behavior | Recommended For |
|---|---|---|
| RoundRobin | Distributes rows sequentially | Balanced workloads |
| Hash | Hash-based distribution on first column (ahash) | GROUP BY, JOIN queries |
| Range | Assigns contiguous row ranges | Sorted data, range scans |
use dbx_core::storage::gpu::ShardingStrategy;
let manager = ShardManager::new(device_count, ShardingStrategy::Hash);
let shards = manager.shard_batch(&batch)?;
PTX Persistent Kernel
Uses NVRTC to compile CUDA C kernels to PTX at runtime. The kernel persists on GPU, continuously processing work queue items until shutdown.
use dbx_core::storage::gpu::persistent::PersistentKernelManager;
let manager = PersistentKernelManager::new(device.clone());
manager.compile_kernel()?;
if let Some(func) = manager.get_kernel_function() {
// Execute kernel
}
Note: Only available with the
gpufeature enabled. As ofcudarc0.19.2, Unified Memory and P2P access are not supported; host memory with explicit transfers is used instead.
CUDA Stream Management
Create separate streams for parallel GPU operations via fork_default_stream():
use dbx_core::engine::stream::GpuStreamContext;
let ctx = GpuStreamContext::new(device.clone())?;
// Execute async GPU work on separate stream
SQL Integration
GPU acceleration is automatically used for compatible SQL operations:
use dbx_core::Database;
fn main() -> dbx_core::DbxResult<()> {
let db = Database::open_in_memory()?;
// ... register table ...
db.sync_gpu_cache("orders")?;
// These operations automatically use GPU if available:
// 1. Aggregations
let results = db.execute_sql(
"SELECT SUM(amount), AVG(amount) FROM orders"
)?;
// 2. Filtering
let results = db.execute_sql(
"SELECT * FROM orders WHERE amount > 1000"
)?;
// 3. GROUP BY
let results = db.execute_sql(
"SELECT city, SUM(amount) FROM orders GROUP BY city"
)?;
// 4. Joins
let results = db.execute_sql(
"SELECT u.name, o.amount
FROM users u
JOIN orders o ON u.id = o.user_id"
)?;
Ok(())
}
Performance Tuning
Memory Management
Allocate Sufficient VRAM
use dbx_core::gpu::GpuConfig;
let config = GpuConfig::default()
.max_memory_mb(2048) // 2GB VRAM
.cache_size(1000000); // 1M records
db.configure_gpu(config)?;
Monitor GPU Memory
if let Some(gpu) = db.gpu_manager() {
let stats = gpu.memory_stats()?;
println!("Used: {} MB / {} MB", stats.used_mb, stats.total_mb);
}
Batch Size Optimization
let config = GpuConfig::default()
.batch_size(10000); // Process 10k rows per batch
db.configure_gpu(config)?;
Data Transfer Optimization
Minimize CPU-GPU Transfers
// Good: Sync once, query multiple times
db.sync_gpu_cache("orders")?;
for i in 0..100 {
let results = gpu.filter_gt("orders", "amount", i * 100)?;
}
// Avoid: Sync on every query
for i in 0..100 {
db.sync_gpu_cache("orders")?; // Too frequent!
let results = gpu.filter_gt("orders", "amount", i * 100)?;
}
Async Transfers
// Async transfer (non-blocking)
db.sync_gpu_cache_async("orders")?;
// Continue with other work
// ...
// Wait for completion
db.wait_gpu_sync()?;
Benchmarking
Run GPU Benchmarks
cd testing/benchmarks
cargo bench --features gpu
Custom Benchmarks
use dbx_core::Database;
use std::time::Instant;
fn benchmark_gpu_vs_cpu() -> dbx_core::DbxResult<()> {
let db = Database::open_in_memory()?;
// ... register large dataset ...
// CPU benchmark
let start = Instant::now();
let cpu_result = db.execute_sql("SELECT SUM(amount) FROM orders")?;
let cpu_time = start.elapsed();
// GPU benchmark
db.sync_gpu_cache("orders")?;
let start = Instant::now();
let gpu_result = db.gpu_manager().unwrap().sum("orders", "amount")?;
let gpu_time = start.elapsed();
println!("CPU: {:?}", cpu_time);
println!("GPU: {:?}", gpu_time);
println!("Speedup: {:.2}x", cpu_time.as_secs_f64() / gpu_time.as_secs_f64());
Ok(())
}
Troubleshooting
GPU Not Detected
Problem: gpu_manager() returns None
Solutions:
- Verify CUDA installation:
nvcc --version - Check NVIDIA driver:
nvidia-smi - Rebuild with GPU features:
cargo build --features gpu - Check GPU compatibility (Compute Capability 6.0+)
Out of Memory Errors
Problem: CudaError: out of memory
Solutions:
- Reduce batch size:
let config = GpuConfig::default().batch_size(5000); db.configure_gpu(config)?; - Clear GPU cache:
db.clear_gpu_cache()?; - Use smaller datasets or split queries
Slow Performance
Problem: GPU slower than CPU
Possible Causes:
- Dataset too small - GPU overhead dominates
- Frequent CPU-GPU transfers - Minimize syncs
- Wrong hash strategy - Try Cuckoo for large datasets
Solutions:
// Use GPU only for large datasets
if row_count > 1_000_000 {
db.sync_gpu_cache("table")?;
// Use GPU operations
} else {
// Use CPU operations
}
Advanced Features
Custom CUDA Kernels
For advanced users, DBX allows custom CUDA kernels:
use dbx_core::gpu::CudaKernel;
let kernel = CudaKernel::from_source(r#"
__global__ void custom_filter(int* data, int* result, int threshold, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
result[idx] = data[idx] > threshold ? 1 : 0;
}
}
"#)?;
db.register_kernel("custom_filter", kernel)?;
Multi-GPU Support
use dbx_core::gpu::GpuConfig;
let config = GpuConfig::default()
.device_ids(vec![0, 1, 2, 3]); // Use 4 GPUs
db.configure_gpu(config)?;
Best Practices
1. Use GPU for Large Datasets
// Good: Large dataset (>1M rows)
if row_count > 1_000_000 {
db.sync_gpu_cache("table")?;
let result = gpu.sum("table", "column")?;
}
// Avoid: Small dataset
if row_count < 10_000 {
// CPU is faster for small datasets
}
2. Batch GPU Operations
// Good: Batch multiple operations
db.sync_gpu_cache("orders")?;
let sum = gpu.sum("orders", "amount")?;
let avg = gpu.avg("orders", "amount")?;
let count = gpu.count("orders")?;
// Avoid: Sync for each operation
db.sync_gpu_cache("orders")?;
let sum = gpu.sum("orders", "amount")?;
db.sync_gpu_cache("orders")?; // Redundant!
let avg = gpu.avg("orders", "amount")?;
3. Choose Appropriate Hash Strategy
// Large dataset with many groups
db.set_gpu_hash_strategy(HashStrategy::Cuckoo)?;
// General-purpose workload
db.set_gpu_hash_strategy(HashStrategy::RobinHood)?;
4. Monitor GPU Utilization
if let Some(gpu) = db.gpu_manager() {
let stats = gpu.utilization_stats()?;
println!("GPU Utilization: {}%", stats.utilization);
println!("Memory Used: {} MB", stats.memory_used_mb);
}
Next Steps
- SQL Reference — Use GPU with SQL queries
- Storage Layers — Understand data flow
- Performance Benchmarks — Optimize GPU performance
- Benchmarks — See detailed performance comparisons