A clean TPC-DS benchmark suite for Apache Spark on Kubernetes with native TPC-DS 4.0 support.
This project helps you measure how fast your Spark cluster can process data. Think of it as a standardized speed test for your big data infrastructure.
The Problem: When you set up Apache Spark on Kubernetes (especially on AWS EKS), you want to know: "How fast is my cluster? Is it performing well? How does it compare to other configurations?" Without a standard test, it's hard to answer these questions.
The Solution: This project runs TPC-DS benchmarks - an industry-standard test suite that simulates a retail company's data warehouse. It runs 104 realistic SQL queries against large datasets (from 1GB to 100TB) and measures how long each query takes. This gives you concrete numbers to:
- Compare configurations: CPU vs GPU (RAPIDS), different instance types, memory settings
- Validate your setup: Ensure Spark is running efficiently on Kubernetes
- Track performance over time: Catch regressions after upgrades or config changes
- Make informed decisions: Choose the right infrastructure for your workload
How to use it:
- Build the Docker image (one command)
- Deploy to your Kubernetes cluster using the Spark Operator
- Point it at your S3 bucket for data and results
- Get a CSV report showing how long each query took
The benchmark runs automatically and produces easy-to-read results - no manual SQL execution required.
- TPC-DS 4.0 Native Support: Uses official TPC-DS 4.0 queries
- Kubernetes-Native: Designed for Spark Operator on Amazon EKS
- RAPIDS GPU Support: Compatible with NVIDIA RAPIDS acceleration
- Clean Codebase: No legacy dependencies, minimal and maintainable
The recommended way to use this project is via the Docker image which includes all dependencies:
cd docker
docker build -f Dockerfile-spark352-rapids25-tpcds4-cuda12-9 -t spark-k8s-benchmarks:latest .The Docker build automatically:
- Builds spark-sql-perf (TPC-DS library) from source
- Builds this benchmark application
- Includes TPC-DS toolkit (dsdgen/dsqgen)
- Configures RAPIDS GPU support
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: tpcds-benchmark
namespace: spark
spec:
type: Scala
mode: cluster
image: your-registry/spark-k8s-benchmarks:latest
mainClass: com.k8s.spark.benchmark.BenchmarkSQL
mainApplicationFile: local:///opt/spark/examples/jars/spark-k8s-benchmarks-assembly-1.0.0.jar
arguments:
- "s3://your-bucket/TPCDS-DATA/" # dataDir
- "s3://your-bucket/TPCDS-RESULTS/" # resultLocation
- "/usr/local/bin" # dsdgenDir
- "parquet" # format
- "1000" # scaleFactor
- "3" # iterations
- "false" # optimizeQueries
- "" # filterQueries (empty = all)
- "false" # onlyWarn
sparkVersion: "3.5.2"
sparkConf:
# ==================== Additional JARs ====================
# spark-sql-perf jar needed for TPCDSTables class (not bundled in assembly)
"spark.jars": "local:///opt/spark/examples/jars/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar"
driver:
cores: 4
memory: "8g"
executor:
cores: 4
instances: 4
memory: "16g"Runs TPC-DS benchmark queries:
spark-submit \
--class com.k8s.spark.benchmark.BenchmarkSQL \
spark-k8s-benchmarks-assembly-1.0.0.jar \
<dataDir> <resultLocation> <dsdgenDir> <format> <scaleFactor> \
<iterations> <optimizeQueries> <filterQueries> <onlyWarn>Arguments:
| Argument | Description | Example |
|---|---|---|
| dataDir | S3/HDFS path to TPC-DS data | s3://bucket/tpcds/ |
| resultLocation | Output path for results | s3://bucket/results/ |
| dsdgenDir | Path to dsdgen binary | /usr/local/bin |
| format | Data format | parquet |
| scaleFactor | Dataset size in GB | 1000 |
| iterations | Benchmark iterations | 3 |
| optimizeQueries | Enable CBO | false |
| filterQueries | Query filter | "" or "q1,q2,q3" |
| onlyWarn | WARN log level | false |
Generates TPC-DS dataset:
spark-submit \
--class com.k8s.spark.benchmark.DataGeneration \
spark-k8s-benchmarks-assembly-1.0.0.jar \
<outputLocation> <dsdgenDir> <scaleFactor> <format> \
<overwrite> <partitionTables> <numPartitions>spark-k8s-benchmarks/
├── build.sbt # SBT build configuration
├── project/
│ ├── build.properties # SBT version
│ └── plugins.sbt # sbt-assembly plugin
├── src/main/scala/
│ └── com/k8s/spark/benchmark/
│ ├── BenchmarkSQL.scala # TPC-DS query runner
│ └── DataGeneration.scala # Data generator
├── docker/
│ └── Dockerfile-spark352-... # Multi-stage Docker build
├── lib/ # Local dev only (see lib/README.md)
└── build-spark-sql-perf.sh # Helper script for local dev
For local development/testing (not required for Docker builds):
# 1. Build spark-sql-perf dependency
./build-spark-sql-perf.sh
# 2. Build benchmark JAR
sbt clean assembly
# Output: target/scala-2.12/spark-k8s-benchmarks-assembly-1.0.0.jarSee lib/README.md for details.
| Component | Version |
|---|---|
| Scala | 2.12.18 |
| Spark | 3.5.2 |
| SBT | 1.9.9 |
| Java | 17 |
| spark-sql-perf | 0.5.1-SNAPSHOT (TPC-DS 4.0 fork) |
Results are written as:
- JSON: Detailed execution metrics per iteration
- CSV: Summary with median/min/max per query
queryName,medianRuntimeSeconds,minRuntimeSeconds,maxRuntimeSeconds
q1-v4.0,4.19,3.96,8.31
q2-v4.0,25.56,23.06,31.28- spark-sql-perf (TPC-DS 4.0 fork)
- TPC-DS Toolkit v4.0
- Apache Spark Documentation
- NVIDIA RAPIDS Accelerator
Apache License 2.0
This project stands on the shoulders of giants. A big thank you to the maintainers and contributors of these upstream projects:
- spark-sql-perf by Databricks - The foundation for TPC-DS benchmarking in Spark
- spark-sql-perf TPC-DS 4.0 fork by @heyujiao99 - Added support for TPC-DS 4.0 queries
- TPC-DS Toolkit v4.0 by @heyujiao99 - Updated data generation tools for TPC-DS 4.0
- Apache Spark - The distributed computing engine that powers it all
- TPC - For creating and maintaining the TPC-DS benchmark standard
Your work makes projects like this possible.