Spark K8s Benchmarks

A clean TPC-DS benchmark suite for Apache Spark on Kubernetes with native TPC-DS 4.0 support.

What is this project?

This project helps you measure how fast your Spark cluster can process data. Think of it as a standardized speed test for your big data infrastructure.

The Problem: When you set up Apache Spark on Kubernetes (especially on AWS EKS), you want to know: "How fast is my cluster? Is it performing well? How does it compare to other configurations?" Without a standard test, it's hard to answer these questions.

The Solution: This project runs TPC-DS benchmarks - an industry-standard test suite that simulates a retail company's data warehouse. It runs 104 realistic SQL queries against large datasets (from 1GB to 100TB) and measures how long each query takes. This gives you concrete numbers to:

Compare configurations: CPU vs GPU (RAPIDS), different instance types, memory settings
Validate your setup: Ensure Spark is running efficiently on Kubernetes
Track performance over time: Catch regressions after upgrades or config changes
Make informed decisions: Choose the right infrastructure for your workload

How to use it:

Build the Docker image (one command)
Deploy to your Kubernetes cluster using the Spark Operator
Point it at your S3 bucket for data and results
Get a CSV report showing how long each query took

The benchmark runs automatically and produces easy-to-read results - no manual SQL execution required.

Features

TPC-DS 4.0 Native Support: Uses official TPC-DS 4.0 queries
Kubernetes-Native: Designed for Spark Operator on Amazon EKS
RAPIDS GPU Support: Compatible with NVIDIA RAPIDS acceleration
Clean Codebase: No legacy dependencies, minimal and maintainable

Quick Start

Build Docker Image

The recommended way to use this project is via the Docker image which includes all dependencies:

cd docker
docker build -f Dockerfile-spark352-rapids25-tpcds4-cuda12-9 -t spark-k8s-benchmarks:latest .

The Docker build automatically:

Builds spark-sql-perf (TPC-DS library) from source
Builds this benchmark application
Includes TPC-DS toolkit (dsdgen/dsqgen)
Configures RAPIDS GPU support

Deploy on Kubernetes

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: tpcds-benchmark
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: your-registry/spark-k8s-benchmarks:latest
  mainClass: com.k8s.spark.benchmark.BenchmarkSQL
  mainApplicationFile: local:///opt/spark/examples/jars/spark-k8s-benchmarks-assembly-1.0.0.jar
  arguments:
    - "s3://your-bucket/TPCDS-DATA/"      # dataDir
    - "s3://your-bucket/TPCDS-RESULTS/"   # resultLocation
    - "/usr/local/bin"                     # dsdgenDir
    - "parquet"                            # format
    - "1000"                               # scaleFactor
    - "3"                                  # iterations
    - "false"                              # optimizeQueries
    - ""                                   # filterQueries (empty = all)
    - "false"                              # onlyWarn
  sparkVersion: "3.5.2"
  sparkConf:
    # ==================== Additional JARs ====================
    # spark-sql-perf jar needed for TPCDSTables class (not bundled in assembly)
    "spark.jars": "local:///opt/spark/examples/jars/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar"
  driver:
    cores: 4
    memory: "8g"
  executor:
    cores: 4
    instances: 4
    memory: "16g"

Main Classes

BenchmarkSQL

Runs TPC-DS benchmark queries:

spark-submit \
  --class com.k8s.spark.benchmark.BenchmarkSQL \
  spark-k8s-benchmarks-assembly-1.0.0.jar \
  <dataDir> <resultLocation> <dsdgenDir> <format> <scaleFactor> \
  <iterations> <optimizeQueries> <filterQueries> <onlyWarn>

Arguments:

Argument	Description	Example
dataDir	S3/HDFS path to TPC-DS data	`s3://bucket/tpcds/`
resultLocation	Output path for results	`s3://bucket/results/`
dsdgenDir	Path to dsdgen binary	`/usr/local/bin`
format	Data format	`parquet`
scaleFactor	Dataset size in GB	`1000`
iterations	Benchmark iterations	`3`
optimizeQueries	Enable CBO	`false`
filterQueries	Query filter	`""` or `"q1,q2,q3"`
onlyWarn	WARN log level	`false`

DataGeneration

Generates TPC-DS dataset:

spark-submit \
  --class com.k8s.spark.benchmark.DataGeneration \
  spark-k8s-benchmarks-assembly-1.0.0.jar \
  <outputLocation> <dsdgenDir> <scaleFactor> <format> \
  <overwrite> <partitionTables> <numPartitions>

Project Structure

spark-k8s-benchmarks/
├── build.sbt                 # SBT build configuration
├── project/
│   ├── build.properties      # SBT version
│   └── plugins.sbt           # sbt-assembly plugin
├── src/main/scala/
│   └── com/k8s/spark/benchmark/
│       ├── BenchmarkSQL.scala      # TPC-DS query runner
│       └── DataGeneration.scala    # Data generator
├── docker/
│   └── Dockerfile-spark352-...     # Multi-stage Docker build
├── lib/                      # Local dev only (see lib/README.md)
└── build-spark-sql-perf.sh   # Helper script for local dev

Local Development

For local development/testing (not required for Docker builds):

# 1. Build spark-sql-perf dependency
./build-spark-sql-perf.sh

# 2. Build benchmark JAR
sbt clean assembly

# Output: target/scala-2.12/spark-k8s-benchmarks-assembly-1.0.0.jar

See lib/README.md for details.

Dependencies

Component	Version
Scala	2.12.18
Spark	3.5.2
SBT	1.9.9
Java	17
spark-sql-perf	0.5.1-SNAPSHOT (TPC-DS 4.0 fork)

Output Format

Results are written as:

JSON: Detailed execution metrics per iteration
CSV: Summary with median/min/max per query

queryName,medianRuntimeSeconds,minRuntimeSeconds,maxRuntimeSeconds
q1-v4.0,4.19,3.96,8.31
q2-v4.0,25.56,23.06,31.28

References

License

Apache License 2.0

Acknowledgments

This project stands on the shoulders of giants. A big thank you to the maintainers and contributors of these upstream projects:

spark-sql-perf by Databricks - The foundation for TPC-DS benchmarking in Spark
spark-sql-perf TPC-DS 4.0 fork by @heyujiao99 - Added support for TPC-DS 4.0 queries
TPC-DS Toolkit v4.0 by @heyujiao99 - Updated data generation tools for TPC-DS 4.0
Apache Spark - The distributed computing engine that powers it all
TPC - For creating and maintaining the TPC-DS benchmark standard

Your work makes projects like this possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark K8s Benchmarks

What is this project?

Features

Quick Start

Build Docker Image

Deploy on Kubernetes

Main Classes

BenchmarkSQL

DataGeneration

Project Structure

Local Development

Dependencies

Output Format

References

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docker		docker
lib		lib
project		project
src/main/scala/com/k8s/spark/benchmark		src/main/scala/com/k8s/spark/benchmark
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build-spark-sql-perf.sh		build-spark-sql-perf.sh
build.sbt		build.sbt

Folders and files

Latest commit

History

Repository files navigation

Spark K8s Benchmarks

What is this project?

Features

Quick Start

Build Docker Image

Deploy on Kubernetes

Main Classes

BenchmarkSQL

DataGeneration

Project Structure

Local Development

Dependencies

Output Format

References

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages