Skip to content

KubedAI/spark-k8s-benchmarks

Repository files navigation

Spark K8s Benchmarks

A clean TPC-DS benchmark suite for Apache Spark on Kubernetes with native TPC-DS 4.0 support.

What is this project?

This project helps you measure how fast your Spark cluster can process data. Think of it as a standardized speed test for your big data infrastructure.

The Problem: When you set up Apache Spark on Kubernetes (especially on AWS EKS), you want to know: "How fast is my cluster? Is it performing well? How does it compare to other configurations?" Without a standard test, it's hard to answer these questions.

The Solution: This project runs TPC-DS benchmarks - an industry-standard test suite that simulates a retail company's data warehouse. It runs 104 realistic SQL queries against large datasets (from 1GB to 100TB) and measures how long each query takes. This gives you concrete numbers to:

  • Compare configurations: CPU vs GPU (RAPIDS), different instance types, memory settings
  • Validate your setup: Ensure Spark is running efficiently on Kubernetes
  • Track performance over time: Catch regressions after upgrades or config changes
  • Make informed decisions: Choose the right infrastructure for your workload

How to use it:

  1. Build the Docker image (one command)
  2. Deploy to your Kubernetes cluster using the Spark Operator
  3. Point it at your S3 bucket for data and results
  4. Get a CSV report showing how long each query took

The benchmark runs automatically and produces easy-to-read results - no manual SQL execution required.

Features

  • TPC-DS 4.0 Native Support: Uses official TPC-DS 4.0 queries
  • Kubernetes-Native: Designed for Spark Operator on Amazon EKS
  • RAPIDS GPU Support: Compatible with NVIDIA RAPIDS acceleration
  • Clean Codebase: No legacy dependencies, minimal and maintainable

Quick Start

Build Docker Image

The recommended way to use this project is via the Docker image which includes all dependencies:

cd docker
docker build -f Dockerfile-spark352-rapids25-tpcds4-cuda12-9 -t spark-k8s-benchmarks:latest .

The Docker build automatically:

  1. Builds spark-sql-perf (TPC-DS library) from source
  2. Builds this benchmark application
  3. Includes TPC-DS toolkit (dsdgen/dsqgen)
  4. Configures RAPIDS GPU support

Deploy on Kubernetes

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: tpcds-benchmark
  namespace: spark
spec:
  type: Scala
  mode: cluster
  image: your-registry/spark-k8s-benchmarks:latest
  mainClass: com.k8s.spark.benchmark.BenchmarkSQL
  mainApplicationFile: local:///opt/spark/examples/jars/spark-k8s-benchmarks-assembly-1.0.0.jar
  arguments:
    - "s3://your-bucket/TPCDS-DATA/"      # dataDir
    - "s3://your-bucket/TPCDS-RESULTS/"   # resultLocation
    - "/usr/local/bin"                     # dsdgenDir
    - "parquet"                            # format
    - "1000"                               # scaleFactor
    - "3"                                  # iterations
    - "false"                              # optimizeQueries
    - ""                                   # filterQueries (empty = all)
    - "false"                              # onlyWarn
  sparkVersion: "3.5.2"
  sparkConf:
    # ==================== Additional JARs ====================
    # spark-sql-perf jar needed for TPCDSTables class (not bundled in assembly)
    "spark.jars": "local:///opt/spark/examples/jars/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar"
  driver:
    cores: 4
    memory: "8g"
  executor:
    cores: 4
    instances: 4
    memory: "16g"

Main Classes

BenchmarkSQL

Runs TPC-DS benchmark queries:

spark-submit \
  --class com.k8s.spark.benchmark.BenchmarkSQL \
  spark-k8s-benchmarks-assembly-1.0.0.jar \
  <dataDir> <resultLocation> <dsdgenDir> <format> <scaleFactor> \
  <iterations> <optimizeQueries> <filterQueries> <onlyWarn>

Arguments:

Argument Description Example
dataDir S3/HDFS path to TPC-DS data s3://bucket/tpcds/
resultLocation Output path for results s3://bucket/results/
dsdgenDir Path to dsdgen binary /usr/local/bin
format Data format parquet
scaleFactor Dataset size in GB 1000
iterations Benchmark iterations 3
optimizeQueries Enable CBO false
filterQueries Query filter "" or "q1,q2,q3"
onlyWarn WARN log level false

DataGeneration

Generates TPC-DS dataset:

spark-submit \
  --class com.k8s.spark.benchmark.DataGeneration \
  spark-k8s-benchmarks-assembly-1.0.0.jar \
  <outputLocation> <dsdgenDir> <scaleFactor> <format> \
  <overwrite> <partitionTables> <numPartitions>

Project Structure

spark-k8s-benchmarks/
├── build.sbt                 # SBT build configuration
├── project/
│   ├── build.properties      # SBT version
│   └── plugins.sbt           # sbt-assembly plugin
├── src/main/scala/
│   └── com/k8s/spark/benchmark/
│       ├── BenchmarkSQL.scala      # TPC-DS query runner
│       └── DataGeneration.scala    # Data generator
├── docker/
│   └── Dockerfile-spark352-...     # Multi-stage Docker build
├── lib/                      # Local dev only (see lib/README.md)
└── build-spark-sql-perf.sh   # Helper script for local dev

Local Development

For local development/testing (not required for Docker builds):

# 1. Build spark-sql-perf dependency
./build-spark-sql-perf.sh

# 2. Build benchmark JAR
sbt clean assembly

# Output: target/scala-2.12/spark-k8s-benchmarks-assembly-1.0.0.jar

See lib/README.md for details.

Dependencies

Component Version
Scala 2.12.18
Spark 3.5.2
SBT 1.9.9
Java 17
spark-sql-perf 0.5.1-SNAPSHOT (TPC-DS 4.0 fork)

Output Format

Results are written as:

  • JSON: Detailed execution metrics per iteration
  • CSV: Summary with median/min/max per query
queryName,medianRuntimeSeconds,minRuntimeSeconds,maxRuntimeSeconds
q1-v4.0,4.19,3.96,8.31
q2-v4.0,25.56,23.06,31.28

References

License

Apache License 2.0

Acknowledgments

This project stands on the shoulders of giants. A big thank you to the maintainers and contributors of these upstream projects:

Your work makes projects like this possible.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors