Spark and Spark SQL interfaces for Succinct. This library facilitates compressing RDDs in Spark and DataFrames in Spark SQL and enables queries directly on the compressed representation.
This library requires Spark 1.4+.
To build your application with Succinct-Spark, you can link against this library using Maven by adding the following dependency information to your pom.xml file:
<dependency>
<groupId>amplab</groupId>
<artifactId>succinct-spark</artifactId>
<version>0.1.5</version>
</dependency>Add the dependency to your SBT project by adding the following to build.sbt
(see the Spark Packages listing
for spark-submit and Maven instructions):
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
libraryDependencies += "amplab" % "succinct" % "0.1.5"
The succinct-spark jar file can also be added to a Spark shell using the
--jars command line option. For example, to include it when starting the
spark shell:
$ bin/spark-shell --jars succinct-0.1.5.jar
The Succinct-Spark library exposes three APIs:
- A SuccinctRDD API that views an RDD as an unstructured "flat-file" and enables queries on its compressed representation.
- A SuccinctKVRDD API that provides a key-value abstraction for the data, and supports search and random-access over the values.
- DataFrame API that integrates with the Spark SQL interface via Data Sources, and supports SQL queries on compressed structured data.
Note: The Spark SQL interface is experimental, and only efficient for selected SQL operators. We aim to make the Spark SQL integration more efficient in future releases.
We expose a SuccinctRDD that extends RDD[Array[Byte]]. Since each record is
represented as an array of bytes, SuccinctRDD can be used to encode a
collection of any type of records by providing a serializer/deserializer for
the record type.
SuccinctRDD can be used as follows:
import edu.berkeley.cs.succinct._
// Read text data from file; sc is the SparkContext
val wikiData = ctx.textFile("/path/to/data").map(_.getBytes)
// Converts the wikiData RDD to a SuccinctRDD, serializing each record into an
// array of bytes. We persist the RDD in memory to perform in-memory queries.
val wikiSuccinctData = wikiData.succinct.persist()
// Count the number of occurrences of "Berkeley" in the RDD
val berkeleyOccCount = wikiSuccinctData.count("Berkeley")
println("# of times Berkeley appears in text = " + berkeleyOccCount)
// Find all offsets of occurrences of "Berkeley" in the RDD
val searchOffsets = wikiSuccinctData.search("Berkeley")
println("First 10 locations in the RDD where Berkeley occurs: ")
searchOffsets.take(10).foreach(println)
// Find all occurrences of the regular expression "(berkeley|stanford)\\.edu"
val regexOccurrences = wikiSuccinctData.regexSearch("(stanford|berkeley)\\.edu").collect()
println("# of matches for the regular expression (stanford|berkeley)\\.edu = " + regexOccurrences.count)
// Extract 10 bytes at offset 5 in the RDD
val extractedData = wikiSuccinctData.extract(5, 10)
println("Extracted data = [" + new String(extractedData) + "]")We don't support non-ASCII characters in the input for now, since the algorithms depend on using certain non-ASCII characters as internal symbols.
Another constraint to consider is the construction time for Succinct data-structures. As for any block compression scheme, Succinct requires non-trivial amount of time to compress an input dataset. It is strongly advised that the SuccinctRDD be cached in memory (using RDD.cache()) and persisted on disk after construcion completes, to be able to re-use the constructed data-structures without trigerring re-construction:
import edu.berkeley.cs.succinct._
// Read text data from file; sc is the SparkContext
val wikiData = ctx.textFile("/path/to/data").map(_.getBytes)
// Construct the succinct RDD and save it as follows
wikiData.saveAsSuccinctFile("/path/to/data")
// Load into memory again as follows; sc is the SparkContext
val loadedSuccinctRDD = sc.succinctFile("/path/to/data")The SuccinctKVRDD implements the RDD[(K, Array[Byte]] interface, where key
can be of the specified (ordered) type while the value is a serialized array of
bytes.
SuccinctKVRDD can be used as follows:
import edu.berkeley.cs.succinct.kv._
val wikiData = ctx.textFile(dataPath, partitions).map(_.getBytes)
val wikiKVData = wikiData.zipWithIndex().map(t => (t.\_2, t.\_1))
val succinctKVRDD = wikiKVData.succinctKV
// Get the value for key 0
val value = succinctKVRDD.get(0)
println("Value corresponding to key 0 = " + new String(value))
// Fetch 3 bytes at offset 1 for the value corresponding to key = 0
val valueData = succinctKVRDD.extract(0, 1, 3)
println("Value data for key 0 at offset 1 and length 3 = " + new String(valueData))
// count the number of occurrences of "Berkeley" accross all values
val count = succinctKVRDD.count("Berkeley")
println("Number of times Berkeley occurs in the values: " + count)
// Get the individual occurrences of Berkeley as offsets into each value
val searchOffsets = succinctKVRDD.searchOffsets("Berkeley")
println("First 10 matches for Berkeley as (key, offset) pairs: ")
searchOffsets.take(10).foreach(println)
// Search for values containing "Berkley", and fetch corresponding keys
val keys = succinctKVRDD.search("Berkeley")
println("First 10 keys matching the search query:")
keys.take(10).foreach(println)
// Regex search to find values containing matches of "(stanford|berkeley)\\.edu",
// and fetch the corresponding of keys
val regexKeys = succinctKVRDD.regexSearch("(stanford|berkeley)\\.edu")
println("First 10 keys matching the regex query:")
regexKeys.take(10).foreach(println)Similar to the flat-file interface, we suggest that the KV data be persisted to disk for repeated-use scenarios:
import edu.berkeley.cs.succinct.kv._
// Read data from file; sc is the SparkContext
val wikiData = ctx.textFile("/path/to/data").map(_.getBytes)
val wikiKVData = wikiData.zipWithIndex().map(t => (t.\_2, t.\_1))
// Construct the SuccinctKVRDD and save it as follows
wikiKVData.saveAsSuccinctKV("/path/to/data")
// Load into memory again as follows; sc is the SparkContext
val loadedSuccinctKVRDD = sc.succinctKV("/path/to/data")The DataFrame API for Succinct is experimental for now, and only supports selected data types and filters. The supported SparkSQL types include:
BooleanType
ByteType
ShortType
IntegerType
LongType
FloatType
DoubleType
DecimalType
StringType
The supported SparkSQL filters include:
StringStartsWith
StringEndsWith
StringContains
EqualTo
LessThan
LessThanOrEqual
GreaterThan
GreaterThanOrEqual
Note that certain SQL operations, like joins, might be inefficient on the DataFrame API for now. We plan on improving the performance for generic SQL operations in a future release.
The DataFrame API can be used as follows:
import edu.berkeley.cs.succinct.sql._
// Create a schema
val citySchema = StructType(Seq(
StructField("Name", StringType, false),
StructField("Length", IntegerType, true),
StructField("Area", DoubleType, false),
StructField("Airport", BooleanType, true)))
// Create an RDD of Rows with some data
val cityRDD = sparkContext.parallelize(Seq(
Row("San Francisco", 12, 44.52, true),
Row("Palo Alto", 12, 22.33, false),
Row("Munich", 8, 3.14, true)))
// Create a data frame from the RDD and the schema
val cityDataFrame = sqlContext.createDataFrame(cityRDD, citySchema)
// Save the DataFrame in the "Succinct" format
cityDataFrame.write.format("edu.berkeley.cs.succinct.sql").save("/path/to/data")
// Read the Succinct DataFrame from the saved path
val succinctCities = sqlContext.succinctFile("/path/to/data")
// Filter and prune
val bigCities = succinctCities.filter("Area >= 22.0").select("Name").collect
// Alternately, use the DataFrameReader API:
cityDataFrame.write.format("edu.berkeley.cs.succinct.sql").save("/path/to/data")
val succinctCities2 = sqlContext.read.format("edu.berkeley.cs.succinct.sql").load("/path/to/data")
val smallCities = succinctCities2.filter("Area <= 10.0").select("Name").collectThe Succinct-Spark packages includes a few
examples that elucidate the
usage of its API. To run these examples, we provide convenient scripts to run
them in the bin/ directory. In particular, to execute the
Wikipedia Search
example using SuccinctRDD, run as follows:
./bin/wiki-search [num-partitions]
The num-partitions parameter is simply the number of partitions that the
original dataset should be divided into for creating Succinct data structures.
This defaults to 1 by default; note that due to Java constraints, we do not
support partitions of sizes greater than 2GB yet.
The KV Search and Table Search examples are executed similarly.