Spark 2.0 New API - Dataset and its predecessors

Apark Spark, the allegedly hottest open source cluster computing project, recently released a major upgrade to its 2.0 version. With new changes and updates, the performance of its computing engine sees significant improvements.

To help people get started or transition to this new version, I started a series of blog posts to cover Spark 2.0 related stuff. This article in particular talks about one of its major API updates; that is, the introduction of a new interface Dataset (Well, technically, Dataset was introduced in Spark 1.6 as an API preview, but it really becomes stable in this new release). With Dataset and its two predecessors RDD and DataFrame, Spark now has three major APIs for operating large datasets.

RDD

RDD (Risilient Distributed Dataset) was the primary API introduced since Spark 1.0. It essentially is an immutable distributed collection of elements of your data that is partitioned across nodes in the cluster. It is functional-oriented and emphasizes on immutability, which provides a simple, OOP-style API with complie time type-safety, but may cause issues with garbabge collection due to too many temporary objects being created during computation.

RDD is lazily evaluated. It provides two types of operations, i.e. transformations and actions. The transformations, such as map and reduce, only creates a new RDD representing the transformed data and defines the operations to be performed . The operaions are not actually performed until an action is called. You can find a list of these operations on the Spark Programming Guide.

RDD will continue to be the building block of Spark, since Dataset and DataFrame are built on top of it. It will be the low-level API if you want more explicit control over operations or process unstructured data. But it would not enjoy many of the optimizations and performance benefits available with Dataset and DataFrame due to its lack of structure information and the help brought by Spark’s internal Catalyst Optimizer.

Dataset and DataFrame

Dataset is formally introduced in Spark 2.0 and is positioned to take over DataFrame (introduced in Spark 1.3) by bringing in some of the advantages with RDD, such as compile time type-safety. I am combining Dataset and DataFrame here together mainly because DataFrame will merely be an alias for Dataset[Row] starting from Spark 2.0, where Row is a generic untyped object. Alternatively, for a typed object, such as a case class Person, you can create a Dataset[Person], similar to RDD[Person], to take advantage of the type-safety.

Dataset optimizes data storage via encoders to eliminate the cost of deserialization and garbabge collection, and improves performance by using Catalyst Optimizer to generate optimized query plan. Its overall performance is improved significantly (see this example). The trade-off here is that Dataset is limited to classes that extend the Scala Product trait, e.g. case class, whereas RDD can be used with any objects.

Even better, Dataset can be seamlessly convert to RDD by calling the .rdd method.

1 2	// suppose you have personDS: Dataset[Person] val personRDD: RDD[Person] = personDS.rdd

And, vice versa, with a little bit extra work, you can also convert RDD to Dataset.

// suppose you have personRDD: RDD[Person] and spark: SparkSession
// implicits is imported to infer type
import spark.implicits._

val personDS: Dataset[Person] = spark.createDataset[Person](personRDD)

Summary

Starting from Spark 2.0, I would imagine Dataset to become de facto API for users to operate with on a daily basis and RDD to be used only when lower level functionality and control are needed. The space efficiency and performance gains with Dataset are very significant for most use cases - at least for me, dealing with structured or semi-structured data is the norm.

As always, thanks for your time and hope this post is helpful. If you have other Spark 2.0 related questions, you should check out this series of posts or leave comments below.