Apark Spark, the allegedly hottest open source cluster computing project, recently released a major upgrade to its 2.0 version. With new changes and updates, the performance of its computing engine sees significant improvements.
To help people get started or transition to this new version, I started a series of blog posts to cover Spark 2.0 related stuff. This article in particular talks about one of its major API updates; that is, the introduction of a new interface Dataset
(Well, technically, Dataset
was introduced in Spark 1.6 as an API preview, but it really becomes stable in this new release). With Dataset
and its two predecessors RDD
and DataFrame
, Spark now has three major APIs for operating large datasets.
RDD
RDD
(Risilient Distributed Dataset) was the primary API introduced since Spark 1.0. It essentially is an immutable distributed collection of elements of your data that is partitioned across nodes in the cluster. It is functional-oriented and emphasizes on immutability, which provides a simple, OOP-style API with complie time type-safety, but may cause issues with garbabge collection due to too many temporary objects being created during computation.
RDD
is lazily evaluated. It provides two types of operations, i.e. transformations and actions. The transformations, such as map
and reduce
, only creates a new RDD
representing the transformed data and defines the operations to be performed . The operaions are not actually performed until an action is called. You can find a list of these operations on the Spark Programming Guide.
RDD
will continue to be the building block of Spark, since Dataset
and DataFrame
are built on top of it. It will be the low-level API if you want more explicit control over operations or process unstructured data. But it would not enjoy many of the optimizations and performance benefits available with Dataset
and DataFrame
due to its lack of structure information and the help brought by Spark’s internal Catalyst Optimizer.
Dataset and DataFrame
Dataset
is formally introduced in Spark 2.0 and is positioned to take over DataFrame
(introduced in Spark 1.3) by bringing in some of the advantages with RDD
, such as compile time type-safety. I am combining Dataset
and DataFrame
here together mainly because DataFrame
will merely be an alias for Dataset[Row]
starting from Spark 2.0, where Row
is a generic untyped object. Alternatively, for a typed object, such as a case class Person
, you can create a Dataset[Person]
, similar to RDD[Person]
, to take advantage of the type-safety.
Dataset
optimizes data storage via encoders to eliminate the cost of deserialization and garbabge collection, and improves performance by using Catalyst Optimizer to generate optimized query plan. Its overall performance is improved significantly (see this example). The trade-off here is that Dataset
is limited to classes that extend the Scala Product
trait, e.g. case class
, whereas RDD
can be used with any objects.
Even better, Dataset
can be seamlessly convert to RDD
by calling the .rdd
method.
1 | // suppose you have personDS: Dataset[Person] |
And, vice versa, with a little bit extra work, you can also convert RDD
to Dataset
.
1 | // suppose you have personRDD: RDD[Person] and spark: SparkSession |
Summary
Starting from Spark 2.0, I would imagine Dataset
to become de facto API for users to operate with on a daily basis and RDD
to be used only when lower level functionality and control are needed. The space efficiency and performance gains with Dataset
are very significant for most use cases - at least for me, dealing with structured or semi-structured data is the norm.
As always, thanks for your time and hope this post is helpful. If you have other Spark 2.0 related questions, you should check out this series of posts or leave comments below.