Apark Spark, the allegedly hottest open source cluster computing project, recently released a major upgrade to its 2.0 version. With new changes and updates, the performance of its computing engine sees significant improvements.
To help people get started or transition to this new version, I started a series of blog posts to cover Spark 2.0 related stuff. This article in particular talks about one of its major API updates; that is, the introduction of a new interface
Dataset (Well, technically,
Dataset was introduced in Spark 1.6 as an API preview, but it really becomes stable in this new release). With
Dataset and its two predecessors
DataFrame, Spark now has three major APIs for operating large datasets.
RDD (Risilient Distributed Dataset) was the primary API introduced since Spark 1.0. It essentially is an immutable distributed collection of elements of your data that is partitioned across nodes in the cluster. It is functional-oriented and emphasizes on immutability, which provides a simple, OOP-style API with complie time type-safety, but may cause issues with garbabge collection due to too many temporary objects being created during computation.
RDD is lazily evaluated. It provides two types of operations, i.e. transformations and actions. The transformations, such as
reduce, only creates a new
RDD representing the transformed data and defines the operations to be performed . The operaions are not actually performed until an action is called. You can find a list of these operations on the Spark Programming Guide.
RDD will continue to be the building block of Spark, since
DataFrame are built on top of it. It will be the low-level API if you want more explicit control over operations or process unstructured data. But it would not enjoy many of the optimizations and performance benefits available with
DataFrame due to its lack of structure information and the help brought by Spark’s internal Catalyst Optimizer.
Dataset and DataFrame
Dataset is formally introduced in Spark 2.0 and is positioned to take over
DataFrame (introduced in Spark 1.3) by bringing in some of the advantages with
RDD, such as compile time type-safety. I am combining
DataFrame here together mainly because
DataFrame will merely be an alias for
Dataset[Row] starting from Spark 2.0, where
Row is a generic untyped object. Alternatively, for a typed object, such as a case class
Person, you can create a
Dataset[Person], similar to
RDD[Person], to take advantage of the type-safety.
Dataset optimizes data storage via encoders to eliminate the cost of deserialization and garbabge collection, and improves performance by using Catalyst Optimizer to generate optimized query plan. Its overall performance is improved significantly (see this example). The trade-off here is that
Dataset is limited to classes that extend the Scala
Product trait, e.g.
case class, whereas
RDD can be used with any objects.
Dataset can be seamlessly convert to
RDD by calling the
// suppose you have personDS: Dataset[Person]
And, vice versa, with a little bit extra work, you can also convert
// suppose you have personRDD: RDD[Person] and spark: SparkSession
Starting from Spark 2.0, I would imagine
Dataset to become de facto API for users to operate with on a daily basis and
RDD to be used only when lower level functionality and control are needed. The space efficiency and performance gains with
Dataset are very significant for most use cases - at least for me, dealing with structured or semi-structured data is the norm.
As always, thanks for your time and hope this post is helpful. If you have other Spark 2.0 related questions, you should check out this series of posts or leave comments below.