Apache Spark is arguably the most fast-growing open source project in the big data world for the past few years. It provides, as quoted from its website, “a fast and general engine for large-scale data processing”.
With the recent release of Spark 2.0, the project becomes even better for building big data applications with consolidated APIs and improved performance. To help people get started or transition from 1.x, I am starting this series of posts to talk about some of Spark 2.0’s important concepts and applications.
Every Spark application needs an entry point that enables connection with different data sources to write or read data from them. Historically (Spark 1.x), we use
HiveContext as the Spark SQL entry points on top of a
SparkContext created for the application.
SparkContext allows your application to access the cluster through a resource manager. But, as entry points, the difference between
HiveContext could be a little bit confusing to users. My understanding is that
HiveContext is a super set of
SQLContext that you would need if you want to access Hive tables, or to use richer functionalities such as the
window operation, and the trade-off is that
HiveContext requires many dependencies to run.
In the new world of Spark 2.0, a new entry point
SparkSession is introduced to unify these old APIs and it contains all the features that were present.
SparkSession can be created using a builder pattern. Internally, it requires a
SparkContext that used to create
RDD and manage cluster resources, but you don’t need to provide it explicitly anymore. Under the hood, the builder would reuse an existing
SparkContext, or create a new one if it doesn’t exist.
Note that it is sort of a convention to declare your
SparkSession object with the name
SparkSession created above is effectively a
SQLContext, but, if you want to work with your old friend
HiveContext instead, you just need to add one line to enable the hive support.
You can also directly access the underlying
Similar to its predecessors,
SparkSession can be used to read data from disk or execute SQL queries. Both with returned results as
Dataset[Row], more on this new API in a separate post).
// read from a local json file
Alternatively, you can convert a local collection or
RDD into a
Dataset. This can be achieved through the
toDS method and get the conversion automatically via
SparkSession.implicits, which defines encoders for Scala’s primitive types, and their products and collections.
There are many other methods associated with
SparkSession that I decide to omit here, since they are, similar to what you have seen so far, mostly just methods borrowed from
Backward compatibility is always a sensitive topic when there involves API changes. People will freak out when things break. The good news is that Spark is in general good on backward compatibility. The old
SparkContext are all still available, although you should move away from them and use the handy
SparkSession in your next application.
In this post, I briefly introduced the new entry point
SparkSession used to develop Spark SQL applications in Spark 2.0. It subsumes the old entry points, i.e.
HiveContext, from Spark 1.x and provides a much less confusing API to access data sources.