Apache Spark is arguably the most fast-growing open source project in the big data world for the past few years. It provides, as quoted from its website, “a fast and general engine for large-scale data processing”.
With the recent release of Spark 2.0, the project becomes even better for building big data applications with consolidated APIs and improved performance. To help people get started or transition from 1.x, I am starting this series of posts to talk about some of Spark 2.0’s important concepts and applications.
New entry point - SparkSession
Every Spark application needs an entry point that enables connection with different data sources to write or read data from them. Historically (Spark 1.x), we use SQLContext
or HiveContext
as the Spark SQL entry points on top of a SparkContext
created for the application.
Essentially, SparkContext
allows your application to access the cluster through a resource manager. But, as entry points, the difference between SQLContext
and HiveContext
could be a little bit confusing to users. My understanding is that HiveContext
is a super set of SQLContext
that you would need if you want to access Hive tables, or to use richer functionalities such as the window
operation, and the trade-off is that HiveContext
requires many dependencies to run.
In the new world of Spark 2.0, a new entry point SparkSession
is introduced to unify these old APIs and it contains all the features that were present.
Create SparkSession
A SparkSession
can be created using a builder pattern. Internally, it requires a SparkContext
that used to create RDD
and manage cluster resources, but you don’t need to provide it explicitly anymore. Under the hood, the builder would reuse an existing SparkContext
, or create a new one if it doesn’t exist.
1 | import org.apache.spark.sql.SparkSession |
Note that it is sort of a convention to declare your SparkSession
object with the name spark
. The SparkSession
created above is effectively a SQLContext
, but, if you want to work with your old friend HiveContext
instead, you just need to add one line to enable the hive support.
1 | import org.apache.spark.sql.SparkSession |
You can also directly access the underlying SparkContext
through SparkSession
.
1 | spark.sparkContext |
Process data
Similar to its predecessors, SparkSession
can be used to read data from disk or execute SQL queries. Both with returned results as DataFrame
(or Dataset[Row]
, more on this new API in a separate post).
1 | // read from a local json file |
Alternatively, you can convert a local collection or RDD
into a Dataset
. This can be achieved through the toDS
method and get the conversion automatically via SparkSession.implicits
, which defines encoders for Scala’s primitive types, and their products and collections.
1 | import spark.implicits._ |
There are many other methods associated with SparkSession
that I decide to omit here, since they are, similar to what you have seen so far, mostly just methods borrowed from SQLContext
or HiveContext
.
Backward compatibility
Backward compatibility is always a sensitive topic when there involves API changes. People will freak out when things break. The good news is that Spark is in general good on backward compatibility. The old SQLContext
, HiveContext
and SparkContext
are all still available, although you should move away from them and use the handy SparkSession
in your next application.
Summary
In this post, I briefly introduced the new entry point SparkSession
used to develop Spark SQL applications in Spark 2.0. It subsumes the old entry points, i.e. SQLContext
and HiveContext
, from Spark 1.x and provides a much less confusing API to access data sources.
This article belongs to a series of posts on Spark 2.0 that I will be blogging in the next few weeks. If you are interested, please subscribe to the RSS feed or follow me on Twitter.