Spark 2.0 New Entry Point - SparkSession

Apache Spark is arguably the most fast-growing open source project in the big data world for the past few years. It provides, as quoted from its website, “a fast and general engine for large-scale data processing”.

With the recent release of Spark 2.0, the project becomes even better for building big data applications with consolidated APIs and improved performance. To help people get started or transition from 1.x, I am starting this series of posts to talk about some of Spark 2.0’s important concepts and applications.

New entry point - SparkSession

Every Spark application needs an entry point that enables connection with different data sources to write or read data from them. Historically (Spark 1.x), we use SQLContext or HiveContext as the Spark SQL entry points on top of a SparkContext created for the application.

Essentially, SparkContext allows your application to access the cluster through a resource manager. But, as entry points, the difference between SQLContext and HiveContext could be a little bit confusing to users. My understanding is that HiveContext is a super set of SQLContext that you would need if you want to access Hive tables, or to use richer functionalities such as the window operation, and the trade-off is that HiveContext requires many dependencies to run.

In the new world of Spark 2.0, a new entry point SparkSession is introduced to unify these old APIs and it contains all the features that were present.

Create SparkSession

A SparkSession can be created using a builder pattern. Internally, it requires a SparkContext that used to create RDD and manage cluster resources, but you don’t need to provide it explicitly anymore. Under the hood, the builder would reuse an existing SparkContext, or create a new one if it doesn’t exist.

import org.apache.spark.sql.SparkSession

val spark: SparkSession =
  SparkSession
    .builder()
    .master("local")
    .appName("Spark Example")
    .getOrCreate()

Note that it is sort of a convention to declare your SparkSession object with the name spark. The SparkSession created above is effectively a SQLContext, but, if you want to work with your old friend HiveContext instead, you just need to add one line to enable the hive support.

import org.apache.spark.sql.SparkSession

val spark: SparkSession =
  SparkSession
    .builder()
    .master("local")
    .appName("Spark Example")
    .enableHiveSupport()
    .getOrCreate()

You can also directly access the underlying SparkContext through SparkSession.

1 2	spark.sparkContext // res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@3dacr7bs

Process data

Similar to its predecessors, SparkSession can be used to read data from disk or execute SQL queries. Both with returned results as DataFrame (or Dataset[Row], more on this new API in a separate post).

// read from a local json file
val jsonDF = spark.read.json("example.json")
// jsonDF: org.apache.spark.sql.DataFrame = ...

// execute sql query
val sqlDF = sparl.sql("select * from example")
// sqlDF: org.apache.spark.sql.DataFrame = ...

Alternatively, you can convert a local collection or RDD into a Dataset. This can be achieved through the toDS method and get the conversion automatically via SparkSession.implicits, which defines encoders for Scala’s primitive types, and their products and collections.

1
2
3

import spark.implicits._
val oneDS = Seq(1).toDS
// oneDS: org.apache.spark.sql.Dataset[Int] = [value: Int]

There are many other methods associated with SparkSession that I decide to omit here, since they are, similar to what you have seen so far, mostly just methods borrowed from SQLContext or HiveContext.

Backward compatibility

Backward compatibility is always a sensitive topic when there involves API changes. People will freak out when things break. The good news is that Spark is in general good on backward compatibility. The old SQLContext, HiveContext and SparkContext are all still available, although you should move away from them and use the handy SparkSession in your next application.

Summary

In this post, I briefly introduced the new entry point SparkSession used to develop Spark SQL applications in Spark 2.0. It subsumes the old entry points, i.e. SQLContext and HiveContext, from Spark 1.x and provides a much less confusing API to access data sources.

This article belongs to a series of posts on Spark 2.0 that I will be blogging in the next few weeks. If you are interested, please subscribe to the RSS feed or follow me on Twitter.