Here is a summary of my key impressions from the Day 1 keynotes of the 2014 Spark Summit.
This year’s Spark Summit was one of the deepest, most interesting technical conferences I have attended, and I don’t say that lightly. It is easy to get caught up in the excitement of a conference filled with enthusiasts, but trust me when I say that conventional MapReduce–based Hadoop is over and technologies like Spark will be part of the tipping point that will turn Big Data hype into real applications and much more widespread deployment. Hadoop is legacy.
Spark, like Hadoop, is more than one “thing”. The base component of Spark is a cluster computation engine that is like MapReduce on steroids. Instead of the simple two stage “map then reduce” computational model, Spark supports more general DAG-structured computational flows (Microsoft watchers will remember Dryad). This in itself, is a big innovation, especially for analytics scenarios. Indeed, Spark has been shown to be 10, 100, even 1000 times faster than Hadoop on a number of real workloads. More important than this, in my view, is that Spark includes higher level libraries for data access and analytics, surfaced in clean, consistent APIs that are available from three languages: Java, Scala, and Python. Three important Spark components are MLLib, a machine learning library; GraphX, a graph processing library; and Spark SQL, introduced at this conference. An analogue for those familiar with the Microsoft ecosystem is the .Net Framework – .Net provides languages, a runtime, and a set of libraries together. The integration of the pieces makes each much more useful.
The Summit is organized and principally sponsored by Databricks (tagline: “making big data easy”). This is the company founded by the Berkeley-based creators of Spark. Ion Stoico, CEO of Databricks, kicked off Monday’s festivities, introducing Databricks Cloud, a web based Spark workbench for doing big data analytics. You can find screenshots on the Databricks Cloud site, or on twitter. Key points:
- Databricks Cloud is currently in a private beta.
- It’s a standalone web interface.
- It has a command-line “REPL” interface.
- The examples I saw were in Scala (which is kind of like a mix of F# and Java).
- You can bring in data from Amazon S3 or other sources using Spark SQL (more on that in future posts).
- It includes pre-canned datasets such as a twitter snapshot/firehose (can’t tell which).
- You can do SQL queries right from the REPL.
- It has incredibly simple, clean looking visualizations tied to results.
- You can drag and drop dashboard components that correspond to REPL / program outputs. You can configure how often these components are “refreshed”.
- We were presented a live 10 demo to create a dashboard to filter live tweets, based on a similarity model authored with the help of MLLib, and trained on wikipedia.
- Databricks Cloud would be quite useful even as a standalone, single node analytics workbench, but recall that all of this is running on top of Spark, on the cluster without any “parallel programming” going on by the user.
- Everything you create in the workbench is Spark 1.0 compliant, meaning you can move it over to any other Spark 1.0 distribution without changes.
The development plan is sound, and there is a ton of corporate support for Spark from Cloudera, Hortonworks, DataBricks, SAP, IBM, and others. If time permits I will summarize some of the other keynotes and sessions.