Everything That You Wanted to Know about Spark and were Afraid to Ask


(T) I had the opportunity to go last October to the Data Science Camp 2015 of the San Francisco Bay ACM. The keynote of the conference was given by Joseph Bradley, a software engineer at Databricks and a Spark Committer. Joseph’s presentation about Spark is quite complete and easy to understand…so worthwhile watching…

Following is the abstract and the video of his talk:

“Spark’s wide adoption largely stems from allowing fast, iterative analysis, both on a laptop and on large computing clusters. This interactivity has led many data scientists to adopt Spark for both exploratory analysis and production modeling and scoring.

In response, the Spark community has been working on key features to further improve the experience of data scientists. This talk will highlight some of these features, mention use cases, and discuss recent and ongoing work on optimizations and extended functionality. Spark DataFrames, introduced in Spark 1.3, allow manipulation of distributed data using a friendly API inspired by R and Python pandas. Machine Learning Pipelines, introduced in Spark 1.2, facilitate construction of ML workflows and model tuning. Spark R, shipped with Spark 1.4, provides an API for R users to work with distributed data, and we continue work towards feature parity for the R API. For each of these items, we are working on improving integrations with familiar data science tools such as R and Python dataframes and scikit-learn. Initial PMML support, added in Spark 1.4, allows users to export models to other tools and deployments. This talk will be accessible to new Spark users, and will also provide insights, references, and tips helpful for experienced users.”

Note: The picture above is the logo from Apache Spark.

Copyright © 2005-2016 by Serge-Paul Carrasco. All rights reserved.
Contact Us: asvinsider at gmail dot com.