Extensible Interfaces to Spark from R

Apache Spark

Apache Spark: "A fast and general engine for large-scale data processing"; "A fast and general-purpose cluster computing system"
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel
Distributed DataFrame API based on R and pandas data frames.
Open source and supported by many vendors including Microsoft, IBM, Intel, Google, Cloudera, HortonWorks, DataBricks, and many others.
IBM announced last year that they'd be dedicating 3,500 people to work on Spark related projects (they are now the #2 committer to Spark after DataBricks)

Spark APIs

Scala/Java API (traditional object-oriented API)
pyspark: "The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, we recommend reading through the Scala programming guide first; it should be easy to follow even if you don’t know Scala. This guide will show how to use the Spark features described there in Python."
What should the R interface look like?

R as Interface Language

Optimized for interactive exploration—many language features aimed at productive REPL usage.
Generic functions—S3 dispatch provides uniform interfaces to all objects for inspection, plotting, etc.
Functional language w/ immutable data—promotes trustworthy computing.
Non-standard evaluation for meta-programming—ideal for creating DSLs like the R formula interface, dplyr, etc.

The R interface should provide high level facades for the tasks users want to undertake with Spark that are consistent with base R semantics and take advantage of it's strengths as an interface language.

Demo

Aspirations

Data Frame Interfaces:
- R data frame API, dplyr, data.table
- We want to use these interfaces for remote data and local data (i.e. please don't mask our local interface in the service of providing a remote interface!)
Distributed Machine Learning:
- High level functional interfaces to distributed machine learning that play well with R generics like print, predict, summary, residuals, fitted, etc.
Distributed Parallel Execution:
- ddR package (unified R interface for writing parallel and distributed applications): https://github.com/vertica/ddR

Evolving SparkR

Break core RPC layer into new package: sparkapi
- Exposes the core R to Java RPC bridge publicly and makes it possible to write extensions that call arbitrary Spark APIs packages
New package that provides a dplyr-interface to Spark DataFrames: sparklyr
- Also provides Spark MLlib interface that works within dplyr pipelines
- Use of dplyr is optional so extensions that provide alternate data frame interfaces can still call MLlib functions.
Extensions also compatible with SparkR (if sparkapi is supported in SparkR, this is not in our control).

Demo

sparkapi Package

Function	Description
spark_connection	Get the Spark connection associated with an object (S3)
spark_jobj	Get the Spark jobj associated with an object (S3)
spark_dataframe	Get the Spark DataFrame associated with an object (S3)
spark_context	Get the SparkContext for a `spark_connection`
hive_context	Get the HiveContext for a `spark_connection`
invoke	Call a method on an object
invoke_new	Create a new object by invoking a constructor
invoke_static	Call a static method on an object

Distributed Parallel Execution

Nothing (yet) in sparkapi for distributing R computations to cluster nodes
Need to ascertain what common infrstructure is required for various projects (ddR, Tessera, hmr, etc.)
Need help to define and build these interfaces

Next Steps

Community review of sparkapi package: is it possible to write the extensions we'd like to?
Apache Spark review of sparkapi: can we agree on a common extension API?
CRAN submissions of sparkapi and sparklyr