- Apache Spark: "A fast and general engine for large-scale data processing"; "A fast and general-purpose cluster computing system"
- Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel
- Distributed DataFrame API based on R and pandas data frames.
- Open source and supported by many vendors including Microsoft, IBM, Intel, Google, Cloudera, HortonWorks, DataBricks, and many others.
- IBM announced last year that they'd be dedicating 3,500 people to work on Spark related projects (they are now the #2 committer to Spark after DataBricks)