Spark and Machine Learning

Apache Spark is an open-source cluster-computing framework. Spark improves on previous previous MapReduce implementations by use of Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. To schedule and run jobs on the nodes in the cluster, Spark uses Akka which is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM.

Fusion manages a Spark cluster which is used for all signal aggregation processes. As of Fusion 2.4, this Spark cluster can also be used to train and compile machine learning models as well as to run experiment-management tasks via the Spark Jobs API.

The topics covered in the section are: