Spark Concepts and Terminology

The following schematic shows the Spark components available from Fusion:

Spark Processes in Fusion

Spark Fusion Terminology

  • application: an active SparkContext in Spark Master UI, which consists of a classpath and configuration. Jobs submitted to the cluster always run as classes in a specific application, ie. using the application’s classpath and configuration.

  • SparkDriver: JVM process launched by the Fusion API service to execute Fusion jobs in Spark. SparkDriver creates and manages SparkContext for the Fusion application, and stops SparkContext when it’s no longer needed.

  • spark-master: Agent-managed Fusion service that coordinates worker processes and applications in a Spark cluster. You should run at least 2 spark-master processes per cluster to achieve high-availability. ZooKeeper determines which spark-master process is the leader and handles fail-over.

  • spark-worker: Agent-managed Fusion service that launches executors for Spark applications. Spark-workers communicate with the master to launch executors for an application.

  • spark-sql-engine: Agent-managed Fusion service that runs Spark’s thrift-based SQL engine. Provides JDBC access to a Spark cluster.

  • spark-shell: Wrapper script provided with Fusion to launch the Spark Scala REPL shell with the correct master URL (pulled from Fusion’s API) and shaded Fusion JAR added.

  • CoarseGrainedExecutorBackend: Executor process(es) launched by a spark-worker to execute the tasks for a specific application, such as the spark-shell.

  • dynamic allocation: Spark configuration setting that allows the Spark master to reclaim allocated resources (cpu & memory) from an "application" if it is not using the resources.

  • shaded JAR: The Fusion API service creates an uber-jar containing all of the dependencies needed to use spark-solr and Fusion classes within a spark job. The classes that conflict with classes on spark’s classpath are shaded to ensure that Fusion classes use the correct version.

  • akka: Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM. Akka uses the Actor model to hide all the thread-related code and provides simple interfaces which allow you to more easily implement a scalable and fault-tolerant system. Spark is built on top of Akka.

Spark Fusion Directories

In Fusion 3.0, the following directories are used for Spark logs and components:

Fusion 3.0 Directories used for Spark