Spark Driver Processes

A Spark "driver" is an application that creates a SparkContext for executing one or more jobs in the Spark cluster. The following diagram depicts the driver’s role in a Spark cluster:

Spark driver

In the diagram above, the spark-master service in Fusion is the Cluster Manager.

If your Spark job performs any collect operations, then the result of the collect (or collectAsMap) is sent back to the driver from all the executors. Consequently, if the result of the collect is too big too fit into memory, you will encounter OOM issues (or other memory related problems) when running your job.

All Fusion jobs run on Spark using a driver process started by the API service. There are three types of drivers in Fusion:

  • Default driver: Executes built-in Fusion jobs, such as a signal aggregation job or a metrics rollup job.

  • Scripted job driver: Executes custom script jobs; a separate driver is needed to isolate the classpath for custom Scala scripts.

  • Spark-shell driver: Launch using fusion/bin/spark-shell.

Default Driver

Navigate to the Fusion UI and select the system metrics collection in the UI. Select one of the built-in Aggregation jobs, such as hourlyMetricsRollup-counters. In the diagram above, the spark-master service in Fusion is the Cluster Manager.

You must delete any existing driver applications before launching the job. Even if you haven’t started any jobs by hand, Fusion’s API service may have started one automatically, because Fusion ships with built-in jobs that run in the background which perform rollups of metrics in the system_metrics collection. Therefore, before you try to launch a job, you should run the following command:

> curl -XDELETE http://localhost:8764/api/spark/driver

Wait a few seconds and use the Spark UI to verify that no Fusion-spark application (e.g., "Fusion-20161005224611") is running.

In a terminal window or windows, set up a tail -f job on the api and spark-driver-default logs

$ cd /path/to/fusion
$ tail -f var/log/api/api.log var/log/api/spark-driver-default.log

Now, start any aggregation job from the UI. It doesn’t matter whether or not this job performs any work; the goal of this exercise is to show what happens in Fusion and Spark when you run an aggregation job. You should see activity in both logs related to starting the driver application and running the selected job. The Spark UI will now show a Fusion-spark app:

Spark default driver

Use the ps command to get more details on this process:

$ ps waux | grep SparkDriver

The output should show that the Fusion SparkDriver is a JVM process started by the API service; it is not managed by the Fusion agent. Within a few minutes, the Spark UI will update itself:

Spark default driver dynamic allocation

Notice that the application no longer has any cores allocated and that all of the memory available is not being used (0.0B Used of 2.0 GB Total). This is because we launch our driver applications with spark.dynamicAllocation.enabled=true. This setting allows the Spark master to reclaim CPU & memory from an application if it is not actively using the resources allocated to it.

Both driver processes (default and scripted) manage a SparkContext. For the default driver, the SparkContext will be shut down after waiting a configurable (fusion.spark.idleTime: default 5 mins) idle time. The scripted driver shuts down the SparkContext after every scripted job is run to avoid classpath pollution between jobs.

Scripted Driver

Script jobs require a separate driver to isolate the classpath for custom Scala scripts, as well as to isolate the classpath between the jobs, so that classes compiled from scripts don’t pollute the classpath for subsequent scripted jobs. For this reason the SparkContext that each scripted job uses is immediately shut down after the job is finished and recreated for new jobs. This adds some startup overhead for scripted jobs. Refer to the apachelogs lab in the Fusion Spark Bootcamp project for a complete example of a custom script job.

To troubleshoot problems with a script job, start by looking for errors in the script driver log fusion/var/log/api/spark-driver-scripted.log.

Spark Drivers in a Multi-node Cluster

To find out which node is running the Spark driver which node is running the driver when running a multi-node Fusion deployment which has several nodes running Fusion’s API services, you can query the driver status via the following call to the Spark Jobs API endpoint:

$ curl http://localhost:8764/api/spark/driver/status

This returns a status report:

  "/spark-drivers/15797426d56T537184c2" : {
    "id" : "15797426d56T537184c2",
    "hostname" : "",
    "port" : 8601,
    "scripted" : false
This endpoint only exists in 3.x and 2.4.3 and higher.