Spark Configuration

Spark has a number of configuration properties. In this section, we’ll cover some of the key settings you’ll need to use Fusion’s Spark integration.

For the full set of Fusion’s spark-related configuration properties, see the Spark Jobs API.

Spark Master / Worker Resource Allocation

Note
If you co-locate Spark workers and Solr nodes on the same server, then be sure to reserve some CPU for Solr to avoid a compute intensive Spark job from starving Solr of CPU resources.

Number of Cores Allocated

When a worker process joins the Spark cluster, it looks in the Fusion configuration (stored in Fusion’s ZooKeeper) for the property spark.worker.cores (Fusion 2.4) or fusion.spark.worker.cores (Fusion 3.x). If this setting is left unspecified, then the worker will use all available cores.

For example, in the screenshot below, we have a 3-node cluster, where each worker uses all available cores (8) for a total of 24 cores:

Spark cores

To change the CPU usage per worker, you need to use the Fusion configuration API to update this setting, as in the following examples.

Fusion 2.4
$ curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '6' \
http://localhost:8764/api/apollo/configurations/spark.worker.cores
Fusion 3.0
$ curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '6' \
http://localhost:8764/api/apollo/configurations/fusion.spark.worker.cores

You can also over-allocate cores to a spark-worker, which usually is recommended for hyper-threaded cores by setting the property spark-worker.envVars to SPARK_WORKER_CORES=<number of cores> in the fusion.properties file on all nodes hosting a spark-worker. For example, a r4.2xlarge instance in EC2 has 8 CPU cores, but the following configuration will improve utilization and performance:

spark-worker.envVars=SPARK_WORKER_CORES=16

After making this change to your spark worker nodes, you must restart the spark-worker process on each node:

$ cd /path/to/fusion/3.0.x
$ bin/spark-worker restart

Memory Allocation

The amount of memory allocated to each worker process is controlled by Fusion property fusion.spark.worker.memory which specifies the total amount of memory available for all executors spun up by that Spark Worker process.

Fusion 3.0:

$ curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '8g' \
http://localhost:8764/api/apollo/configurations/fusion.spark.worker.memory

Fusion 2.4:

$ curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '8g' \
http://localhost:8764/api/apollo/configurations/spark.worker.memory

The Spark worker process manages executors for multiple jobs running concurrently. For certain types of aggregation jobs you can also configure the per executor memory, but this may impact how many jobs you can run concurrently in your cluster. Unless explicitly specified using the parameter spark.executor.memory, Fusion calculates the amount of memory that can be allocated to the executor

In Fusion 3.x, the aggregation Spark jobs always get half the memory of what is assigned to the workers. This is controlled by fusion.spark.executor.memory.fraction property which is set to default value 0.5. E.g., Spark workers in 3.x have 4 Gb of memory by default and the executors for aggregator Spark jobs are assigned 2 Gb for each executor. To let fusion aggregation jobs use more memory of the workers, increase fusion.spark.executor.memory.fraction property to 1. This property should be used instead of the spark executor memory property.

Fusion 3.0:

$ curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '1' \
http://localhost:8764/api/apollo/configurations/fusion.spark.executor.memory.fraction

In Fusion 2.4, you must set the per-executor memory directly via property spark.executor.memory:

$ curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '8g' \
http://localhost:8764/api/apollo/configurations/spark.executor.memory

After making these changes and restarting the workers, when we run a Fusion job, we get the following (screenshots taken from running instance of Fusion 2.4):

Spark cores

Cores per Driver Allocation

The configuration property fusion.spark.cores.fraction allows you to limit the number of cores used by the Fusion driver applications (default and scripted). For instance, in the screenshot above, we see 18 total CPUs available.

We set the cores fraction property to 0.5 via the following command:

$ curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '0.5' \
http://localhost:8764/api/apollo/configurations/fusion.spark.cores.fraction

This cuts the number of available cores in half, as shown in the following screenshot:

Spark cores

Ports used by Spark in Fusion

The following list shows the default port number used by Spark processes. If that a port number is not available, Spark will use the next available port by adding a ‘+1’ to the assigned port. E.g., if 4040 is not available, Spark will use 4041 (if available, or 4042 …​etc).

Process Port number

Spark master web UI

8767

Spark worker web UI

8082

SparkContext web UI

4040

Spark master listening port

8766

Spark worker listening port

8769

Spark driver listening port

random (spark.driver.port)

Spark executor listening port

random (spark.executor.port)

Spark blockmanager port

random (spark.blockManager.port)

Spark file server port

random (spark.fileserver.port)

Spark REPL class server port

random, can be overridden through spark.replClassServer.port

Directories and Temporary Files

Shaded jar file

The shaded jar file is downloaded to the var/api/work folder. If one of the jars in the (api) has changed, then a new shaded jar will be created with an updated name.

Temporary work directories

Temporary work dirs (spark-workDir-*) are created in var/’ when an application is running. They are removed after the driver is shut down or closed.

Fusion 3.0 Directories used for Spark

Connection Configurations for an SSL-enabled Solr cluster

You’ll need to set these Java system properties used by SolrJ:

  • javax.net.ssl.trustStore

  • javax.net.ssl.trustStorePassword

  • javax.net.ssl.trustStoreTyep

For the following Spark configuration properties:

  • spark.executor.extraJavaOptions

  • fusion.spark.driver.jvmArgs

  • spark.driver.extraJavaOptions

$ curl -H 'Content-type:application/json' -X PUT \
  -d '-Djavax.net.ssl.trustStore=/opt/app/jobs/ssl/solrtrust.jks -Djavax.net.ssl.trustStorePassword=changeit -Djavax.net.ssl.trustStoreType=jks' \
  "http://localhost:8765/api/v1/configurations/spark.executor.extraJavaOptions"

$ curl -H 'Content-type:application/json' -X PUT \
  -d '-Djavax.net.ssl.trustStore=/opt/app/jobs/ssl/solrtrust.jks -Djavax.net.ssl.trustStorePassword=changeit -Djavax.net.ssl.trustStoreType=jks' \
  "http://localhost:8765/api/v1/configurations/fusion.spark.driver.jvmArgs"

$ curl -H 'Content-type:application/json' -X PUT \
  -d '-Djavax.net.ssl.trustStore=/opt/app/jobs/ssl/solrtrust.jks -Djavax.net.ssl.trustStorePassword=changeit -Djavax.net.ssl.trustStoreType=jks' \
  "http://localhost:8765/api/v1/configurations/spark.driver.extraJavaOptions"