- Spark Master / Worker Resource Allocation
- Number of Cores Allocated
- Memory Allocation
- Cores per Driver Allocation
- Ports used by Spark in Fusion
- Directories and Temporary Files
- Shaded jar file
- Temporary work directories
- Connection Configurations for an SSL-enabled Solr cluster
Spark has a number of configuration properties. In this section, we’ll cover some of the key settings you’ll need to use Fusion’s Spark integration.
For the full set of Fusion’s spark-related configuration properties, see the Spark Jobs API.
Spark Master / Worker Resource Allocation
|If you co-locate Spark workers and Solr nodes on the same server, then be sure to reserve some CPU for Solr to avoid a compute intensive Spark job from starving Solr of CPU resources.|
Number of Cores Allocated
When a worker process joins the Spark cluster, it looks in the Fusion configuration (stored in Fusion’s ZooKeeper) for
spark.worker.cores (Fusion 2.4) or
fusion.spark.worker.cores (Fusion 3.x).
If this setting is left unspecified, then the worker will use all available cores.
For example, in the screenshot below, we have a 3-node cluster, where each worker uses all available cores (8) for a total of 24 cores:
To change the CPU usage per worker, you need to use the Fusion configuration API to update this setting, as in the following examples.
> curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '6' \ http://localhost:8764/api/apollo/configurations/spark.worker.cores
> curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '6' \ http://localhost:8764/api/apollo/configurations/fusion.spark.worker.cores
You can also over-allocate cores to a spark-worker, which usually is recommended for hyper-threaded cores
by setting the property
SPARK_WORKER_CORES=<number of cores>
fusion.properties file on all nodes hosting a spark-worker.
For example, a r4.2xlarge instance in EC2 has 8 CPU cores, but
the following configuration will improve utilization and performance:
After making this change to your spark worker nodes, you must restart the spark-worker process on each:
> cd $FUSION_HOME > bin/spark-worker restart
The amount of memory allocated to each worker process is controlled by Fusion property
fusion.spark.worker.memory which specifies the total amount of memory available for all executors spun up by that Spark Worker process.
> curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '8g' \ http://localhost:8764/api/apollo/configurations/fusion.spark.worker.memory
> curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '8g' \ http://localhost:8764/api/apollo/configurations/spark.worker.memory
The Spark worker process manages executors for multiple jobs running concurrently.
For certain types of aggregation jobs you can also configure the per executor memory, but this may impact how many jobs you can run concurrently in your cluster.
Unless explicitly specified using the parameter
spark.executor.memory, Fusion calculates the amount of memory that can be allocated to the executor
In Fusion 3.x, the aggregation Spark jobs always get half the memory of what is assigned to the workers.
This is controlled by
fusion.spark.executor.memory.fraction property which is set to default value
E.g., Spark workers in 3.x have 4 Gb of memory by default and
the executors for aggregator Spark jobs are assigned 2 Gb for each executor.
To let fusion aggregation jobs use more memory of the workers, increase
fusion.spark.executor.memory.fraction property to
This property should be used instead of the spark executor memory property.
> curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '1' \ http://localhost:8764/api/apollo/configurations/fusion.spark.executor.memory.fraction
In Fusion 2.4, you must set the per-executor memory directly via property
> curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '8g' \ http://localhost:8764/api/apollo/configurations/spark.executor.memory
After making these changes and restarting the workers, when we run a Fusion job, we get the following (screenshots taken from running instance of Fusion 2.4):
Cores per Driver Allocation
The configuration property
fusion.spark.cores.fraction allows you to limit the number of cores used by the Fusion driver applications (default and scripted).
For instance, in the screenshot above, we see 18 total CPUs available.
We set the cores fraction property to 0.5 via the following command:
> curl -u <name>:<password> -H 'Content-type:application/json' -X PUT -d '0.5' \ http://localhost:8764/api/apollo/configurations/fusion.spark.cores.fraction
This cuts the number of available cores in half, as shown in the following screenshot:
Ports used by Spark in Fusion
The following list shows the default port number used by Spark processes. If that a port number is not available, Spark will use the next available port by adding a ‘+1’ to the assigned port. E.g., if 4040 is not available, Spark will use 4041 (if available, or 4042 …etc).
Spark master web UI
Spark worker web UI
SparkContext web UI
Spark master listening port
Spark worker listening port
Spark driver listening port
Spark executor listening port
Spark blockmanager port
Spark file server port
Spark REPL class server port
random, can be overridden through spark.replClassServer.port
Directories and Temporary Files
Shaded jar file
The shaded jar file is downloaded to the
If one of the jars in the (api) has changed, then a new shaded jar will be created with an updated name.
Temporary work directories
Temporary work dirs (spark-workDir-*) are created in
var/’ when an application is running.
They are removed after the driver is shut down or closed.
Connection Configurations for an SSL-enabled Solr cluster
You’ll need to set these Java system properties used by SolrJ:
For the following Spark configuration properties:
> curl -H 'Content-type:application/json' -X PUT \ -d '-Djavax.net.ssl.trustStore=/opt/app/jobs/ssl/solrtrust.jks -Djavax.net.ssl.trustStorePassword=changeit -Djavax.net.ssl.trustStoreType=jks' \ "http://localhost:8765/api/v1/configurations/spark.executor.extraJavaOptions" > curl -H 'Content-type:application/json' -X PUT \ -d '-Djavax.net.ssl.trustStore=/opt/app/jobs/ssl/solrtrust.jks -Djavax.net.ssl.trustStorePassword=changeit -Djavax.net.ssl.trustStoreType=jks' \ "http://localhost:8765/api/v1/configurations/fusion.spark.driver.jvmArgs" > curl -H 'Content-type:application/json' -X PUT \ -d '-Djavax.net.ssl.trustStore=/opt/app/jobs/ssl/solrtrust.jks -Djavax.net.ssl.trustStorePassword=changeit -Djavax.net.ssl.trustStoreType=jks' \ "http://localhost:8765/api/v1/configurations/spark.driver.extraJavaOptions"