Spark Administration in Kubernetes

In Fusion 5.0, Spark operates in native Kubernetes mode instead of standalone mode (like in Fusion 4.x). The sections below describe Spark operations in Fusion 5.0.

Cluster Mode

Fusion 5.0 ships with Spark 2.4.3 and operates in "cluster" mode on top of Kubernetes. In cluster mode, each Spark driver runs in a separate pod and hence resources can be managed per job. Each executor also runs in its own pod.

Spark config defaults

The table below shows the default configurations for Spark. These settings are configured in the job-launcher config map, accessible using kubectl get configmaps <release-name>-job-launcher. Some of these settings are also configurable via Helm.

Table 1. Spark Resource Configurations
Spark Configuration Default value Helm Variable

spark.driver.memory

3g

spark.executor.instances

2

executorInstances

spark.executor.memory

3g

spark.executor.cores

6

spark.kubernetes.executor.request.cores

3

Table 2. Spark Kubernetes Configurations
Spark Configuration Default value Helm Variable

spark.kubernetes.container.image.pullPolicy

Always

image.imagePullPolicy

spark.kubernetes.container.image.pullSecrets

[artifactory]

image.imagePullSecrets

spark.kubernetes.authenticate.driver.serviceAccountName

<name>-job-launcher-spark

spark.kubernetes.driver.container.image

fusion-dev-docker.ci-artifactory.lucidworks.com

image.repository

spark.kubernetes.executor.container.image

fusion-dev-docker.ci-artifactory.lucidworks.com

image.repository

Spark Job Resource Allocation

Number of Instances and Cores Allocated

In order to set the number of cores allocated for a job, add the following parameter keys and values in the Spark Settings field within the "advanced" job properties in the Fusion UI or the sparkConfig object if defining a job via the Fusion API. If spark.kubernetes.executor.request.cores is not set (default config), then Spark will set the number of CPUs for the executor pod to be the same number as spark.executor.cores. In that case, if spark.executor.cores is 3, then Spark will allocate 3 CPUs for the executor pod and will run 3 tasks in parallel. To under-allocate the CPU for the executor pod and still run multiple tasks in parallel, set spark.kubernetes.executor.request.cores to a lower value than spark.executor.cores. The ratio for spark.kubernetes.executor.request.cores to spark.executor.cores depends on the type of job: either CPU-bound or I/O-bound. Allocate more memory to the executor if more tasks are running in parallel on a single executor pod.

Parameter Key Parameter Example Value

spark.executor.instances

3

spark.kubernetes.executor.request.cores

3

spark.executor.cores

6

spark.driver.cores

1

If these settings are left unspecified, then the job launches with a driver using one core and 3GB of memory plus two executors, each using one core with 1GB of memory.

Memory Allocation

The amount of memory allocated to the driver and executors is controlled on a per-job basis using the spark.executor.memory and spark.driver.memory parameters in the Spark Settings section of the job definition in the Fusion UI or within the sparkConfig object in the JSON definition of the job.

Parameter Key Parameter Example Value

spark.executor.memory

6g

spark.driver.memory

2g