LABEL
specified for the node, and the name of the node as the LABEL_VALUE
. For example, if a node is labeled with fusion_node_type=spark_only
, schedule Spark executor pods to run on that node using:
kubectl get configmaps <release-name>-job-launcher
. Some of these settings are also configurable via Helm.
Spark Resource Configurations
Spark Configuration | Default value | Helm Variable |
---|---|---|
spark.driver.memory | 3g | |
spark.executor.instances | 2 | executorInstances |
spark.executor.memory | 3g | |
spark.executor.cores | 6 | |
spark.kubernetes.executor.request.cores | 3 | |
spark.sql.caseSensitive | true |
Spark Configuration | Default value | Helm Variable |
---|---|---|
spark.kubernetes.container.image.pullPolicy | Always | image.imagePullPolicy |
spark.kubernetes.container.image.pullSecrets | image.imagePullSecrets | |
spark.kubernetes.authenticate.driver.serviceAccountName | <name>-job-launcher-spark | |
spark.kubernetes.driver.container.image | fusion-dev-docker.ci-artifactory.lucidworks.com | image.repository |
spark.kubernetes.executor.container.image | fusion-dev-docker.ci-artifactory.lucidworks.com | image.repository |
Configure Spark Job Resource Allocation
sparkConfig
object, if defining a job via the Fusion API.Parameter Key | Example Value |
---|---|
spark.executor.instances | 3 |
spark.kubernetes.executor.request.cores | 3 |
spark.executor.cores | 6 |
spark.driver.cores | 1 |
spark.kubernetes.executor.request.cores
is unset, the default configuration, Spark sets the number of CPUs for the executor pod to be the same number as spark.executor.cores
. For exmaple, if spark.executor.cores
is 3
, Spark allocates 3 CPUs for the executor pod and runs 3 tasks in parallel. To under-allocate the CPU for the executor pod and still run multiple tasks in parallel, set spark.kubernetes.executor.request.cores
to a lower value than spark.executor.cores
.The ratio for spark.kubernetes.executor.request.cores
to spark.executor.cores
depends on the type of job: either CPU-bound or I/O-bound. Allocate more memory to the executor if more tasks are running in parallel on a single executor pod.If these settings not specified, the job launches with a driver using one core and 3GB of memory plus two executors, each using one core with 1GB of memory.spark.executor.memory
and spark.driver.memory
parameters in the Spark Settings section of the job definition. This is found in the Fusion UI or within the sparkConfig
object in the JSON definition of the job.Parameter Key | Example Value |
---|---|
spark.executor.memory | 6g |
spark.driver.memory | 2g |
Configure Spark Jobs to Access Cloud Storage
Placeholder | Description |
---|---|
<key name> | Name of the Solr GCS service account key. |
<key file path> | Path to the Solr GCS service account key. |
gcp-launcher
config map to values.yaml
under job-launcher
:
aws-launcher
config map to values.yaml
under job-launcher
:
configMaps
isn’t possible yet. Instead, manually upload the core-site.xml
file into the job-launcher
pod at /app/spark-dist/conf
. See below for an example core-site.xml
file:Placeholder | Description |
---|---|
<key name> | Name of the Solr GCS service account key. |
<key file path> | Path to the Solr GCS service account key. |
Get Logs for a Spark Job
Description | Command |
---|---|
Retrieve the initial logs that contain information about the pod spin up. | curl -X GET -u USERNAME:PASSWORD http://FUSION_HOST:FUSION_PORT/api/spark/driver/log/JOB_ID |
Retrieve the pod ID. | k get pods -l spark-role=driver -l jobConfigId=JOB_ID |
Retrieve logs from failed jobs. | kubectl logs DRIVER_POD_NAME |
Tail logs from running containers by using the -f parameter. | kubectl logs -f POD_NAME |
localhost:4040
Clean Up Spark Driver Pods
job-launcher
microservice is installed in the Fusion cluster.To clean up pods manually, run this command:Install the Spark History Server
kubectl logs [POD_NAME]
, executor pods are deleted at their end of their execution, and driver pods are deleted by Fusion on a default schedule of every hour. In order to preserve and view Spark logs, install the Spark History Server into your Kubernetes cluster and configure Spark to write logs in a manner that suits your needs.Spark History Server can be installed via its publicly available Helm chart. To do this, create a values.yaml
file to configure it:Configure the Spark History Server
values.yaml
file:key
and secret
fields provide the Spark History Server with the details of where it can find an account with access to the Google Cloud Storage bucket given in logDirectory
. Later examples show how to set up a new service account that’s shared between the Spark History Server and the Spark driver/executors for both viewing and writing logs.service
key overrides the default. The Spark History Server is set up on an internal IP within your cluster only and is not exposed externally. Later examples show how to access the Spark History Server.nfs.enableExampleNFS
option turns off the unneeded default NFS server set up by the Spark History Server.gcloud
to create a new service account:
create
command, though you will still need to create the JSON key pair and ensure that the existing account can read and write to the log bucket.keys create
to create a JSON key pair, and upload it to your cluster as a Kubernetes secret:
storage/admin
role, allowing it to perform “create” and “view” operations:
gsutil
command to apply the service account to your chosen bucket.
helm install [namespace]-spark-history-server stable/spark-history-server --values values.yaml
.echo "your-azure-storage-account-key" >> azure-storage-account-key
to use a storage account key instead.values.yaml
file resembles the following:values.yaml
file resembles the following:values.yaml
file resembles the following:values.yaml
file uses the Hadoop s3a://
link instead of s3://
.sparkhistory
.spark
key with the new YAML settings below:
eventLog
. The settings also inform Spark how to access GCS with the spark.hadoop.fs.AbstractFileSystem.gs.impl
and spark.hadoop.fs.gs.impl
keys.
job-launcher
pod. The new job-launcher
pod will apply the new configuration to later jobs.
Access the Spark History Server
kubectl
:http://localhost:18080
. Run a Spark job and confirm that you can see the logs appear in the UI.