Configure the Spark History Server

Table of Contents

Recommended configuration
Other configurations
- Azure
- Amazon Web Services
Configuring Spark

For related topics, see Spark Operations.

Recommended configuration

For Fusion, configure the Spark History Server to store and read Spark logs in cloud storage. For installations on Google Kubernetes Engine, set these keys in the values.yaml file:

gcs:
  enableGCS: true
  secret: history-secrets (1)
  key: sparkhistory.json (1)
  logDirectory: gs://[BUCKET_NAME]
service: (2)
  type: ClusterIP
  port:
     number: 18080

pvc:
  enablePVC: false
nfs:
  enableExampleNFS: false (3)

1	The `key` and `secret` fields provide the Spark History Server with the details of where it can find an account with access to the Google Cloud Storage bucket given in `logDirectory`. Later examples show how to set up a new service account that’s shared between the Spark History Server and the Spark driver/executors for both viewing and writing logs.
2	By default, the Spark History Server Helm chart creates an external LoadBalancer, exposing it to outside access. In this example, the `service` key overrides the default. The Spark History Server is set up on an internal IP within your cluster only and is not exposed externally. Later examples show how to access the Spark History Server.
3	The `nfs.enableExampleNFS` option turns off the unneeded default NFS server set up by the Spark History Server.

To give the Spark History Server access to the Google Cloud Storage bucket where the logs are kept:

Use gcloud to create a new service account:

export ACCOUNT_NAME=sparkhistory
export GCP_PROJECT_ID=[PROJECT_ID]
gcloud iam service-accounts create ${ACCOUNT_NAME} --display-name "${ACCOUNT_NAME}"

If you have an existing service account you wish to use instead, you can skip the create command, though you will still need to create the JSON key pair and ensure that the existing account can read and write to the log bucket.

Use keys create to create a JSON key pair, and upload it to your cluster as a Kubernetes secret:

gcloud iam service-accounts keys create "${ACCOUNT_NAME}.json" --iam-account "${
ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com"

Give the service account the storage/admin role, allowing it to perform "create" and "view" operations:

gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} --member "serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com" --role roles/storage.admin

Run the gsutil command to apply the service account to your chosen bucket.

gsutil iam ch serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com:objectAdmin gs://[BUCKET_NAME]

Upload the JSON key pair into the cluster as a secret:

kubectl -n [NAMESPACE] create secret generic history-secrets --from-file=sparkhistory.json

The Spark History Server can now be installed with helm install [namespace]-spark-history-server stable/spark-history-server --values values.yaml.

Other configurations

Azure

The Azure configuration process is similar Google Kubernetes Engine. However, logs are stored in Azure Blob Storage, and you can use SAS token or key access.

echo "your-storage-account-name" >> azure-storage-account-name
echo "your-container-name" >> azure-blob-container-name
echo "your-azure-blob-sas-key" >> azure-blob-sas-key (1)
kubectl create secret generic azure-secrets --from-file=azure-storage-account-name --from-file=azure-blob-container-name [--from-file=azure-blob-sas-key | --from-file=azure-storage-account-key]

1	This line is used to authenticate with a SAS token. Replace the line with `echo "your-azure-storage-account-key" >> azure-storage-account-key` to use a storage account key instead.

To use SAS token access, the values.yaml file resembles the following:

wasbs:
  enableWASBS: true
  secret: azure-secrets
  sasKeyName: azure-blob-sas-key
  storageAccountNameKeyName: azure-storage-account-name
  containerKeyName: azure-blob-container-name
  logDirectory: [BUCKET_NAME]

For non-SAS access, the values.yaml file resembles the following:

wasbs:
  enableWASBS: true
  secret: azure-secrets
  sasKeyMode: false
  storageAccountKeyName: azure-storage-account-key
  storageAccountNameKeyName: azure-storage-account-name
  containerKeyName:  azure-blob-container-name
  logDirectory: [BUCKET_NAME]

Amazon Web Services

In AWS, you can use IAM roles or an access/secret key pair. The use of AWS IAM roles is preferred over using an access/secret key pair, but both options are described.

aws iam list-access-keys --user-name your-user-name --output text | awk '{print $2}' >> aws-access-key
echo "your-aws-secret-key" >> aws-secret-key
kubectl create secret generic aws-secrets --from-file=aws-access-key --from-file=aws-secret-key

For IAM, the values.yaml file resembles the following:

s3:
  enableS3: true
  logDirectory: s3a://[BUCKET_NAME]

The values.yaml file uses the Hadoop s3a:// link instead of s3://.

For an access/secret pair, add the secret:

s3:
  enableS3: true
  enableIAM: false
  accessKeyName: aws-access-key
  secretKeyName: aws-secret-key
  logDirectory: s3a://[BUCKET_NAME]

Configuring Spark

After starting the Spark History Server, update the config map for Fusion’s job-launcher service so it can write logs to the same location that Spark History Server is reading from.

In this example, Fusion is installed into a namespace called sparkhistory.

Before editing the config map, make a copy of the existing settings in case you need to revert the changes.
```
kubectl get cm -n [NAMESPACE] sparkhistory-job-launcher -o yaml > sparkhistory-job-launcher.yaml
```
Edit the config map to write the logs to the same Google Cloud Storage bucket we configured the Spark History Server to read from.
```
kubectl edit cm -n [NAMESPACE] sparkhistory-job-launcher
```

Update the spark key with the new YAML settings below:

spark:
  hadoop:
    fs:
      AbstractFileSystem:
        gs:
          impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
      gs:
        impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
    google:
      cloud:
        auth:
          service:
            account:
              json:
                keyfile: /etc/history-secrets/[ACCOUNT_NAME].json
  eventLog:
    enabled: true
    compress: true
    dir: gs://[BUCKET_NAME]
  …
  kubernetes:
    driver:
      secrets:
        history-secrets: /etc/history-secrets
      container:
        …
    executor:
      secrets:
        history-secrets: /etc/history-secrets
      container:
        …
    …

The YAML settings inform Spark of the location of the secret and the settings that specify the location of the Spark eventLog. The settings also inform Spark how to access GCS with the spark.hadoop.fs.AbstractFileSystem.gs.impl and spark.hadoop.fs.gs.impl keys.

Delete the job-launcher pod. The new job-launcher pod will apply the new configuration to later jobs.