Configure The Spark History Server

For related topics, see Spark Operations.

Our recommended configuration for using the Spark History Server with Fusion is to store and read Spark logs in cloud storage. For installations on Google Kubernetes Engine, we suggest setting these keys in the values.yaml:

gcs:
  enableGCS: true
  secret: history-secrets
  key: sparkhistory.json
  logDirectory: gs://[BUCKET_NAME]
service:
  type: ClusterIP
  port: 18080

pvc:
  enablePVC: false
nfs:
  enableExampleNFS: false

Note that, by default, the Spark History Server Helm chart creates an external LoadBalancer, exposing it to outside access. This is usually undesirable. In the above, we prevent this via the service key - the Spark History Server will only be set up on an internal IP within your cluster and will not be exposed externally. Later, we will show how to properly access the Spark History Server.

The key and secret fields provide the Spark History Server with the details of where it will find an account with access to the Google Cloud Storage bucket given in logDirectory. In the following example, we’re going to set up a new service account that will be shared between the Spark History Server and the Spark driver/executors for both viewing and writing logs.

The nfs.enableExampleNFS option turns off the NFS server that the Spark History Server sets up by default, as we won’t be needing it in our installation.

In order to give the Spark History Server access to the Google Cloud Storage bucket where the logs will be kept, we use gcloud to create a new service account, and then keys create to create a JSON keypair which we will shortly upload into our cluster as a Kubernetes secret.

$ export ACCOUNT_NAME=sparkhistory
$ export GCP_PROJECT_ID=[PROJECT_ID]
$ gcloud iam service-accounts create ${ACCOUNT_NAME} --display-name "${ACCOUNT_NAME}"
$ gcloud iam service-accounts keys create "${ACCOUNT_NAME}.json" --iam-account "${
ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com"

We then give our service account the storage/admin role, allowing it to perform create and view operations, and the final gsutil command applies our service account to our chosen bucket. If you have an existing service account you wish to use instead, you can skip the create command, though you will still need to create the JSON keypair and ensure that the existing account can read and write to the log bucket.

$ gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} --member "serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com" --role roles/storage.admin
$ gsutil iam ch serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com:objectAdmin gs://[BUCKET_NAME]

We now need to upload the JSON keypair into the cluster as a secret:

kubectl -n [NAMESPACE] create secret generic history-secrets --from-file=sparkhistory.json

With all this in place, the Spark History Server can now be installed with helm install [namespace]-spark-history-server stable/spark-history-server --values values.yaml.

Other Configurations

Azure

Azure is a similar process to Google Kubernetes Engine, except our logs will be stored in Azure Blob Storage, and we can either use SAS token or key access.

$ echo "your-storage-account-name" >> azure-storage-account-name
$ echo "your-container-name" >> azure-blob-container-name
# to auth with sas token (if wasbs.sasKeyMode=true, which is the default)
$ echo "your-azure-blob-sas-key" >> azure-blob-sas-key
# or to auth with storage account key
$ echo "your-azure-storage-account-key" >> azure-storage-account-key
$ kubectl create secret generic azure-secrets --from-file=azure-storage-account-name --from-file=azure-blob-container-name [--from-file=azure-blob-sas-key | --from-file=azure-storage-account-key]

For SAS token access, values.yaml should look like:

wasbs:
  enableWASBS: true
  secret: azure-secrets
  sasKeyName: azure-blob-sas-key
  storageAccountNameKeyName: azure-storage-account-name
  containerKeyName: azure-blob-container-name
  logDirectory: [BUCKET_NAME]

For non-SAS access:

wasbs:
  enableWASBS: true
  secret: azure-secrets
  sasKeyMode: false
  storageAccountKeyName: azure-storage-account-key
  storageAccountNameKeyName: azure-storage-account-name
  containerKeyName:  azure-blob-container-name
  logDirectory: [BUCKET_NAME]

AWS

The recommended approach for S3 access is to use AWS IAM roles, but you can also use a access/secret key pair as a Kubernetes secret:

$ aws iam list-access-keys --user-name your-user-name --output text | awk '{print $2}' >> aws-access-key
$ echo "your-aws-secret-key" >> aws-secret-key
$ kubectl create secret generic aws-secrets --from-file=aws-access-key --from-file=aws-secret-key

For IAM, your values.yaml will be:

s3:
  enableS3: true
  logDirectory: s3a://[BUCKET_NAME]

(Note the Hadoop s3a:// link instead of s3://.)

With a access/secret pair, you’ll need to add the secret:

s3:
  enableS3: true
  enableIAM: false
  accessKeyName: aws-access-key
  secretKeyName: aws-secret-key
  logDirectory: s3a://[BUCKET_NAME]

Configuring Spark

After starting the Spark History Server, we must update the config map for Fusion’s job-launcher so it can write logs to the same location that Spark History Server is reading from.

In this example, having installed Fusion into a namespace of sparkhistory, we will edit the config map to write the logs to the same Google Cloud Storage bucket we configured the Spark History Server to read from. Before editing the config map, make a copy of the existing settings in case you need to revert the changes.

kubectl get cm -n [NAMESPACE] sparkhistory-job-launcher -o yaml > sparkhistory-job-launcher.yaml

kubectl edit cm -n [NAMESPACE] sparkhistory-job-launcher

Update the spark key with the new YAML settings below and then delete the job-launcher pod. The new job-launcher pod will apply the new configuration to subsequent jobs. In addition to the location of the secret and the settings that specify the location of the Spark eventLog, we also have to tell Spark how to access GCS with the spark.hadoop.fs.gs.impl``spark.hadoop.fs.AbstractFileSystem.gs.impl keys.

spark:
  hadoop:
    fs:
      AbstractFileSystem:
        gs:
          impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
      gs:
        impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
    google:
      cloud:
        auth:
          service:
            account:
              json:
                keyfile: /etc/history-secrets/[ACCOUNT_NAME].json
  eventLog:
    enabled: true
    compress: true
    dir: gs://[BUCKET_NAME]
  
  kubernetes:
    driver:
      secrets:
        history-secrets: /etc/history-secrets
      container:
        
    executor:
      secrets:
        history-secrets: /etc/history-secrets
      container: