Looking for the old site?

How To

Browse By

  • Objective

  • Products

  • User Role

    How To
      Learn More

        Configure the Spark History Server

        For related topics, see Spark Operations.

        Our recommended configuration for using the Spark History Server with Fusion is to store and read Spark logs in cloud storage. For installations on Google Kubernetes Engine, we suggest setting these keys in the values.yaml:

          enableGCS: true
          secret: history-secrets
          key: sparkhistory.json
          logDirectory: gs://[BUCKET_NAME]
          type: ClusterIP
          port: 18080
          enablePVC: false
          enableExampleNFS: false

        Note that, by default, the Spark History Server Helm chart creates an external LoadBalancer, exposing it to outside access. This is usually undesirable. In the above, we prevent this via the service key - the Spark History Server will only be set up on an internal IP within your cluster and will not be exposed externally. Later, we will show how to properly access the Spark History Server.

        The key and secret fields provide the Spark History Server with the details of where it will find an account with access to the Google Cloud Storage bucket given in logDirectory. In the following example, we’re going to set up a new service account that will be shared between the Spark History Server and the Spark driver/executors for both viewing and writing logs.

        The nfs.enableExampleNFS option turns off the NFS server that the Spark History Server sets up by default, as we won’t be needing it in our installation.

        In order to give the Spark History Server access to the Google Cloud Storage bucket where the logs will be kept, we use gcloud to create a new service account, and then keys create to create a JSON keypair which we will shortly upload into our cluster as a Kubernetes secret.

        $ export ACCOUNT_NAME=sparkhistory
        $ export GCP_PROJECT_ID=[PROJECT_ID]
        $ gcloud iam service-accounts create ${ACCOUNT_NAME} --display-name "${ACCOUNT_NAME}"
        $ gcloud iam service-accounts keys create "${ACCOUNT_NAME}.json" --iam-account "${

        We then give our service account the storage/admin role, allowing it to perform create and view operations, and the final gsutil command applies our service account to our chosen bucket. If you have an existing service account you wish to use instead, you can skip the create command, though you will still need to create the JSON keypair and ensure that the existing account can read and write to the log bucket.

        $ gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} --member "serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com" --role roles/storage.admin
        $ gsutil iam ch serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com:objectAdmin gs://[BUCKET_NAME]

        We now need to upload the JSON keypair into the cluster as a secret:

        kubectl -n [NAMESPACE] create secret generic history-secrets --from-file=sparkhistory.json

        With all this in place, the Spark History Server can now be installed with helm install [namespace]-spark-history-server stable/spark-history-server --values values.yaml.

        Other Configurations


        Azure is a similar process to Google Kubernetes Engine, except our logs will be stored in Azure Blob Storage, and we can either use SAS token or key access.

        $ echo "your-storage-account-name" >> azure-storage-account-name
        $ echo "your-container-name" >> azure-blob-container-name
        # to auth with sas token (if wasbs.sasKeyMode=true, which is the default)
        $ echo "your-azure-blob-sas-key" >> azure-blob-sas-key
        # or to auth with storage account key
        $ echo "your-azure-storage-account-key" >> azure-storage-account-key
        $ kubectl create secret generic azure-secrets --from-file=azure-storage-account-name --from-file=azure-blob-container-name [--from-file=azure-blob-sas-key | --from-file=azure-storage-account-key]

        For SAS token access, values.yaml should look like:

          enableWASBS: true
          secret: azure-secrets
          sasKeyName: azure-blob-sas-key
          storageAccountNameKeyName: azure-storage-account-name
          containerKeyName: azure-blob-container-name
          logDirectory: [BUCKET_NAME]

        For non-SAS access:

          enableWASBS: true
          secret: azure-secrets
          sasKeyMode: false
          storageAccountKeyName: azure-storage-account-key
          storageAccountNameKeyName: azure-storage-account-name
          containerKeyName:  azure-blob-container-name
          logDirectory: [BUCKET_NAME]


        The recommended approach for S3 access is to use AWS IAM roles, but you can also use a access/secret key pair as a Kubernetes secret:

        $ aws iam list-access-keys --user-name your-user-name --output text | awk '{print $2}' >> aws-access-key
        $ echo "your-aws-secret-key" >> aws-secret-key
        $ kubectl create secret generic aws-secrets --from-file=aws-access-key --from-file=aws-secret-key

        For IAM, your values.yaml will be:

          enableS3: true
          logDirectory: s3a://[BUCKET_NAME]

        (Note the Hadoop s3a:// link instead of s3://.)

        With a access/secret pair, you’ll need to add the secret:

          enableS3: true
          enableIAM: false
          accessKeyName: aws-access-key
          secretKeyName: aws-secret-key
          logDirectory: s3a://[BUCKET_NAME]

        Configuring Spark

        After starting the Spark History Server, we must update the config map for Fusion’s job-launcher so it can write logs to the same location that Spark History Server is reading from.

        In this example, having installed Fusion into a namespace of sparkhistory, we will edit the config map to write the logs to the same Google Cloud Storage bucket we configured the Spark History Server to read from. Before editing the config map, make a copy of the existing settings in case you need to revert the changes.

        kubectl get cm -n [NAMESPACE] sparkhistory-job-launcher -o yaml > sparkhistory-job-launcher.yaml
        kubectl edit cm -n [NAMESPACE] sparkhistory-job-launcher

        Update the spark key with the new YAML settings below and then delete the job-launcher pod. The new job-launcher pod will apply the new configuration to subsequent jobs. In addition to the location of the secret and the settings that specify the location of the Spark eventLog, we also have to tell Spark how to access GCS with the spark.hadoop.fs.gs.impl``spark.hadoop.fs.AbstractFileSystem.gs.impl keys.

                  impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
                impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
                        keyfile: /etc/history-secrets/[ACCOUNT_NAME].json
            enabled: true
            compress: true
            dir: gs://[BUCKET_NAME]
                history-secrets: /etc/history-secrets
                history-secrets: /etc/history-secrets