Looking for the old docs site? You can still view it for a limited time here.

Configure The Spark History Server

For related topics, see Spark Operations.

Our recommended configuration for using the Spark History Server with Fusion is to store and read Spark logs in cloud storage. For installations on Google Kubernetes Engine, we suggest setting these keys in the values.yaml:

gcs:
    enableGCS: false
    secret: history-secrets
    key: [SECRET_KEY_NAME].json
    logDirectory: gs://[BUCKET_NAME]
service:
    type: ClusterIP
    port: 18080

We override the default service.type of LoadBalancer with ClusterIP to keep the History Server from being accessible to outside connections without port-forwarding.

You may need to set up your secret for full access to the cloud bucket:

$ export ACCOUNT_NAME=[SECRET_KEY_NAME]
$ export GCP_PROJECT_ID=[PROJECT_ID]
$ gcloud iam service-accounts create ${ACCOUNT_NAME} --display-name "${ACCOUNT_NAME}"
$ gcloud iam service-accounts keys create "${ACCOUNT_NAME}.json" --iam-account "${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com"
$ gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} --member "serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com" --role roles/storage.admin
$ gsutil iam ch serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com:objectAdmin gs://[BUCKET_NAME]

The service key sets up the history server on an internal IP within your cluster but does not create a LoadBalancer (which is the default setting in the Spark History Server Helm chart). This prevents the server from being exposed to outside access by default. We’ll look at how to access the history server shortly.

Other Configurations

Azure

$ echo "your-storage-account-name" >> azure-storage-account-name
$ echo "your-container-name" >> azure-blob-container-name
# to auth with sas token (if wasbs.sasKeyMode=true, which is the default)
$ echo "your-azure-blob-sas-key" >> azure-blob-sas-key
# or to auth with storage account key
$ echo "your-azure-storage-account-key" >> azure-storage-account-key
$ kubectl create secret generic azure-secrets --from-file=azure-storage-account-name --from-file=azure-blob-container-name [--from-file=azure-blob-sas-key | --from-file=azure-storage-account-key]

For SAS token access, values.yaml should look like:

wasbs:
    enableWASBS: true
    secret: azure-secrets
    sasKeyName: azure-blob-sas-key
    storageAccountNameKeyName: azure-storage-account-name
    containerKeyName: azure-blob-container-name
    logDirectory: [BUCKET-NAME]

For non-SAS access:

wasbs:
    enableWASBS: true
    secret: azure-secrets
    sasKeyMode: false
    storageAccountKeyName: azure-storage-account-key
    storageAccountNameKeyName: azure-storage-account-name
    containerKeyName:  azure-blob-container-name
    logDirectory: [BUCKET-NAME]

AWS

The recommended approach for S3 access is to use AWS IAM roles, but you can also use a access/secret key pair as a Kubernetes secret:

$ aws iam list-access-keys --user-name your-user-name --output text | awk '{print $2}' >> aws-access-key
$ echo "your-aws-secret-key" >> aws-secret-key
$ kubectl create secret generic aws-secrets --from-file=aws-access-key --from-file=aws-secret-key

For IAM, your values.yaml will be:

s3:
    enableS3: true
    logDirectory: s3a://[BUCKET-NAME]

(Note the Hadoop s3a:// link instead of s3://.)

With a access/secret pair, you’ll need to add the secret:

s3:
    enableS3: true
    enableIAM: false
    accessKeyName: aws-access-key
    secretKeyName: aws-secret-key
    logDirectory: s3a://[BUCKET-NAME]

Configuring Spark

After the History Server has been set up, then the Fusion job-launcher deployment ConfigMap’s application.yaml key will need to be updated with these Spark settings so the driver and executors know where to write out their logs. In this example, we’re redirecting to a GCS bucket:

spark:
    eventLog:
        enabled: true
        compress: true
        dir: gs://[BUCKET-NAME]
    hadoop:
        google:
            cloud:
                auth:
                    service:
                        account:
                            json:
                                keyfile: /etc/secrets/[SECRET_KEY_NAME].json]
    kubernetes:
        driver:
            secrets:
                history-secrets: /etc/secrets
        executor:
            secrets:
                history-secrets: /etc/secrets