Configure the Spark History Server
For related topics, see Spark Operations.
Recommended Configuration
Our recommended configuration for using the Spark History Server with Fusion is to store and read Spark logs in cloud storage. For installations on Google Kubernetes Engine, we suggest setting these keys in the values.yaml
:
gcs:
enableGCS: true
secret: history-secrets
key: sparkhistory.json
logDirectory: gs://[BUCKET_NAME]
service:
type: ClusterIP
port: 18080
pvc:
enablePVC: false
nfs:
enableExampleNFS: false
Note that, by default, the Spark History Server Helm chart creates an external LoadBalancer, exposing it to outside access. This is usually undesirable. In the above, we prevent this via the service
key - the Spark History Server will only be set up on an internal IP within your cluster and will not be exposed externally. Later, we will show how to properly access the Spark History Server.
The key
and secret
fields provide the Spark History Server with the details of where it will find an account with access to the Google Cloud Storage bucket given in logDirectory
. In the following example, we’re going to set up a new service account that will be shared between the Spark History Server and the Spark driver/executors for both viewing and writing logs.
The nfs.enableExampleNFS
option turns off the NFS server that the Spark History Server sets up by default, as we won’t be needing it in our installation.
In order to give the Spark History Server access to the Google Cloud Storage bucket where the logs will be kept, we use gcloud
to create a new service account, and then keys create
to create a JSON keypair which we will shortly upload into our cluster as a Kubernetes secret.
$ export ACCOUNT_NAME=sparkhistory
$ export GCP_PROJECT_ID=[PROJECT_ID]
$ gcloud iam service-accounts create ${ACCOUNT_NAME} --display-name "${ACCOUNT_NAME}"
$ gcloud iam service-accounts keys create "${ACCOUNT_NAME}.json" --iam-account "${
ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com"
We then give our service account the storage/admin
role, allowing it to perform create and view operations, and the final gsutil
command applies our service account to our chosen bucket. If you have an existing service account you wish to use instead, you can skip the create
command, though you will still need to create the JSON keypair and ensure that the existing account can read and write to the log bucket.
$ gcloud projects add-iam-policy-binding ${GCP_PROJECT_ID} --member "serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com" --role roles/storage.admin
$ gsutil iam ch serviceAccount:${ACCOUNT_NAME}@${GCP_PROJECT_ID}.iam.gserviceaccount.com:objectAdmin gs://[BUCKET_NAME]
We now need to upload the JSON keypair into the cluster as a secret:
kubectl -n [NAMESPACE] create secret generic history-secrets --from-file=sparkhistory.json
With all this in place, the Spark History Server can now be installed with helm install [namespace]-spark-history-server stable/spark-history-server --values values.yaml
.
Other Configurations
Azure
Azure is a similar process to Google Kubernetes Engine, except our logs will be stored in Azure Blob Storage, and we can either use SAS token or key access.
$ echo "your-storage-account-name" >> azure-storage-account-name
$ echo "your-container-name" >> azure-blob-container-name
# to auth with sas token (if wasbs.sasKeyMode=true, which is the default)
$ echo "your-azure-blob-sas-key" >> azure-blob-sas-key
# or to auth with storage account key
$ echo "your-azure-storage-account-key" >> azure-storage-account-key
$ kubectl create secret generic azure-secrets --from-file=azure-storage-account-name --from-file=azure-blob-container-name [--from-file=azure-blob-sas-key | --from-file=azure-storage-account-key]
For SAS token access, values.yaml
should look like:
wasbs:
enableWASBS: true
secret: azure-secrets
sasKeyName: azure-blob-sas-key
storageAccountNameKeyName: azure-storage-account-name
containerKeyName: azure-blob-container-name
logDirectory: [BUCKET_NAME]
For non-SAS access:
wasbs:
enableWASBS: true
secret: azure-secrets
sasKeyMode: false
storageAccountKeyName: azure-storage-account-key
storageAccountNameKeyName: azure-storage-account-name
containerKeyName: azure-blob-container-name
logDirectory: [BUCKET_NAME]
AWS
The recommended approach for S3 access is to use AWS IAM roles, but you can also use a access/secret key pair as a Kubernetes secret:
$ aws iam list-access-keys --user-name your-user-name --output text | awk '{print $2}' >> aws-access-key
$ echo "your-aws-secret-key" >> aws-secret-key
$ kubectl create secret generic aws-secrets --from-file=aws-access-key --from-file=aws-secret-key
For IAM, your values.yaml
will be:
s3:
enableS3: true
logDirectory: s3a://[BUCKET_NAME]
(Note the Hadoop s3a://
link instead of s3://
.)
With a access/secret pair, you’ll need to add the secret:
s3:
enableS3: true
enableIAM: false
accessKeyName: aws-access-key
secretKeyName: aws-secret-key
logDirectory: s3a://[BUCKET_NAME]
Configuring Spark
After starting the Spark History Server, we must update the config map for Fusion’s job-launcher so it can write logs to the same location that Spark History Server is reading from.
In this example, having installed Fusion into a namespace of sparkhistory
, we will edit the config map to write the logs to the same Google Cloud Storage bucket we configured the Spark History Server to read from. Before editing the config map, make a copy of the existing settings in case you need to revert the changes.
kubectl get cm -n [NAMESPACE] sparkhistory-job-launcher -o yaml > sparkhistory-job-launcher.yaml
kubectl edit cm -n [NAMESPACE] sparkhistory-job-launcher
Update the spark
key with the new YAML settings below and then delete the job-launcher
pod. The new job-launcher
pod will apply the new configuration to subsequent jobs. In addition to the location of the secret and the settings that specify the location of the Spark eventLog, we also have to tell Spark how to access GCS with the spark.hadoop.fs.gs.impl``spark.hadoop.fs.AbstractFileSystem.gs.impl
keys.
spark:
hadoop:
fs:
AbstractFileSystem:
gs:
impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
gs:
impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
google:
cloud:
auth:
service:
account:
json:
keyfile: /etc/history-secrets/[ACCOUNT_NAME].json
eventLog:
enabled: true
compress: true
dir: gs://[BUCKET_NAME]
…
kubernetes:
driver:
secrets:
history-secrets: /etc/history-secrets
container:
…
executor:
secrets:
history-secrets: /etc/history-secrets
container:
…
…