Configure Spark Jobs to Access Cloud Storage

Table of Contents

Supported jobs
Configuring credentials for Spark jobs
Configuring credentials per job
- GCS
- AWS S3

For related topics, see Spark Operations.

Supported jobs

This procedure applies to Spark-based jobs:

ALS Recommender
Cluster Labeling
Co-occurrence Similarity
Collection Analysis
Create Seldon Core Model Deployment Job
Delete Seldon Core Model Deployment Job
Document Clustering
Ground Truth
Head/Tail Analysis
Item Similarity Recommender
Legacy Item Recommender
Legacy Item Similarity
Levenshtein Spell Checking
Logistic Regression Classifier Training
Matrix Decomposition-Based Query-Query Similarity
Outlier Detection
Parallel Bulk Loader
Parameterized SQL Aggregation
Phrase Extraction
Query-to-Query Session-Based Similarity
Query-to-Query Similarity
Random Forest Classifier Training
Ranking Metrics
SQL Aggregation
SQL-Based Experiment Metric (deprecated)
Statistically Interesting Phrases
Synonym Detection Jobs
Synonym and Similar Queries Detection Jobs
Token and Phrase Spell Correction
Word2Vec Model Training

For Argo-based jobs, see Configure An Argo-Based Job to Access GCS and Configure An Argo-Based Job to Access S3.

Amazon Web Services (AWS) and Google Cloud Storage (GCS) credentials can be configured per job or per cluster.

Configuring credentials for Spark jobs

GCS

The examples in this subsection use placeholder values. See the table below for descriptions of the placeholders:

Placeholder	Description
<key name>	Name of the Solr GCS service account key.
<key file path>	Path to the Solr GCS service account key.

Placeholder

Description

Name of the Solr GCS service account key.

Path to the Solr GCS service account key.

Create a secret containing the credentials JSON file:
```
kubectl create secret generic <key name> --from-file=/<key file path>/<key name>.json
```
For more information, see Creating and managing service account keys. The topic is used to generate your organization’s GOOGLE_APPLICATION_CREDENTIALS, which are needed to create an extra config map.

Create an extra config map in Kubernetes setting the required properties for GCP.

Create a properties file with GCP properties:

cat gcp-launcher.properties
spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/<key name>.json
spark.kubernetes.driver.secrets.<key name> = /mnt/gcp-secrets
spark.kubernetes.executor.secrets.<key name> = /mnt/gcp-secrets
spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/<key name>.json
spark.hadoop.google.cloud.auth.service.account.json.keyfile = /mnt/gcp-secrets/<key name>.json

Create a config map based on the properties file:

kubectl create configmap gcp-launcher --from-file=gcp-launcher.properties

Add the gcp-launcher config map to values.yaml under job-launcher:
```
configSources:
 - gcp-launcher
```

AWS S3

AWS credentials can’t be set with a single file. Instead, set two environment variables referring to the key and secret using the instructions below:

Create a secret pointing to the credentials:

kubectl create secret generic aws-secret --from-literal=key='<access key>' --from-literal=secret='<secret key>'

Create an extra config map in Kubernetes setting the required properties for AWS:

Create a properties file with AWS properties:

cat aws-launcher.properties
spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key
spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret
spark.kubernetes.executor.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key
spark.kubernetes.executor.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret