Looking for the old site?

How To

Browse By

  • Objective

  • Products

  • User Role

    How To
    Documentation
      Learn More

        Configure Spark Jobs to Access Cloud Storage

        For related topics, see Spark Operations.

        Supported jobs

        This procedure applies to Spark-based jobs:

        • ALS Recommender

        • Cluster Labeling

        • Co-occurrence Similarity

        • Collection Analysis

        • Create Seldon Core Model Deployment Job

        • Delete Seldon Core Model Deployment Job

        • Document Clustering

        • Ground Truth

        • Head/Tail Analysis

        • Item Similarity Recommender

        • Legacy Item Recommender

        • Legacy Item Similarity

        • Levenshtein Spell Checking

        • Logistic Regression Classifier Training

        • Matrix Decomposition-Based Query-Query Similarity

        • Outlier Detection

        • Parallel Bulk Loader

        • Parameterized SQL Aggregation

        • Phrase Extraction

        • Query-to-Query Session-Based Similarity

        • Query-to-Query Similarity

        • Random Forest Classifier Training

        • Ranking Metrics

        • SQL Aggregation

        • SQL-Based Experiment Metric (deprecated)

        • Statistically Interesting Phrases

        • Synonym Detection Jobs

        • Synonym and Similar Queries Detection Jobs

        • Token and Phrase Spell Correction

        • Word2Vec Model Training

        AWS/GCS credentials can be configured per job or per cluster.

        Configuring GCS credentials for Spark jobs

        1. Create a secret containing the credentials JSON file.

          See https://cloud.google.com/iam/docs/creating-managing-service-account-keys on how to create service account JSON files.

          kubectl create secret generic solr-dev-gc-serviceaccount-key --from-file=/Users/kiranchitturi/creds/solr-dev-gc-serviceaccount-key.json
        2. Create an extra config map in Kubernetes setting the required properties for GCP.

          1. Create a properties file with GCP properties:

            $ cat gcp-launcher.properties
            spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/solr-dev-gc-serviceaccount-key.json
            spark.kubernetes.driver.secrets.solr-dev-gc-serviceaccount-key = /mnt/gcp-secrets
            spark.kubernetes.executor.secrets.solr-dev-gc-serviceaccount-key = /mnt/gcp-secrets
            spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/solr-dev-gc-serviceaccount-key.json
            spark.hadoop.google.cloud.auth.service.account.json.keyfile = /mnt/gcp-secrets/solr-dev-gc-serviceaccount-key.json
          2. Create a config map based on the properties file:

            kubectl create configmap gcp-launcher --from-file=gcp-launcher.properties
        3. Add the gcp-launcher config map to values.yaml under job-launcher:

          configSources: gcp-launcher

        Configuring S3 credentials for Spark jobs

        AWS credentials can’t be set via a single file. So, we have to set two environment variables referring to the key and secret.

        1. Create a secret pointing to the creds:

          kubectl create secret generic aws-secret --from-literal=key='<access key>' --from-literal=secret='<secret key>'
        2. Create an extra config map in Kubernetes setting the required properties for AWS:

          1. Create a properties file with AWS properties:

            cat aws-launcher.properties
            spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key
            spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret
            spark.kubernetes.executor.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key
            spark.kubernetes.executor.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret
          2. Create a config map based on the properties file:

            kubectl create configmap aws-launcher --from-file=aws-launcher.properties
        3. Add the aws-launcher config map to values.yaml under job-launcher:

          configSources: aws-launcher

        Configuring Azure Data Lake credentials for Spark jobs

        Configuring Azure through environment variables or configMaps does not seem to be possible at the moment. You need to manually upload the core-site.xml file into the job-launcher pod at /app/spark-dist/conf.

        Currently only Data Lake Gen 1 is supported.

        Here’s what the core-site.xml file should look like:

        <property>
          <name>dfs.adls.oauth2.access.token.provider.type</name>
          <value>ClientCredential</value>
        </property>
        <property>
            <name>dfs.adls.oauth2.refresh.url</name>
            <value> Insert Your OAuth 2.0 Endpoint URL Value Here </value>
        </property>
        <property>
            <name>dfs.adls.oauth2.client.id</name>
            <value> Insert Your Application ID Here </value>
        </property>
        <property>
            <name>dfs.adls.oauth2.credential</name>
            <value>Insert the Secret Key Value Here </value>
        </property>
        <property>
            <name>fs.adl.impl</name>
            <value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
        </property>
        <property>
            <name>fs.AbstractFileSystem.adl.impl</name>
            <value>org.apache.hadoop.fs.adl.Adl</value>
        </property>

        Configuring credentials per job

        1. Create a Kubernetes secret with the GCP/AWS credentials.

        2. Add the Spark configuration to configure the secrets for the Spark driver/executor.

        GCS

        1. Create a secret containing the credentials JSON file.

          See https://cloud.google.com/iam/docs/creating-managing-service-account-keys on how to create service account JSON files.

          kubectl create secret generic solr-dev-gc-serviceaccount-key --from-file=/Users/kiranchitturi/creds/solr-dev-gc-serviceaccount-key.json
        2. Toggle the Advanced config in the job UI and add the following to the Spark configuration:

          spark.kubernetes.driver.secrets.solr-dev-gc-serviceaccount-key = /mnt/gcp-secrets
          spark.kubernetes.executor.secrets.solr-dev-gc-serviceaccount-key = /mnt/gcp-secrets
          spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/solr-dev-gc-serviceaccount-key.json
          spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/solr-dev-gc-serviceaccount-key.json
          spark.hadoop.google.cloud.auth.service.account.json.keyfile = /mnt/gcp-secrets/solr-dev-gc-serviceaccount-key.json

        S3

        AWS credentials can’t be set via a single file. So, we have to set two environment variables referring to the key and secret.

        1. Create a secret pointing to the creds:

          kubectl create secret generic aws-secret --from-literal=key='<access key>' --from-literal=secret='<secret key>'
        2. Toggle the Advanced config in the job UI and add the following to Spark configuration:

          spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key
          spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret
          spark.kubernetes.executor.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key
          spark.kubernetes.executor.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret