Product Selector

Fusion 5.11
    Fusion 5.11

    Configure Spark Jobs to Access Cloud Storage

    For related topics, see Spark Operations.

    Supported jobs

    This procedure applies to Spark-based jobs:

    • ALS Recommender

    • Cluster Labeling

    • Co-occurrence Similarity

    • Collection Analysis

    • Create Seldon Core Model Deployment Job

    • Delete Seldon Core Model Deployment Job

    • Document Clustering

    • Ground Truth

    • Head/Tail Analysis

    • Item Similarity Recommender

    • Legacy Item Recommender

    • Legacy Item Similarity

    • Levenshtein Spell Checking

    • Logistic Regression Classifier Training

    • Matrix Decomposition-Based Query-Query Similarity

    • Outlier Detection

    • Parallel Bulk Loader

    • Parameterized SQL Aggregation

    • Phrase Extraction

    • Query-to-Query Session-Based Similarity

    • Query-to-Query Similarity

    • Random Forest Classifier Training

    • Ranking Metrics

    • SQL Aggregation

    • SQL-Based Experiment Metric (deprecated)

    • Statistically Interesting Phrases

    • Synonym Detection Jobs

    • Synonym and Similar Queries Detection Jobs

    • Token and Phrase Spell Correction

    • Word2Vec Model Training

    Amazon Web Services (AWS) and Google Cloud Storage (GCS) credentials can be configured per job or per cluster.

    Configuring credentials for Spark jobs

    GCS

    The examples in this subsection use placeholder values. See the table below for descriptions of the placeholders:

    Placeholder Description

    <key name>

    Name of the Solr GCS service account key.

    <key file path>

    Path to the Solr GCS service account key.

    1. Create a secret containing the credentials JSON file:

      kubectl create secret generic <key name> --from-file=/<key file path>/<key name>.json
    2. Create an extra config map in Kubernetes setting the required properties for GCP.

      1. Create a properties file with GCP properties:

        cat gcp-launcher.properties
        spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/<key name>.json
        spark.kubernetes.driver.secrets.<key name> = /mnt/gcp-secrets
        spark.kubernetes.executor.secrets.<key name> = /mnt/gcp-secrets
        spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/<key name>.json
        spark.hadoop.google.cloud.auth.service.account.json.keyfile = /mnt/gcp-secrets/<key name>.json
      2. Create a config map based on the properties file:

        kubectl create configmap gcp-launcher --from-file=gcp-launcher.properties
    3. Add the gcp-launcher config map to values.yaml under job-launcher:

      configSources:
       - gcp-launcher

    AWS S3

    AWS credentials can’t be set with a single file. Instead, set two environment variables referring to the key and secret using the instructions below:

    1. Create a secret pointing to the credentials:

      kubectl create secret generic aws-secret --from-literal=key='<access key>' --from-literal=secret='<secret key>'
    2. Create an extra config map in Kubernetes setting the required properties for AWS:

      1. Create a properties file with AWS properties:

        cat aws-launcher.properties
        spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key
        spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret
        spark.kubernetes.executor.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key
        spark.kubernetes.executor.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret
      2. Create a config map based on the properties file:

        kubectl create configmap aws-launcher --from-file=aws-launcher.properties
    3. Add the aws-launcher config map to values.yaml under job-launcher:

      configSources:
       - aws-launcher

    Azure Data Lake

    Configuring Azure through environment variables or configMaps isn’t possible yet. Instead, manually upload the core-site.xml file into the job-launcher pod at /app/spark-dist/conf. See below for an example core-site.xml file:

    <property>
      <name>dfs.adls.oauth2.access.token.provider.type</name>
      <value>ClientCredential</value>
    </property>
    <property>
        <name>dfs.adls.oauth2.refresh.url</name>
        <value> Insert Your OAuth 2.0 Endpoint URL Value Here </value>
    </property>
    <property>
        <name>dfs.adls.oauth2.client.id</name>
        <value> Insert Your Application ID Here </value>
    </property>
    <property>
        <name>dfs.adls.oauth2.credential</name>
        <value>Insert the Secret Key Value Here </value>
    </property>
    <property>
        <name>fs.adl.impl</name>
        <value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
    </property>
    <property>
        <name>fs.AbstractFileSystem.adl.impl</name>
        <value>org.apache.hadoop.fs.adl.Adl</value>
    </property>
    At this time, only Data Lake Gen 1 is supported.

    Configuring credentials per job

    1. Create a Kubernetes secret with the GCP/AWS credentials.

    2. Add the Spark configuration to configure the secrets for the Spark driver/executor.

    GCS

    The examples in this subsection use placeholder values. See the table below for descriptions of the placeholders:

    Placeholder Description

    <key name>

    Name of the Solr GCS service account key.

    <key file path>

    Path to the Solr GCS service account key.

    1. Create a secret containing the credentials JSON file.

      kubectl create secret generic <key name> --from-file=/<key file path>/<key name>.json
    2. Toggle the Advanced configuration in the job UI, and add the following to the Spark configuration:

      spark.kubernetes.driver.secrets.<key name> = /mnt/gcp-secrets
      spark.kubernetes.executor.secrets.<key name> = /mnt/gcp-secrets
      spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/<key name>.json
      spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS = /mnt/gcp-secrets/<key name>.json
      spark.hadoop.google.cloud.auth.service.account.json.keyfile = /mnt/gcp-secrets/<key name>.json

    AWS S3

    AWS credentials can’t be set with a single file. Instead, set two environment variables referring to the key and secret using the instructions below:

    1. Create a secret pointing to the credentials:

      kubectl create secret generic aws-secret --from-literal=key='<access key>' --from-literal=secret='<secret key>'
    2. Toggle the Advanced configuration in the job UI, and add the following to the Spark configuration:

      spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key
      spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret
      spark.kubernetes.executor.secretKeyRef.AWS_ACCESS_KEY_ID=aws-secret:key
      spark.kubernetes.executor.secretKeyRef.AWS_SECRET_ACCESS_KEY=aws-secret:secret