Product Selector

Fusion 5.9
    Fusion 5.9

    Built-in SQL Aggregation Jobs Using Cloud Storage Buckets

    Built-in SQL aggregation jobs can be set up to use source files in Cloud storage buckets.

    This process can be used with the following data types and Cloud storage systems:

    • File formats such as .parquet and .orc files

    • Cloud storage systems such as Google Cloud Storage (GCS), Amazon Web Services (AWS), and Azure Kubernetes Service (AKS).

    Configure Parameters

    Google Cloud Storage (GCS)

    1. Create a Kubernetes secret with the necessary credentials. For more information about creating a secret containing the credentials JSON file, see Configuring credentials for Spark jobs.

    2. When the secret is successfully created, set the following parameters:

    GENERAL PARAMETERS

    Parameter Name

    Example Value

    Notes

    SOURCE COLLECTION

    gs://<path_to_data>/*.parquet

    Value: URI path that contains the desired signal data files.

    Example value returns all parquet files in the directory, where gs is used to access GCS data.

    DATA FORMAT

    parquet

    Value: File type of input file.

    Other value can be orc.

    SPARK SETTINGS

    Parameter Name

    Example Value

    Notes

    spark.kubernetes.driver.secrets.{secret-name}

    /mnt/gcp-secrets

    Value: The {secret-name} obtained during configuration.

    Example: example-serviceaccount-key.

    spark.kubernetes.executor.secrets.{secret-name}

    /mnt/gcp-secrets

    Value: The {secret-name} obtained during configuration.

    Example: example-serviceaccount-key.

    spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS

    /mnt/gcp-secrets/{secret-name}.json

    Value: The name of the .json file used to create the {secret-name} obtained during configuration.

    Example: example-serviceaccount-key.json.

    spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS

    /mnt/gcp-secrets/{secret-name}.json

    Value: The name of the .json file used to create the {secret-name} obtained during configuration.

    Example: example-serviceaccount-key.json.

    spark.hadoop.google.cloud.auth.service.account.json.keyfile

    /mnt/gcp-secrets/{secret-name}.json

    Value: The name of the .json file used to create the {secret-name} obtained during configuration.

    Example: example-serviceaccount-key.json.

    Amazon Web Services (AWS)

    1. Create a Kubernetes secret with the necessary credentials. For more information about creating a secret containing the credentials JSON file, see Configuring credentials for Spark jobs.

    2. When the secret is successfully created, set the following parameters:

    GENERAL PARAMETERS

    Parameter Name

    Example Value

    Notes

    SOURCE COLLECTION

    s3a://<path_to_data>/*.parquet

    Value: URI path that contains the desired signal data files.

    Example value returns all parquet files in the directory, where s3a is used to access AWS data.

    DATA FORMAT

    parquet

    Value: File type of input file.

    Other value can be orc.

    SPARK SETTINGS

    Parameter Name

    Example Value

    Notes

    spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID

    {aws-secret-key}

    Value: The aws-secret:key obtained during configuration.

    spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY

    {aws-secret-secret}

    Value: The aws-secret:secret obtained during configuration.

    spark.kubernetes.executor.secretKeyRef.AWS_ACCESS_KEY_ID

    {aws-secret-key}

    Value: The aws-secret:key obtained during configuration.

    spark.kubernetes.executor.secretKeyRef.AWS_SECRET_ACCESS_KEY

    {aws-secret-secret}

    Value: The aws-secret:secret obtained during configuration.

    Azure settings

    GENERAL PARAMETERS

    Parameter Name

    Example Value

    Notes

    SOURCE COLLECTION

    wasbs://<path_to_data>/*.parquet

    Value: URI path that contains the desired signal data files.

    Example value returns all parquet files in the directory, where wasbs is used to access Azure data.

    DATA FORMAT

    parquet

    Value: File type of input file.

    Other value can be orc.

    SPARK SETTINGS

    Parameter Name

    Example Value

    Notes

    spark.hadoop.fs.wasbs.impl

    org.apache.hadoop.fs.azure.NativeAzureFileSystem

    Makes the system file available inside the Spark job.

    spark.hadoop.fs.azure.account.key.{storage-account-name}.blob.core.windows.net

    {access-key-value}

    Obtain the values for {storage-account-name} and {access-key-value} from the Users Azure UI.