Built-in SQL Aggregation Jobs Using Cloud Storage Buckets

Table of Contents

Configure Parameters

Built-in SQL aggregation jobs can be set up to use source files in Cloud storage buckets.

This process can be used with the following data types and Cloud storage systems:

File formats such as .parquet and .orc files
Cloud storage systems such as Google Cloud Storage (GCS), Amazon Web Services (AWS), and Azure Kubernetes Service (AKS).

Configure Parameters

Google Cloud Storage (GCS)

Create a Kubernetes secret with the necessary credentials. For more information about creating a secret containing the credentials JSON file, see Configuring credentials for Spark jobs.
When the secret is successfully created, set the following parameters:

GENERAL PARAMETERS

GENERAL PARAMETERS
Parameter Name	Example Value	Notes
SOURCE COLLECTION	`gs://<path_to_data>/*.parquet`	Value: URI path that contains the desired signal data files. Example value returns all parquet files in the directory, where `gs` is used to access GCS data.
DATA FORMAT	`parquet`	Value: File type of input file. Other value can be `orc`.
SPARK SETTINGS
Parameter Name	Example Value	Notes
`spark.kubernetes.driver.secrets.{secret-name}`	`/mnt/gcp-secrets`	Value: The `{secret-name}` obtained during configuration. Example: `example-serviceaccount-key`.
`spark.kubernetes.executor.secrets.{secret-name}`	`/mnt/gcp-secrets`	Value: The `{secret-name}` obtained during configuration. Example: `example-serviceaccount-key`.
`spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS`	`/mnt/gcp-secrets/{secret-name}.json`	Value: The name of the `.json` file used to create the `{secret-name}` obtained during configuration. Example: `example-serviceaccount-key.json`.
`spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS`	`/mnt/gcp-secrets/{secret-name}.json`	Value: The name of the `.json` file used to create the `{secret-name}` obtained during configuration. Example: `example-serviceaccount-key.json`.
`spark.hadoop.google.cloud.auth.service.account.json.keyfile`	`/mnt/gcp-secrets/{secret-name}.json`	Value: The name of the `.json` file used to create the `{secret-name}` obtained during configuration. Example: `example-serviceaccount-key.json`.

Parameter Name

Example Value

Notes

SOURCE COLLECTION

gs://<path_to_data>/*.parquet

Value: URI path that contains the desired signal data files.

Example value returns all parquet files in the directory, where gs is used to access GCS data.

DATA FORMAT

parquet

Value: File type of input file.

Other value can be orc.

SPARK SETTINGS

Parameter Name

Example Value

Notes

spark.kubernetes.driver.secrets.{secret-name}

/mnt/gcp-secrets

Value: The {secret-name} obtained during configuration.

Example: example-serviceaccount-key.

spark.kubernetes.executor.secrets.{secret-name}

/mnt/gcp-secrets

Value: The {secret-name} obtained during configuration.

Example: example-serviceaccount-key.

spark.kubernetes.driverEnv.GOOGLE_APPLICATION_CREDENTIALS

/mnt/gcp-secrets/{secret-name}.json

Value: The name of the .json file used to create the {secret-name} obtained during configuration.

Example: example-serviceaccount-key.json.

spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS

/mnt/gcp-secrets/{secret-name}.json

Value: The name of the .json file used to create the {secret-name} obtained during configuration.

Example: example-serviceaccount-key.json.

spark.hadoop.google.cloud.auth.service.account.json.keyfile

/mnt/gcp-secrets/{secret-name}.json

Value: The name of the .json file used to create the {secret-name} obtained during configuration.

Example: example-serviceaccount-key.json.

Amazon Web Services (AWS)

Create a Kubernetes secret with the necessary credentials. For more information about creating a secret containing the credentials JSON file, see Configuring credentials for Spark jobs.
When the secret is successfully created, set the following parameters:

GENERAL PARAMETERS

GENERAL PARAMETERS
Parameter Name	Example Value	Notes
SOURCE COLLECTION	`s3a://<path_to_data>/*.parquet`	Value: URI path that contains the desired signal data files. Example value returns all parquet files in the directory, where `s3a` is used to access AWS data.
DATA FORMAT	`parquet`	Value: File type of input file. Other value can be `orc`.
SPARK SETTINGS
Parameter Name	Example Value	Notes
`spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID`	`{aws-secret-key}`	Value: The `aws-secret:key` obtained during configuration.
`spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY`	`{aws-secret-secret}`	Value: The `aws-secret:secret` obtained during configuration.
`spark.kubernetes.executor.secretKeyRef.AWS_ACCESS_KEY_ID`	`{aws-secret-key}`	Value: The `aws-secret:key` obtained during configuration.
`spark.kubernetes.executor.secretKeyRef.AWS_SECRET_ACCESS_KEY`	`{aws-secret-secret}`	Value: The `aws-secret:secret` obtained during configuration.

Parameter Name

Example Value

Notes

SOURCE COLLECTION

s3a://<path_to_data>/*.parquet

Value: URI path that contains the desired signal data files.

Example value returns all parquet files in the directory, where s3a is used to access AWS data.

DATA FORMAT

parquet

Value: File type of input file.

Other value can be orc.

SPARK SETTINGS

Parameter Name

Example Value

Notes

spark.kubernetes.driver.secretKeyRef.AWS_ACCESS_KEY_ID

{aws-secret-key}

Value: The aws-secret:key obtained during configuration.

spark.kubernetes.driver.secretKeyRef.AWS_SECRET_ACCESS_KEY

{aws-secret-secret}

Value: The aws-secret:secret obtained during configuration.

spark.kubernetes.executor.secretKeyRef.AWS_ACCESS_KEY_ID

{aws-secret-key}

Value: The aws-secret:key obtained during configuration.

spark.kubernetes.executor.secretKeyRef.AWS_SECRET_ACCESS_KEY

{aws-secret-secret}

Value: The aws-secret:secret obtained during configuration.

Azure settings

GENERAL PARAMETERS

GENERAL PARAMETERS
Parameter Name	Example Value	Notes
SOURCE COLLECTION	`wasbs://<path_to_data>/*.parquet`	Value: URI path that contains the desired signal data files. Example value returns all parquet files in the directory, where `wasbs` is used to access Azure data.
DATA FORMAT	`parquet`	Value: File type of input file. Other value can be `orc`.
SPARK SETTINGS
Parameter Name	Example Value	Notes
`spark.hadoop.fs.wasbs.impl`	`org.apache.hadoop.fs.azure.NativeAzureFileSystem`	Makes the system file available inside the Spark job.
`spark.hadoop.fs.azure.account.key.{storage-account-name}.blob.core.windows.net`	`{access-key-value}`	Obtain the values for {storage-account-name} and {access-key-value} from the Users Azure UI.

Parameter Name

Example Value

Notes

SOURCE COLLECTION

wasbs://<path_to_data>/*.parquet

Value: URI path that contains the desired signal data files.

Example value returns all parquet files in the directory, where wasbs is used to access Azure data.

DATA FORMAT

parquet

Value: File type of input file.

Other value can be orc.

SPARK SETTINGS

Parameter Name

Example Value

Notes

spark.hadoop.fs.wasbs.impl

org.apache.hadoop.fs.azure.NativeAzureFileSystem

Makes the system file available inside the Spark job.

spark.hadoop.fs.azure.account.key.{storage-account-name}.blob.core.windows.net

{access-key-value}

Obtain the values for {storage-account-name} and {access-key-value} from the Users Azure UI.