Product Selector

Fusion 5.11
    Fusion 5.11

    Parallel Bulk Loader

    The Parallel Bulk Loader (PBL) job enables bulk ingestion of structured and semi-structured data from big data systems, NoSQL databases, and common file formats like Parquet and Avro.

    For more information, see Import Data with the Parallel Bulk Loader.

    Use this job when you want to load data into Fusion from a SparkSQL compliant datasource, and send this data to Solr directly or to an index pipeline for additional ETL processing.

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

    <= 128 characters

    Match pattern: ^[A-Za-z0-9_\-]+$

    format - stringrequired

    Specifies the input data source format; common examples include: parquet, json,

    path - string

    Path to load; for data sources that support multiple paths, separate by commas

    readOptions - array[object]

    Options passed to the data source to configure the read operation; options differ for every data source so refer to the documentation for more information.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    outputCollection - stringrequired

    Solr Collection to send the documents loaded from the input data source.

    outputIndexPipeline - string

    Send the documents loaded from the input data source to an index pipeline instead of going directly to Solr.

    outputParser - string

    Parser to send the documents to while sending to index pipeline. (Defaults to same as index pipeline)

    defineFieldsUsingInputSchema - boolean

    If true, define fields in Solr using the input schema; if a SQL transform is defined, the fields to define are based on the transformed DataFrame schema instead of the input.

    Default: true

    atomicUpdates - boolean

    Send documents to Solr as atomic updates; only applies if sending directly to Solr and not an index pipeline.

    Default: false

    timestampFieldName - string

    Name of the field that holds a timestamp for each document; only required if using timestamps to filter new rows from the input source.

    clearDatasource - boolean

    If true, delete any documents indexed in Solr by previous runs of this job. Default is false.

    Default: false

    outputPartitions - integer

    Partition the input DataFrame into partitions before writing out to Solr or Fusion

    optimizeOutput - integer

    Optimize the Solr collection down to the specified number of segments after writing to Solr.

    writeOptions - array[object]

    Options used when writing output to Solr directly or to an index profile.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    transformScala - string

    Optional Scala script used to transform the results returned by the data source before indexing. You must define your transform script in a method with signature: def transform(inputDF: Dataset[Row]) : Dataset[Row]

    mlModelId - string

    The ID of the Spark ML PipelineModel stored in the Fusion blob store.

    transformSql - string

    Optional SQL used to transform the results returned by the data source before indexing. The input DataFrame returned from the data source will be registered as a temp table named '_input'. The Scala transform is applied before the SQL transform if both are provided, which allows you to define custom UDFs in the Scala script for use in your transformation SQL.

    sparkConfig - array[object]

    Spark configuration settings.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    shellOptions - array[object]

    Additional options to pass to the Spark shell when running this job.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    envOptions - array[object]

    Additional environment variables to set for Spark driver before running this job

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    type - stringrequired

    Default: parallel-bulk-loader

    Allowed values: parallel-bulk-loader