Product Selector

Fusion 5.12
    Fusion 5.12

    Parallel Bulk LoaderJob configuration specifications

    The Parallel Bulk Loader (PBL) job enables bulk ingestion of structured and semi-structured data from big data systems, NoSQL databases, and common file formats like Parquet and Avro.

    Use this job to load data into Managed Fusion from a SparkSQL-compliant datasource, and then send the data to any Spark-supported datasource such as Solr, an index pipeline, etc.

    To create a Parallel Bulk Loader job, sign in to Managed Fusion and click Collections > Jobs. Then click Add+ and in the Custom and Others Jobs section, select Parallel Bulk Loader. You can enter basic and advanced parameters to configure the job. If the field has a default value, it is populated when you click to add the job.

    Basic parameters

    To enter advanced parameters in the UI, click Advanced. Those parameters are described in the advanced parameters section.
    • Spark job ID. The unique ID for the Spark job that references this job in the API. This is the id field in the configuration file. Required field.

    • Format. The format of the input datasource. For example, Parquet or JSON. This is the format field in the configuration file. Required field.

    • Path. The path to load the datasource. If the datasource has multiple paths, separate the paths with commas. This is the path field of the configuration file. Optional field.

    • Streaming. This is the streaming field in the configuration file. Optional field. If this checkbox is selected (set to true), the following fields are available:

      • Enable streaming. If this checkbox is selected (set to true), the job streams the data from the input datasource to an output Solr collection. This is the enableStreaming field in the configuration file. Optional field.

      • Output mode. This field specifies how the output is processed. Values include append, complete, and update. This is the outputMode field in the configuration file. Optional field.

    • Read Options. This section lets you enter parameter name:parameter value options to use when reading input from datasources. Options differ for every datasource, so refer to the documentation for that datasource for more information. This is the readOptions field in the configuration file.

    • Output collection. The Solr collection where the documents loaded from the input datasource are stored. This is the outputCollection field in the configuration file. Optional field.

    • Send to index pipeline. The index pipeline where the documents are loaded from the input datasource instead of being loaded directly to Solr. This is the outputIndexPipeline field in the configuration file. Optional field.

    • Spark ML pipeline model ID. The identifier of the Spark machine learning (ML) pipeline model that is stored in the Managed Fusion blob store. This is the mlModelId field in the configuration file. Optional field.

    Advanced parameters

    If you click the Advanced toggle, the following optional fields are displayed in the UI.

    • Spark Settings. This section lets you enter parameter name:parameter value options to use for Spark configuration. This is the sparkConfig field in the configuration file.

    • Send to parser. The parser where documents are sent, while sending to the index pipeline. The default is the value in the Send to index pipeline field. This is the outputParser field in the configuration file.

    • Define fields in Solr? If this checkbox is selected (set to true), define fields in Solr using the input schema. However, if a SQL transform is defined, the fields to define are based on the transformed DataFrame schema instead of the input. This is the defineFieldsUsingInputSchema field in the configuration file.

    • Send as Atomic updated? If this checkbox is selected (set to true), the job sends documents to Solr as atomic updates. An atomic update allows changes to one or more fields of a document without having to reindex the whole document. This feature only applies if sending directly to Solr and not an index pipeline. This is the atomicUpdates field in the configuraton file.

    • Timestamp field name. The field name that contains the timestamp value for each document. This field is only required if timestamps are used to filter new rows from the input source. This is the timestampFieldName field in the configuration file.

    • Clear existing documents. If this checkbox is selected (set to true), the job deletes any documents indexed in Solr by previous runs of this job. The default is false. This is the clearDatasource field in the configuration file.

    • Output partitions. The number of partitions to create in the input DataFrame where data is stored before it is written to Solr or Managed Fusion. This is the outputPartitions field in the configuration file.

    • Optimize. The number of segments into which the Solr collection is optimized after data is written to Solr. This is the optimizeOutput field in the configuration file.

    • Write Options. This section lets you enter parameter name:parameter value options to use when writing output to sources other than Solr or the index pipeline. This is the writeOptions field in the configuration file.

    • Transform Scala. The Scala script used to transform the results returned by the datasource before indexing. Define the transform script in a method with signature: def transform(inputDF: Dataset[Row]) : Dataset[Row]. This is the transformScala field in the configuration file.

    • Transform SQL. The SQL script used to transform the results returned by the datasource before indexing. The input DataFrame returned from the datasource is registered as a temp table named _input. The Scala transform is applied before the SQL transform if both are provided, which lets you define custom user-defined functions (UDFs) in the Scala script for use in your transformation SQL. This is the transformSql field in the configuration file.

    • Spark shell options. This section lets you enter parameter name:parameter value options to send to the Spark shell when the job is run. This is the shellOptions field in the configuration file.

    • Interpreter params. This section lets you enter parameter name:parameter value options to bind the key:value pairs to the script interpreter. This is the templateParams field in the configuration file.

    • Continue after index failure. If this checkbox is selected (set to true), the job skips over a document that fails when it is sent through an index pipeline, and continues to the next document without failing the job. This is the continueAfterFailure field in the configuration file.

    Use this job when you want to load data into Fusion from a SparkSQL compliant datasource, and send this data to any Spark supported datasource (Solr/Index Pipeline/S3/GCS/...).

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

    <= 63 characters

    Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

    sparkConfig - array[object]

    Spark configuration settings.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    format - stringrequired

    Specifies the input data source format; common examples include: parquet, json, textinputformat

    path - string

    Path to load; for data sources that support multiple paths, separate by commas

    streaming - Streaming

    enableStreaming - boolean

    Stream data from input source to output Solr collection

    outputMode - string

    Specifies the output mode for streaming. E.g., append (default), complete, update

    Default: append

    Allowed values: appendcompleteupdate

    readOptions - array[object]

    Options passed to the data source to configure the read operation; options differ for every data source so refer to the documentation for more information.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    outputCollection - string

    Solr Collection to send the documents loaded from the input data source.

    outputIndexPipeline - string

    Send the documents loaded from the input data source to an index pipeline instead of going directly to Solr.

    outputParser - string

    Parser to send the documents to while sending to index pipeline. (Defaults to same as index pipeline)

    defineFieldsUsingInputSchema - boolean

    If true, define fields in Solr using the input schema; if a SQL transform is defined, the fields to define are based on the transformed DataFrame schema instead of the input.

    Default: true

    atomicUpdates - boolean

    Send documents to Solr as atomic updates; only applies if sending directly to Solr and not an index pipeline.

    Default: false

    timestampFieldName - string

    Name of the field that holds a timestamp for each document; only required if using timestamps to filter new rows from the input source.

    clearDatasource - boolean

    If true, delete any documents indexed in Solr by previous runs of this job. Default is false.

    Default: false

    outputPartitions - integer

    Partition the input DataFrame into partitions before writing out to Solr or Fusion

    optimizeOutput - integer

    Optimize the Solr collection down to the specified number of segments after writing to Solr.

    writeOptions - array[object]

    Options used when writing output. For output formats other than solr or index-pipeline, format and path options can be specified here

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    transformScala - string

    Optional Scala script used to transform the results returned by the data source before indexing. You must define your transform script in a method with signature: def transform(inputDF: Dataset[Row]) : Dataset[Row]

    mlModelId - string

    The ID of the Spark ML PipelineModel stored in the Fusion blob store.

    transformSql - string

    Optional SQL used to transform the results returned by the data source before indexing. The input DataFrame returned from the data source will be registered as a temp table named '_input'. The Scala transform is applied before the SQL transform if both are provided, which allows you to define custom UDFs in the Scala script for use in your transformation SQL.

    shellOptions - array[object]

    Additional options to pass to the Spark shell when running this job.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    templateParams - array[object]

    Bind the key/values to the script interpreter

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    continueAfterFailure - boolean

    If set to true, when a failure occurs when sending a document through an index pipeline, the job will continue onto the next document instead of failing

    Default: false

    type - stringrequired

    Default: parallel-bulk-loader

    Allowed values: parallel-bulk-loader