Parallel Bulk LoaderJob configuration specifications
The Parallel Bulk Loader (PBL) job enables bulk ingestion of structured and semi-structured data from big data systems, NoSQL databases, and common file formats like Parquet and Avro.
Use this job to load data into Managed Fusion from a SparkSQL-compliant datasource, and then send the data to any Spark-supported datasource such as Solr, an index pipeline, etc.
To create a Parallel Bulk Loader job, sign in to Managed Fusion and click Collections > Jobs. Then click Add+ and in the Custom and Others Jobs section, select Parallel Bulk Loader. You can enter basic and advanced parameters to configure the job. If the field has a default value, it is populated when you click to add the job.
Basic parameters
To enter advanced parameters in the UI, click Advanced. Those parameters are described in the advanced parameters section. |
-
Spark job ID. The unique ID for the Spark job that references this job in the API. This is the
id
field in the configuration file. Required field. -
Format. The format of the input datasource. For example, Parquet or JSON. This is the
format
field in the configuration file. Required field. -
Path. The path to load the datasource. If the datasource has multiple paths, separate the paths with commas. This is the
path
field of the configuration file. Optional field. -
Streaming. This is the
streaming
field in the configuration file. Optional field. If this checkbox is selected (set totrue
), the following fields are available:-
Enable streaming. If this checkbox is selected (set to
true
), the job streams the data from the input datasource to an output Solr collection. This is theenableStreaming
field in the configuration file. Optional field. -
Output mode. This field specifies how the output is processed. Values include append, complete, and update. This is the
outputMode
field in the configuration file. Optional field.
-
-
Read Options. This section lets you enter
parameter name:parameter value
options to use when reading input from datasources. Options differ for every datasource, so refer to the documentation for that datasource for more information. This is thereadOptions
field in the configuration file. -
Output collection. The Solr collection where the documents loaded from the input datasource are stored. This is the
outputCollection
field in the configuration file. Optional field. -
Send to index pipeline. The index pipeline where the documents are loaded from the input datasource instead of being loaded directly to Solr. This is the
outputIndexPipeline
field in the configuration file. Optional field. -
Spark ML pipeline model ID. The identifier of the Spark machine learning (ML) pipeline model that is stored in the Managed Fusion blob store. This is the
mlModelId
field in the configuration file. Optional field.
Advanced parameters
If you click the Advanced toggle, the following optional fields are displayed in the UI.
-
Spark Settings. This section lets you enter
parameter name:parameter value
options to use for Spark configuration. This is thesparkConfig
field in the configuration file. -
Send to parser. The parser where documents are sent, while sending to the index pipeline. The default is the value in the Send to index pipeline field. This is the
outputParser
field in the configuration file. -
Define fields in Solr? If this checkbox is selected (set to
true
), define fields in Solr using the input schema. However, if a SQL transform is defined, the fields to define are based on the transformed DataFrame schema instead of the input. This is thedefineFieldsUsingInputSchema
field in the configuration file. -
Send as Atomic updated? If this checkbox is selected (set to
true
), the job sends documents to Solr as atomic updates. An atomic update allows changes to one or more fields of a document without having to reindex the whole document. This feature only applies if sending directly to Solr and not an index pipeline. This is theatomicUpdates
field in the configuraton file. -
Timestamp field name. The field name that contains the timestamp value for each document. This field is only required if timestamps are used to filter new rows from the input source. This is the
timestampFieldName
field in the configuration file. -
Clear existing documents. If this checkbox is selected (set to
true
), the job deletes any documents indexed in Solr by previous runs of this job. The default isfalse
. This is theclearDatasource
field in the configuration file. -
Output partitions. The number of partitions to create in the input DataFrame where data is stored before it is written to Solr or Managed Fusion. This is the
outputPartitions
field in the configuration file. -
Optimize. The number of segments into which the Solr collection is optimized after data is written to Solr. This is the
optimizeOutput
field in the configuration file. -
Write Options. This section lets you enter
parameter name:parameter value
options to use when writing output to sources other than Solr or the index pipeline. This is thewriteOptions
field in the configuration file. -
Transform Scala. The Scala script used to transform the results returned by the datasource before indexing. Define the transform script in a method with
signature: def transform(inputDF: Dataset[Row]) : Dataset[Row]
. This is thetransformScala
field in the configuration file. -
Transform SQL. The SQL script used to transform the results returned by the datasource before indexing. The input DataFrame returned from the datasource is registered as a temp table named
_input
. The Scala transform is applied before the SQL transform if both are provided, which lets you define custom user-defined functions (UDFs) in the Scala script for use in your transformation SQL. This is thetransformSql
field in the configuration file. -
Spark shell options. This section lets you enter
parameter name:parameter value
options to send to the Spark shell when the job is run. This is theshellOptions
field in the configuration file. -
Interpreter params. This section lets you enter
parameter name:parameter value
options to bind thekey:value
pairs to the script interpreter. This is thetemplateParams
field in the configuration file. -
Continue after index failure. If this checkbox is selected (set to
true
), the job skips over a document that fails when it is sent through an index pipeline, and continues to the next document without failing the job. This is thecontinueAfterFailure
field in the configuration file.