Normally, signals are indexed as streaming data during the natural activity of users. This topic describes how to load historical signals data in batches, in Parquet format, using Spark shell.
|Fusion’s performance may be affected during this resource-intensive operation. Be sure to allocate sufficient memory for the Spark, Solr, and connectors services.|
Customize the code below by replacing the following strings:
path_of_folder- The absolute path to the folder containing your Parquet files.
collection_name_signals- The name of the signals collection where you want to load these signals.
localhost:9983/lwfusion/4.2.2/solr- You can verify the correct path by going to the Solr console at
/http://fusion_host:8983/solr/#/and looking for the value of
val parquetFilePath = "path_of_folder" val signals = spark.read.parquet(parquetFilePath) val collectionName = "collection_name_signals" val zkhostName = "localhost:9983/lwfusion/4.2.1/solr" var connectionMap = Map("collection" -> collectionName, "zkhost" -> zkhostName, "commit_within" -> "5000", "batch_size" -> "10000") signals.write.format("solr").options(connectionMap).save()
For information about
batch_size, see https://github.com/lucidworks/spark-solr#commit_within.
Launch the Spark shell:
scala>prompt, enter paste mode:
Paste your modified code from step 1.
Exit paste mode by pressing
When the operation is finished, navigate to Collections > Collections Manager to verify that the number of documents in the specified signals collection has increased as expected.