Importing Signals

Normally, signals are indexed as streaming data during the natural activity of users. This topic describes how to load historical signals data in batches, in Parquet format, using Spark shell.

Note
Fusion’s performance may be affected during this resource-intensive operation. Be sure to allocate sufficient memory for the Spark, Solr, and connectors services.
How to load Parquet files in Spark shell
  1. Customize the code below by replacing the following strings:

    • path_of_folder - The absolute path to the folder containing your Parquet files.

    • collection_name_signals - The name of the signals collection where you want to load these signals.

    • localhost:9983/lwfusion/4.2.2/solr - You can verify the correct path by going to the Solr console at /http://fusion_host:8983/solr/#/ and looking for the value of DzkHost.

    val parquetFilePath = "path_of_folder"
    val signals = spark.read.parquet(parquetFilePath)
    val collectionName = "collection_name_signals"
    val zkhostName = "localhost:9983/lwfusion/4.2.1/solr"
    var connectionMap = Map("collection" -> collectionName, "zkhost" -> zkhostName, "commit_within" -> "5000", "batch_size" -> "10000")
    signals.write.format("solr").options(connectionMap).save()

    For information about commit_within and batch_size, see https://github.com/lucidworks/spark-solr#commit_within.

  2. Launch the Spark shell:

    fusion/4.2.x/bin/spark-shell
  3. At the scala> prompt, enter paste mode:

    :paste
  4. Paste your modified code from step 1.

  5. Exit paste mode by pressing CTRL-d.

  6. When the operation is finished, navigate to Collections > Collections Manager to verify that the number of documents in the specified signals collection has increased as expected.