Import with the Bulk Loader
Setting | Description |
format | Unique identifier of the data source provider. Spark scans the job’s classpath for a class named DefaultSource in the <format> package. For example, for the solr format, we provide the solr.DefaultSource class in our spark-solr repository: |
path (optional) | Comma-delimited list of paths to load. Some data sources, such as parquet, require a path. Others, such as Solr, do not. Refer to the documentation for your data source to determine if you need to provide a path. |
readOptions | Options passed to the Spark SQL data source to configure the read operation. Options differ for every data source. Refer to the specific data source documentation for more information. |
sparkConfig (optional) | List of Spark configuration settings needed to run the Parallel Bulk Loader. |
shellOptions | Behind the scenes, the Parallel Bulk Loader job submits a Scala script to the Fusion Spark shell. The shellOptions setting lets you pass any additional options needed by the Spark shell. The two most common options are --packages and --repositories : --packages Comma-separated list of Maven coordinates of JAR files to include on the driver and executor classpaths. Spark searches the local Maven repository, and then Maven central and any additional remote repositories given in the config. The format for the coordinates should be groupId:artifactId:version . The HBase example below demonstrates the use of the packages option for loading the com.hortonworks:shc-core:1.1.1-2.1-s_2.11 package. TIP: Use the https://spark-packages.org/ site to find interesting packages to add to your Parallel Bulk Loader jobs. --repositories Comma-separated list of additional remote Maven repositories to search for the Maven coordinates given in the packages config setting. The Index HBase tables example below demonstrates the use of the repositories option for loading the com.hortonworks:shc-core:1.1.1-2.1-s_2.11 package from the Hortonworks repository. |
timestampFieldName | For datasources that support time-based filters, the Parallel Bulk Loader computes the timestamp of the last document written to Solr and the current timestamp of the Parallel Bulk Loader job. For example, the HBase data source lets you filter the read between a MIN_STAMP and MAX_STAMP , for example: val timeRangeOpts = Map(HBaseRelation.MIN_STAMP -> minStamp.toString, HBaseRelation.MAX_STAMP -> maxStamp.toString) lets Parallel Bulk Loader jobs run on schedules, and pull only the newest rows from the underlying datasources. To support timestamp based filtering, the Parallel Bulk Loader provides two simple macros: $lastTimestamp(format) $nowTimestamp(format) The format argument is optional. If not supplied, then an ISO-8601 date/time string is used. The timestampFieldName setting is used to determine the value of lastTimestamp , using a Top 1 query to Solr to get the max timestamp. You can also pass $lastTimestamp(EPOCH) or $lastTimestamp(EPOCH_MS) to get the timestamp in seconds or milliseconds. See the Index HBase tables example below for an example of using this configuration property. |
Setting | Description |
transformScala | Sometimes, you can write a small script to transform input data into the correct form for indexing. But at other times, you might need the full power of the Spark API to transform data into an indexable form. The transformScala option lets you filter and/or transform the input DataFrame any way you would like. You can even define UDFs to use during your transformation. For an example of using Scala to transform the input DataFrame before indexing in Solr, see the Read from Parquet example. Another powerful use of the transformScala option is that you can pull in advanced libraries, such as Spark NLP (from John Snow Labs) to do NLP work on your content before indexing. See the Use NLP during indexing example. Your Scala script can do other things but, at a minimum, it must define the function that the Parallel Bulk Loader invokes (see below this table). |
transformSql | The transformSql option lets you write a SQL query to transform the input DataFrame. The SQL is executed after the transformScala script (if both are defined). The input DataFrame is exposed to your SQL as the _input view. See the Clean up data with SQL transformations example below for an example of using SQL to transform the input before indexing in Solr. This option also lets you leverage the UDF/UDAF functions provided by Spark SQL. |
mlModelId | If you have a Spark ML PipelineModel loaded into the blob store, you can supply the blob ID to the Parallel Bulk Loader and it will: . Load the model from the blob store. . Transform the input DataFrame (after the Scala transform but before the SQL transform). . Add the predicted output field (specified in the model metadata stored in the blob store) to the projected fields list. ![]() |
transformScala
:Setting | Description |
outputCollection | Name of the Fusion collection to write to. The Parallel Bulk Loader uses the Collections API to resolve the underlying Solr collection at runtime. |
outputIndexPipeline | Name of a Fusion index pipeline to which to send documents, instead of directly indexing to Solr. This option lets you perform additional ETL (extract, transform, and load) processing on the documents before they are indexed in Solr. If you need to write to time-partitioned indexes, then you must use an index pipeline, because writing directly to Solr is not partition aware. |
defineFieldsUsingInputSchema | Flag to indicate if the Parallel Bulk Loader should use the input schema to create fields in Solr, after applying the Scala and/or SQL transformations. If false , then the Parallel Bulk Loader relies on the Fusion index pipeline and/or Solr field guessing to create the fields. If true , only fields that do not exist already in Solr are created. Consequently, if there is a type mismatch between an existing field in Solr and the input schema, you will need to use a transformation to rename the field in the input schema. |
clearDatasource | If checked, the Parallel Bulk Loader deletes any existing documents in the output collection that match the query _lw_loader_id_s:<JOB> . Consequently, the Parallel Bulk Loader adds two metadata fields to each row: _lw_loader_id_s and _lw_loader_job_id_s . |
atomicUpdates | Flag to send documents directly to Solr as atomic updates instead of as new documents. This option is not supported when using an index profile. Also note that the Parallel Bulk Loader tracking fields _lw_loader_id_s and _lw_loader_job_id_s are not sent when using atomic updates, so the clear datasource option does not work with documents created using atomic updates. |
outputOptions | Options used when writing directly to Solr. See Spark-Solr: https://github.com/lucidworks/spark-solr#index-parameters For example, if your docs are relatively small, you might want to increase the batch_size (2000 default) as shown below: ![]() |
outputPartitions | Coalesce the DataFrame into N partitions before writing to Solr. This can help spread the indexing work out across more executors that are available in Spark, or limit the parallelism when writing to Solr. |
Parameter Name | Description and Default |
--driver-cores | Cores for the driver Default: 1 |
--driver-memory | Memory for the driver (for example, 1000M or 2G ) Default: 1024M |
--executor-cores | Cores per executor Default: 1 in YARN mode, or all available cores on the worker in standalone mode |
--executor-memory | Memory per executor (for example, 1000M or 2G ) Default: 1G |
--total-executor-cores | Total cores for all executors Default: Without setting this parameter, the total cores for all executors is the number of executors in YARN mode, or all available cores on all workers in standalone mode. |
JohnSnowLabs:spark-nlp:1.4.2
package using Spark Shell Options.core-site.xml
in the apps/spark-dist/conf
directory, such as:s3a://sstk-dev/data/u.user
. If you are running a Fusion cluster then each instance of Fusion will need a core-site.xml
file. S3a is the preferred protocol for reading data into Spark because it uses Amazon’s libraries to read from S3 instead of the legacy Hadoop libraries. If you need other S3 protocols (for example, s3 or s3n) you will need to add the equivalent properties to core-site.xml
.org.apache.hadoop:hadoop-aws:2.7.3
package to the job using the --packages
Spark option. Also, you will need to exclude the com.fasterxml.jackson.core:jackson-core,joda-time:joda-time
packages using the --exclude-packages
option.core-site.xml
; see Installing the Cloud Storage connector.transformScala
option to filter and transform the input DataFrame into a better form for indexing using the following Scala script:emp_no
column (int
). Behind the scenes, Spark sends four separate queries to the database and processes the result sets in parallel.resourceType=spark:jar
; for example:resourceType=spark:jar
from the blob store to the appropriate classpaths before running a Parallel Bulk Loader job.$MIN(emp_no)
and $MAX(emp_no)
macros in the read options. These are macros offered by the Parallel Bulk Loader to help configure parallel reads of JDBC tables. Behind the scenes, the macros are translated into SQL queries to get the MAX and MIN values of the specified field, which Spark uses to compute splits for partitioned queries. As mentioned above, the field must be numeric and must have a relatively balanced distribution of values between MAX and MIN; otherwise, you are unlikely to see much performance benefit to partitioning.hbase-site.xml
(and possibly core-site.xml
) to apps/spark-dist/conf
in Fusion, for example:fusion_nums
with a single column family named lw
:
$lastTimestamp
macro in the read options. This lets us filter rows read from HBase using the timestamp of the last document the Parallel Bulk Loader wrote to Solr, that is, to get the newest updates from HBase only (incremental updates). Most Spark data sources provide a way to filter results based on timestamp.Job JSON:org.elasticsearch:elasticsearch-spark-20_2.11:6.2.1
package, here is a Scala script to run in bin/spark-shell
to index some test data:cbq -e=http://<host>:8091 -u <user> -p <password>
. Ensure the provided user is an authorized user of the test bucket.CREATE PRIMARY INDEX 'test-primary-index' ON 'test' USING GSI;
.INSERT INTO 'test' ( KEY, VALUE ) VALUES ( "1", { "id": "01", "field1": "a value", "field2": "another value"} ) RETURNING META().id as docid, *;
.select * from 'test';
.com.couchbase.spark.sql.DefaultSource
. Then specify the com.couchbase.client:spark-connector_2.11:2.2.0
package as the spark shell --packages
option, as well as a few spark settings that direct the connector to a particular Couchbase server and bucket to connect to using the provided credentials. See here for all of the available Spark configuration settings for the Couchbase Spark connector.Putting it all together:format
and --packages
. In addition, you must specify the filepath in the readOptions
section. For example:Import Data with Hive
lucidworks-hive-serde-v2.2.6.jar
in $FUSION_HOME/apps/connectors/resources/lucid.hadoop/jobs
.helm
as it is required to install Fusion for any K8s platform.
On MacOS, you can do:3.0.0
; check your Helm version by running helm version --short
.install-roles
directory.kubectl
, e.g.:role.yaml
and cluster-role.yaml
files to that namespacehelm install
command as the <install_user>
fusion-cloud-native-master
directory.helm
as it is required to install Fusion for any K8s platform.
On MacOS, you can do:3.0.0
; check your Helm version by running helm version --short
.install-roles
directory.kubectl
, e.g.:role.yaml
and cluster-role.yaml
files to that namespacehelm install
command as the <install_user>
fusion-cloud-native-master
directory.hive-solr
supports Hive 3.0.0. For support for Hive 1.x, see the hive_1x
branch.hive-solr
should only be used with Solr 5.0 and higher.solr-hadoop-common
submodule (contained in a separate GitHub repository, https://github.com/lucidworks/solr-hadoop-common). This submodule must be initialized before building the SerDe .jar.git submodule update
will fetch all the data from that project and check out the appropriate commit listed in the superproject. You must initialize and update the submodule before attempting to build the SerDe jar.solr-hadoop-common
is pointing to the correct SHA.
(See https://github.com/blog/2104-working-with-submodules for more details.)
solr-hive-serde/build/libs/{packageUser}-hive-serde-{connectorVersion}.jar
, which can be used with Hive v3.0. Other Hive versions (such as v2.x) may work with this jar, but have not been tested.hive.aux.jars.path
capability. There are several options for this, described below.It’s considered a best practice to use a single directory for all auxiliary jars you may want to add to Hive so you only need to define a single path. However, you must then copy any jars you want to use to that path./usr/hive/auxlib
; if you use another path, update the path in the examples accordingly.HIVE_AUX_JARS_PATH
is defined, and add the path to each line which starts with export
. What you want will end up looking like:hive/conf/hive-site.xml
:auxpath
variable:
hive —auxpath /usr/hive/auxlibThere are also other approaches that could be used. Keep in mind, though, that the jar must be loaded into the classpath, adding it with the ADD JAR
function is not sufficient.<1>
In this example, we have created an external table named “solr”, and defined a set of fields and types for the data we will store in the table. See the section <<Defining Fields for Solr>>
below for best practices when naming fields.<2>
This defines a custom storage handler (STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler'
), which is one of the classes included with the Hive SerDe jar.<3>
The LOCATION indicates the location in HDFS where the table data will be stored. In this example, we have chosen to use /tmp/solr
.<4>
In the section TBLPROPERTIES, we define several parameters for Solr so the data can be indexed to the right Solr installation and collection. See the section <<Table Properties>>
below for details about these parameters.If the table needs to be dropped at a later time, you can use the DROP TABLE command in Hive. This will remove the metadata stored in the table in Hive, but will not modify the underlying data (in this case, the Solr index).ClassCastException
in response to queries.To avoid this problem, you can use a Solr feature called dynamic fields. These direct Solr to use specific field types based on a prefix or suffix found on an incoming field name, which overrides Solr guessing at the type. Solr includes by default dynamic field rules for nearly all types it supports, so you only need to use the same suffix on your field names in your Hive tables for the correct type to be defined.To illustrate this, note the field names in the table example above:CREATE EXTERNAL TABLE solr (id string, field1_s string, field2_i int)In this example, we have defined the id
field as a string, field1_s
as a string, and field2_i
as an integer. In Solr’s default schema, there is a dynamic field rule that any field with a _s
suffix should be a string. Similarly, there is another rule that any field with _i
as a suffix should be an integer. This allows us to make sure the field types match.An alternative to this is to disable Solr’s field guessing altogether, but this would require you to create all of your fields in Solr before indexing any content from Hive.For more information about these features and options, please see the following sections of the Apache Solr Reference Guide:solr.zkhost
:
The location of the ZooKeeper quorum if using LucidWorks in SolrCloud mode. If this property is set along with the solr.server.url
property, the solr.server.url
property will take precedence.solr.server.url
:
The location of the Solr instance if not using LucidWorks in SolrCloud mode. If this property is set along with the solr.zkhost
property, this property will take precedence.solr.collection
:
The Solr collection for this table. If not defined, an exception will be thrown.solr.query
:
The specific Solr query to execute to read this table. If not defined, a default of \*:*
will be used. This property is not needed when loading data to a table, but is needed when defining the table so Hive can later read the table.lww.commit.on.close
:
If true, inserts will be automatically committed when the connection is closed. True is the default.lww.jaas.file
:
Used only when indexing to or reading from a Solr cluster secured with Kerberos.
This property defines the path to a JAAS file that contains a service principal and keytab location for a user who is authorized to read from and write to Solr and Hive.
The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed). Here is a sample section of a JAAS file:<1>
The name of this section of the JAAS file. This name will be used with the lww.jaas.appname
parameter.
<2>
The location of the keytab file.
<3>
The service principal name. This should be a different principal than the one used for Solr, but must have access to both Solr and Hive.lww.jaas.appname
:
Used only when indexing to or reading from a Solr cluster secured with Kerberos.
This property provides the name of the section in the JAAS file that includes the correct service principal and keytab path../apps/connectors/resources/lucid.hadoop/jobs
directory. The SerDe jar that supports Fusion is v2.2.4 or higher. This was released with Fusion 3.0.If you are using Fusion 3.1.x and higher, you will need to download the Hive SerDe from http://lucidworks.com/connectors/. Choose the proper Hadoop distribution and the resulting .zip file will include the Hive SerDe.A 2.2.4 or higher jar built from this repository will also work with Fusion 2.4.x releases.STORED BY 'com.lucidworks.hadoop.hive.FusionStorageHandler'
) that a class included with the Hive SerDe jar designed for use with Fusion.Note that all of the same caveats about field types discussed in the section <<Defining Fields for Solr>>
apply to Fusion as well. In Fusion, however, you have the option of using an index pipeline to perform specific field mapping instead of using dynamic fields.The LOCATION indicates the location in HDFS where the table data will be stored. In this example, we have chosen to use /tmp/fusion
.In the section TBLPROPERTIES, we define several properties for Fusion so the data can be indexed to the right Fusion installation and collection:fusion.endpoints
:
The full URL to the index pipeline in Fusion. The URL should include the pipeline name and the collection data will be indexed to.fusion.fail.on.error
:
If true
, when an error is encountered, such as if a row could not be parsed, indexing will stop. This is false
by default.fusion.buffer.timeoutms
:
The amount of time, in milliseconds, to buffer documents before sending them to Fusion. The default is 1000. Documents will be sent to Fusion when either this value or fusion.batchSize
is met.fusion.batchSize
:
The number of documents to batch before sending the batch to Fusion. The default is 500. Documents will be sent to Fusion when either this value or fusion.buffer.timeoutms
is met.fusion.realm
:
This is used with fusion.user
and fusion.password
to authenticate to Fusion for indexing data. Two options are supported, KERBEROS
or NATIVE
.
Kerberos authentication is supported with the additional definition of a JAAS file. The properties java.security.auth.login.config
and fusion.jaas.appname
are used to define the location of the JAAS file and the section of the file to use.
Native authentication uses a Fusion-defined username and password. This user must exist in Fusion, and have the proper permissions to index documents.fusion.user
:
The Fusion username or Kerberos principal to use for authentication to Fusion. If a Fusion username is used ('fusion.realm' = 'NATIVE'
), the fusion.password
must also be supplied.fusion.password
:
This property is not shown in the example above. The password for the fusion.user
when the fusion.realm
is NATIVE
.java.security.auth.login.config
:
This property defines the path to a JAAS file that contains a service principal and keytab location for a user who is authorized to read from and write to Fusion and Hive.
The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed). Here is a sample section of a JAAS file:<1>
The name of this section of the JAAS file. This name will be used with the fusion.jaas.appname
parameter.
<2>
The location of the keytab file.
<3>
The service principal name. This should be a different principal than the one used for Fusion, but must have access to both Fusion and Hive. This name is used with the fusion.user
parameter described above.fusion.jaas.appname
:
Used only when indexing to or reading from Fusion when it is secured with Kerberos.
This property provides the name of the section in the JAAS file that includes the correct service principal and keytab path.fusion.query.endpoints
:
The full URL to a query pipeline in Fusion. The URL should include the pipeline name and the collection data will be read from. You should also specify the request handler to be used.
If you do not intend to query your Fusion data from Hive, you can skip this parameter.fusion.query
:
The query to run in Fusion to select records to be read into Hive. This is \*:*
by default, which selects all records in the index.
If you do not intend to query your Fusion data from Hive, you can skip this parameter.$SOLR_HOME/example/exampledocs/books.csv
.Using the sample books.csv
file, we can see a detailed example of creating a table, loading data to it, and indexing that data to Solr.<1>
Define the table books
, and provide the field names and field types that will make up the table.
<2>
Load the data from the books.csv
file.
<3>
Create an external table named solr
, and provide the field names and field types that will make up the table. These will be the same field names as in your local Hive table, so we can index all of the same data to Solr.
<4>
Define the custom storage handler provided by the {packageUser}-hive-serde-{connectorVersion}.jar
.
<5>
Define storage location in HDFS.
<6>
The query to run in Solr to read records from Solr for use in Hive.
<7>
Define the location of Solr (or ZooKeeper if using SolrCloud), the collection in Solr to index the data to, and the query to use when reading the table. This example also refers to a JAAS configuration file that will be used to authenticate to the Kerberized Solr cluster.Import Data with Pig
lucidworks-pig-functions-v2.2.6.jar
file found in $FUSION_HOME/apps/connectors/resources/lucid.hadoop/jobs
.solr-hadoop-common
submodule (contained in a separate GitHub repository, https://github.com/lucidworks/solr-hadoop-common). This submodule must be initialized before building the Functions jar.To initialize the submodule, pull this repo, then:git submodule update
will fetch all the data from that project and check out the appropriate commit listed in the superproject. You must initialize and update the submodule before attempting to build the Functions jar.solr-hadoop-common
is pointing to the correct SHA.
(See https://github.com/blog/2104-working-with-submodules for more details.)
./gradlew clean shadowJar --info
This will make a .jar file:solr-pig-functions/build/libs/-pig-functions-.jarThe .jar is required to use the Pig functions.{packageUser}-pig-functions-{connectorVersion}.jar
are three UserDefined Functions (UDF) and two Store functions. These functions are:com/lucidworks/hadoop/pig/SolrStoreFunc.class
com/lucidworks/hadoop/pig/FusionIndexPipelinesStoreFunc.class
com/lucidworks/hadoop/pig/EpochToCalendar.class
com/lucidworks/hadoop/pig/Extract.class
com/lucidworks/hadoop/pig/Histogram.class
REGISTER
them in the script, or load them with your Pig command line request.If using REGISTER
, the Pig function jars must be put in HDFS in order to be used by your Pig script. It can be located anywhere in HDFS; you can either supply the path in your script or use a variable and define the variable with -p
property definition.The example below uses the second approach, loading the jars with the -Dpig.additional.jars
system property when launching the script. With this approach, the jars can be located anywhere on the machine where the script will be run.solr.zkhost
:
The ZooKeeper connection string if using Solr in SolrCloud mode. This should be in the form of server:port,server:port,server:port/chroot
.If you are not using SolrCloud, use the solr.server.url
parameter instead.solr.server.url
:
The location of the Solr instance when Solr is running in standalone mode. This should be in the form of \http://server:port/solr
.solr.collection
:
The name of the Solr collection where documents will be indexed.lww.jaas.file
:
The path to the JAAS file that includes a section for the service principal who will write to the Solr indexes. For example, to use this property in a Pig script:
set lww.jaas.file ‘/path/to/login.conf’;
The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed).lww.jaas.appname
:
The name of the section in the JAAS file that includes the correct service principal and keytab path. For example, to use this property in a Pig script:
set lww.jaas.appname ‘Client’;Here is a sample section of a JAAS file:<1>
The name of this section of the JAAS file. This name will be used with the lww.jaas.appname
parameter.
<2>
The location of the keytab file.
<3>
The service principal name. This should be a different principal than the one used for Solr, but must have access to both Solr and Pig.keystore
and truststore
with their respective passwords.set lww.keystore ‘/path/to/solr-ssl.keystore.jks’
set lww.keystore.password ‘secret’
set lww.truststore ‘/path/to/solr-ssl.truststore.jks’
set lww.truststore.password ‘secret’fusion.endpoints
:
The full URL to the index pipeline in Fusion. The URL should include the pipeline name and the collection data will be indexed to.fusion.fail.on.error
:
If true
, when an error is encountered, such as if a row could not be parsed, indexing will stop. This is false
by default.fusion.buffer.timeoutms
:
The amount of time, in milliseconds, to buffer documents before sending them to Fusion. The default is 1000. Documents will be sent to Fusion when either this value or fusion.batchSize
is met.fusion.batchSize
:
The number of documents to batch before sending the batch to Fusion. The default is 500. Documents will be sent to Fusion when either this value or fusion.buffer.timeoutms
is met.fusion.realm
:
This is used with fusion.user
and fusion.password
to authenticate to Fusion for indexing data. Two options are supported, KERBEROS
or NATIVE
.
Kerberos authentication is supported with the additional definition of a JAAS file. The properties java.security.auth.login.config
and fusion.jaas.appname
are used to define the location of the JAAS file and the section of the file to use. These are described in more detail below.
Native authentication uses a Fusion-defined username and password. This user must exist in Fusion, and have the proper permissions to index documents.fusion.user
:
The Fusion username or Kerberos principal to use for authentication to Fusion.
If a Fusion username is used ('fusion.realm' = 'NATIVE'
), the fusion.password
must also be supplied.fusion.pass
:
This property is not shown in the example above. The password for the fusion.user
when the fusion.realm
is NATIVE
.kinit
.java.security.auth.login.config
:
This property defines the path to a JAAS file that contains a service principal and keytab location for a user who is authorized to write to Fusion.
The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed). Here is a sample section of a JAAS file:<1>
The name of this section of the JAAS file. This name will be used with the fusion.jaas.appname
parameter.
<2>
The location of the keytab file.
<3>
The service principal name. This should be a different principal than the one used for Fusion, but must have access to both Fusion and Pig. This name is used with the fusion.user
parameter described above.fusion.jaas.appname
:
Used only when indexing to or reading from Fusion when it is secured with Kerberos.
This property provides the name of the section in the JAAS file that includes the correct service principal and keytab path.<1>
This and the line above define parameters that are needed by SolrStoreFunc
to know where Solr is. SolrStoreFunc
needs the properties solr.zkhost
and solr.collection
, and these lines are mapping the zkhost
and collection
parameters we will pass when invoking Pig to the required properties.<2>
Load the CSV file, the path and name we will pass with the csv
parameter. We also define the field names for each column in CSV file, and their types.<3>
For each item in the CSV file, generate a document id from the first field ($0
) and then define each field name and value in name, value
pairs.<4>
Load the documents into Solr, using the SolrStoreFunc
. While we don’t need to define the location of Solr here, the function will use the zkhost
and collection
properties that we will pass when we invoke our Pig script.SolrStoreFunc
, the document ID must be the first field.-p
option, such as in this command:csv
:
The path and name of the CSV file we want to process.zkhost
:
The ZooKeeper connection string for a SolrCloud cluster, in the form of zkhost1:port,zkhost2:port,zkhost3:port/chroot
. In the script, we mapped this to the solr.zkhost
property, which is required by the SolrStoreFunc
to know where to send the output documents.collection
:
The Solr collection to index into. In the script, we mapped this to the solr.collection
property, which is required by the SolrStoreFunc
to know the Solr collection the documents should be indexed to.zkhost
parameter above is only used if you are indexing to a SolrCloud cluster, which uses ZooKeeper to route indexing and query requests.If, however, you are not using SolrCloud, you can use the solrUrl
parameter, which takes the location of a standalone Solr instance, in the form of \http://host:port/solr
.In the script, you would change the line that maps solr.zkhost
to the zkhost
property to map solr.server.url
to the solrUrl
property. For example:set solr.server.url '$solrUrl';
Import Data with the REST API
?echo=false
to the URL.Be sure to set the content type header properly for the content being sent. Some frequently used content types are:application/json
, application/xml
application/pdf
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.vnd.openxmlformats-officedocument.presentationml.presentation
$FUSION_HOME/apps/solr-dist/example/exampledocs
you can find a few sample documents. This example uses one of these, books.json
.To push JSON data to an index profile under an app:books.json
, enter the following, substituting your values for username, password, and index profile name:
\*:*
.author
and name
.books.json
. To delete “The Lightning Thief” and “The Sea of Monsters” from the index, use their id values in the JSON file.The del-json-data.json
file to delete the two books:?echo=false
can be used to turn off the response to the terminal.https://FUSION_HOST:FUSION_PORT/api/index-pipelines/INDEX_PIPELINE/collections/COLLECTION_NAME/index?parserId=PARSER
.If you do not specify a parser, and you are indexing outside of an app (https://FUSION_HOST:FUSION_PORT/api/index-pipelines/...
), then the _system
parser is used.If you do not specify a parser, and you are indexing in an app context (https://FUSION_HOST:FUSION_PORT/api/apps/APP_NAME/index-pipelines/...
), then the parser with the same name as the app is used.Import Signals
path_of_folder
. The absolute path to the folder containing your Parquet files.collection_name_signals
. The name of the signals collection where you want to load these signals.localhost:9983/lwfusion/4.2.2/solr
- You can verify the correct path by going to the Solr console at http://fusion_host:8983/solr/#/
and looking for the value of DzkHost
.commit_within
and batch_size
, see https://github.com/lucidworks/spark-solr#commit_within.scala>
prompt, enter paste mode:
CTRL-d
.