Upload a JDBC Driver to Fusion
/blobs/{id}
endpoint.
Specify an arbitrary blob ID, and a resourceType
value of plugin:connector
, as in this example:
curl -u USERNAME:PASSWORD https://FUSION_HOST:FUSION_PORT/api/blobs
BLOB_ID
is the name specified during upload, such as “mydriver” above. A success response looks like this:
Import Data with Hive
lucidworks-hive-serde-v2.2.6.jar
in $FUSION_HOME/apps/connectors/resources/lucid.hadoop/jobs
.hive.aux.jars.path
capability. There are several options for this, described below.It’s considered a best practice to use a single directory for all auxiliary jars you may want to add to Hive so you only need to define a single path. However, you must then copy any jars you want to use to that path./usr/hive/auxlib
; if you use another path, update the path in the examples accordingly.HIVE_AUX_JARS_PATH
is defined, and add the path to each line which starts with export
. What you want will end up looking like:hive/conf/hive-site.xml
:auxpath
variable:
hive —auxpath /usr/hive/auxlibThere are also other approaches that could be used. Keep in mind, though, that the jar must be loaded into the classpath, adding it with the ADD JAR
function is not sufficient../apps/connectors/resources/lucid.hadoop/jobs
directory. The SerDe jar that supports Fusion is v2.2.4 or higher. This was released with Fusion 3.0.If you are using Fusion 3.1.x and higher, you will need to download the Hive SerDe from http://lucidworks.com/connectors/. Choose the proper Hadoop distribution and the resulting .zip file will include the Hive SerDe.A 2.2.4 or higher jar built from this repository will also work with Fusion 2.4.x releases.STORED BY 'com.lucidworks.hadoop.hive.FusionStorageHandler'
) that a class included with the Hive SerDe jar designed for use with Fusion.Note that all of the same caveats about field types discussed in the section <<Defining Fields for Solr>>
apply to Fusion as well. In Fusion, however, you have the option of using an index pipeline to perform specific field mapping instead of using dynamic fields.The LOCATION indicates the location in HDFS where the table data will be stored. In this example, we have chosen to use /tmp/fusion
.In the section TBLPROPERTIES, we define several properties for Fusion so the data can be indexed to the right Fusion installation and collection:fusion.endpoints
:
The full URL to the index pipeline in Fusion. The URL should include the pipeline name and the collection data will be indexed to.fusion.fail.on.error
:
If true
, when an error is encountered, such as if a row could not be parsed, indexing will stop. This is false
by default.fusion.buffer.timeoutms
:
The amount of time, in milliseconds, to buffer documents before sending them to Fusion. The default is 1000. Documents will be sent to Fusion when either this value or fusion.batchSize
is met.fusion.batchSize
:
The number of documents to batch before sending the batch to Fusion. The default is 500. Documents will be sent to Fusion when either this value or fusion.buffer.timeoutms
is met.fusion.realm
:
This is used with fusion.user
and fusion.password
to authenticate to Fusion for indexing data. Two options are supported, KERBEROS
or NATIVE
.
Kerberos authentication is supported with the additional definition of a JAAS file. The properties java.security.auth.login.config
and fusion.jaas.appname
are used to define the location of the JAAS file and the section of the file to use.
Native authentication uses a Fusion-defined username and password. This user must exist in Fusion, and have the proper permissions to index documents.fusion.user
:
The Fusion username or Kerberos principal to use for authentication to Fusion. If a Fusion username is used ('fusion.realm' = 'NATIVE'
), the fusion.password
must also be supplied.fusion.password
:
This property is not shown in the example above. The password for the fusion.user
when the fusion.realm
is NATIVE
.java.security.auth.login.config
:
This property defines the path to a JAAS file that contains a service principal and keytab location for a user who is authorized to read from and write to Fusion and Hive.
The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed). Here is a sample section of a JAAS file:
<1>
The name of this section of the JAAS file. This name will be used with the fusion.jaas.appname
parameter.<2>
The location of the keytab file.<3>
The service principal name. This should be a different principal than the one used for Fusion, but must have access to both Fusion and Hive. This name is used with the fusion.user
parameter described above.fusion.jaas.appname
:
Used only when indexing to or reading from Fusion when it is secured with Kerberos.
This property provides the name of the section in the JAAS file that includes the correct service principal and keytab path.fusion.query.endpoints
:
The full URL to a query pipeline in Fusion. The URL should include the pipeline name and the collection data will be read from. You should also specify the request handler to be used.
If you do not intend to query your Fusion data from Hive, you can skip this parameter.fusion.query
:
The query to run in Fusion to select records to be read into Hive. This is \*:*
by default, which selects all records in the index.
If you do not intend to query your Fusion data from Hive, you can skip this parameter.$SOLR_HOME/example/exampledocs/books.csv
.Using the sample books.csv
file, we can see a detailed example of creating a table, loading data to it, and indexing that data to Solr.<1>
Define the table books
, and provide the field names and field types that will make up the table.<2>
Load the data from the books.csv
file.<3>
Create an external table named solr
, and provide the field names and field types that will make up the table. These will be the same field names as in your local Hive table, so we can index all of the same data to Solr.<4>
Define the custom storage handler provided by the lucidworks-hive-serde-v2.2.6.jar
.<5>
Define storage location in HDFS.<6>
The query to run in Solr to read records from Solr for use in Hive.<7>
Define the location of Solr (or ZooKeeper if using SolrCloud), the collection in Solr to index the data to, and the query to use when reading the table. This example also refers to a JAAS configuration file that will be used to authenticate to the Kerberized Solr cluster.Import Data with Pig
lucidworks-pig-functions-v2.2.6.jar
file found in $FUSION_HOME/apps/connectors/resources/lucid.hadoop/jobs
.{packageUser}-pig-functions-{connectorVersion}.jar
are three UserDefined Functions (UDF) and two Store functions. These functions are:com/lucidworks/hadoop/pig/SolrStoreFunc.class
com/lucidworks/hadoop/pig/FusionIndexPipelinesStoreFunc.class
com/lucidworks/hadoop/pig/EpochToCalendar.class
com/lucidworks/hadoop/pig/Extract.class
com/lucidworks/hadoop/pig/Histogram.class
REGISTER
them in the script, or load them with your Pig command line request.If using REGISTER
, the Pig function jars must be put in HDFS in order to be used by your Pig script. It can be located anywhere in HDFS; you can either supply the path in your script or use a variable and define the variable with -p
property definition.The example below uses the second approach, loading the jars with the -Dpig.additional.jars
system property when launching the script. With this approach, the jars can be located anywhere on the machine where the script will be run.fusion.endpoints
:
The full URL to the index pipeline in Fusion. The URL should include the pipeline name and the collection data will be indexed to.fusion.fail.on.error
:
If true
, when an error is encountered, such as if a row could not be parsed, indexing will stop. This is false
by default.fusion.buffer.timeoutms
:
The amount of time, in milliseconds, to buffer documents before sending them to Fusion. The default is 1000. Documents will be sent to Fusion when either this value or fusion.batchSize
is met.fusion.batchSize
:
The number of documents to batch before sending the batch to Fusion. The default is 500. Documents will be sent to Fusion when either this value or fusion.buffer.timeoutms
is met.fusion.realm
:
This is used with fusion.user
and fusion.password
to authenticate to Fusion for indexing data. Two options are supported, KERBEROS
or NATIVE
.
Kerberos authentication is supported with the additional definition of a JAAS file. The properties java.security.auth.login.config
and fusion.jaas.appname
are used to define the location of the JAAS file and the section of the file to use. These are described in more detail below.
Native authentication uses a Fusion-defined username and password. This user must exist in Fusion, and have the proper permissions to index documents.fusion.user
:
The Fusion username or Kerberos principal to use for authentication to Fusion.
If a Fusion username is used ('fusion.realm' = 'NATIVE'
), the fusion.password
must also be supplied.fusion.pass
:
This property is not shown in the example above. The password for the fusion.user
when the fusion.realm
is NATIVE
.kinit
.java.security.auth.login.config
:
This property defines the path to a JAAS file that contains a service principal and keytab location for a user who is authorized to write to Fusion.
The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed). Here is a sample section of a JAAS file:
<1>
The name of this section of the JAAS file. This name will be used with the fusion.jaas.appname
parameter.<2>
The location of the keytab file.<3>
The service principal name. This should be a different principal than the one used for Fusion, but must have access to both Fusion and Pig. This name is used with the fusion.user
parameter described above.fusion.jaas.appname
:
Used only when indexing to or reading from Fusion when it is secured with Kerberos.
This property provides the name of the section in the JAAS file that includes the correct service principal and keytab path.<1>
This and the line above define parameters that are needed by SolrStoreFunc
to know where Solr is. SolrStoreFunc
needs the properties solr.zkhost
and solr.collection
, and these lines are mapping the zkhost
and collection
parameters we will pass when invoking Pig to the required properties.<2>
Load the CSV file, the path and name we will pass with the csv
parameter. We also define the field names for each column in CSV file, and their types.<3>
For each item in the CSV file, generate a document id from the first field ($0
) and then define each field name and value in name, value
pairs.<4>
Load the documents into Solr, using the SolrStoreFunc
. While we don’t need to define the location of Solr here, the function will use the zkhost
and collection
properties that we will pass when we invoke our Pig script.SolrStoreFunc
, the document ID must be the first field.-p
option, such as in this command:csv
:
The path and name of the CSV file we want to process.zkhost
:
The ZooKeeper connection string for a SolrCloud cluster, in the form of zkhost1:port,zkhost2:port,zkhost3:port/chroot
. In the script, we mapped this to the solr.zkhost
property, which is required by the SolrStoreFunc
to know where to send the output documents.collection
:
The Solr collection to index into. In the script, we mapped this to the solr.collection
property, which is required by the SolrStoreFunc
to know the Solr collection the documents should be indexed to.zkhost
parameter above is only used if you are indexing to a SolrCloud cluster, which uses ZooKeeper to route indexing and query requests.If, however, you are not using SolrCloud, you can use the solrUrl
parameter, which takes the location of a standalone Solr instance, in the form of http://host:port/solr
.In the script, you would change the line that maps solr.zkhost
to the zkhost
property to map solr.server.url
to the solrUrl
property. For example:set solr.server.url '$solrUrl';
Import Data with the REST API
application/json
as the content type. If your JSON file is a list or array of many items, the endpoint operates in a streaming way and indexes the docs as necessary.?echo=false
to the URL.Be sure to set the content type header properly for the content being sent. Some frequently used content types are:application/json
, application/xml
application/pdf
application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.vnd.openxmlformats-officedocument.presentationml.presentation
$FUSION_HOME/apps/solr-dist/example/exampledocs
you can find a few sample documents. This example uses one of these, books.json
.To push JSON data to an index profile under an app:books.json
, enter the following, substituting your values for username, password, and index profile name:
*:*
.author
and name
.books.json
. To delete “The Lightning Thief” and “The Sea of Monsters” from the index, use their id values in the JSON file.The del-json-data.json
file to delete the two books:?echo=false
to turn off the response to the terminal.https://FUSION_HOST:FUSION_PORT/api/index-pipelines/INDEX_PIPELINE/collections/COLLECTION_NAME/index?parserId=PARSER
.If you do not specify a parser, and you are indexing outside of an app (https://FUSION_HOST:FUSION_PORT/api/index-pipelines/...
), then the _system
parser is used.If you do not specify a parser, and you are indexing in an app context (https://FUSION_HOST:FUSION_PORT/api/apps/APP_NAME/index-pipelines/...
), then the parser with the same name as the app is used.