Product Selector

Fusion 5.9
    Fusion 5.9

    Import Data with Hive

    Fusion ships with a Serializer/Deserializer (SerDe) for Hive, included in the distribution as lucidworks-hive-serde-v2.2.6.jar in $FUSION_HOME/apps/connectors/resources/lucid.hadoop/jobs.

    For Fusion 4.1.x and 4.2.x, the preferred method of importing data with Hive is to use the Parallel Bulk Loader. The import procedure does not apply to Fusion 5.x.

    Features

    • Index Hive table data to Solr.

    • Read Solr index data to a Hive table.

    • Kerberos support for securing communication between Hive and Solr.

    • As of v2.2.4 of the SerDe, integration with Lucidworks Fusion is supported.

      • Fusion’s index pipelines can be used to index data to Fusion.

      • Fusion’s query pipelines can be used to query Fusion’s Solr instance for data to insert into a Hive table.

    Add the SerDe Jar to Hive Classpath

    In order for the Hive SerDe to work with Solr, the SerDe jar must be added to Hive’s classpath using the hive.aux.jars.path capability. There are several options for this, described below.

    It’s considered a best practice to use a single directory for all auxiliary jars you may want to add to Hive so you only need to define a single path. However, you must then copy any jars you want to use to that path.

    The following options all assume you have created such a directory at /usr/hive/auxlib; if you use another path, update the path in the examples accordingly.
    1. If you use Hive with Ambari (as with the Hortonworks HDP distribution), go to Hive  Configs  Advanced, and scroll down to Advanced hive-env  hive-env template. Find the section where the HIVE_AUX_JARS_PATH is defined, and add the path to each line which starts with export. What you want will end up looking like:

      # Folder containing extra libraries required for hive compilation/execution can be controlled by:
      if [ "${HIVE_AUX_JARS_PATH}" != "" ]; then
        if [ -f "${HIVE_AUX_JARS_PATH}" ]; then
          export HIVE_AUX_JARS_PATH=${HIVE_AUX_JARS_PATH},/usr/hive/auxlib
        elif [ -d "/usr/hdp/current/hive-webhcat/share/hcatalog" ]; then
          export HIVE_AUX_JARS_PATH=/usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar,/usr/hive/auxlib
        fi
      elif [ -d "/usr/hdp/current/hive-webhcat/share/hcatalog" ]; then
        export HIVE_AUX_JARS_PATH=/usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar,/usr/hive/auxlib
      fi
    2. If not using Ambari or similar cluster management tool, you can add the jar location to hive/conf/hive-site.xml:

      <property>
        <name>hive.aux.jars.path</name>
        <value>/usr/hive/auxlib</value>
      </property>
    3. Another option is to launch Hive with the path defined with the auxpath variable:

      hive --auxpath /usr/hive/auxlib

    There are also other approaches that could be used. Keep in mind, though, that the jar must be loaded into the classpath, adding it with the ADD JAR function is not sufficient.

    Indexing Data to Fusion

    If you use Lucidworks Fusion, you can index data from Hive to Solr via Fusion’s index pipelines. These pipelines allow you several options for further transforming your data.

    If you are using Fusion v3.0.x, you already have the Hive SerDe in Fusion’s ./apps/connectors/resources/lucid.hadoop/jobs directory. The SerDe jar that supports Fusion is v2.2.4 or higher. This was released with Fusion 3.0.

    If you are using Fusion 3.1.x and higher, you will need to download the Hive SerDe from http://lucidworks.com/connectors/. Choose the proper Hadoop distribution and the resulting .zip file will include the Hive SerDe.

    A 2.2.4 or higher jar built from this repository will also work with Fusion 2.4.x releases.

    This is an example Hive command to create an external table to index documents in Fusion and to query the table later.

    hive> CREATE EXTERNAL TABLE fusion (id string, field1_s string, field2_i int)
          STORED BY 'com.lucidworks.hadoop.hive.FusionStorageHandler'
          LOCATION '/tmp/fusion'
          TBLPROPERTIES('fusion.endpoints' = 'http://localhost:8764/api/apollo/index-pipelines/<pipeline>/collections/<collection>/index',
                        'fusion.fail.on.error' = 'false',
                        'fusion.buffer.timeoutms' = '1000',
                        'fusion.batchSize' = '500',
                        'fusion.realm' = 'KERBEROS',
                        'fusion.user' = 'fusion-indexer@FUSIONSERVER.COM',
                        'java.security.auth.login.config' = '/path/to/JAAS/file',
                        'fusion.jaas.appname' = 'FusionClient',
                        'fusion.query.endpoints' = 'http://localhost:8764/api/apollo/query-pipelines/pipeline-id/collections/collection-id',
                        'fusion.query' = '*:*');

    In this example, we have created an external table named "fusion", and defined a custom storage handler (STORED BY 'com.lucidworks.hadoop.hive.FusionStorageHandler') that a class included with the Hive SerDe jar designed for use with Fusion.

    Note that all of the same caveats about field types discussed in the section [Defining Fields for Solr] apply to Fusion as well. In Fusion, however, you have the option of using an index pipeline to perform specific field mapping instead of using dynamic fields.

    The LOCATION indicates the location in HDFS where the table data will be stored. In this example, we have chosen to use /tmp/fusion.

    In the section TBLPROPERTIES, we define several properties for Fusion so the data can be indexed to the right Fusion installation and collection:

    fusion.endpoints

    The full URL to the index pipeline in Fusion. The URL should include the pipeline name and the collection data will be indexed to.

    fusion.fail.on.error

    If true, when an error is encountered, such as if a row could not be parsed, indexing will stop. This is false by default.

    fusion.buffer.timeoutms

    The amount of time, in milliseconds, to buffer documents before sending them to Fusion. The default is 1000. Documents will be sent to Fusion when either this value or fusion.batchSize is met.

    fusion.batchSize

    The number of documents to batch before sending the batch to Fusion. The default is 500. Documents will be sent to Fusion when either this value or fusion.buffer.timeoutms is met.

    fusion.realm

    This is used with fusion.user and fusion.password to authenticate to Fusion for indexing data. Two options are supported, KERBEROS or NATIVE.

    Kerberos authentication is supported with the additional definition of a JAAS file. The properties java.security.auth.login.config and fusion.jaas.appname are used to define the location of the JAAS file and the section of the file to use.

    Native authentication uses a Fusion-defined username and password. This user must exist in Fusion, and have the proper permissions to index documents.

    fusion.user

    The Fusion username or Kerberos principal to use for authentication to Fusion. If a Fusion username is used ('fusion.realm' = 'NATIVE'), the fusion.password must also be supplied.

    fusion.password

    This property is not shown in the example above. The password for the fusion.user when the fusion.realm is NATIVE.

    java.security.auth.login.config

    This property defines the path to a JAAS file that contains a service principal and keytab location for a user who is authorized to read from and write to Fusion and Hive.

    The JAAS configuration file must be copied to the same path on every node where a Node Manager is running (i.e., every node where map/reduce tasks are executed). Here is a sample section of a JAAS file:

    Client { (1)
      com.sun.security.auth.module.Krb5LoginModule required
      useKeyTab=true
      keyTab="/data/fusion-indexer.keytab" (2)
      storeKey=true
      useTicketCache=true
      debug=true
      principal="fusion-indexer@FUSIONSERVER.COM"; (3)
    };
    1 The name of this section of the JAAS file. This name will be used with the fusion.jaas.appname parameter.
    2 The location of the keytab file.
    3 The service principal name. This should be a different principal than the one used for Fusion, but must have access to both Fusion and Hive. This name is used with the fusion.user parameter described above.
    fusion.jaas.appname

    Used only when indexing to or reading from Fusion when it is secured with Kerberos.

    This property provides the name of the section in the JAAS file that includes the correct service principal and keytab path.

    fusion.query.endpoints

    The full URL to a query pipeline in Fusion. The URL should include the pipeline name and the collection data will be read from. You should also specify the request handler to be used.

    If you do not intend to query your Fusion data from Hive, you can skip this parameter.

    fusion.query

    The query to run in Fusion to select records to be read into Hive. This is *:* by default, which selects all records in the index.

    If you do not intend to query your Fusion data from Hive, you can skip this parameter.

    Query and Insert Data to Hive

    Once the table is configured, any syntactically correct Hive query will be able to query the index.

    For example, to select three fields named "id", "field1_s", and "field2_i" from the "solr" table, you would use a query such as:

    hive> SELECT id, field1_s, field2_i FROM solr;

    Replace the table name as appropriate to use this example with your data.

    To join data from tables, you can make a request such as:

    hive> SELECT id, field1_s, field2_i FROM solr left
          JOIN sometable right
          WHERE left.id = right.id;

    And finally, to insert data to a table, simply use the Solr table as the target for the Hive INSERT statement, such as:

    hive> INSERT INTO solr
          SELECT id, field1_s, field2_i FROM sometable;

    Example Indexing Hive to Solr

    Solr includes a small number of sample documents for use when getting started. One of these is a CSV file containing book metadata. This file is found in your Solr installation, at $SOLR_HOME/example/exampledocs/books.csv.

    Using the sample books.csv file, we can see a detailed example of creating a table, loading data to it, and indexing that data to Solr.

    CREATE TABLE books (id STRING, cat STRING, title STRING, price FLOAT, in_stock BOOLEAN, author STRING, series STRING, seq INT, genre STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; (1)
    
    LOAD DATA LOCAL INPATH '/solr/example/exampledocs/books.csv' OVERWRITE INTO TABLE books; (2)
    
    CREATE EXTERNAL TABLE solr (id STRING, cat_s STRING, title_s STRING, price_f FLOAT, in_stock_b BOOLEAN, author_s STRING, series_s STRING, seq_i INT, genre_s STRING) (3)
         STORED BY 'com.lucidworks.hadoop.hive.LWStorageHandler' (4)
         LOCATION '/tmp/solr' (5)
         TBLPROPERTIES('solr.zkhost' = 'zknode1:2181,zknode2:2181,zknode3:2181/solr',
                       'solr.collection' = 'gettingstarted',
                       'solr.query' = '*:*', (6)
                       'lww.jaas.file' = '/data/jaas-client.conf'); (7)
    
    
    INSERT OVERWRITE TABLE solr SELECT b.* FROM books b;
    1 Define the table books, and provide the field names and field types that will make up the table.
    2 Load the data from the books.csv file.
    3 Create an external table named solr, and provide the field names and field types that will make up the table. These will be the same field names as in your local Hive table, so we can index all of the same data to Solr.
    4 Define the custom storage handler provided by the lucidworks-hive-serde-v2.2.6.jar.
    5 Define storage location in HDFS.
    6 The query to run in Solr to read records from Solr for use in Hive.
    7 Define the location of Solr (or ZooKeeper if using SolrCloud), the collection in Solr to index the data to, and the query to use when reading the table. This example also refers to a JAAS configuration file that will be used to authenticate to the Kerberized Solr cluster.