Getting the Package

HDP Search packages are hosted in the Hortonworks repositories, and your cluster must be able to access the package repositories for successful installation.

Installation instructions for supported operating systems is available from the Hortonworks documentation at Installing HDP Search.

The RPM package is signed. To validate the package, you should add the gpg key to your server.

# Get the key
gpg --keyserver pgp.mit.edu --recv-keys 7884ED70
# Export the key
gpg -a --export 7884ED70 > RPM-GPG-KEY-lucidworks
# Import the key to RPM
rpm --import RPM-GPG-KEY-lucidworks

Please see the HDP documentation for details on how to set up your system to access the Hortonworks package repositories.

After installation, the HDP 2.5.x files will be found in the /opt/lucidworks-hdpsearch directory. See the section HDP Search Directory Layout for details about the directory layout after installation.

Initial Solr Setup

When using Solr with HDP Search, you should run Solr in SolrCloud mode, which provides central configuration for a cluster of Solr servers, automatic load balancing and fail-over for queries, and distributed index replication. This mode is set when starting Solr.

SolrCloud relies on Apache ZooKeeper to coordinate requests between the nodes of the cluster. It’s recommended to use the ZooKeeper ensemble running with HDP 2.5.x for this purpose.

The guide Getting Started with Solr will review some basics to get you started with Solr. More information about the concepts included in SolrCloud are contained in the Apache Solr Reference Guide section on SolrCloud.

When manually installing Solr with HDP 2.5.x, you need to make some modifications to some Solr-specific configuration files before starting Solr.

Prepare Solr For HDFS

In a new installation, Solr includes three sets of sample configuration files that you can use when getting started with Solr. If you choose one of these configsets when working in SolrCloud mode, the configuration files are uploaded to ZooKeeper and managed centrally for all Solr nodes.

It’s best to modify the configset before starting Solr (and before the configuration files are uploaded to Solr), when possible. However, if you do need to make changes later, a Config API is available to modify the configurations if necessary.

Available Configsets

The configset files are found in /opt/lucidworks-hdpsearch/solr/server/solr/configsets. If you want to modify any of the default configsets, simply make a copy of the directory with whatever name you choose.

The default configsets available are:

basic_configs

This configset is meant to provide a minimally viable Solr instance.

data_driven_schema_configs

This configset is also very minimal, but includes configurations to allow Solr to run in "schemaless" mode.

sample_techproducts_configs

This configset is designed to support sample documents available in Solr’s exampledocs directory (/opt/lucidworks-hdpsearch/solr/example/exampledocs).

You can copy one of these configsets and customize the configuration files within them. Or you can customize these configsets for your own needs.

HDFS-Specific Changes

The following changes only need to be completed for the first Solr node that is started. After the first node is running, all additional nodes will get their configuration information from ZooKeeper.

Solr’s <directoryFactory>, found in solrconfig.xml, defines how indexes will be stored on disk. The default is usually fine for most filesystem-based indexes, but with HDP 2.5.x we would like to store the indexes in HDFS. To support that, we need to modify the default directoryFactory definition and point it to our HDP 2.5.x cluster.

If Solr has been installed on a server that is not already a node of the Hadoop cluster, a client for HDP 2.5.x must be installed on each Solr node. The client must be configured to be able to communicate with the Hadoop cluster with the appropriate settings in core-site.xml and hdfs-site.xml.

Before starting Solr, find the solrconfig.xml file in the configset you will customize for your first collection. Within that file, find the section for <directoryFactory>. It will most likely look like this:

<directoryFactory name="DirectoryFactory"
   class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}">
</directoryFactory>

We will want to replace this with a different class, and define several additional properties.

    <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
      <str name="solr.hdfs.home">hdfs://<host:port>/user/solr</str>
      <str name="solr.hdfs.confdir">/etc/hadoop/conf</str>
      <bool name="solr.hdfs.blockcache.enabled">true</bool>
      <int name="solr.hdfs.blockcache.slab.count">1</int>
      <bool name="solr.hdfs.blockcache.direct.memory.allocation">true</bool>
      <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
      <bool name="solr.hdfs.blockcache.read.enabled">true</bool>
      <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
      <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
      <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
    </directoryFactory>

You can copy the above configuration and use it without any problem. However, if you would prefer to customize any of the properties, they are described below:

Index and Config Locations
solr.hdfs.home

The location for the Solr indexes in HDFS. The example above uses user/solr as the location, but it can be any dedicated path in HDFS.

solr.hdfs.confdir

The location of HDFS client configuration files. This feature is needed when Hadoop is running in High Availability mode.

Block Cache settings

The block cache allows Solr to maintain expected performance even though the indexes may be distributed across the nodes that make up the HDFS system. The cache stores HDFS index blocks in JVM direct memory. The following settings allow customization of the block cache:

solr.hdfs.blockcache.enabled

By default, Solr’s implementation will cache HDFS blocks, which allows Solr to replace its standard file caching mechanisms when the indexes are stored in HDFS. This, however, has the side effect that the cache will be allocated off heap, and will require that the off heap memory settings for the JVM be raised to ensure proper performance.

solr.hdfs.blockcache.global

This enables a single global block cache instead of defining separate block caches for each collection or core. By default, this

solr.hdfs.blockcache.slab.count

The number of memory slabs to allocate. Each slab is 128Mb. The default is 1.

This is the primary setting for the size of your block cache. If, for example, you want to allocate 4Gb of RAM to your block cache, you would set the solr.blockcache.slab.count to 32, since 32 x 128Mb = 4096Mb.

solr.hdfs.blockcache.direct.memory.allocation

When this is true, direct memory allocation will be used. When false, heap is used. The default is true.

solr.hdfs.blockcache.blocksperbank

This setting defines how many blocks to store in each slab. Each block in the cache is 8Kb. The default is 16384 blocks.

solr.hdfs.blockcache.read.enabled

The block cache has a read-side cache and a write-side cache. If true, this will enable the read cache. The default is true.

NRT Caching Directory Settings

When using Near-Real-Time (NRT) searching in Solr, the NRT Cache is used to provide suitable near-real time performance when the indexes are stored in HDFS. If you are not using NRT, you can disable this cache.

solr.hfds.nrtcachingdirectory.enable

If true, NRTCachingDirectory will be enabled.

solr.hdfs.nrtcachingdirectory.maxmergesizemb

Configures the maximum segment size for index merges. The default is 16MB.

solr.hdfs.nrtcachingdirectory.maxcachedmb

Configures the maximum NRTCachingDirectory cache size. The default is 192MB.

Kerberos Settings

If your cluster is secured with Kerberos, there are a few additional properties to add to the DirectoryFactory configuration to allow Solr access to the cluster.

<directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
   ...
   <bool name="solr.hdfs.security.kerberos.enabled">true</bool>
   <str name="solr.hdfs.security.kerberos.keytabfile">/etc/krb5.keytab</str>
   <str name="solr.hdfs.security.kerberos.principal">solr/admin@KERBEROS.COM</str>
</directoryFactory>

These properties are explained in detail below:

solr.hdfs.security.kerberos.enabled

Defines if Kerberos should be used. The default is false, set this to true to enable Kerberos.

solr.hdfs.security.kerberos.keytabfile

A keytab file contains pairs of Kerberos principals and encrypted keys to allow password-less authentication. Enter the path to the keytab file, wherever it is located on the server where Solr is running.

This file must be present on each node that is running Solr when in SolrCloud mode.

More information about getting a ticket for the Solr user is available in the section Kerberos Support.

solr.hdfs.security.kerberos.principal

The Kerberos principal that Solr should use to authenticate. The format of a typical Kerberos V5 principal is primary/instance@realm. For example, krbtgt/HADOOP-VM1@HADOOP-VM1.

Start Solr

If your cluster uses Kerberos, please see additional information in the section Starting Solr with Kerberos below before starting Solr.

Solr includes a robust start script that can handle a broad range of common customizations. The script can additionally be used to stop Solr, check the status and health of a node, and create collections.

To start Solr, the script is quite simple, such as:

bin/solr start -c (1)
   -z 10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 (2)
   -Dsolr.directoryFactory=HdfsDirectoryFactory (3)
   -Dsolr.lock.type=hdfs (4)
   -Dsolr.hdfs.home=hdfs://host:port/path (5)
1 The start command for the bin/solr script. The -c parameter tells Solr to start in SolrCloud mode.
2 The connect string for the ZooKeeper ensemble. We give the addresses for each node of the ZooKeeper ensemble in case one is down; we will still be able to connect as long as there is a quorum.
3 The Solr index implementation you will use; this parameter defines how the indexes are stored on disk. In this case, we are telling Solr all indexes should be stored in HDFS.
4 The index lock type to use. Again, we have defined hdfs to indicate the indexes will be stored in HDFS.
5 The path to the location of the Solr indexes in HDFS.
If you do not specify a ZooKeeper connect string with the -z property, Solr will launch its embedded ZooKeeper instance. This instance has a single ZooKeeper instance, so provides no failover and is not meant for production use.

Note we have not defined a collection name, a configuration set, how many shards or nodes we want, etc. Those properties are defined at the collection level, and we’ll define those when we create a collection.

It is possible to use the -e property to start Solr using an example collection. This will launch an interactive session and allow you to define the collection name, the configset to use, as well as the number of shards and replicas. These options are also described in the next section.

More information about options for starting Solr with the bin/solr script is available in the "Starting and Stopping" section of the Apache Solr Reference Guide at Solr Start Script Reference.

Starting Solr with Kerberos

When starting Solr in a cluster secured with Kerberos, check that Solr is configured for Kerberos as explained in the section HDFS-Specific Changes.

Before starting Solr, ensure you have a ticket for the Solr user as described in the section Kerberos Support. The user referenced in the keytab file should be the same user that is defined for the solr.hdfs.security.kerberos.principal property in solrconfig.xml.

If using a client because the Solr server is not located on the same node as a node of the HDP 2.5.x cluster, ensure the dfs.namenode.kerberos.principal in the hdfs-site.xml file is properly configured for the principal of the namenode of the cluster.

Once a ticket exists for the Solr user and solrconfig.xml has been customized as described in the section HDFS-Specific Changes, the start parameters shown above can be used with no need for any additional parameters.

Create a Collection

We can use the same bin/solr script to create and delete collections. This time we use it with the create command and define several new properties.

bin/solr create -c SolrCollection (1)
   -d data_driven_schema_configs (2)
   -n mySolrConfigs (3)
   -s 2 (4)
   -rf 2 (5)
   -p 8983 (6)
1 The create command for the bin/solr script. In this case, the -c parameter provides the name of the collection to create.
2 The configset to use. In this case, we’ve used the data_driven_schema_configs configset. If you modified a configset to support storing Solr indexes HDFS, as above, you should instead use the name of the configset you modified.
3 The name of the configset in ZooKeeper. This allows the same configset to be reused and very similar configsets to be differentiated easily.
4 The number of shards to split the collection into. The shards are physical sections of the collection’s index on nodes of the cluster.
5 The number of replicas of each shard for the collection. Replicas are copies of the index which are used for failover and backup in case of failure of one of the main shards.
6 The port where the instance of Solr where the collection should be created. This port should always be defined, to prevent the wrong instance from being used.
The concepts of shards and replicas are described in more detail in the Apache Solr Reference Guide section on SolrCloud.

More information about options for starting Solr with the bin/solr script is available "Collections and Cores" section of the Apache Solr Reference Guide at Solr Start Script Reference.

Index Content

To index content from your HDFS system to Solr, you will need to use the Hadoop job jar provided by Lucidworks. For details on how to use job jar see the Job Jar Guide.

Because you have a fully functional installation of Solr, you can also use any of the other approaches to indexing documents supported by Solr. For more details, review the options in the Getting Started with Solr section on Indexing Documents.