First Run Tutorial
- Fusion Key Concepts
- Download and Install Fusion
- Create a Collection
- Document Indexing with Datasources and Pipelines
- Lucene and Solr
- Further Reading
This section is an introductory, step-by-step guide to using the Fusion UI for basic search and indexing. This guide helps you to:
Verify that the installed Fusion instance components are installed and communicating properly.
Use the Fusion UI to create and search a small dataset.
Introduce a few underlying principles of search and indexing over structured and semi-structured texts.
Fusion supports several approaches to indexing content. In this section we use Fusion’s web crawler to index a website which contains Shakespeare’s sonnets. This dataset is chosen precisely because it is simple and trivial. Web data is challenging to parse because a page of HTML contains many different kinds of elements: content, meta-data, scripts, layout directives, and links, all of which can be effectively processed using Fusion pipelines. By the end of this exercise, you should understand the Fusion UI and workflow and will be ready to create custom pipelines in order to effectively process your data.
Fusion Key Concepts
Fusion uses the Solr/Lucene search engine to evaluate search requests and return results in the form of a ranked list of document ids. Fusion extends Solr/Lucene functionality via a REST-API and a UI built on top of that REST-API.
The Fusion UI is organized around the following key concepts:
Collections store your data.
Documents are the items that are returned as search results.
Fields are the things that are actually stored in a collection.
Datasource are the conduit between your data repository and Fusion.
Pipelines encapsulate a sequence of processing steps, called stages.
Indexing Pipelines process the raw data received from a datasource into fielded documents for indexing into a Fusion collection.
Query Pipelines process search requests and return an ordered list of matching documents.
Relevancy is the metric used to order search results. It is a non-negative real number which indicates the similarity between a search request and a document.
Download and Install Fusion
Fusion is distributed as a gzipped tar file or as a compressed zip file which can be used directly to run a single-node local Fusion instance.
The Fusion distribution for Linux and Mac unpacks into a directory named "fusion",
this is the Fusion home directory,
If you are installing an evaluation copy of Fusion on Windows, you must use the freely available 7zip file archiver to unzip the archive, because 7zip is more robust than the standard Windows utility. The Fusion archive contains a large number of 3rd party jarfiles, and the standard Windows utility cannot reliably deal with all of them. Visit the 7zip download page for the latest version. The archive unzips as a directory (folder) with the basename of the zip file, e.g., fusion-2.1.0, which contains a single directory named "fusion".
The script $FUSION/bin/fusion is used to start and stop Fusion via command line arguments "start" and "stop", respectively. To start Fusion from a terminal window (Linux or Mac):
Successful startup results in several lines of output to the terminal window, listing the Fusion components and the ports they are listening on:
2015-10-12 23:13:01Z Starting Fusion ZooKeeper on port 9983 2015-10-12 23:13:11Z Starting Fusion Solr on port 8983 2015-10-12 23:13:36Z Starting Fusion Spark Master on port 8766 2015-10-12 23:13:36Z Starting Fusion Spark Worker on port 8769 2015-10-12 23:13:36Z Starting Fusion API Services on port 8765 2015-10-12 23:13:42Z Starting Fusion Connectors on port 8984 2015-10-12 23:13:47Z Starting Fusion UI on port 8764
Create a Collection
The next step in getting started is to create a Fusion collection, the repository which stores your data. In this example, we create a collection called "sonnets" which will store Shakespeare’s sonnets.
Once Fusion is running, open a web browser and access the server and port that the Fusion UI is listening on. For a default single-node Fusion installation, the Fusion UI runs on port 8764 (as shown above), so the URL is: http://localhost:8764/.
Upon initial install, when you first access the Fusion UI, it will present a sequence of panels:
The initial login panel, URL: "\http://localhost:8764/initial-login". Set the Fusion admin password. Remember to check box "Agree to License Terms".
The registration panel, URL: "\http://localhost:8764/registration". Registration is optional. You can opt-out by clicking the "Skip" link at the bottom of this panel. Please see System Usage Monitor for information about how Fusion collects and uses this information.
The welcome panel, URL: "\http://localhost:8764/welcome". This panel prompts you to create a collection. When running a single-server developer instance of Fusion, the "Advanced" options don’t apply, therefore, just enter the collection name. Here, we use the name "sonnets".
|Although Fusion will recognize collections named "TesT" and "tESt" as different collections, the filesystem on which the underlying Solr collection is stored may not. Therefore, avoid using letter case as the sole distinguishing feature for a collection name.|
Upon successful create on collection "sonnets", Fusion displays the main collections panel.
As the collection is created, the UI generates and stores a series of notifications. The rightmost icon on the Fusion top menu bar toggles display of these notifications.
Document Indexing with Datasources and Pipelines
The target dataset for this example is the website "Shakespeare: Sonnets", URL: "http://poetry.eserver.org/sonnets/":
To do this, we need to configure and run a Fusion datasource over this collection. From the main collections panel (which is shown in previous screenshot), click on the name of the collection "sonnets", so that the UI now displays the sonnets collection home, URL "\http://localhost:8764/panels/sonnets". The default initial display for a collection has the collection "Home" panel on the left and the collection search panel next to it:
At the top of the search panel display is the search query input text box and control in the form of a gear icon which toggles display of the search results display controls. The search results display panel is underneath these elements. As collection "sonnets" doesn’t yet contain any documents, the search results panel is empty.
The narrow panel on the left is the collection "Home" panel. The Home panel contains the list of collection admin tools which are used for indexing, search, and system administration. It’s possible to have multiple home panels open at once, as needed to view configuration information. Clicking the Home icon at the top of the home panel will open a new home panel.
The collections admin tools under the "Index" heading are the controls for defining and using datasources and pipelines. Click the "Datasources" tool at the top of the Index section.
After clicking on "Datasources", the admin panel displays a choice of connector types.
Because we want to get data from a website, we select Web > Anda Web.
In configuring the web datasource, we only need to specify a datasource name, here "ds-1",
and a starting URL,
|Configuration choices are only set/updated if you click on the red "Save" button at the bottom of this panel. Always scroll to the bottom of the panel and make sure that you have saved your work! Fusion displays a notification to confirm the datasource save.|
Running the datasource will cause Fusion to fire up the connector, which will retrieve documents from the eserver.org website. Each webpage is treated as a separate document. The datasource hands off each document to the indexing pipeline. Here we ran the Anda Web datasource with indexing pipeline "Documents_Parsing", which is the default pipeline for this datasource. The "Documents_Parsing" pipeline consists of the following processing stages:
Apache Tika Parser - recognizes and parses most common document formats, including HTML
Field Mapping - transforms field names to valid Solr field names, as needed
Detect Language - transforms text field names based on language of field contents
Solr Indexer - transforms Fusion index pipeline document into Solr document and adds (or updates) document to collection.
To run the datasource, simply click the "Start" button on the Datasource panel. While the job is running, this control changes to "Stop / Abort", when the job is finished, the control changes back to a "Start" button and the job status is displayed directly below the "Start" button. To inspect the results, click on the control which toggles "show/hide" details. This opens an adjoining panel which lists each run of the datasource job. Each job has a small control that expands the job listing into a detailed listing the results of the pipeline processing stage:
Once a datasource has been configured and the indexing job is complete, the collection can be searched using the search results tool. The wildcard query "*" matches all documents in the collection. Here is the result of running this search, showing all fields in one document:
In order to use the search results tool to examine the documents in this collection, we need to configure the search results tool to show only those fields we care about. In this case the field "url" shows the sonnet number, and the field "content_txt" shows the raw text contents extracted from the HTML page.
Clicking on the gear icon next to the search box toggles the Search Results Configuration Tool open and close. The get a compact display of search results over this dataset, you should:
toggle the configuration tool open via the gear icon
choose tab "Documents"
select display "Primary"
uncheck all fields above field "URL"
select display "Secondary"
uncheck all fields above field "content_txt"
scroll down to the bottom of the control and hit "save"
toggle the configuration tool closed by clicking on the gear icon
With this configuration in place, a search on the word "love" returns the following result:
The most relevant document is Sonnet 40:
It contains eight instances of the word "love", and one instance of "loves", more than in any other sonnet. To understand what we mean by relevancy, it is necessary to understand Lucene and Solr.
Lucene and Solr
Underlyingly, Fusion collections are Solr collections and Solr collections are comprised of Lucene indexes.
Lucene itself is a search API. Solr wraps Lucene in an web platform. Search and indexing are carried out via HTTP requests and responses. Solr generalizes the notion of a Lucene index to a Solr collection, a uniquely named, managed, and configured index which can be distributed ("sharded") and replicated across servers, allowing for scalability and high availability.
Lucene started out a search engine, designed to perform the following information retrieval task: given a set of query terms and a set of documents, find the subset of documents which are relevant for that query. Lucene provides a rich query language which allows for writing complicated logical conditions. Lucene now encompasses much of the functionality of a traditional DBMS, both in the kinds of data it can handle and the transactional security it provides.
Lucene maps discrete pieces of information, e.g., words, dates, numbers, to the documents in which they occur. This map is called an inverted index because the keys are document elements and the values are document ids, in contrast to other kinds of datastores where document ids are used as a key and the values are the document contents. This indexing strategy means that search requires just one lookup on an inverted index, as opposed to a document oriented search which would require a large number of lookups, one per document. Lucene treats a document as a list of named, typed fields. For each document field, Lucene builds an inverted index that maps field values to documents.
With this inverted index, it is easy to compute relevancy based on the overall frequency of terms in documents. An extremely relevant document for a query is a document which contains more of the terms in the query relative to all the other documents in the collection, as seen in the above search example
An expanded version of this getting started page is available on the Lucidworks blog as a two-part series:
The Lucidworks blog also has a series of articles on getting started with Signals in Fusion: