From Ingest to Search:
Part One - Default Indexing

In Part One, we’ll see what we get with the default index and query pipelines. You can use Fusion 2.4 or 3.0 for this tutorial. Here’s an overview of the process:

Overview

Before You Begin

For our dataset, we’ll use a simple CSV file containing abstracts from Wikipedia articles that contain either of the words "film" or "movie". Each row in the CSV file contains one document’s worth of data. Fusion will process each row into a fielded Solr document whose field names are derived from the CSV column names.

  1. Download Fusion.

  2. tar -xf fusion-2.4.x.tar.gz

  3. fusion/2.4.x/bin/fusion start

  4. Go to http://localhost:8764/.

  5. Download the dataset and unzip it.

  6. Completeness check: Does the file contain the right number of lines? Your copy of this file should be 155,299 lines long.

    Tip
    You should always check that your data is in the format that you expect and that it is complete and correct before you import it into Fusion.
    > wc -l cinema-short-abstracts.csv
    155299 cinema-short-abstracts.csv
  7. Correctness check: View the first several lines of the file:

    > head -4 cinema-short-abstracts.csv | cat -n
         1	DBpediaURL_s,WikipediaURL_s,abstractShort_txt
         2	"<http://dbpedia.org/resource/!Women_Art_Revolution>","<http://en.wikipedia.org/wiki/!Women_Art_Revolution?oldid=498606361>","!Women Art Revolution is a 2010 documentary film directed by Lynn Hershman Leeson and distributed by Zeitgeist Films. It was released theatrically in the United States on June 1, 2011."
         3	"<http://dbpedia.org/resource/$100_Film_Festival>","<http://en.wikipedia.org/wiki/$100_Film_Festival?oldid=540809587>","The $100 Film Festival is an independent film festival that runs for three days every March at the Globe Cinema in downtown Calgary, Alberta. The festival showcases films in all genres by local and international independent artists who enjoy working with traditional film. Created in 1992 by the Calgary Society of Independent Filmmakers (CSIF), the $100 Film Festival started as a challenge for area filmmakers to a make low budget movie using Super8 film for less than $100."
         4	"<http://dbpedia.org/resource/$30_Film_School>","<http://en.wikipedia.org/wiki/$30_Film_School?oldid=498969852>","$30 Film School is a book written by Michael W. Dean instructing on filmmaking on a limited budget, and is part of the $30 School book series which includes $30 Music School and $30 Writing School. Like the other books of this series, $30 Film School advocates a start-to-finish DIY ethic, and includes interviews with professionals in the given field, as well as a CD or DVD of extras. Published by Muska & Lipman in 2003, the first edition sold 30,000 copies."
    • The first line is the CSV header.

    • The column names are suffixed by the Solr data type for each column; we’ll look at those data types later.

    • In the Solr index, the column names from the file become the field names for each entry.

    • Fusion will process each non-header and non-comment row into a Solr document.

  8. Correctness check: Verify the CSV format with csvlint

    This particular data file was derived from a file downloaded from the DBpedia project. The original dataset contains UTF-8 characters outside of the ISO Latin 1 character set. The abstracts contain many quoted strings which must be properly escaped for CSV. Did the person who created the CSV file format it correctly?

    The answer is yes. To check for yourself, you may run the "csvlint" utility, which can be downloaded from https://github.com/Clever/csvlint

    > csvlint
    cinema-short-abstracts.csv is VALID

1. Create a Collection

Fusion stores datasets in a Fusion collection in the Solr index.

We’ll create a collection called "cinema_1" for our data:

  1. Log in to Fusion via the browser. (http://localhost:8764/ if you installed Fusion locally. We support Chrome, Firefox, and new versions of IE).

    When you log in, the Collections application is displayed by default.

  2. Click Add a Collection

    collections application

  3. For the Collection name, enter "cinema_1".

    save collection button

  4. Click Save Collection.

    Now you have a "cinema_1" collection that contains no documents or datasources:

    collection created

The next step is to define a datasource for this collection.

2. Define a Datasource

A datasource is an ingest configuration that’s associated with a collection. Ingested data belongs to the same collection as its datasource.

Define a datasource

To configure Fusion to ingest the "cinema-short-abstracts.csv" file, you must define a datasource:

  1. In the collections list, click cinema_1.

    The collection’s Home toolbar appears.

  2. Click Datasource:

    add datasources

    This opens a new datasource configuration panel:

    add datasources

    There are no datasources in this collection yet.

  3. Click the Add button.

  4. Select the appropriate connector for our local CSV file by navigating to Filesystem > Local Filesystem:

    filesystem datasource

  5. Enter configuration values using the example shown below, changing collection names and pathnames as needed:

    local filesystem config

    1. Enter a recognizable name for the datasource.

    2. Select the pipeline ID "cinema_1-default".

    3. For the start link, enter the full path to the unzipped data file.

      Enter additional configuration values for processing CSV-formatted data:

    4. Click "Complex Documents".

    5. Select "Include". When you select "Include", additional CSV options appear.

    6. Verify that "CSV has header row" is selected.

  6. Click the Save button.

3. Crawl The Datasource and Check The Job History

Now we can import the dataset into our collection by running the datasource.

When it finishes, we’ll check the job history to verify that the datasource job succeeded.

Run the datasource

  1. Click the Start Crawl button to begin processing the data file.

    datasource run controls

    The current job status is displayed below the job controls.

    Note
    Datasources keep a run history, and they won’t re-run jobs over the same dataset. Therefore, if you have already tried to process this file, you may need to click the Clear Datasource button to clear the run history.
  2. Once the current job status is "finished" click the Job History button.

    This displays a new panel showing all runs of this datasource, from most recent run down through initial run.

  3. Click the most recent job to view the details:

    datasource job history

    Notice the input is 155,299 documents but the output is 155,298. This is because the first line of the CSV file is the header row consisting of column names. Fusion reads the header row to get the column names but does not include that row in its indexing output. All data rows in the CSV file were processed into a document.

    The job history section "Pipelines" shows the set of processing stages used to transform the raw data into a document for indexing. Some pipeline stages report an input count of 0. This is a known issue, resulting from the way that the CSV file is processed. Not all index stages report input and output counts.

4. Check the Collection Fields

Fusion processes each row in the input file into a fielded Solr document. Let’s verify that the correct fields were created, that each has the correct data type, and that each contains the correct number of documents.

Check the fields

Since document field names correspond to the CSV file column headers, we expect our "cinema_1" collection to contain these three fields:

DBpediaURL_s,WikipediaURL_s,abstractShort_txt

Notice the fieldname suffixes. These suffixes indicate the data type of each field.

Because every field is indexed and stored separately in Solr, we expect each of those three fields to contain 155,298 documents.

Let’s check the collection’s fields:

  1. Navigate to Home > Fields.

    home index pipelines

    This opens the Fields panel:

    fields browser

    The display shows "Static" and "Dynamic" fields.

    • Static fields are explicitly defined by the underlying Solr schema configuration for the collection.

    • Dynamic fields are a set of suffixes which define different data types.

      We’ll find DBpediaURL_s, WikipediaURL_s, and abstractShort_txt among the dynamic fields.

      • *_s fields are String fields.

      • *_txt fields are Text fields.

  2. Scroll down to the *_s field and click the Play icon.

    This expands the view to show the actual fields created, including DBpediaURL_s and WikipediaURL_s:

    fields star_s

  3. Scroll down to the *_txt field and click the Play icon:

    fields star_txt

In all three of our document’s fields, we can see that the correct number of documents (155,298) is present. We can also see that our fields have the correct data types.

5. Search the Collection

Now we can take a look at the shape of our indexed data by searching it.

Search the collection

In the collection’s Home toolbar, click Search.

This opens the Search UI. This is a search interface for Fusion users; in the real world, you’ll develop a custom search interface just for your end users.

The first time this panel is opened it displays the results of the default wildcard search, which matches all documents in the collection:

default search

By default, Fusion displays the id field as the primary field in the collection. In the case of CSV data, the id field consists of the full pathname of the file and the row number in which the document was found.

Configuring The Display Fields

Next we’ll configure the Search UI to display fields that are most meaningful to us.

  1. Click on the little gear icon next to the search box at the top of the page to open the Search UI Configuration panel:

    default search

  2. Click "Documents".

    The Documents control allows you to choose which fields to display as the "primary", "secondary", and "additional" fields. First we’ll configure the primary field.

  3. Scroll down the field list to find the field named DBpediaURL_s.

  4. Click on the drag control next to the field name and drag the field to the top of the field list.

  5. Select Secondary from the dropdown menu.

  6. Drag the abstractShort_txt field to the top.

  7. Select Additional from the dropdown menu.

  8. Drag the score field to the top, unchecking all other fields.

  9. To save your changes and close the Search UI Configuration window, click the little gear icon or anywhere outside the configuration window.

    The display for the default wildcard search now looks like this:

    wildcard search

    Because all documents match the wildcard query exactly, all documents have a score of "1" for this search. When documents have the same score, they are displayed together in random order. Therefore, your search results may begin with a different set of documents than the ones shown above.

Search for "Star Wars"

Now let’s see how well search works.

  1. Search for "Star Wars".

    star wars search

    This search returns documents related to the Star Wars universe. To verify that our collection contains an entry for the original "Star Wars" movie, let’s add some more terms to our search query.

  2. Search for "Star Wars A New Hope 1977 George Lucas".

    star wars search again

    This targeted search returns the correct entry.

    However, this places a burden on the user to formulate a detailed, expert-level query. Unless the user knows what to look for - the original movie director, release date, and episode name - they may not find the correct item.

This is a very basic search experience that’s not ideal for end users. In the next parts of this tutorial, we’ll see how to improve it.

Lessons Learned

  • Data validation - check your work early and often

    • The larger your dataset, the more often you should check your work

  • Fusion Collections - the underlying datastore

    • Creating a new collection

    • Viewing the collection fields

  • Fusion Datasources - how to ingest data into a collection

    • Creating and configuring a datasource

    • Running the datasource

  • Fusion’s Search UI - a tool for investigating and customizing search results

    • Running a search in the UI

    • Customizing the results display

Next Steps

The next parts of this tutorial show you how Fusion provides tools to improve the user search experience:

  • Part Two shows you how to configure the index pipeline to provide better indexed data for search purposes.

  • Part Three explains how to configure the query pipeline to provide better processing for user queries.