From Ingest to Search:
Part Two - Better Indexing

Before you begin, be sure to complete Part One.

In this part of the tutorial, we’ll use Fusion to extract more information from the dataset and use it to improve the search experience. Better search starts with better indexing.

In Part 1, we created a collection called "cinema_1" and used the Index Workbench to preview the data using the default parser and index pipeline configurations. We saw that our data includes a movie abstract and two URL fields, each of which contain the movie title. Since we know that users expect to search by title as well as by keywords, we want to create a searchable field containing a properly-formatted movie title.

To do this, we will modify the default index pipeline using the Index Workbench.

Working with An Index Pipeline

The index pipeline development workflow happens in the Index Workbench like this:

  1. The Simulated Results pane shows you what your indexed data currently looks like.

    Use it to determine whether your data needs modification in order to be useful. You are shown you a preview of how your data will be indexed, while you reconfigure parser and index pipelines.

    When you save your changes here, they’re only applied to incoming data. Existing data is not affected until you re-index it.

  2. The Datasource configuration pane is where you re-index your data.

    You must always re-index your data in order to apply an updated index pipeline.

  3. Return to the Search UI pane to verify that your index pipeline changes produced the desired results.

Index pipeline workflow

You can also view and configure your index pipelines by navigating to Home > Index Pipelines, though you won’t get a preview of the output until you go to back into the Index Workbench. Regardless of which tool you use to develop your pipeline, the data in your index will not be modified until you re-index it.

The default index pipeline for any collection is called collection-default. So, the default pipeline for the collection we created in Part One is called cinema_1-default. A default index pipeline consists of just three stages:

  • The Field Mapping stage provides provides field-level document cleanup. Fields can be renamed, combined, or deleted from the document.

  • The Solr Dynamic Field Name Mapping stage maps pipeline document fields to Solr fields.

  • The Solr Indexer stage sends the document to Solr for indexing into the collection.

index workbench initial config

Now we’ll begin modifying the pipeline to get search results that include a title field.

Using regular expressions to create new fields from (parts of) existing ones

Our pipeline needs a Regular Expression Extractor stage to take the contents of a source field, match it against a regular expression, and copy any matches into a target field. We’ll craft a regular expression which will extract the title string from the DBpedia URL.

  1. If it is not already loaded, select the cinema_1-default pipeline in the Index Pipeline section.

  2. From the Add a Stage menu, select the Regex Field Extraction stage.

    The Regex Field Extraction configuration screen appears.

  3. Under Regex Rules, click the green add (+) icon.

  4. In the new line, select the edit button under Source Fields.

    The Source Fields window appears.

  5. Click the green add (+) icon.

  6. Enter "DBpediaURL_s".

  7. Close the Source Fields window.

  8. Under Target Field, enter "title_txt".

    IndexWorkbench RegexRules

  9. Leave Write Mode as "append".

  10. Under Regex Pattern, enter the following:

    ^<http://dbpedia.org/resource/(.*)>$

    Based on this regex, the entire URL will be matched, with a capturing group that matches just the entry title.

  11. Scroll all the way to the right to find Regex Capture Group, enter "1".

  12. Click the Apply button.

  13. After hitting apply, the Simulated Results pane will reflect the regex that you just defined in the new Regex Field Extraction pipeline stage. IndexWorkbench RegexRules NewField

(Since the Index Workbench uses a random sample from your data, you might see different field contents than those shown above.)

You can see that the new title_txt field is created, and its contents match the last segment of the DBpediaURL_s field.

But it’s not formatted as an end user would expect. We need to change underscore chars to spaces. We can also see that non-ASCII characters are URI-encoded. For example, in the above screenshot, the string "%C3%81lvaro_de_la_Quadra" would be more readable to end users as "Álvaro de la Quadra".

Using JavaScript to modify field contents

For a custom string processing task like this one, we need to add a Javascript index pipeline stage.

  1. Click the Add new stage button in the Index Pipeline configuration again.

  2. Select JavaScript from the advanced stages.

    The Javascript stage configuration screen appears.

  3. In the Script Body field, paste the following:

    function(doc) {
      var orig = "";
      var decoded = "";
      var clean = "";
      if (doc.hasField("title_txt")) {
        orig = doc.getFirstFieldValue("title_txt");
        decoded = decodeURIComponent(orig);
        clean = decoded.replaceAll("_"," ");
        doc.setField("title_txt",clean);
      }
      return doc;
    }

    + It will look like this: IndexWorkbench Javascript

  4. Click the Apply button.

  5. In the Stages list, re-order the stages so that the Javascript stage is after the Regex Field Extraction stage.

    IndexWorkbench ArrangeIndexingStages

    The results simulator automatically re-loads the index preview using the modified pipeline:

    simulator results

  6. Click Save pipeline and start import.

    The Save Pipeline window appears.

  7. Verify that Save over existing pipeline is selected.

  8. Click Save pipeline.

  9. Click Go to datasource.

Re-index the dataset

In order to apply our index pipeline changes to the dataset, it’s necessary to clear the datasource and re-run the ingest job.

  1. Navigate to the Home menu.

  2. Select Datasource.

  3. Click Clear Datasource.

    You will be prompted to confirm that you really want to do this. Clearing a datasource cannot be undone.

  4. Click Yes, clear.

  5. Verify that the collection is now empty:

    1. Navigate back to Home > Index Workbench.

    2. Notice that the default wildcard search for the cinema_1 datasource now returns no results in the Simulator.

      The collection is empty.

  6. Navigate to Home > Datasources.

  7. Select the "ds1" datasource.

  8. Click Start Crawl.

  9. When the Status indicator displays "Finished", click Job History.

  10. Click the most recent job to display its details.

  11. Verify that the output field contains the correct number: output: 155298

  12. Click Close to exit the job history window.

Search over entry titles

Now let’s test the search results using our new title_txt field.

  1. Navigate to Home > Query Workbench.

  2. Enter title_txt:"Star Wars" in the search box.

search title field phrase star wars

We’re getting closer to a useful search experience for this dataset.

Lessons Learned

Improved indexing leads to better search. Adding a field containing the Wikipedia entry title to all documents in the collection allows users to search for articles by title.

  • Building Index Pipelines

    • Using Regular expressions to extract embedded information from existing fields in a document

    • Using custom JavaScript processing to encapsulate

  • Pipeline Simulator

  • Workflow - clear collection, re-index data

  • Search UI - fields search, phrase search

Next Steps

Part Three of this tutorial shows how to apply logic in Fusion Search UI to custom search pipeline for use by your search app.