From Ingest to Search:
Part Two - Better Indexing

Before you begin, be sure to complete Part One.

In this part of the tutorial, we’ll use Fusion to extract more information from the dataset and use it to improve the search experience. Better search starts with better indexing.

In Part 1, we configured a collection called "cinema_1", ingested the cinema-short-abstracts.csv file, and explored the indexed data, all using the default configurations. Part One showed us that the default search experience is sub-optimal for users who don’t already know the shape of the dataset. The defaults are a foundation on which to build well-tuned search applications customized for your unique dataset.

Better search through better indexing

Our search for "Star Wars" in Part One implied a search by title. The default search results couldn’t be ranked by title because no title information was stored in the index. However, the index does include a DBpediaURL_s field. Here are some examples of its values:

<http://dbpedia.org/resource/Star_Wars:_Clone_Wars_(2003_TV_series)>
<http://dbpedia.org/resource/Star_Wars:_Droids>
<http://dbpedia.org/resource/Star_Wars:_Ewoks>

Notice that the URLs directly encode the entry title. Fusion can extract the entry title from these values and add it into the indexed Solr document to support search by title over the documents in the collection. The most efficient way to do this is to add processing stages to the index pipeline.

This part of the tutorial shows you how to add and configure index pipeline stages which extract information present in an existing field and store the result in a new field. We’ll add two stages:

  • The Regex Field Extraction stage will extract the title string from the DBpediaURL_s field and add it as a new field called title_txt.

  • The JavaScript stage will reformat the contents of the title_txt field for easier searching and better display.

Using an index pipeline to manipulate fields

Working with An Index Pipeline

The index pipeline development workflow goes like this:

  1. The Search UI shows you what your indexed data currently looks like.

    Use it to determine whether your data needs modification in order to be useful.

  2. The Index Pipeline Simulator shows you a preview of how your data will be indexed, while you reconfigure the index pipeline.

    When you save your changes here, they’re only applied to incoming data. Existing data is not affected until you re-index it.

  3. The Datasource configuration page is where you re-index your data.

    You must always re-index your data in order to apply an updated index pipeline.

  4. Return to the Search UI to verify that your index pipeline changes produced the desired results.

Index pipeline workflow

You can also view and configure your index pipelines by navigating to Home > Index Pipelines, though you won’t get a preview of the output until you go to Home > Index Pipeline Simulator. Regardless of which tool you use to develop your pipeline, the data in your index will not be modified until you re-index it.

The default index pipeline for any collection is called collection-default. So, the default pipeline for the collection we created in Part One is called cinema_1-default. A default index pipeline consists of just two stages:

  • The Field Mapping stage provides provides field-level document cleanup. Fields can be renamed, combined, or deleted from the document.

  • The Solr Indexer stage sends the document to Solr for indexing into the collection.

default index pipeline initial config

Now we’ll begin modifying the pipeline to get search results that include a title field.

Using regular expressions to create new fields from (parts of) existing ones

Our pipeline needs a Regular Expression Extractor stage to take the contents of a source field, match it against a regular expression, and copy any matches into a target field. We’ll craft a regular expression which will extract the title string from the DBpedia URL.

  1. Navigate to Home > Index Pipelines, if you haven’t already.

  2. Select the cinema_1-default pipeline.

  3. From the Add an new pipeline stage menu, select Regex Field Extraction.

    The Regex Field Extraction configuration screen appears.

  4. Under Regex Rules, click the green add (+) icon.

  5. Under Source Fields, click the edit icon.

    The Source Fields window appears.

  6. Click the green add (+) icon.

  7. Enter "DBpediaURL_s".

  8. Close the Source Fields window.

  9. Under Target Field, enter "title_txt".

  10. Under Regex Pattern, enter the following:

    ^<http://dbpedia.org/resource/(.*)>$

    This regex matches the entire URL, with a capturing group that matches just the entry title.

  11. Under Regex Capture Group, enter "1".

    You may need to scroll to the right in order to see the Regex Capture Group field.

  12. Click the Save button.

Using The Pipeline Simulator to Preview Results

Our data remains unchanged because we haven’t re-indexed it. But we can use the Index Pipeline Simulator to preview the result of the change we made above.

  1. Click Home > Index Pipeline Simulator.

    simulator initial display

  2. Select your "ds1" datasource.

  3. Click Start Simulation.

    Check the results:

    simulator results

    (Since the Pipeline Simulator uses a random sample from your data, you might see different field contents than those shown above.)

You can see that the new title_txt field is created, and its contents match the last segment of the DBpediaURL_s field.

But it’s not formatted as an end user would expect. We need to change underscore chars to spaces. We can also see that non-ASCII characters are URI-encoded. For example, in the above screenshot, the string "%C3%81lvaro_de_la_Quadra" would be more readable to end users as "Álvaro de la Quadra".

Using JavaScript to modify field contents

For a custom string processing task like this one, we need to add a Javascript index pipeline stage. We can do this in the Index Pipeline Simulator.

  1. Click the Add new stage button.

  2. Select JavaScript.

    The JavaScript stage configuration screen appears.

  3. In the Script Body field, paste the following:

    function(doc) {
      var orig = "";
      var decoded = "";
      var clean = "";
      if (doc.hasField("title_txt")) {
        orig = doc.getFirstFieldValue("title_txt");
        decoded = decodeURIComponent(orig);
        clean = decoded.replaceAll("_"," ");
        doc.setField("title_txt",clean);
      }
      return doc;
    }
  4. Click the Save button.

  5. In the Stages list, re-order the stages so that the JavaScript stage is after the Regex Field Extraction stage.

    The Pipeline Simulator automatically re-loads the index preview using the modified pipeline:

    simulator results

  6. Click Save pipeline and start import.

    The Save Pipeline window appears.

  7. Verify that Save over existing pipeline is selected.

  8. Click Save pipeline.

  9. Click Go to datasource.

Re-index the dataset

In order to apply our index pipeline changes to the dataset, it’s necessary to clear the datasource and re-run the ingest job.

  1. In the datasource configuration window, click Clear Datasource.

    You will be prompted to confirm that you really want to do this. Clearing a datasource cannot be undone.

  2. Click Yes, clear.

  3. Verify that the collection is now empty:

    1. Navigate to Home > Search.

    2. Notice that the default wildcard search now returns no results.

      The collection is empty.

  4. Navigate to Home > Datasources.

  5. Select the "ds1" datasource.

  6. Click Start Crawl.

  7. When the Status indicator displays "Finished", click Job History.

  8. Click the most recent job to display its details.

  9. Verify that the output field contains the correct number: output: 155298

  10. Click Close to close the job history window.

Search over entry titles

Now let’s test the search results using our new title_txt field.

  1. Navigate to Home > Search.

  2. Enter title_txt:"Star Wars".

search title field phrase star wars

We’re getting closer to a useful search experience for this dataset.

Lessons Learned

Improved indexing leads to better search. Adding a field containing the Wikipedia entry title to all documents in the collection allows users to search for articles by title.

  • Building Index Pipelines

    • Using Regular expressions to extract embedded information from existing fields in a document

    • Using custom JavaScript processing to encapsulate

  • Pipeline Simulator

  • Workflow - clear collection, re-index data

  • Search UI - fields search, phrase search

Next Steps

Part Three of this tutorial shows how to apply logic in Fusion Search UI to custom search pipeline for use by your search app.