Getting Started with Fusion:
Part One - Getting Data In

Fusion’s Index Workbench provides the tools to configure datasources, parsers, and index pipelines, then preview the results before you load your data into the index.

This tool first sets up the necessary data extraction configuration, then retrieves a small number of documents as sample data against which to test and refine your index pipeline.

When working in the Index Workbench, all processing is simulated processing of the test data and no actual data ingestion takes place. Once you have a complete configuration, the Index Workbench saves this as a Fusion datasource. To load your data into Fusion, you use the Fusion Datasource tool to run the resulting configuration.

Part 1 takes you through configuring a datasource using the Index Workbench. In Part 2, we’ll load the data into Fusion and view it using the Query Workbench.

Before you begin

  1. Download Fusion.

  2. tar -xf fusion-3.1.x.tar.gz

  3. fusion/3.1.x/bin/fusion start

  4. Go to http://localhost:8764/.

  5. Download the dataset.

    This is a Movielens dataset created by the Grouplens research lab

  6. Unpack the ml-latest-small.zip file.

    Fusion can parse .zip files, but for simplicity we’ll index just one file from the archive (movies.csv).

The movies.csv file contains a list of 9,125 movie titles, plus a header row. Here is a truncated listing:

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller

1. Create a collection

  1. Log in to Fusion as the admin user.

    The Launcher appears, showing you the available workflow contexts:

    launcher

    Note
    If this is your first login, the Quickstart appears. Dismiss it by clicking Exit the Quickstart.
  2. From the pull-down menu in the upper left, select Manage Collections:

    Manage Collections

    The Collections Manager appears.

  3. Click the New button:

    Collections Manager

  4. For your collection name, enter "ml-movies".

  5. Click Save collection.

    Collections Manager save ml movies

2. Configure the datasource

launcher button A collection includes one or more datasources. A datasource is a configuration that manages the import, parsing, and indexing of data into the collection. We’ll use the Index Workbench to configure a datasource for our movie data.

  1. In the upper left, click the Launcher button and select Search.

    Make sure that "ml-movies" is selected in the collections menu at the top of the screen.

  2. In the Home menu, click Index Workbench.

    The Index Workbench appears. Initially, no data preview is displayed because no datasource has been configured. When we configure a datasource, Fusion will sample the data and display a preview of how it would be formatted in the index using the default parsing and index pipeline configurations.

  3. Click New.

    Index Workbench ml movies add ds

  4. Select Or, upload a file.

  5. Click Choose File.

  6. Navigate to the movies.csv file and select it.

  7. Click Add New Datasource.

    The Fileupload datasource configuration panel appears, with a default datasource ID and a file ID that should match the movies.csv filename:

    Index Workbench movies datasource

  8. Click Apply.

    The Index Workbench reads up to 20 documents into memory from the movies.csv file, then displays a preview of how they would be indexed:

    Index Workbench movies noconfig

    In the lower right, you can select the number of documents to preview.

3. Analyze the default output

  1. Notice that Fusion has already made some assumptions about two of our original fields:

    • genres became genres_t (the "text_general" field type) and genres_s (the "string" field type). String fields are useful for faceting and sorting, while text fields are for full-text search. At this point, Fusion doesn’t know whether we intend to use this field for faceting and sorting, for full-text search, or for both.

    • title became title_t and title_s for the same reason.

    Index Workbench stagedisableThese fields are created by the Solr Dynamic Field Mapping stage in the default index pipeline. This stage attempts to automatically detect field types, and renames the fields accordingly. For this tutorial, we’ll manually configure our fields instead.

  2. Turn off the Solr Dynamic Field Mapping stage by clicking the green circle next to it.

    Our data’s original fields reappear: genres, movieId, and title.

4. Configure the pipeline

First we’ll configure the field mappings in our index pipeline so that each field has the correct data type. Then we’ll split the genres field into multiple values so that each value can be used as a facet in Part 2 of this tutorial.

4.1. Configure field mappings

The mapping from CSV fields to Fusion document field types can be controlled by field name conventions. Fusion uses field name suffixes to determine field types. When a field name has no suffix, Fusion stores it as a string field and treats it as an unanalyzed whole.

For more precise analysis and search, most fields need suffixes to indicate their specific types. We’ll see how this relates to the fields in our dataset.

  1. Click Field Mapping to open the Field Mapping stage configuration panel.

  2. Click the Add Index Workbench ml movies fieldmapping add button to create a new field mapping rule.

  3. Under Source Field, enter genres.

  4. Under Target Field, enter genres_ss.

    The field suffix _ss means that this field is a multi-valued string field.

    Note
    Notice that Fusion currently interprets this field as having a single value. We can see that it’s actually a pipe-delimited array of values. We’ll fix this after we finish configuring our field mappings.
  5. Under Operation, select "move".

    The "move" action means that the resulting document no longer has a genres field, only genres_ss.

    Index Workbench ml movies fieldmappings genres

  6. Click Apply.

    Applying the new configuration re-runs the simulation and updates the contents of the preview panel:

    Before After

    Index Workbench ml movies results1

    Index Workbench ml movies results1a

  7. Add more field mapping rules as follows:

    • The movieId field is a unique document identifier and should be copied into the document’s "id" field.

    • The title should be searchable as a text field, so we move it to a field with suffix "_txt".

    Your field mappings should look like this:

    Index Workbench ml movies fieldmappings

  8. Click Apply.

    Once we have specified these explicit mapping rules, we can browse the resulting document in the preview panel to check our work.

    Before After

    Index Workbench ml movies results1a

    Index Workbench ml movies results2

Now our document ID is more useful, and our movie titles are full-text searchable.

Tip
As the input document in this tutorial is simple documents with a fixed number of known fields, it is easy to configure the Field Mapping stage to ensure the correct document structure for Fusion. For situations where the documents have large numbers of fields, the Solr Dynamic Field Mapping stage can minimize the work required to configure the pipeline.

4.2. Split a multi-value field

As we noticed above, the genres_ss field has been parsed as a single-value field, but we can see that it’s really a pipe-delimited array of values. To split this field into its constituent values, we’ll add a Regex Field Extraction stage to our pipeline. This stage uses regular expressions to extract data from specific fields. It can append or overwrite existing fields with the extracted data, or use the data to populate new fields.

  1. Click Add a stage.

  2. Scroll down and select Regex Field Extraction.

    The Regex Field Extraction stage configuration panel appears.

  3. Under Regex Rules, click the Add add icon icon.

  4. Under Source Fields, click the Edit icon.

    Source Field

    The Source Fields window opens.

  5. Click the Add add icon icon.

  6. Enter "genres_ss".

  7. Click Apply to close the Source Fields window.

  8. Under Target Field, enter "genres_ss".

  9. In the Write Mode field, select "overwrite".

  10. In the Regex Pattern field, enter this expression:

    [^|\s][^\|]*[^|\s]*
    Tip
    You may need to scroll horizontally to see this field.
  11. In the Return If No Match field, select "input_string".

  12. Click Apply.

    Initially, our data doesn’t change.

  13. In the pipeline, drag the Regex Field Extraction stage down so that it comes after the Field Mapping stage:

    Pipeline stage reordering

    Now the preview indicates multiple values for the genres_ss field:

    Before After

    Index Workbench ml movies results2

    Index Workbench ml movies results3

    Tip
    If the preview panel doesn’t update automatically, try selecting a different number of documents to view. This forces the preview to update.
  14. Expand the genres_ss field to view its values:

    Index Workbench ml movies results3a

These field values will be useful for faceting, which we’ll explore in Part 2 of this tutorial.

4.3. Create a new field from part of an existing one

Notice that the title_txt field also contains the year in which the movie was released. Instead of including the year in our full-text search field, it would be more useful as a separate field that we can use for faceting. This is another job for the Regex Field Extraction stage.

  1. In the Regex Field Extraction configuration panel, under Regex Rules, click the Add Index Workbench ml movies fieldmapping add icon.

  2. In the new line, click the edit button under Source Fields.

    The Source Fields window appears.

  3. Click the Add Index Workbench ml movies fieldmapping add icon.

  4. Enter "title_txt".

  5. Close the Source Fields window.

  6. Under Target Field, enter "year_ti".

    The _ti suffix indicates an integer value. Fusion will create this new field whenever the regular expression matches.

    Tip
    When you use the Regex Field Extraction stage to create a new field, the value of Write Mode makes no difference.
  7. In the Regex Pattern field, enter this expression to match the digits inside the parentheses at the end of the title_txt value:

    \(([0-9]+)\)$
  8. In the Regex Capture Group field, enter "1".

    Tip
    Scroll all the way to the right to see this field.
  9. Click Apply.

    Now the preview includes the new year_ti field:

    Before After

    Index Workbench ml movies results3

    Index Workbench ml movies results4

4.4. Trim a field’s value

The title_txt field still includes the year of the film’s release, which we’ve extracted into its own field, year_ti. Let’s trim that information from the title_txt values so that only the title text remains.

  1. In the Regex Field Extraction configuration panel, under Regex Rules, click the Add Index Workbench ml movies fieldmapping add icon.

  2. In the new line, click the edit button under Source Fields.

    The Source Fields window appears.

  3. Click the Add Index Workbench ml movies fieldmapping add icon.

  4. Enter "title_txt".

  5. Close the Source Fields window.

  6. Under Target Field, enter "title_txt".

  7. In the Write Mode field, select "overwrite".

  8. In the Regex Pattern field, enter this expression to match the digits inside the parentheses at the end of the title_txt value:

    ^(.+)\s\(([0-9]+)\)$
  9. In the Regex Capture Group field, enter "1".

  10. Click Apply.

    Now the preview pane shows the title_txt field with with only the title string:

    Before After

    Index Workbench ml movies results4

    Index Workbench ml movies results5

5. Save the configuration and run the datasource

Index Workbench ml movies startjob Now we have a correctly-configured processing pipeline appropriate to our data. We are ready to finalize our changes and index the data.

  1. Click the Save button in the upper right.

    This saves the new index pipeline configuration.

  2. Click Start job in the upper left.

    This launches a datasource job that imports and indexes the complete contents of our movies.json.gz file, using the configuration we just saved.

6. Inspect the collection fields

To verify that we have correctly ingested the data, we use the Fields Manager tool which lists all document fields in the Fusion collection. Before you do this, make sure that the status of the index job is "finished".

  1. Navigate to Home > Fields Manager.

    We see that the collection contains the correct number of entries (9,125):

    Fields Manager ml movies

  2. In the search field, enter "genres".

    Fields are grouped by their suffixes, so you’ll find this field in the *_ss group. Click the arrow to expand the group and view the genres_ss field.

    Fields Manager ml movies genres

Tip
If you find that your data is incorrectly indexed, you can navigate to Home > Datasources, click the datasource name, then click Clear Datasource. The data and its index history are removed from the index so that you can re-index it.

Fusion only re-indexes data that is not found in the index history. In other words, Fusion won’t overwrite indexed data; it will only re-write it after you clear the datasource.

What’s next

Now we have 9,125 movie listings from the MovieLens database in Fusion’s index, customized to indicate the data type for each field. We also split a multi-valued field so that its values can be treated individually.

Let’s compare our original data with the indexed data:

Before After

Index Workbench ml movies results1

Index Workbench ml movies results5

In Part 2 we’ll use the Query Workbench to get search results from our collection and configure the query pipeline that customizes those results. We’ll add faceting using the genres_ss and year_ti fields so that users can easily filter their search results.