Get Started with Fusion Server:
Get Data In

Fusion’s Index Workbench provides the tools to configure datasources, parsers, and index pipelines. It lets you preview the results of indexing before you load your data into the actual index.

Index Workbench first sets up the necessary data extraction configuration, and then retrieves a small number of documents as sample data. You can use the sample documents to test and refine your index pipeline. All processing is simulated processing of the test data. No actual data ingestion takes place.

After you have a complete configuration, Index Workbench saves this as a Fusion datasource. To load your data into Fusion, use the Fusion Datasource tool to run the resulting configuration.

Part 2 takes you through configuring a datasource using Index Workbench. In Part 3, you’ll load the data into Fusion and view it using Query Workbench.

Before you begin

To proceed with this part of the tutorial, you must first complete Part 1, which gives you a running instance of Fusion and a Fusion app.

1. Download the MovieLens dataset

  1. Download the dataset.

    This is a MovieLens dataset created by the Grouplens research lab.

  2. Unpack the ml-latest-small.zip file.

    Fusion can parse .zip files, but for simplicity we’ll index just one file from the archive (movies.csv).

The movies.csv file contains a list of 9,125 movie titles, plus a header row. Here is a truncated listing:

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller

2. Open the Movie Search app

If the Fusion UI isn’t already open, then open it.

  1. In a browser window, open localhost:8764.

  2. Enter the password for the user admin, and then click Log in.

    Welcome

    The Fusion launcher appears. You see the Movie Search that app you created in Part 1:

    Movie Search app in launcher

  3. In the Fusion launcher, click the Movie Search app.

    The Fusion workspace appears. It has controls along the left and top sides.

    Fusion workspace

  4. In the upper right, hover over Apps Apps. You can see that Movie Search is the currently selected app.

    Also, the user collection Movie_Search is selected in the collection picker. This is the default collection for the Movie Search app, and where Fusion will place index data.

3. Configure the datasource

A collection includes one or more datasources. A datasource is a configuration that manages the import, parsing, and indexing of data into a collection. You’ll use Index Workbench to configure a datasource for the movie data.

  1. In the collection picker, verify that the collection Movie_Search is selected.

    Collection Movie_Search is selected

  2. Open Index Workbench. Navigate to Indexing Indexing > Index Workbench.

    Initially, no data preview appears because no datasource has been configured. When you configure a datasource, Fusion samples the data and displays a preview of how it would be formatted in the index using the default parsing and index pipeline configurations.

  3. In the upper right, click New.

  4. Select Or, upload a file.

  5. Click Choose File.

  6. Navigate to the movies.csv file, select it, and then click Open.

    New datasource

  7. Click Add New Datasource.

    The Datasource (File Upload) configuration panel appears, with the default datasource ID movies_csv-Movie_Search and the default file ID movies.csv. These default values are fine.

  8. Enter the Description Movies CSV file.

    Configure datasource

  9. Click Apply.

    Index Workbench reads up to 20 documents into memory from the movies.csv file, and then displays a preview of how they would be indexed.

    You have finished configuring the datasource. At the bottom of the page, click Cancel.

    First preview of index

    In the lower right, you can select the number of documents to preview.

4. Analyze the default output

  1. Notice that Fusion made some assumptions about your original fields:

    • genres became genres_t (the text_general field type) and genres_s (the string field type). String fields are useful for faceting and sorting, while text fields are for full-text search. At this point, Fusion doesn’t know whether you intend to use this field for faceting and sorting, for full-text search, or for both.

    • title became title_t and title_s for the same reason.

    • movieId became movieId_t and movieId_s for the same reason. This might seem odd, because the original field contains numbers. But, at this stage, Fusion creates text_general and string fields. To use the contents of this field as an integer, you would map the field to an integer field.

    You also see fields that begin with _lw. These fields contain data that Fusion creates for its own housekeeping. You can ignore them.

    These fields are created by the Solr Dynamic Field Mapping stage in the default index pipeline. This stage attempts to automatically detect field types, and renames fields accordingly. For this tutorial, you’ll manually configure the fields instead.

  2. Turn off the Solr Dynamic Field Mapping stage by clicking the green circle next to it.

    Your data’s original fields reappear: genres, movieId, and title.

    Stage disabled

5. Configure the index pipeline

First you’ll configure the field mappings in the index pipeline so each field has the correct data type. Then you’ll split the genres field into multiple values so each value can be used as a facet in Part 3 of this tutorial.

5.1. Configure field mappings

Configure field mappings to control the field types of Fusion documents. Fusion uses field name suffixes to determine field types. When a field name has no suffix, Fusion stores it as a string field and treats it as an unanalyzed whole. For precise analysis and search, most fields need suffixes to indicate their specific types. You’ll see how this relates to the fields in the dataset.

  1. In the list of index pipeline stages on the left, click Field Mapping to open the Field Mapping stage configuration panel.

  2. Click Add Add to create a new field mapping rule.

  3. Under Source Field, enter genres.

  4. Under Target Field, enter genres_ss.

    The field suffix _ss means that this field is a multi-valued string field.

    Note
    Fusion currently interprets this field as having a single value. You can see that the field actually contains a pipe-delimited array of values. You’ll fix this after you finish configuring field mappings.
  5. Under Operation, select move.

    The move operation means that the resulting document no longer has a genres field; it only has genres_ss.

    Field mapping of genres field

  6. Click Apply.

    Applying the new configuration re-runs the simulation and updates the contents of the preview panel. Notice the change in the field name from genres to genres_ss:

    Before After

    Simulation results 1

    Simulation results 2

  7. Click Add Add to add more field mapping rules as follows:

    • The movieId field is a unique document identifier. It should be copied into the document’s id field.

    • The title should be searchable as a text field, so you move it to the field title_txt.

    Your field mappings should look like this:

    All field mappings

  8. Click Apply.

    After you have specified these explicit field mapping rules, you can browse the resulting documents in the preview panel to check your work.

    Before After

    Simulation results 2

    Simulation results 3

  9. In the upper right, click Save. This saves your modified index pipeline. Get in the habit of saving your work as you work.

Now your document ID is more useful, and your movie titles are full-text searchable.

Tip
Because the input documents in this tutorial are simple documents with a fixed number of known fields, it’s easy to configure the Field Mapping stage to ensure the correct document structure for Fusion. When documents have large numbers of fields, the Solr Dynamic Field Mapping stage can reduce the work required to configure the index pipeline.

5.2. Split a multi-value field

The genres_ss field has been parsed as a single-value field, but you can see that it’s really a pipe-delimited array of values. To split this field into its constituent values, you’ll add a Regex Field Extraction stage to your index pipeline. This stage uses regular expressions to extract data from specific fields. It can append or overwrite existing fields with the extracted data, or use the data to populate new fields.

  1. Click Add a stage.

  2. Scroll down and select Regex Field Extraction.

    The Regex Field Extraction stage configuration panel appears.

    Regex Field Extraction stage

  3. Under Regex Rules, click Add Add.

  4. On the new line, hover over the […​] under Source Fields, and then click Edit Edit.

    The Source Fields window opens.

  5. Click Add Add.

  6. Enter genres_ss, and then click Apply.

  7. Under Target Field, enter genres_ss.

  8. In the Write Mode field, select overwrite.

  9. In the Regex Pattern field, enter this expression:

    [^|\s][^\|]*[^|\s]*
    Tip
    You might need to scroll horizontally to see this field.

    The first bracketed term in the regex matches any character that is not a vertical bar or a space. The second term matches any character that is not a vertical bar, zero or more times. The last term matches any character that is not a vertical bar, zero or more times.

  10. In the Return If No Match field, select input_string.

  11. Click Apply.

    Initially, your data doesn’t change.

  12. In the list of index pipeline stages, drag the Regex Field Extraction stage down so that it comes after the Field Mapping stage:

    Index pipeline stage reordering

    Now the preview shows multiple values for the genres_ss field:

    Before After

    Simulation results 3

    Simulation results 4

    Tip
    If the preview panel doesn’t update automatically, select a different number of documents to view. This forces the preview to update.
  13. To view the values of the genres_ss field, expand it and values under it by clicking the triangles:

    Simulation results 4 expanded

    These field values are useful for faceting, which you’ll explore in Part 3 of this tutorial.

  14. In the upper right, click Save. This saves your modified index pipeline.

5.3. Create a new field from part of an existing one

Notice that the title_txt field also contains the year in which the movie was released. Instead of including the year in your full-text search field, it would be more useful as a separate field that you can use for faceting. This is another job for the Regex Field Extraction stage.

  1. In the list of index pipeline stages, click Regex Field Extraction.

  2. In the Regex Field Extraction configuration panel, under Regex Rules, click Add Add.

  3. On the new line, hover over the […​] under Source Fields, and then click Edit Edit.

    The Source Fields window appears.

  4. Click Add Add.

  5. Enter title_txt, and then click Apply.

  6. Under Target Field, enter year_i.

    The _i suffix indicates an integer point field (specifically, that the field is a dynamic field with a pint field type). Fusion will create this new field whenever the regular expression matches the contents of the source field.

    Tip
    When you use the Regex Field Extraction stage to create a new field, the value of Write Mode makes no difference.
  7. In the Regex Pattern field, enter this expression to match the digits inside the parentheses at the end of the title_txt value:

    \(([0-9]+)\)$
  8. In the Regex Capture Group field, enter 1. This lets the index pipeline stage transfer the year into the year_i field.

    Tip
    Scroll all the way to the right to see this field.
  9. Click Apply.

    Now the preview includes the new year_i field:

    Before After

    Simulation results 4

    Simulation results 5

  10. In the upper right, click Save. This saves your modified index pipeline.

5.4. Trim a field’s value

The title_txt field still includes the year of the film’s release, which you’ve extracted into its own field, year_i. Let’s trim that information from the title_txt values so that only the title text remains.

  1. In the list of index pipeline stages, click Regex Field Extraction.

  2. In the Regex Field Extraction configuration panel, under Regex Rules, click Add Add.

  3. On the new line, hover over Source Fields, and then click Edit Edit.

    The Source Fields window appears.

  4. Click Add Add.

  5. Enter title_txt, and then click Apply.

  6. Under Target Field, enter title_txt.

  7. In the Write Mode field, select overwrite.

  8. In the Regex Pattern field, enter this expression to match the digits inside the parentheses at the end of the title_txt value:

    ^(.+)\s\(([0-9]+)\)$
  9. In the Regex Capture Group field, enter 1.

  10. Click Apply.

    Now the preview pane shows the title_txt field with with only the title string:

    Before After

    Simulation results 5

    Simulation results 6

  11. In the upper right, click Save. This saves your modified index pipeline.

6. Run the datasource job

Now you have a correctly-configured index pipeline appropriate to your data. You’re ready to index the data.

  1. In the upper left, click Start job.

    Start job

    This launches a datasource job that imports and indexes the complete contents of your movies.csv file, using the configuration you just saved.

  2. Confirm that the datasource job has finished running.

    1. In the upper left, click System System > Scheduler.

    2. Click the job movies_csv-Movie_Search. To the right, you should see the status Success, and the Last Runtime should be close to now.

      Scheduler shows job success

7. Confirm the indexing

To verify that Fusion has correctly imported and indexed the data, view the number of documents in the Movie_Search collection:

  1. Use the Collections Manager. Navigate to Collections Collections > Collections Manager.

    You see that the collection Movie_Search contains the correct number of documents (9,125):

    Collections Manager

  2. (Alternatively) Use the Fields Manager tool, which lists all document fields in a Fusion collection. Navigate to Collections Collections > Fields.

    In the Filter field, enter genres.

    Fields are grouped by their suffixes, so you’ll find this field in the *_ss group. Click the arrow Arrow to expand the group and view the genres_ss field. You can see that 9,125 documents have this field.

    Genres field

Tip
If you find that your data is incorrectly indexed, you can navigate to Indexing Indexing > Datasources, click the datasource name, and then click Clear Datasource. The data and its index history are removed from the index so that you can re-index it.

Fusion only re-indexes data that is not found in the index history. In other words, Fusion won’t overwrite indexed data; it will only re-write existing data after you clear the datasource.

8. Close panels you no longer need open

Fusion opens panels beside already open panels. Close all of the panels that are open by clicking Close Close.

What’s next

Now you have 9,125 movie listings from the MovieLens database in Fusion’s index, customized to indicate the data type for each field. You also split a multi-valued field so that its values can be treated individually, created a new field to contain partial contents of a different field, and trimmed that content from the original field.

Let’s compare the initial indexing of your data with the indexing after field mappings and extractions:

Before After

Simulation results 1

Simulation results 6

In Part 3, you’ll use Query Workbench to get search results from your collection and configure the query pipeline that customizes those results. You’ll add faceting using the genres_ss and year_i fields so that users can easily filter their search results.