Index DataGetting Started with Managed Fusion

Table of Contents

General information
Before you begin
Download the MovieLens dataset
Open the Movie Search app
Configure the datasource
Analyze the default output
Configure the index pipeline
Run the datasource job
Close panels you no longer need open
Reindex the datasource
Summary
Next steps
Additional resources

This topic details how to configure a datasource using Index Workbench.

General information

Managed Fusion’s Index Workbench provides the tools to configure datasources, parsers, and index pipelines. It lets you preview the results of indexing before you load your data into the actual index.

When you enter the necessary data extraction configuration in Index Workbench, it retrieves a small number of documents as sample data.

Since this processing is simulated, and actual data is not yet ingested, you can preview the sample documents to test and refine the index pipeline before all of the data is loaded into the actual index.

When you complete and save the configuration, it is saved in Index Workbench as a Managed Fusion datasource. To load your data into Managed Fusion, use the Datasource tool to run the resulting configuration.

Before you begin

To perform the steps in this part of the tutorial, you must complete Part 1 - Create a Managed Fusion application.

Throughout these tutorials, it is important to save your work regularly. The steps include instructions to save, but you can save your work more frequently if needed. When you configure datasources, pipelines, and other settings on your own site, saving your changes regularly is essential.

Download the MovieLens dataset

Download the dataset.

This is a MovieLens dataset created by the Grouplens research lab.

Unpack the ml-latest-small.zip file.

Managed Fusion can parse .zip files, but in this tutorial, we will index just one file from the archive (movies.csv).

The movies.csv file contains a list of 9,125 movie titles, plus a header row. Here is a truncated listing:

movieId,title,genres
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
7,Sabrina (1995),Comedy|Romance
8,Tom and Huck (1995),Adventure|Children
9,Sudden Death (1995),Action
10,GoldenEye (1995),Action|Adventure|Thriller

Open the Movie Search app

Sign in to Managed Fusion if it is not currently open.
In the Managed Fusion launcher, click the Movie Search app.
To verify the Movie Search app is selected to display in the workspace:
- Hover over Apps . Movie Search is the currently selected app.
- Review the collection picker selection at the top of the screen. Movie_Search is selected as the default collection for the Movie Search app, and is where Managed Fusion will place index data.

Configure the datasource

A collection includes one or more datasources. A datasource is a configuration that manages the import, parsing, and indexing of data into a collection.

Click Indexing > Index Workbench.
Click New.
In the Add A New Datasource section, click Or, upload a file.
Click Choose File.
Navigate to the movies.csv file on your computer, select it, and click Open. The file name displays on the screen.
Click Add New Datasource.

The Datasource (File Upload) configuration panel displays the default datasource ID movies_csv-Movie_Search and the default file ID movies.csv. You do not have to change these values.
Enter the Description Movies CSV file.
Click Apply.

Index Workbench reads up to 20 documents into memory from the movies.csv file, and then displays a preview of how they would be indexed based on current parameter and field settings.

You have finished configuring the datasource. At the bottom of the page, click Cancel.

The View Documents field in the lower right lets you select the number of documents to preview.

Analyze the default output

Review the preview to inspect how Managed Fusion interpreted the original fields:
- genres became genres_t (the text_general field type) and genres_s (the string field type). String fields are useful for faceting and sorting, while text fields are for full-text search. At this point, Managed Fusion cannot determine whether you intend to use this field for faceting and sorting, for full-text search, or for both.
- Similarly, title became title_t and title_s because Managed Fusion cannot determine whether you intend to use this field for faceting and sorting, for full-text search, or for both.
- Like the other fields, movieId became movieId_t and movieId_s because Managed Fusion cannot determine whether you intend to use this field for faceting and sorting, for full-text search, or for both. This might seem odd, because the original field contains numbers. But, at this stage, Managed Fusion creates text_general and string fields. To use the contents of this field as an integer, you would map the field to an integer field.
- Fields that begin with _lw fields contain data that Managed Fusion creates for its own housekeeping. You can disregard these entries.
These fields are created by the Solr Dynamic Field Name Mapping stage in the default index pipeline. This stage attempts to automatically detect field types, and renames fields accordingly. For this tutorial, you will manually configure the fields instead.
Click the green circle next to the Solr Dynamic Field Name Mapping stage to turn off the stage.

Your data’s original fields display: genres, movieId, and title.

Configure the index pipeline

In this section, you will:

Configure the field mappings in the index pipeline so each field has the correct data type.
Split the genres field into multiple values so each value can be used as a facet in Part 3 - Query Data.
Create a new field from part of an existing one.
Trim a field’s value.

Configure field mappings

Field mappings control the data types of documents. Managed Fusion uses field name suffixes to determine field types. If the field name:

Contains a suffix, precise analysis and search occurs.
Does not contain a suffix, Managed Fusion stores the data as a string field and treats it as an unanalyzed whole.

This section provides examples of both instances.

In the list of index pipeline stages, click Field Mapping to open the Field Mapping stage configuration panel.
In Field Translations, click Add to create a new field mapping rule.
In the Source Field, enter genres.

In the Target Field, enter genres_ss.

The field suffix _ss means that this field is a multi-valued string field.

Managed Fusion currently interprets this field as having a single value. The field actually contains a pipe-delimited array of values. When you finish configuring field mappings, subsequent steps will guide you to change the value type.

In Operation, select move.

The move operation means that the resulting document contains genres_ss instead of genres.
Click Apply. The new configuration runs the simulation again and updates the preview panel contents, changing the field name to genres_ss.

Before After
Click Add to add more field mapping rules as follows:
- The movieId field is a unique document identifier. Select to copy it into the document’s id field.
- The title should be searchable as a text field, so select to move it to the title_txt field.
  
  The field mappings display as:
Click Apply. The results using those field mappings display in the preview panel.

Before After
In the upper right, click Save. The changes to the index pipeline make the document ID more useful and the full text of the movie titles searchable.

Because the input documents in this tutorial are simple documents with a fixed number of known fields, it is easy to configure the Field Mapping stage to ensure the correct document structure. When documents have large numbers of fields, the Solr Dynamic Field Mapping stage can reduce the work required to configure the index pipeline.

Split a multivalue field

The genres_ss field has been parsed as a single value field, but it is really a pipe-delimited array of values. To split this field into its constituent values, add a Regex Field Extraction stage to your index pipeline. This stage uses regular expressions to extract data from specific fields. It can append or overwrite existing fields with the extracted data, or use the data to populate new fields.

Click Add a stage.
Scroll down to Field Transformation and select Regex Field Extraction.
In Regex Rules, click Add .
On the new line, hover over the […] under Source Fields, and click Edit .
In the Source Fields screen, click Add .
Enter genres_ss and click Apply.
In Target Field, enter genres_ss.
In the Write Mode field, select overwrite.
In the Regex Pattern field, paste this expression:
```
[^|\s][^\|]*[^|\s]*
```
You might need to scroll horizontally to see this field on the screen.

The first bracketed term in the regex matches any character that is not a vertical bar or a space. The second term matches any character that is not a vertical bar, zero or more times. The last term matches any character that is not a vertical bar, zero or more times.
In Return If No Match, select input_string.
Click Apply.

Initially, your data does not change.
In the list of index pipeline stages, click and drag the Regex Field Extraction stage so it processes after the Field Mapping stage:

Now the preview shows multiple values for the genres_ss field:

Before After

If the preview panel does not update automatically, select a different number of documents to view using the dropdown in the bottom right of the screen. This forces the preview to update.
To view the values of the genres_ss field, click the right triangle to expand it and values under it:

These field values are useful for faceting, which is detailed in Part 3 - Query Data.
In the upper right, click Save to save the changes to the index pipeline.

Create a new field from part of an existing one

Currently, the title_txt field also contains the year in which the movie was released. To make the field more useful for faceting, the year needs to be a separate field. The Regex Field Extraction stage will separate the data.

In the list of index pipeline stages, click Regex Field Extraction.
In the Regex Field Extraction configuration panel, under Regex Rules, click Add .
On the new line, hover over the […] under Source Fields, and then click Edit .
In the Source Fields screen, click Add .
Enter title_txt and click Apply.
In Target Field, enter year_i.

The _i suffix indicates an integer point field (specifically, that the field is a dynamic field with a point integer, pint, field type). Managed Fusion creates this new field when the regular expression matches the contents of the source field.

When you use the Regex Field Extraction stage to create a new field, the value of Write Mode does not affect the data.
In the Regex Pattern field, paste this expression to match the digits inside the parentheses at the end of the title_txt value:
```
$([0-9]+)$$
```
In the Regex Capture Group field, enter 1. This lets the index pipeline stage transfer the year into the year_i field.

Scroll to the right to see this field on the screen.
Click Apply.

Now the preview includes the new year_i field:

Before After
In the upper right, click Save to save the changes to the index pipeline.

Trim a field’s value

The title_txt field still includes the year of the film’s release, which you have extracted into its own field, year_i. To refine the field for faceting, trim year_i from the title_txt values so only the title text remains.

In the list of index pipeline stages, click Regex Field Extraction.
In the Regex Field Extraction configuration panel, under Regex Rules, click Add .
On the new line, hover over Source Fields and click Edit .
In the Source Fields screen, click Add .
Enter title_txt and click Apply.
In Target Field, enter title_txt.
In the Write Mode field, select overwrite.
In the Regex Pattern field, paste this expression to match the digits inside the parentheses at the end of the title_txt value:
```
^(.+)\s$([0-9]+)$$
```
In the Regex Capture Group field, enter 1.
Click Apply.

The preview pane displays the title_txt field with only the title string:

Before After
In the upper right, click Save to save the changes to the index pipeline.

Run the datasource job

In the upper left, click Start job to index the data using the configured index pipeline.

Start job

This launches a datasource job that imports and indexes the complete contents of your movies.csv file using the configuration you just saved.

Your datasource job is finished when the Index Workbench displays Status: success in the upper left. If the status does not change, click to return to the launcher and relaunch your app to refresh the status.

Close panels you no longer need open

If you do not manually close each panel, Managed Fusion opens panels beside already open panels. Click Close to close all of the open panels.

Reindex the datasource

Documents are associated with a collection through the name of the datasource, which is stored as a value in the _lw_data_source_s field.

For various reasons, you may wish to remove all documents associated with a datasource from a collection before using CrawlDB to add relevant documents back to the collection. This process is known as reindexing.

Navigate to Indexing > Datasources.
Select the datasource name.
Click Clear Datasource. This removes all documents with the selected datasource name in the _lw_data_source_s field.
When the documents are removed, repeat the steps in Configure the index pipeline to reindex the data.

Do not use the name of an existing datasource if you change the name of a datasource or if you create a new datasource. If an identical name is used, all document associations will be shared between the datasource names.

Summary

The parts of this tutorial so far have guided you to:

Move 9,125 movie listings from the MovieLens database into Managed Fusion
Customize the data type for each field
Split multivalued fields to treat its values individually
Create a new field that contains partial contents of a different field
Trimmed the content of the original multivalue field

The example displays the initial index versus the results after the field mappings and extractions:

Before

After

Simulation results 1

Simulation results 6

Next steps

In Part 3 - Query Data, you will use Query Workbench to get search results from your collection and configure the query pipeline that customizes those results. You will also add faceting using the genres_ss and year_i fields so users can easily filter their search results.

Additional resources

Lucidworks offers free training to help you get started.

The Course for Indexing Data focuses on how to ingest and store your data in a format that’s optimized for search:

Visit the LucidAcademy to see the full training catalog.