Product Selector

Fusion 5.9
    Fusion 5.9

    Index DataGetting Started with Managed Fusion

    This topic details how to configure a datasource using Index Workbench.

    General information

    Managed Fusion’s Index Workbench provides the tools to configure datasources, parsers, and index pipelines. It lets you preview the results of indexing before you load your data into the actual index.

    When you enter the necessary data extraction configuration in Index Workbench, it retrieves a small number of documents as sample data.

    Since this processing is simulated, and actual data is not yet ingested, you can preview the sample documents to test and refine the index pipeline before all of the data is loaded into the actual index.

    When you complete and save the configuration, it is saved in Index Workbench as a Managed Fusion datasource. To load your data into Managed Fusion, use the Datasource tool to run the resulting configuration.

    Before you begin

    To perform the steps in this part of the tutorial, you must complete Part 1 - Create a Managed Fusion application.

    Throughout these tutorials, it is important to save your work regularly. The steps include instructions to save, but you can save your work more frequently if needed. When you configure datasources, pipelines, and other settings on your own site, saving your changes regularly is essential.

    Download the MovieLens dataset

    1. Download the dataset.

      This is a MovieLens dataset created by the Grouplens research lab.

    2. Unpack the ml-latest-small.zip file.

      Managed Fusion can parse .zip files, but in this tutorial, we will index just one file from the archive (movies.csv).

      The movies.csv file contains a list of 9,125 movie titles, plus a header row. Here is a truncated listing:

      movieId,title,genres
      1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
      2,Jumanji (1995),Adventure|Children|Fantasy
      3,Grumpier Old Men (1995),Comedy|Romance
      4,Waiting to Exhale (1995),Comedy|Drama|Romance
      5,Father of the Bride Part II (1995),Comedy
      6,Heat (1995),Action|Crime|Thriller
      7,Sabrina (1995),Comedy|Romance
      8,Tom and Huck (1995),Adventure|Children
      9,Sudden Death (1995),Action
      10,GoldenEye (1995),Action|Adventure|Thriller

    Open the Movie Search app

    1. Sign in to Managed Fusion if it is not currently open.

    2. In the Managed Fusion launcher, click the Movie Search app.

    3. To verify the Movie Search app is selected to display in the workspace:

      • Hover over Apps Apps. Movie Search is the currently selected app.

      • Review the collection picker selection at the top of the screen. Movie_Search is selected as the default collection for the Movie Search app, and is where Managed Fusion will place index data.

        Collection Movie_Search is selected

    Configure the datasource

    A collection includes one or more datasources. A datasource is a configuration that manages the import, parsing, and indexing of data into a collection.

    1. Click Indexing Indexing > Index Workbench.

    2. Click New.

    3. In the Add A New Datasource section, click Or, upload a file.

    4. Click Choose File.

    5. Navigate to the movies.csv file on your computer, select it, and click Open. The file name displays on the screen.

      New datasource

    6. Click Add New Datasource.

      The Datasource (File Upload) configuration panel displays the default datasource ID movies_csv-Movie_Search and the default file ID movies.csv. You do not have to change these values.

    7. Enter the Description Movies CSV file.

      Configure datasource

    8. Click Apply.

      Index Workbench reads up to 20 documents into memory from the movies.csv file, and then displays a preview of how they would be indexed based on current parameter and field settings.

      You have finished configuring the datasource. At the bottom of the page, click Cancel.

      First preview of index

      The View Documents field in the lower right lets you select the number of documents to preview.

    Analyze the default output

    1. Review the preview to inspect how Managed Fusion interpreted the original fields:

      • genres became genres_t (the text_general field type) and genres_s (the string field type). String fields are useful for faceting and sorting, while text fields are for full-text search. At this point, Managed Fusion cannot determine whether you intend to use this field for faceting and sorting, for full-text search, or for both.

      • Similarly, title became title_t and title_s because Managed Fusion cannot determine whether you intend to use this field for faceting and sorting, for full-text search, or for both.

      • Like the other fields, movieId became movieId_t and movieId_s because Managed Fusion cannot determine whether you intend to use this field for faceting and sorting, for full-text search, or for both. This might seem odd, because the original field contains numbers. But, at this stage, Managed Fusion creates text_general and string fields. To use the contents of this field as an integer, you would map the field to an integer field.

      • Fields that begin with _lw fields contain data that Managed Fusion creates for its own housekeeping. You can disregard these entries.

      These fields are created by the Solr Dynamic Field Name Mapping stage in the default index pipeline. This stage attempts to automatically detect field types, and renames fields accordingly. For this tutorial, you will manually configure the fields instead.

    2. Click the green circle next to the Solr Dynamic Field Name Mapping stage to turn off the stage.

      Your data’s original fields display: genres, movieId, and title.

      Stage disabled

    Configure the index pipeline

    In this section, you will:

    Configure field mappings

    Field mappings control the data types of documents. Managed Fusion uses field name suffixes to determine field types. If the field name:

    • Contains a suffix, precise analysis and search occurs.

    • Does not contain a suffix, Managed Fusion stores the data as a string field and treats it as an unanalyzed whole.

    This section provides examples of both instances.

    1. In the list of index pipeline stages, click Field Mapping to open the Field Mapping stage configuration panel.

    2. In Field Translations, click Add Add to create a new field mapping rule.

    3. In the Source Field, enter genres.

    4. In the Target Field, enter genres_ss.

      The field suffix _ss means that this field is a multi-valued string field.

      Managed Fusion currently interprets this field as having a single value. The field actually contains a pipe-delimited array of values. When you finish configuring field mappings, subsequent steps will guide you to change the value type.
    5. In Operation, select move.

      The move operation means that the resulting document contains genres_ss instead of genres.

      Field mapping of genres field

    6. Click Apply. The new configuration runs the simulation again and updates the preview panel contents, changing the field name to genres_ss.

      Before After

      Simulation results 1

      Simulation results 2

    7. Click Add Add to add more field mapping rules as follows:

      • The movieId field is a unique document identifier. Select to copy it into the document’s id field.

      • The title should be searchable as a text field, so select to move it to the title_txt field.

        The field mappings display as:

        All field mappings

    8. Click Apply. The results using those field mappings display in the preview panel.

      Before After

      Simulation results 2

      Simulation results 3

    9. In the upper right, click Save. The changes to the index pipeline make the document ID more useful and the full text of the movie titles searchable.

    Because the input documents in this tutorial are simple documents with a fixed number of known fields, it is easy to configure the Field Mapping stage to ensure the correct document structure. When documents have large numbers of fields, the Solr Dynamic Field Mapping stage can reduce the work required to configure the index pipeline.

    Split a multivalue field

    The genres_ss field has been parsed as a single value field, but it is really a pipe-delimited array of values. To split this field into its constituent values, add a Regex Field Extraction stage to your index pipeline. This stage uses regular expressions to extract data from specific fields. It can append or overwrite existing fields with the extracted data, or use the data to populate new fields.

    1. Click Add a stage.

    2. Scroll down to Field Transformation and select Regex Field Extraction.

      Regex Field Extraction stage

    3. In Regex Rules, click Add Add.

    4. On the new line, hover over the […​] under Source Fields, and click Edit Edit.

    5. In the Source Fields screen, click Add Add.

    6. Enter genres_ss and click Apply.

    7. In Target Field, enter genres_ss.

    8. In the Write Mode field, select overwrite.

    9. In the Regex Pattern field, paste this expression:

      [^|\s][^\|]*[^|\s]*
      You might need to scroll horizontally to see this field on the screen.

      The first bracketed term in the regex matches any character that is not a vertical bar or a space. The second term matches any character that is not a vertical bar, zero or more times. The last term matches any character that is not a vertical bar, zero or more times.

    10. In Return If No Match, select input_string.

    11. Click Apply.

      Initially, your data does not change.

    12. In the list of index pipeline stages, click and drag the Regex Field Extraction stage so it processes after the Field Mapping stage:

      Index pipeline stage reordering

      Now the preview shows multiple values for the genres_ss field:

      Before After

      Simulation results 3

      Simulation results 4

      If the preview panel does not update automatically, select a different number of documents to view using the dropdown in the bottom right of the screen. This forces the preview to update.
    13. To view the values of the genres_ss field, click the right triangle triangles to expand it and values under it:

      Simulation results 4 expanded

      These field values are useful for faceting, which is detailed in Part 3 - Query Data.

    14. In the upper right, click Save to save the changes to the index pipeline.

    Create a new field from part of an existing one

    Currently, the title_txt field also contains the year in which the movie was released. To make the field more useful for faceting, the year needs to be a separate field. The Regex Field Extraction stage will separate the data.

    1. In the list of index pipeline stages, click Regex Field Extraction.

    2. In the Regex Field Extraction configuration panel, under Regex Rules, click Add Add.

    3. On the new line, hover over the […​] under Source Fields, and then click Edit Edit.

    4. In the Source Fields screen, click Add Add.

    5. Enter title_txt and click Apply.

    6. In Target Field, enter year_i.

      The _i suffix indicates an integer point field (specifically, that the field is a dynamic field with a point integer, pint, field type). Managed Fusion creates this new field when the regular expression matches the contents of the source field.

      When you use the Regex Field Extraction stage to create a new field, the value of Write Mode does not affect the data.
    7. In the Regex Pattern field, paste this expression to match the digits inside the parentheses at the end of the title_txt value:

      \(([0-9]+)\)$
    8. In the Regex Capture Group field, enter 1. This lets the index pipeline stage transfer the year into the year_i field.

      Scroll to the right to see this field on the screen.
    9. Click Apply.

      Now the preview includes the new year_i field:

      Before After

      Simulation results 4

      Simulation results 5

    10. In the upper right, click Save to save the changes to the index pipeline.

    Trim a field’s value

    The title_txt field still includes the year of the film’s release, which you have extracted into its own field, year_i. To refine the field for faceting, trim year_i from the title_txt values so only the title text remains.

    1. In the list of index pipeline stages, click Regex Field Extraction.

    2. In the Regex Field Extraction configuration panel, under Regex Rules, click Add Add.

    3. On the new line, hover over Source Fields and click Edit Edit.

    4. In the Source Fields screen, click Add Add.

    5. Enter title_txt and click Apply.

    6. In Target Field, enter title_txt.

    7. In the Write Mode field, select overwrite.

    8. In the Regex Pattern field, paste this expression to match the digits inside the parentheses at the end of the title_txt value:

      ^(.+)\s\(([0-9]+)\)$
    9. In the Regex Capture Group field, enter 1.

    10. Click Apply.

      The preview pane displays the title_txt field with only the title string:

      Before After

      Simulation results 5

      Simulation results 6

    11. In the upper right, click Save to save the changes to the index pipeline.

    Run the datasource job

    In the upper left, click Start job to index the data using the configured index pipeline.

    Start job

    This launches a datasource job that imports and indexes the complete contents of your movies.csv file using the configuration you just saved.

    Your datasource job is finished when the Index Workbench displays Status: success in the upper left. If the status does not change, click to return to the launcher and relaunch your app to refresh the status.

    Close panels you no longer need open

    If you do not manually close each panel, Managed Fusion opens panels beside already open panels. Click Close Close to close all of the open panels.

    Reindex the datasource

    Documents are associated with a collection through the name of the datasource, which is stored as a value in the _lw_data_source_s field.

    For various reasons, you may wish to remove all documents associated with a datasource from a collection before using CrawlDB to add relevant documents back to the collection. This process is known as reindexing.

    1. Navigate to Indexing Indexing > Datasources.

    2. Select the datasource name.

    3. Click Clear Datasource. This removes all documents with the selected datasource name in the _lw_data_source_s field.

    4. When the documents are removed, repeat the steps in Configure the index pipeline to reindex the data.

    Do not use the name of an existing datasource if you change the name of a datasource or if you create a new datasource. If an identical name is used, all document associations will be shared between the datasource names.

    Summary

    The parts of this tutorial so far have guided you to:

    • Move 9,125 movie listings from the MovieLens database into Managed Fusion

    • Customize the data type for each field

    • Split multivalued fields to treat its values individually

    • Create a new field that contains partial contents of a different field

    • Trimmed the content of the original multivalue field

    The example displays the initial index versus the results after the field mappings and extractions:

    Before After

    Simulation results 1

    Simulation results 6

    Next steps

    In Part 3 - Query Data, you will use Query Workbench to get search results from your collection and configure the query pipeline that customizes those results. You will also add faceting using the genres_ss and year_i fields so users can easily filter their search results.

    Additional resources

    Lucidworks offers free training to help you get started.

    The Course for Indexing Data focuses on how to ingest and store your data in a format that’s optimized for search:

    Indexing Data

    Visit the LucidAcademy to see the full training catalog.