Get Data In
ml-latest-small.zip
file.
Fusion can parse .zip
files, but for simplicity we will index just one file from the archive (movies.csv
).
movies.csv
file contains a list of 9,125 movie titles, plus a header row. Here is a truncated listing:admin
, and then click Log in.
Movie_Search
is selected in the collection picker. This is the default collection for the Movie Search app, and where Fusion will place index data.
movies.csv
file, select it, and then click Open.
movies_csv-Movie_Search
and the default file ID movies.csv
. These default values are fine.
Movies CSV file
.
movies.csv
file, and then displays a preview of how they would be indexed.
You have finished configuring the datasource. At the bottom of the page, click Cancel.
genres
became genres_t
(the text_general
field type) and genres_s
(the string
field type). String fields are useful for faceting and sorting, while text fields are for full-text search. At this point, Fusion does not know whether you intend to use this field for faceting and sorting, for full-text search, or for both.title
became title_t
and title_s
for the same reason.movieId
became movieId_t
and movieId_s
for the same reason. This might seem odd, because the original field contains numbers. But, at this stage, Fusion creates text_general
and string
fields. To use the contents of this field as an integer, you would map the field to an integer field._lw
. These fields contain data that Fusion creates for its own housekeeping. You can ignore them.
These fields are created by the Solr Dynamic Field Name Mapping stage in the default index pipeline. This stage attempts to automatically detect field types, and renames fields accordingly. For this tutorial, you will manually configure the fields instead.
genres
, movieId
, and title
.
genres
field into multiple values so each value can be used as a facet in Part 3 of this tutorial.genres
.
genres_ss
.
The field suffix _ss
means that this field is a multi-valued string field.
genres
field; it only has genres_ss
.
genres
to genres_ss
:
Before | After |
![]() | ![]() |
movieId
field is a unique document identifier. It should be copied into the document’s id
field.title
should be searchable as a text field, so you move it to the field title_txt
.Before | After |
![]() | ![]() |
genres_ss
field has been parsed as a single-value field, but you can see that it is really a pipe-delimited array of values. To split this field into its constituent values, you will add a Regex Field Extraction stage to your index pipeline. This stage uses regular expressions to extract data from specific fields. It can append or overwrite existing fields with the extracted data, or use the data to populate new fields.[...]
under Source Fields, and then click Edit genres_ss
, and then click Apply.
genres_ss
.
input_string
.
genres_ss
field:Before | After |
![]() | ![]() |
genres_ss
field, expand it and values
under it by clicking the right triangle title_txt
field also contains the year in which the movie was released. Instead of including the year in your full-text search field, it would be more useful as a separate field that you can use for faceting. This is another job for the Regex Field Extraction stage.[...]
under Source Fields, and then click Edit title_txt
, and then click Apply.
year_i
.
The _i
suffix indicates an integer point field (specifically, that the field is a dynamic field with a pint
field type). Fusion will create this new field whenever the regular expression matches the contents of the source field.
title_txt
value:
1
. This lets the index pipeline stage transfer the year into the year_i
field.
year_i
field:
Before | After |
![]() | ![]() |
title_txt
field still includes the year of the film’s release, which you have extracted into its own field, year_i
. Let us trim that information from the title_txt
values so that only the title text remains.title_txt
, and then click Apply.
title_txt
.
overwrite
.
title_txt
value:
1
.
title_txt
field with only the title string:Before | After |
![]() | ![]() |
movies.csv
file, using the configuration you just saved.
Your datasource job is finished when the Index Workbench displays Status: success
in the upper left. If the status does not change, go back to the launcher and relaunch your app.
_lw_data_source_s
field. For various reasons, you may wish to remove all documents associated with a datasource from a collection before using CrawlDB to add relevant documents back to the collection. This process is known as reindexing.To accomplish this, navigate to Indexing _lw_data_source_s
field. After the documents are removed from the collection, you can repeat [the steps above”/> to reindex the data.Before | After |
![]() | ![]() |
genres_ss
and year_i
fields so that users can easily filter their search results.Getting Started with Fusion Server
\t
for the tab character. When entering configuration values in the API, use escaped characters, such as \\t
for the tab character.