From Ingest to Search:
Part One - Default Indexing

In Part One, we’ll see what we get with the default index and query pipelines.

Before You Begin

For our dataset, we’ll use a simple CSV file containing abstracts from Wikipedia articles that contain either of the words "film" or "movie". Each row in the CSV file contains one document’s worth of data. Fusion will process each row into a fielded Solr document whose field names are derived from the CSV column names.

  1. Download Fusion.

  2. tar -xf fusion-3.0.x.tar.gz

  3. fusion/bin/fusion start

  4. Go to http://localhost:8764/.

  5. Download the dataset and unzip it.

  6. Completeness check: Does the file contain the right number of lines? Your copy of this file should be 155,299 lines long.

    Tip
    You should always check that your data is in the format that you expect and that it is complete and correct before you import it into Fusion.
    > wc -l cinema-short-abstracts.csv
    155299 cinema-short-abstracts.csv
  7. Correctness check: View the first several lines of the file:

    > head -4 cinema-short-abstracts.csv | cat -n
         1	DBpediaURL_s,WikipediaURL_s,abstractShort_txt
         2	"<http://dbpedia.org/resource/!Women_Art_Revolution>","<http://en.wikipedia.org/wiki/!Women_Art_Revolution?oldid=498606361>","!Women Art Revolution is a 2010 documentary film directed by Lynn Hershman Leeson and distributed by Zeitgeist Films. It was released theatrically in the United States on June 1, 2011."
         3	"<http://dbpedia.org/resource/$100_Film_Festival>","<http://en.wikipedia.org/wiki/$100_Film_Festival?oldid=540809587>","The $100 Film Festival is an independent film festival that runs for three days every March at the Globe Cinema in downtown Calgary, Alberta. The festival showcases films in all genres by local and international independent artists who enjoy working with traditional film. Created in 1992 by the Calgary Society of Independent Filmmakers (CSIF), the $100 Film Festival started as a challenge for area filmmakers to a make low budget movie using Super8 film for less than $100."
         4	"<http://dbpedia.org/resource/$30_Film_School>","<http://en.wikipedia.org/wiki/$30_Film_School?oldid=498969852>","$30 Film School is a book written by Michael W. Dean instructing on filmmaking on a limited budget, and is part of the $30 School book series which includes $30 Music School and $30 Writing School. Like the other books of this series, $30 Film School advocates a start-to-finish DIY ethic, and includes interviews with professionals in the given field, as well as a CD or DVD of extras. Published by Muska & Lipman in 2003, the first edition sold 30,000 copies."
    • The first line is the CSV header.

    • The column names are suffixed by the Solr data type for each column; we’ll look at those data types later.

    • In the Solr index, the column names from the file become the field names for each entry.

    • Fusion will process each non-header and non-comment row into a Solr document.

  8. Correctness check: Verify the CSV format with csvlint

    This particular data file was derived from a file downloaded from the DBpedia project. The original dataset contains UTF-8 characters outside of the ISO Latin 1 character set. The abstracts contain many quoted strings which must be properly escaped for CSV. Did the person who created the CSV file format it correctly?

    The answer is yes. To check for yourself, you may run the "csvlint" utility, which can be downloaded from https://github.com/Clever/csvlint

    > csvlint
    cinema-short-abstracts.csv is VALID

1. Create a Collection

Fusion stores datasets in a Fusion collection in the Solr index.

We’ll create a collection called "cinema_1" for our data:

  1. Log in to Fusion via the browser. (http://localhost:8764/ if you installed Fusion locally. We support Chrome, Firefox, and new versions of IE).

    The Launcher appears.

  2. From the pull-down menu in the upper left, select Manage Collections:

    Manage Collections

    The Collections Manager appears.

  3. Click the New button:

    Collections Manager

  4. For the Collection name, enter "cinema_1".

  5. Click Save Collection:

    Save Collections

    Now you have a "cinema_1" collection that contains no documents or datasources:

    Save Collections

Now that we have a collection to work with, we’ll use the Index Workbench to define a datasource for the collection, then configure the parser and index pipeline that determine how the data is formatted for storage.

2. Define a Datasource

A datasource is an ingest configuration that’s associated with a collection. Ingested data belongs to the same collection as its datasource.

Define a datasource

To configure Fusion to ingest the "cinema-short-abstracts.csv" file, you must define a datasource:

  1. In the collections list, click cinema_1.

    The collection’s Home toolbar appears.

  2. Click Index Workbench:

    Home

    This opens the Index Workbench: Index Workbench

    In the "cinema_1" collection that you just created, there are no datasources configured yet, so you will configure a new one.

  3. Click the Add data from file button.

  4. Select the "cinema-short-abstracts.csv" file.

    The datasource configuration panel appears. The File ID field displays the name of the file you selected.

  5. Enter a name in the Datasource ID field, such as "ds1": New datasource

  6. Click Apply.

    Simulating data

    The Index Workbench reads the file, then displays simulated results that show how the data would be indexed using the default parser and index pipeline:

    Simulating data

You can see the DBpediaURL_s, WikipediaURL_s, and abstractShort_txt fields in the simulation, as well as additional fields created by Fusion. Because the simulation is created using default configurations for the parser and the index pipeline, the original fields and their values are unmodified. At this point, our data has not been imported or indexed.

We can see that our data has a text field containing an abstract, and two fields containing URLs. There’s no title field, but both of the URL fields contain the movie title. We know that users will want to search by title as well as by keywords, but the default configuration does not result in a searchable title field.

Before we index our data, we will create a new field for the movie title, derived from one of the URL fields.

3. Lessons Learned

  • Data validation - check your work early and often

    • The bigger your data, the more often you should check

  • Fusion Collections - the underlying datastore

    • Creating a new collection

    • Viewing the collection fields

  • Index Workbench - the tool for ingesting, simulating and completing indexing

    • Setting up the datasource

    • Configuring parsing options

    • Setting up the index pipeline

    • Simulating indexing results

4. Next Steps

The next parts of this tutorial show you how Fusion provides tools to improve the user search experience:

  • Part Two shows you how to configure the index pipeline to provide better indexed data for search purposes.

  • Part Three explains how to configure the query pipeline to provide better processing for user queries.