From Ingest to Search:
Part One - Default Indexing

In Part One, we’ll see what we get with the default index and query pipelines. You can use Fusion 2.4 or 3.0 for this tutorial.

Before You Begin

For our dataset, we’ll use a simple CSV file containing abstracts from Wikipedia articles that contain either of the words "film" or "movie". Each row in the CSV file contains one document’s worth of data. Fusion will process each row into a fielded Solr document whose field names are derived from the CSV column names.

  1. Download Fusion.

  2. tar -xf fusion-3.1.x.tar.gz

  3. fusion/3.1.x/bin/fusion start

  4. Go to http://localhost:8764/.

  5. Download the dataset and unzip it.

  6. Completeness check: Does the file contain the right number of lines? Your copy of this file should be 155,299 lines long.

    Tip
    You should always check that your data is in the format that you expect and that it is complete and correct before you import it into Fusion.
    > wc -l cinema-short-abstracts.csv
    155299 cinema-short-abstracts.csv
  7. Correctness check: View the first several lines of the file:

    > head -4 cinema-short-abstracts.csv | cat -n
         1	DBpediaURL_s,WikipediaURL_s,abstractShort_txt
         2	"<http://dbpedia.org/resource/!Women_Art_Revolution>","<http://en.wikipedia.org/wiki/!Women_Art_Revolution?oldid=498606361>","!Women Art Revolution is a 2010 documentary film directed by Lynn Hershman Leeson and distributed by Zeitgeist Films. It was released theatrically in the United States on June 1, 2011."
         3	"<http://dbpedia.org/resource/$100_Film_Festival>","<http://en.wikipedia.org/wiki/$100_Film_Festival?oldid=540809587>","The $100 Film Festival is an independent film festival that runs for three days every March at the Globe Cinema in downtown Calgary, Alberta. The festival showcases films in all genres by local and international independent artists who enjoy working with traditional film. Created in 1992 by the Calgary Society of Independent Filmmakers (CSIF), the $100 Film Festival started as a challenge for area filmmakers to a make low budget movie using Super8 film for less than $100."
         4	"<http://dbpedia.org/resource/$30_Film_School>","<http://en.wikipedia.org/wiki/$30_Film_School?oldid=498969852>","$30 Film School is a book written by Michael W. Dean instructing on filmmaking on a limited budget, and is part of the $30 School book series which includes $30 Music School and $30 Writing School. Like the other books of this series, $30 Film School advocates a start-to-finish DIY ethic, and includes interviews with professionals in the given field, as well as a CD or DVD of extras. Published by Muska & Lipman in 2003, the first edition sold 30,000 copies."
    • The first line is the CSV header.

    • The column names are suffixed by the Solr data type for each column; we’ll look at those data types later.

    • In the Solr index, the column names from the file become the field names for each entry.

    • Fusion will process each non-header and non-comment row into a Solr document.

  8. Correctness check: Verify the CSV format with csvlint

    This particular data file was derived from a file downloaded from the DBpedia project. The original dataset contains UTF-8 characters outside of the ISO Latin 1 character set. The abstracts contain many quoted strings which must be properly escaped for CSV. Did the person who created the CSV file format it correctly?

    The answer is yes. To check for yourself, you may run the "csvlint" utility, which can be downloaded from https://github.com/Clever/csvlint

    > csvlint
    cinema-short-abstracts.csv is VALID

1. Create a Collection

Fusion stores datasets in a Fusion collection in the Solr index.

We’ll create a collection called "cinema_1" for our data:

  1. Log in to Fusion via the browser. (http://localhost:8764/ if you installed Fusion locally. We support Chrome, Firefox, and new versions of IE).

2. Define a Datasource

A datasource is an ingest configuration that’s associated with a collection. Ingested data belongs to the same collection as its datasource.

Define a datasource

To configure Fusion to ingest the "cinema-short-abstracts.csv" file, you must define a datasource:

  1. In the collections list, click cinema_1.

    The collection’s Home toolbar appears.

3. Lessons Learned

  • Data validation - check your work early and often

    • The larger your dataset, the more often you should check your work

  • Fusion Collections - the underlying datastore

    • Creating a new collection

    • Viewing the collection fields

4. Next Steps

The next parts of this tutorial show you how Fusion provides tools to improve the user search experience:

  • Part Two shows you how to configure the index pipeline to provide better indexed data for search purposes.

  • Part Three explains how to configure the query pipeline to provide better processing for user queries.