Indexing Data

When a connector ingests data, it passes the data through an index pipeline for transformation prior to indexing. The format of your indexed data depends on the index pipeline, which is part of the datasource configuration.

An index pipeline consists of one or more configurable index pipeline stages, each performing a different type of transformation on the incoming data. Each connector has a default index pipeline, but you can modify these or create new ones.

The last stage in any index pipeline should be the Solr Indexer stage, which submits the documents to Solr for indexing.

The following pages provide details on how to configure and use Fusion’s index pipelines:

  • Fusion PipelineDocument Objects describes the datatype on which pipelines operate.

  • Pushing Documents to a Pipeline is a brief overview on how to quickly index content using the pipelines directly (instead of using a datasource).

  • Index Profiles allow the use of one pipeline for multiple collections.

  • Entity-Extraction covers Natural Language Processing (NLP) tools and techniques for finding names of people places and things.

  • Blob Storage describes how to upload large binary objects to Fusion. Language model files needed for the entity extraction pipeline stages are loaded into Fusion by this mechanism.

  • Time-Based Partitioning explains how to configure a collection to automatically index data by time slice.

  • The Index Pipeline Simulator is a tool for previewing the output of a pipeline while you configure it.