Pushing Documents to a Pipeline

Documents can be sent directly to an index pipeline using the Index-Pipelines REST API. The request path is:

/api/apollo/index-pipelines/<id>/collections/<collectionName>/index

where <id> is the name of an specific pipeline and <collectionName> is the name of an specific collection.

These requests are sent as a POST request. The request header specifies the format of the contents of the request body.

To send in a streaming list of JSON documents to the index pipeline you can send the JSON file which holds these objects to the API listed above with "application/json" as the content type. If your JSON file is a list/array of many items, the pipeline will operate in a streaming way and index the docs as necessary.

Example:

The JSON document called "myJsonDoc.json" holds 4.3M entries. Send the document to the index pipeline with the following command:

curl -u user:password -X POST -H 'Content-Type: application/json' -d@myJsonDoc.json "http://localhost:8764/api/apollo/index-pipelines/<id>/collections/<collectionName>/index"

Documents can be created on the fly using the PipelineDocument JSON notation. See Fusion PipelineDocument Objects for details and an example of how to do this.

Indexing PDFs and MS Office Documents

If you can access the filesystem in which the PDFs or MS Office documents reside, you can index these documents using properly configured datasource with the appropriate connector for that filesystem type. See Connectors and Datasources Reference for a list of all Fusion connectors.

If, however, there are obstacles to using the connectors, it may be simpler to index these types of documents with an index pipeline. The pipelines can only be used with REST API calls, and there is complete documentation in the section Index Pipelines API.

When sending the documents, it’s important to set the content type header properly for the content being sent. This is not a complete list, but here are some frequently used content types:

  • PDF documents: application/pdf

  • MS Office:

    • .docx: application/vnd.openxmlformats-officedocument.wordprocessingml.document

    • .xlsx: `application/vnd.openxmlformats-officedocument.spreadsheetml.sheet `

    • .pptx: `application/vnd.vnd.openxmlformats-officedocument.presentationml.presentation `

    • More types: http://filext.com/faq/office_mime_types.php

  • Text: text/json, text/xml, text/csv, etc.

Examples

Index a PDF document through the 'conn_solr' index pipeline to a collection named 'docs'. The pre-configured 'conn_solr' pipeline includes stages to parse documents with Tika, map fields, and index the documents to Solr (in that order ).

curl -u user:pass -X POST -H "Content-Type: application/pdf" --data-binary @/solr/core/src/test-files/mailing_lists.pdf http://localhost:8764/api/apollo/index-pipelines/conn_solr/collections/docs/index

Index one of the example Solr XML documents found in the ./example/exampledocs directory of Solr. For this example to work well, the default pipeline was modified to include a field mapping stage in addition to indexing the documents to Solr. In this example, the custom pipeline is named 'docs-default' and the collection is 'docs'.

curl -u user:pass -X POST -H "Content-Type: text/xml" --data-binary @/Applications/solr-4.10.0/example/exampledocs/hd.xml http://localhost:8764/api/apollo/index-pipelines/docs-default/collections/docs/index

Indexing CSV Files

In the usual case, to index a CSV (or TSV) file, the file is split into records, one per row, and each row is indexed as a seperate document. Datasources which use crawlers that are based on either the lucid.anda or lucid.fs framework can do the CSV splitting as part of the connector process.

Alternatively, the index pipeline can include a CSV Parsing stage.