Index Pipelines API

The Index Pipelines API provides methods for managing a set of named index pipelines. Every pipeline is made up of one or more stages. Stages can be defined during the creation of a pipeline, or stages can be defined separately and included into one or more pipelines. For details of the REST API for index stages, see Index Stages API.

Document processing proceeds stage by stage in a linear fashion. The order of the stages in a pipeline is the order in which they were defined. At installation, Fusion includes several pre-configured pipelines. See Index Pipelines for details on these default pipelines.

For more information about structuring documents for indexing, see Pushing Documents to a Pipeline.

Manage Index Pipelines: List, Create, Update, Delete

The path for this request is:

/api/apollo/index-pipelines/<id>

where <id> is the name of an specific pipeline.

  • GET - returns the definition of a specific index pipeline, or all defined pipelines, if the pipeline name is omitted.

  • POST - creates a new pipeline. The pipeline name is included in the definition in the request body.

  • PUT - modifies an existing pipeline. Used to reorder pipeline stages.

  • DELETE - removes the pipeline.

Index Pipeline Definition Properties

Property Description

id
Required

A unique ID for the pipeline.

stages
Optional

A JSON map that lists the index pipeline stages that make up this pipeline. Each identified stage should include:

  • id: the ID of the stage to include.

  • type: The stage type. You can define a stage directly on a pipeline, or you can use a pre-existing stage. If you use a pre-existing stage, you must use "ref", as a reference to an existing stage.

  • skip: If the stage should be skipped during pipeline processing. By default, this is 'false'.

  • Finally, if a stage has not already been defined, the appropriate stage properties should be included. See Index Pipeline Stages for details of the properties for each stage type.

Refresh an Index Pipeline

When changes are made to a pipeline, the pipeline needs to be refreshed (reloaded), via a PUT request to endpoint:

/api/apollo/index-pipelines/<id>/refresh

A Fusion restart refreshes all pipelines. If the refresh was successful, there will be no response.

Submit a Set of Documents to an Index Pipeline

The path for this request is:

/api/apollo/index-pipelines/<id>/collections/<collectionName>/index

where <id> is the name of an specific pipeline and <collectionName> is the name of an specific collection.

Documents are indexed through a POST request. The documents to be indexed are sent in the request body.

This request takes the following optional query parameters:

Parameter Description

simulate

If true, documents won’t be sent to solr for indexing, i.e., the solr-index stage is skipped. The default is false.

echo

If true, the default, the list of documents indexed will be returned. If this is false, no output will be returned when the pipeline has finished.

bufferDocsForSolr

If true, documents will be buffered before sending to Solr. This is an asynchronous mode and may give faster performance when indexing a large number of documents. The default is false.

eventsCollection

Required for event processing: a string containing the name of the target collection that is the index over link events.

eventsPipeline

Required for event processing: a string containing the name of the index pipeline that processes the link events.

eventTypes

Optional: comma-separated list of eventTypes to be processed by the index pipeline specified by eventsPipeline parameter. Currently, only event type is "links".

Input

The index pipelines can take any document format that is supported by Tika (assuming the pipeline contains a tika-parser stage). The document format is indicated by a content type declaration in the header of the REST call.

Documents can be submitted using the PipelineDocument JSON notation. The content type header to use for this format is:

application/vnd.lucidworks-document

Parameter Description

id
Optional

The id of the document.

If the document does not have an ID, one will be automatically generated during processing.

fields
Optional

The content of the document arranged into fields. The fields are expressed in a JSON array with strings for each field that contain the field "name" property and the field "value" property, as in:

"fields":[{"name":"fieldName","value":"fieldValue"}]

For multiple values of a field, you would repeat the "name" and "value" string, as in:

"fields":[{"name":"fieldName","value":"fieldValue1"},{"name":"fieldName","value":"fieldValue2"}]

commands
Optional

Commands can be added to documents to tell the index what to do with the document.

Documents can consist entirely of a command in order to issue a commit, or delete documents based on a query.

The valid commands are:

  • add: add a document to the index. If the document already exists, it will be removed and the new document added.

  • delete: remove a document from the index.

  • commit: issue a commit to the index. This command would be used as the last document in a set of documents to commit them all to the index.

  • query:

  • delete_by:

metadata
Optional

This property is used by stages to add how field data was retrieved from a document, and for other purposes.

Output

The output by default will be a JSON representation of the document, including all fields of the document after the pipeline has completed processing. This may be the original fields of the document and will also include any fields added by the pipeline stages.

If the query parameter 'echo' has been set to false, no output will be returned.

Debug a Pipeline

The path for this request is:

/api/apollo/index-pipelines/<id>/collections/<collectionName>/debug

where <id> is the name of an specific pipeline and <collectionName> is the name of an specific collection.

Debugging is done via a POST request. The documents to be debugged are sent in the request body. The output shows the state of the document after each indexing stage. A debug request will index documents to the system, but you can prevent it from doing so by setting the query parameter simulate to false.

The request takes to optional query parameters:

Parameter Description

simulate

If true, documents won’t be sent to solr for indexing, i.e., the solr-index stage is skipped.The default is false.

bufferDocsForSolr

If true, documents will be buffered before sending to Solr. This is an asynchronous mode and may give faster performance when indexing a large number of documents.

The output will include details of each stage of the pipeline and a JSON representation of each document as it passed through each stage, including all fields of the document (original fields of the document and any fields added by the pipeline stages).

Examples

List the 'default' pipeline: REQUEST

curl -u user:pass http://localhost:8764/api/apollo/index-pipelines/default

RESPONSE

{
  "id" : "default",
  "stages" : [ {
    "type" : "solr-index",
    "id" : "solr-default",
    "skip" : false
  } ]
}

Create an index pipeline named 'my-index-pipeline' with three stages, one of which does not yet exist:

REQUEST

curl -u user:pass -X POST -H 'Content-type: application/json' -d '{"id":"my-index-pipeline","stages":[{"id":"tika","type":"tika-parser","includeImages":true},{"id":"conn_mapping","type":"ref"},{"id":"solr-default","type":"ref"}]}' http://localhost:8764/api/apollo/index-pipelines

RESPONSE

{
  "id" : "my-index-pipeline",
  "stages" : [ {
    "type" : "tika-parser",
    "id" : "tika",
    "includeImages" : true,
    "flattenCompound" : false,
    "addFailedDocs" : false,
    "addOriginalContent" : true,
    "contentField" : "_raw_content_",
    "skip" : false,
    "label" : "tika-parser"
  }, {
    "type" : "ref",
    "id" : "conn_mapping",
    "skip" : false,
    "label" : "ref"
  }, {
    "type" : "ref",
    "id" : "solr-default",
    "skip" : false,
    "label" : "ref"
  } ]
}

Reload the 'my-index-pipeline' pipeline:

INPUT

curl -u user:pass -X PUT http://localhost:8764/api/apollo/index-pipelines/my-index-pipeline/refresh

Index two JSON documents through a pipeline named 'conn_solr' and a collection named 'my-docs':

INPUT

curl -u user:pass -X POST -H "Content-Type: application/vnd.lucidworks-document" -d '[{"id": "myDoc1","fields": [{"name":"title", "value": "My first document"},{"name":"body", "value": "This is a simple document."}]}, {"id": "myDoc2","fields": [{"name":"title","value": "My second document"},{"name":"body", "value": "This is another simple document."}]}]' http://localhost:8764/api/apollo/index-pipelines/conn_solr/collections/my-docs/index

OUTPUT

[ {
  "id" : "myDoc1",
  "fields" : [ {
    "name" : "content",
    "value" : "This is a simple document.",
    "metadata" : { },
    "annotations" : [ ]
  }, {
    "name" : "title",
    "value" : "My first document",
    "metadata" : { },
    "annotations" : [ ]
  }, {
    "name" : "parsing_s",
    "value" : "no_raw_data",
    "metadata" : {
      "creator" : "tika-parser"
    },
    "annotations" : [ ]
  }, {
    "name" : "parsing_time_l",
    "value" : [ "java.lang.Long", 7 ],
    "metadata" : { },
    "annotations" : [ ]
  } ],
  "metadata" : { },
  "commands" : [ ]
}, {
  "id" : "myDoc2",
  "fields" : [ {
    "name" : "content",
    "value" : "This is another simple document.",
    "metadata" : { },
    "annotations" : [ ]
  }, {
    "name" : "title",
    "value" : "My second document",
    "metadata" : { },
    "annotations" : [ ]
  }, {
    "name" : "parsing_s",
    "value" : "no_raw_data",
    "metadata" : {
      "creator" : "tika-parser"
    },
    "annotations" : [ ]
  }, {
    "name" : "parsing_time_l",
    "value" : [ "java.lang.Long", 0 ],
    "metadata" : { },
    "annotations" : [ ]
  } ],
  "metadata" : { },
  "commands" : [ ]
} ]

Index a PDF document with the 'conn_solr' pipeline:

INPUT

curl -u user:pass -X POST -H "Content-Type: application/pdf" --data-binary @/solr/core/src/test-files/mailing_lists.pdf http://localhost:8764/api/apollo/index-pipelines/conn_solr/collections/my-docs/index

OUTPUT

[ {
  "id" : "d6c7757e-33d9-4fbb-aa38-eef84d679ca9",
  "fields" : [ {
    "name" : "fileSize_l",
    "value" : "8582",
    "metadata" : {
      "creator" : "tika-parser"
    },
    "annotations" : [ ]
  }, {
    "name" : "parsing_s",
    "value" : "no_raw_data",
    "metadata" : { },
    "annotations" : [ ]
  }, {
    "name" : "pageCount_i",
    "value" : "2",
    "metadata" : {
      "creator" : "tika-parser"
    },
    "annotations" : [ ]
  }, {
    "name" : "parsing_time_l",
    "value" : [ "java.lang.Long", 1171 ],
    "metadata" : { },
    "annotations" : [ ]
  }, {
    "name" : "parsing_time_l",
    "value" : [ "java.lang.Long", 0 ],
    "metadata" : { },
    "annotations" : [ ]
  }, {
    "name" : "attr_pdf:encrypted_",
    "value" : "false",
    "metadata" : {
      "creator" : "tika-parser"
    },
    "annotations" : [ ]
  }, {
    "name" : "attr_X-Parsed-By_",
    "value" : "org.apache.tika.parser.pdf.PDFParser",
    "metadata" : {
      "creator" : "tika-parser"
    },
    "annotations" : [ ]
  }, {
    "name" : "attr_pdf:PDFVersion_",
    "value" : "1.3",
    "metadata" : {
      "creator" : "tika-parser"
    },
    "annotations" : [ ]
  }, {
    "name" : "attr_producer_",
    "value" : "FOP 0.20.5",
    "metadata" : {
      "creator" : "tika-parser"
    },
    "annotations" : [ ]
  }, {
    "name" : "content",
    "value" : "\nSolr Mailing Lists\n\nTable of contents\n1 ",
    "metadata" : { },
    "annotations" : [ ]
  }, {
    "name" : "attr_dc:format_",
    "value" : "application/pdf; version=1.3",
    "metadata" : {
      "creator" : "tika-parser"
    },
    "annotations" : [ ]
  }, {
    "name" : "mimeType_s",
    "value" : "application/pdf",
    "metadata" : {
      "creator" : "tika-parser"
    },
    "annotations" : [ ]
  } ],
  "metadata" : { },
  "commands" : [ ]
} ]

Index a JSON document though the 'conn_solr' pipeline into a collection called 'docs', using the "command" option:

INPUT

curl -u user:pass -X POST -H "Content-Type: application/vnd.lucidworks-document" -d '[{"id": "myDoc2","commands": [{"name":"delete","value": "myDoc2"}]},{"id": "myDoc1","commands": [{"name":"delete","value": "myDoc1"},{"name":"commit","value": "true"}]}]' http://localhost:8764/api/apollo/index-pipelines/conn_solr/collections/docs/index

OUTPUT

[ {
  "id" : "myDoc2",
  "fields" : [ ],
  "commands" : [ {
    "name" : "delete",
    "params" : { }
  } ]
}, {
  "id" : "myDoc1",
  "fields" : [ ],
  "commands" : [ {
    "name" : "delete",
    "params" : { }
  }, {
    "name" : "commit",
    "params" : { }
  } ]
} ]

Index two simple documents through a pipeline named 'conn_solr' and a collection named 'my-docs' and get a detailed output of the pipeline process:

INPUT

curl -u user:pass -X POST -H "Content-Type: application/vnd.lucidworks-document" -d '[{"id": "myDoc1","fields": [{"name":"title", "value": "My first document"},{"name":"body", "value": "This is a simple document."}]}, {"id": "myDoc2","fields": [{"name":"title","value": "My second document"},{"name":"body", "value": "This is another simple document."}]}]' http://localhost:8764/api/apollo/index-pipelines/conn_solr/collections/my-docs/debug

OUTPUT

The output will include how each document passed through each stage. (In the example output below, we have truncated the 'field-mapping' stage for space.)

 {
  "stages" : [ {
    "type" : "tika-parser",
    "id" : "conn_tika",
    "includeImages" : true,
    "flattenCompound" : false,
    "addFailedDocs" : true,
    "addOriginalContent" : true,
    "contentField" : "_raw_content_",
    "skip" : false
  }, {
    "type" : "field-mapping",
    "id" : "conn_mapping",
    "mappings" : [
...
],
    "unmapped" : {
      "source" : "/(.*)/",
      "target" : "attr_$1_",
      "operation" : "move"
    },
    "skip" : false
  }, {
    "type" : "multivalue-resolver",
    "id" : "conn_multivalue_resolver",
    "typeStrategy" : [ {
      "fieldName" : "string",
      "resolverStrategy" : "pick_last"
    } ],
    "skip" : false
  }, {
    "type" : "solr-index",
    "id" : "conn_solr",
    "enforceSchema" : true,
    "skip" : false
  } ],
  "output" : [ {
    "stageType" : "tika-parser",
    "stageId" : "conn_tika",
    "context" : {
      "simulate" : false,
      "stageIndex" : 0,
      "collection" : "docs",
      "async" : false
    },
    "docs" : [ {
      "id" : "6b5c10f1-d941-41a6-957f-f677f5ad0fd5",
      "fields" : [ {
        "name" : "attr_id_",
        "value" : "myDoc1",
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "parsing_time_l",
        "value" : [ "java.lang.Long", 0 ],
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "parsing_s",
        "value" : "no_raw_data",
        "metadata" : {
          "creator" : "tika-parser"
        },
        "annotations" : [ ]
      }, {
        "name" : "attr_fields_",
        "value" : [ "java.util.ArrayList", [ {
          "name" : "title",
          "value" : "My first document"
        }, {
          "name" : "body",
          "value" : "This is a simple document."
        } ] ],
        "metadata" : { },
        "annotations" : [ ]
      } ],
      "metadata" : { },
      "commands" : [ ]
    }, {
      "id" : "4dac3c4e-d7f5-4cbd-96dc-e2eae69711e3",
      "fields" : [ {
        "name" : "attr_id_",
        "value" : "myDoc2",
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "parsing_time_l",
        "value" : [ "java.lang.Long", 0 ],
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "parsing_s",
        "value" : "no_raw_data",
        "metadata" : {
          "creator" : "tika-parser"
        },
        "annotations" : [ ]
      }, {
        "name" : "attr_fields_",
        "value" : [ "java.util.ArrayList", [ {
          "name" : "title",
          "value" : "My second document"
        }, {
          "name" : "body",
          "value" : "This is another simple document."
        } ] ],
        "metadata" : { },
        "annotations" : [ ]
      } ],
      "metadata" : { },
      "commands" : [ ]
    } ]
  }, {
    "stageType" : "field-mapping",
    "stageId" : "conn_mapping",
    "context" : {
      "simulate" : false,
      "stageIndex" : 1,
      "collection" : "docs",
      "async" : false
    },
    "docs" : [ {
      "id" : "6b5c10f1-d941-41a6-957f-f677f5ad0fd5",
      "fields" : [ {
        "name" : "attr_id_",
        "value" : "myDoc1",
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "parsing_s",
        "value" : "no_raw_data",
        "metadata" : {
          "creator" : "tika-parser"
        },
        "annotations" : [ ]
      }, {
        "name" : "parsing_time_l",
        "value" : [ "java.lang.Long", 0 ],
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "attr_fields_",
        "value" : [ "java.util.ArrayList", [ {
          "name" : "title",
          "value" : "My first document"
        }, {
          "name" : "body",
          "value" : "This is a simple document."
        } ] ],
        "metadata" : { },
        "annotations" : [ ]
      } ],
      "metadata" : { },
      "commands" : [ ]
    }, {
      "id" : "4dac3c4e-d7f5-4cbd-96dc-e2eae69711e3",
      "fields" : [ {
        "name" : "attr_id_",
        "value" : "myDoc2",
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "parsing_s",
        "value" : "no_raw_data",
        "metadata" : {
          "creator" : "tika-parser"
        },
        "annotations" : [ ]
      }, {
        "name" : "parsing_time_l",
        "value" : [ "java.lang.Long", 0 ],
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "attr_fields_",
        "value" : [ "java.util.ArrayList", [ {
          "name" : "title",
          "value" : "My second document"
        }, {
          "name" : "body",
          "value" : "This is another simple document."
        } ] ],
        "metadata" : { },
        "annotations" : [ ]
      } ],
      "metadata" : { },
      "commands" : [ ]
    } ]
  }, {
    "stageType" : "multivalue-resolver",
    "stageId" : "conn_multivalue_resolver",
    "context" : {
      "simulate" : false,
      "stageIndex" : 2,
      "collection" : "docs",
      "async" : false
    },
    "docs" : [ {
      "id" : "6b5c10f1-d941-41a6-957f-f677f5ad0fd5",
      "fields" : [ {
        "name" : "attr_id_",
        "value" : "myDoc1",
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "parsing_s",
        "value" : "no_raw_data",
        "metadata" : {
          "creator" : "tika-parser"
        },
        "annotations" : [ ]
      }, {
        "name" : "parsing_time_l",
        "value" : [ "java.lang.Long", 0 ],
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "attr_fields_",
        "value" : [ "java.util.ArrayList", [ {
          "name" : "title",
          "value" : "My first document"
        }, {
          "name" : "body",
          "value" : "This is a simple document."
        } ] ],
        "metadata" : { },
        "annotations" : [ ]
      } ],
      "metadata" : { },
      "commands" : [ ]
    }, {
      "id" : "4dac3c4e-d7f5-4cbd-96dc-e2eae69711e3",
      "fields" : [ {
        "name" : "attr_id_",
        "value" : "myDoc2",
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "parsing_s",
        "value" : "no_raw_data",
        "metadata" : {
          "creator" : "tika-parser"
        },
        "annotations" : [ ]
      }, {
        "name" : "parsing_time_l",
        "value" : [ "java.lang.Long", 0 ],
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "attr_fields_",
        "value" : [ "java.util.ArrayList", [ {
          "name" : "title",
          "value" : "My second document"
        }, {
          "name" : "body",
          "value" : "This is another simple document."
        } ] ],
        "metadata" : { },
        "annotations" : [ ]
      } ],
      "metadata" : { },
      "commands" : [ ]
    } ]
  }, {
    "stageType" : "solr-index",
    "stageId" : "conn_solr",
    "context" : {
      "simulate" : false,
      "stageIndex" : 3,
      "collection" : "docs",
      "async" : false
    },
    "docs" : [ {
      "id" : "6b5c10f1-d941-41a6-957f-f677f5ad0fd5",
      "fields" : [ {
        "name" : "attr_id_",
        "value" : "myDoc1",
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "parsing_s",
        "value" : "no_raw_data",
        "metadata" : {
          "creator" : "tika-parser"
        },
        "annotations" : [ ]
      }, {
        "name" : "parsing_time_l",
        "value" : [ "java.lang.Long", 0 ],
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "attr_fields_",
        "value" : [ "java.util.ArrayList", [ {
          "name" : "title",
          "value" : "My first document"
        }, {
          "name" : "body",
          "value" : "This is a simple document."
        } ] ],
        "metadata" : { },
        "annotations" : [ ]
      } ],
      "metadata" : { },
      "commands" : [ ]
    }, {
      "id" : "4dac3c4e-d7f5-4cbd-96dc-e2eae69711e3",
      "fields" : [ {
        "name" : "attr_id_",
        "value" : "myDoc2",
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "parsing_s",
        "value" : "no_raw_data",
        "metadata" : {
          "creator" : "tika-parser"
        },
        "annotations" : [ ]
      }, {
        "name" : "parsing_time_l",
        "value" : [ "java.lang.Long", 0 ],
        "metadata" : { },
        "annotations" : [ ]
      }, {
        "name" : "attr_fields_",
        "value" : [ "java.util.ArrayList", [ {
          "name" : "title",
          "value" : "My second document"
        }, {
          "name" : "body",
          "value" : "This is another simple document."
        } ] ],
        "metadata" : { },
        "annotations" : [ ]
      } ],
      "metadata" : { },
      "commands" : [ ]
    } ]
  } ]
}