Fusion PipelineDocument Objects

A PipelineDocument organizes the contents of each document submitted to the pipeline, document-level metadata, and processing commands into a list of fields where each field has a string name, a value, an associated metadata object and a list of annotations. A Solr Indexer stage transforms a PipelineDocument into a Solr document and submits it to Solr for indexing.

The PipelineDocument Java Object

Under the Fusion hood, a PipelineDocument is a Java object, see the PipelineDocument javadocs.

JSON representation of a PipelineDocument

The JSON representation of a PipelineDocument object has four fields:

  • id : value is a string identifier.

  • commands : value is a list of processing commands for the index (optional)

  • metadata : value is single object containing a name : value pair (optional)

  • fields: value is a list of field objects, where a field object consists of four fields:

    • name : value is a string containing the field name

    • value : value is a string containing the field value

    • metadata : value is a single object containing a name : value pair (optional)

    • annotations : value is a list of annotations (optional)

Pipeline stages add, remove, and update the fields of the PipelineDocument. The Solr Indexer stage transforms the list of PipelineDocument fields into a set of Solr document fields.

The commands field can be used to issue a commit at the end of document processing or to delete documents based on documents that match an included query.

If a pipeline includes a logging stage, the PipelineDocument will be pretty-printed to the Fusion connectors logfile (default location $FUSION/var/log/connectors/connectors.log). To see how this works, we set up a pipeline consisting of an initial logging stage, followed by a Apache Tika Parser stage, followed by another logging stage, followed by a Field Mapping stage.

We define a datasource named "email" configured with lucid.anda filesystem connector that submits the contents of a file named "test_email.eml" to the pipeline.

The initial logging stage from the connectors logfile is shown below. The raw bytes from the file are encoded as a BASE64 string in field "raw_content". After the initial logging stage (before Tika Parsing), the PipelineDocument object is:

{ "id" : "/Users/demo/test_email.eml",
  "metadata" : { },
  "commands" : [ ],
  "fields" : [
    { "name" : "_lw_batch_id_s",
      "value" : "14982991ada04c62a77cbb9ee4b32439",
      "metadata" : { },
      "annotations" : [ ] },
    { "name" : "_lw_data_source_s",
      "value" : "email",
      "metadata" : { },
      "annotations" : [ ] },
    { "name" : "_lw_data_source_collection_s",
      "value" : "foo",
      "metadata" : { },
      "annotations" : [ ] },
    { "name" : "_lw_data_source_pipeline_s",
      "value" : "conn_logging",
      "metadata" : { },
      "annotations" : [ ] },
    { "name" : "_lw_data_source_type_s",
      "value" : "lucid.anda/file",
      "metadata" : { },
      "annotations" : [ ] },
    { "name" : "lastModified_dt",
      "value" : "2015-02-28T14:14:32Z",
      "metadata" : { "creator" : "lucid.anda" },
      "annotations" : [ ] },
    { "name" : "_raw_content_",
      "value" : "TUlNRS1WZXJzaW9uOiAxLjAKUmVjZWl2ZWQ6IGJ5IDEwLjIwMi4yMjguMTk3IHdpdGggSFRUUDsgRnJpLCAyNyBGZWIgMjAxNSAxOToxMzo0MSAtMDgwMCAoUFNUKQpEYXRlOiBTYXQsIDI4IEZlYiAyMDE1IDE0OjEzOjQxICsxMTAwCkRlbGl2ZXJlZC1UbzogbWl0emkubW9ycmlzQGx1Y2lkd29ya3MuY29tCk1lc3NhZ2UtSUQ6IDxDQU03UFJDVjVuYzQxMTJ2Ym5hdEtKNk05RFVEU0prVzFpeXcxUHhhLWJaWFVCZ1FlV3dAbWFpbC5nbWFpbC5jb20+ClN1YmplY3Q6IHRoaXMgaXMgdGhlIHN1YmplY3Qgb2YgZW1haWwgbWVzc2FnZQpGcm9tOiBNaXR6aSBNb3JyaXMgPG1pdHppLm1vcnJpc0BsdWNpZHdvcmtzLmNvbT4KVG86IE1pdHppIE1vcnJpcyA8bWl0emkubW9ycmlzQGx1Y2lkd29ya3MuY29tPgpDb250ZW50LVR5cGU6IG11bHRpcGFydC9hbHRlcm5hdGl2ZTsgYm91bmRhcnk9MDQ3ZDdiZDc1ZjJhM2YzNGYzMDUxMDFkNWY5OQoKLS0wNDdkN2JkNzVmMmEzZjM0ZjMwNTEwMWQ1Zjk5CkNvbnRlbnQtVHlwZTogdGV4dC9wbGFpbjsgY2hhcnNldD1VVEYtOAoKdGhpcyBpcyB0aGUgZmlyc3QgbGluZSBvZiB0aGUgYm9keSBvZiBhbiBlbWFpbCBtZXNzYWdlLgoKYW5kIHRoaXMgaXMgdGhlIHNlY29uZCBsaW5lLgoKYW5kIHRoaXMgaXMgdGhlIGNsb3NpbmcsIGNoZWVycywKCi0tMDQ3ZDdiZDc1ZjJhM2YzNGYzMDUxMDFkNWY5OQpDb250ZW50LVR5cGU6IHRleHQvaHRtbDsgY2hhcnNldD1VVEYtOAoKPGRpdiBkaXI9Imx0ciI+dGhpcyBpcyB0aGUgZmlyc3QgbGluZSBvZiB0aGUgYm9keSBvZiBhbiBlbWFpbCBtZXNzYWdlLjxkaXY+PGJyPjwvZGl2PjxkaXY+YW5kIHRoaXMgaXMgdGhlIHNlY29uZCBsaW5lLjwvZGl2PjxkaXY+PGJyPjwvZGl2PjxkaXY+YW5kIHRoaXMgaXMgdGhlIGNsb3NpbmcsIGNoZWVycyw8L2Rpdj48ZGl2Pjxicj48L2Rpdj48L2Rpdj4KCi0tMDQ3ZDdiZDc1ZjJhM2YzNGYzMDUxMDFkNWY5OS0tCg==" ,
      "metadata" : { },
      "annotations" : [ ] },
  ]
}

After processing by the Tika parser and the Field Mapping stage, the document contains additional fields, e.g.:

{ "id" : "/Users/mitzimorris/tmp/test_email.eml",
  "fields" : [ {
    "name" : "parsing_time_l",
    "value" : [ "java.lang.Long", 157 ],
    "metadata" : { },
    "annotations" : [ ]
  }, {
    "name" : "subject",
    "value" : "this is the subject of email message",
    "metadata" : {
      "creator" : "tika-parser"
    },
    "annotations" : [ ]
  }, {
    "name" : "dcterms:created",
    "value" : "2015-02-28T03:13:41Z",
    "metadata" : {
      "creator" : "tika-parser"
    },
    "annotations" : [ ]
  }, ...

The following is an example of an empty PipelineDocument that issues a commit command on the index:

{ "fields" : [ ],
  "metadata" : { },
  "commands" : [ {
    "name" : "commit",
    "params" : { }
  } ]
}

Submitting PipelineDocuments Directly to a Pipeline via the REST-API

PipelineDocuments can be submitted to a pipeline as a POST request to the Fusion REST-API path:

/api/apollo/index-pipelines/<id>/collections/<collectionName>/index

where <id> is the name of an specific pipeline and <collectionName> is the name of a specific collection. The content type header to use for this format is:

application/vnd.lucidworks-document

Example

Send two documents to collection named "docs" using the "conn_solr" pipeline:

curl -u user:pass -X POST -H "Content-Type: application/vnd.lucidworks-document" -d '[{"id":"myDoc1", "fields":[{"name":"title", "value":"My first document"}, {"name":"body", "value":"This is a simple document."}]}, {"id":"myDoc2", "fields":[{"name":"title", "value":"My second document"}, {"name":"body", "value":"This is another simple document."}]}]' http://localhost:8764/api/apollo/index-pipelines/conn_solr/collections/docs/index

Limitations of using the index-pipelines REST-API

All fields of a pipeline document sent to a pipeline as an HTTP POST request are treated as plain text strings. If you encode data in a binary format as a BASE64-encoded string, you must then add a stage to decode that data before the Tika Parser stage, else it will be parsed as plain text by Tika.