Lucid.fs (V1) Connector Framework

This connector is used to crawl filesystems and filesystem-based data repositories.

Options for splitting very large files

The connector framework provides several splitter implementations. All splitters can handle files compressed with GZip, Bzip2, Unix compress(1), or XZip. Compressed data is uncompressed on the fly from the input stream without creating temporary files.

Usually smaller content can be processed by the index pipeline without splitting it first. Most splitters use the "min_size" property to make this determination. The splitter splits the archive before sending it to the pipeline whenever the data size is undetermined (size == -1) or exceeds the "min_size" property threshold.

Splitter implementations are registered using Guice multi-binder and are identified by symbolic names. By default all implementations found on the classpath are registered and available for use. The order in which splitting is attempted by different splitters is undetermined by default, with one exception - the binary splitter is always registered as the last to try. This ordering can be customized using the "splitters" property, which is a comma- or space-separated list of symbolic names. If this property is not empty, then only the splitters listed there are tried in the specified order (see example below). Each available splitter type is tried in turn until one of them succeeds. Binary splitter always succeeds if it’s present in the configuration and if data size exceeds the "min_size" threshold.

Portions of the split content (records) are passed either as a raw binary field processed by other stages in the pipeline or turned directly into PipelineDocuments.

Available splitters and their options

  • "archive" - For common archive formats (zip, tar, 7z, jar / war / ear, ar, arj, cpio, dump). Observes the "min_size" property. Produces raw binary data that corresponds to individual entries in an archive.

  • "csv" - For CSV and TSV files, and it supports the same options as CSV/TSV options in TikaParsingStage. The "min_size" property is observed. Produces PipelineDocuments with the content of single CSV/TSV records.

  • "tika" - For Unix mailbox (.mbox) files and MS Outlook PST files, regardless of their size. Produces PipelineDocuments with the content of single emails.

  • "text" - For record-oriented text files - either large plain text files such as logs, or any other content identified as "text/plain" by Tika content type detector. Observes the "min_size" property. Records are defined by a delimiter, which is an arbitrary string of characters (NOT a regular expression!). Produces raw binary data that corresponds to the collected portion of the input data. It supports the following options:

    • "text_delimiter" - A string of characters to delimit records. Default value is new-line ("\n"). If delimiter is set to an empty string ("") then the splitter applies only the "text_max_length" limit where the input text is divided into equally-sized chunks of "text_max_length" size.

    • "text_count" - Number of delimited records to collect into one output value. Default value is 1. When this is greater than 1, the collected records are available both as a single raw binary field (joined with delimiters) and as multiple "txt_records" fields with textual content of each record.

    • "text_charset" - Character used when converting bytes to characters. Default is `"UTF-8".

    • "text_max_length" - Maximum length (in characters) of any individual record. If record size exceeds this threshold, then data is discarded until the next delimiter is encountered. Default value is 32768.

    • "text_skip_empty" - Discard empty records. Default value is true.

  • "binary" - For any type data, used as a last resort. Splits the input on delimiters. It supports the following options:

    • "bin_delimiter" - The binary delimiter used to split, expressed as a string. If the value starts with "0x" then it is treated as a hexadecimal representation of bytes, otherwise it is treated as a string of characters to be converted to bytes using UTF-8 encoding. Default value is "" (empty string), which means splitting only by "bin_max_size" (see below).

    • "bin_count" - See "text_count" above.

    • "bin_max_length" - See "text_max_length" above. This limit is expressed in the number of bytes.

Example configuration

Splitting configuration is provided in a datasource configuration, under the/properties/splitter JSON path.

{
 `"id": `"ds1",
 `"connector": `"lucid.fs",
...
 `"properties": {
    `"splitter"` : {
      `"splitters": `"archive tika csv text binary",
      `"min_size"` : 10485760,
      `"csv_format"` : `"default",
      `"header_line"` : false,
      `"csv_delimiter"` : `"|",
      `"text_delimiter": `"\n",
      `"text_charset": `"UTF-8",
      `"text_max_length": 32768,
      `"text_skip_empty": true,
      `"bin_max_length": 32768
    },
  ...
 }
}

Plugins for the lucid.fs Connector

The following plugins are available:

  • ftp - traverses FTP sites.

  • s3 - traverses an Amazon S3 native bucket.

  • s3h - traverses an Amazon S3 bucket stored as blocks (as in HDFS).

  • smb - traverses a Windows Samba share.

  • hdfs - traverses Hadoop filesystems (not MapReduce-enabled).

    Note
    The HDFS connector is available in Fusion 4.x only.