Product Selector

Fusion 5.9
    Fusion 5.9

    Use Tika Asynchronous Parsing

    This document describes how to set up your application to use Tika asynchronous parsing.

    Unlike synchronous Tika parsing, which uses a parser stage, asynchronous Tika parsing is configured in the datasource and index pipeline. For more information, see Asynchronous Tika Parsing.

    Field names change with asynchronous Tika parsing.

    In contrast to synchronous parsing, asynchronous Tika parsing prepends parser_ to fields added to a document. System fields, which start with _lw_, are not prepended with parser_.

    If you are migrating to asynchronous Tika parsing, and your search application configuration relies on specific field names, update your search application to use the new fields.

    Configure the connectors datasource

    1. Navigate to your datasource.

    2. Enable the Advanced view.

    3. Enable the Async Parsing option.

      Enable async option

      Fusion 5.12 uses your parser configuration when using asynchronous parsing.

      The asynchronous parsing service performs Tika parsing using Apache Tika Server.

      In Fusion 5.8 through 5.11, other parsers, such as HTML and JSON, are not supported by the asynchronous parsing service. By enabling asynchronous parsing, the parser configuration linked to your datasource is ignored.

      In Fusion 5.12 and later, other parsers, such as HTML and JSON, are supported by the asynchronous parsing service. By enabling asynchronous parsing, the parser configuration linked to your datasource is used.

    4. Save the datasource configuration.

    Configure the parser stage

    You must do this step in Fusion 5.12 and later. This section does not apply for users of Fusion 5.11 or earlier.
    1. Navigate to Parsers.

    2. Select the parser, or create a new parser.

    3. From the Add a parser stage menu, select Apache Tika Container Parser.

    4. (Optional) Enter a label for this stage. This label changes the names from Apache Tika Container Parser to the value you enter in this field.

    5. If the Apache Tika Container Parser stage is not already the first stage, drag and drop the stage to the top of the stage list so it is the first stage that runs.

    Configure the index pipeline

    1. Go to the Index Pipeline screen.

    2. Add the Solr Partial Update Indexer stage.

    3. Turn off the Reject Update if Solr Document is not Present option and turn on the Process All Pipeline Doc Fields option:

      Tika config setup

    4. Include an extra update field in the stage configuration using any update type and field name. In this example, an incremental field docs_counter_i with an increment value of 1 is added:

      Tika config setup

    5. Enable the Allow reserved fields option:

      Tika config setup

    6. Click Save.

    7. Turn off or remove the Solr Indexer stage, and move the Solr Partial Update Indexer stage to be the last stage in the pipeline.

      Tika config setup

    Asynchronous Tika parsing setup is now complete. Run the datasource indexing job and monitor the results.