Product Selector

Fusion 5.11
    Fusion 5.11

    Asynchronous Tika Parsing

    Table of Contents

    In synchronous Tika parsing, parsing and indexing are performed concurrently. This can result in slow indexing for a large number of documents, as the parser and indexer must share resources.

    Asynchronous Tika parsing, on the other hand, performs parsing in the background. This allows Fusion to continue indexing documents while the parser is processing others, resulting in improved indexing performance for large numbers of documents.

    How does the configuration differ from synchronous Tika parsing?

    Asynchronous parsing does not use a Fusion parser stage. Instead, the configuration is made in the datasource and index pipeline.

    By default, the asynchronous parsing service is deployed with a single instance, or pod. Depending on the number of documents and their size, you may consider scaling up or down the service. It is also possible to configure:

    • Cron expression: The cron expression that controls when the task is be executed.

    • Number of datasources: The number of datasources to select per task.

    • Number of documents: The number of documents to select per datasource per task.

    • Task execution timeout: The maximum amount of time that a task is allowed to run before it is terminated.

    Your parser configuration is ignored when using asynchronous parsing.

    The asynchronous parsing service performs Tika parsing using Apache Tika Server. Other parsers, such as HTML and JSON, are not supported by the asynchronous parsing service.

    Although your datasource is linked to a parser configuration, this link is ignored when asynchronous parsing is used.

    Requirements

    Asynchronous parsing works for V2 connectors only, which use the Java SDK framework. It is enabled in the datasource configuration by toggling the Advanced view and turning on the Async Parsing option.

    Additionally, the index pipeline must include a Solr Partial Update Indexer stage. This stage replaces the Solr indexer stage, which should be removed or turned off. This is required because the connector plugin and the asynchronous parsing services generate one document each, one from the fetching process and another from the parsing process, respectively. Both documents need to be merged into a single document. The Solr Partial Update Indexer merges both documents, while the Solr indexer stage overrides documents.

    For more information on asynchronous parsing setup, see Use Tika Asynchronous Parsing.