Migrate to Tika Asynchronous Parsing
This document describes the steps to change from the regular Tika parsing to asynchronous parsing support.
Considerations
Asynchronous parsing works only for connectors using the Java SDK framework, known as V2 connectors.
By default, the asynchronous parsing service is deployed with a single instance (pod). Depending on the number of documents and their size, you may consider scaling up or down the service.
It is also possible to configure:
-
The cron expression to control the scheduled task.
-
The number of data sources to select per task.
-
The number of documents to select per data source per task.
-
The task execution timeout.
The asynchronous parsing service is only performing Tika parsing using Apache Tika Server. Other parsers such as HTML and JSON are not supported by the asynchronous parsing service. In case these other parsers are required, they must be configured in the parser configuration and cannot be used with the asynchronous parsing service.
When using the asynchronous parsing service, the index pipeline must include a Solr Partial Update Indexer stage. The Solr indexer stage should be removed or turned off. This is required because the connector plugin and the asynchronous parsing services generate one document each, one from the fetching process and another from the parsing process, respectively. Both documents need to be merged into a single document. The Solr Partial Update Indexer merges both documents, while the Solr indexer stage overrides documents.
Regular Tika parsing configuration
Tika parsing can be configured in the Parser section:
Asynchronous parsing configuration
There are two steps to use the asynchronous parsing support:
-
Configure the index pipeline.
-
Configure the data source.
Configure the index pipeline
-
Go to the Index Pipeline screen.
-
Add the Solr Partial Update Indexer stage.
-
Deselect the Reject Update if Solr Document is not Present option:
-
Enable the Process All Pipeline Doc Fields option:
-
Enable the Allow reserved fields option:
-
Include an extra update field in the stage configuration using any update type and field name:
-
Click Save.
-
Move the Solr Partial Update Indexer stage to be the last one in the pipeline, and turn off or remove the Solr Indexer stage.
Configure the connectors data source
-
Toggle Advanced to on. In the data source configuration, enable the Async Parsing option.
-
Save the data source configuration and run jobs.