Skip to main content
The LWAI Chunker Index Stage asynchronously breaks down large text documents into smaller, semantically meaningful chunks, vectorizes those chunks for Neural Hybrid Search, and stores those vectors in Solr. For more information, see the Lucidworks AI Async Chunking API.

Prerequisites

To use this stage, non-admin Fusion users must be granted the PUT,POST,GET:/LWAI-ACCOUNT-NAME/** permission in Fusion, which is the Lucidworks AI API Account Name defined in Lucidworks AI Gateway when this stage is configured. Click Get Started below to see how to enable chunking in Fusion:
Additional requirements for the stage are:
  • Use a V2 connector. Only V2 connectors work for this task and not other options, such as PBL or V1 connectors.
  • Remove the Apache Tika stage from your parser because it can cause datasource failures with the following error: “The following components failed: [class com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor : Only Tika Container parser can support Async Parsing.]”

Strategies

Choose one of these chunking strategies.

Strategy descriptions

dynamic-newline
Split on newlines, then merges lines up to maxChunkSize. Default for maxChunkSize is 512 tokens.
dynamic-sentence
Join sentences until maxChunkSize. Able to overlap using overlapSize.
sentence
Fixes the number of sentences per chunk. The default for chunkSize is 5. Able to overlap using overlapSize.
regex-splitter
Set regex to split with regex using Python re conventions.
semantic
Group semantically similar sentences up to maxChunkSize. This strategy is the slowest but most precise. Able to overlap using overlapSize.
Additional information about these Chunker names and keys are defined in the Async Chunking API.

How asynchronous results return

The LWAI Chunker Index stage submits text to the Async Chunking API, which returns a chunkingId. Later, results are fetched and written back to the same index pipeline using Solr Partial Update Indexer. This means the same pipeline is visited twice: once for the original document and once to apply chunk fields and vectors.

What this stage writes

  • Vector field (required): in Destination Field Name & Context Output, use a dense vector field and include chunk_vector in the field name, for example, body_chunk_vector_384v.
  • Text chunks field (recommended): set Destination Field Name for Text Chunks, for example, body_chunks_ss.
  • Doctype marker (required for chunking queries): add _lw_chunk_doctype_s with a marker for use in Chunking Neural Hybrid Query stage.
    Markers:
    • _lw_chunk_root on the root document
    • The vector field name, such as body_chunk_vector_384v, on child documents

Example setup for this stage

  1. Add LWAI Chunker Index Stage to your index pipeline:
    • Chunking Strategy: for example, you can use sentence.
    • Model for Vectorization: pick your embedding model.
    • Input context variable: field/ctx containing the text to chunk.
    • Destination Field Name & Context Output: body_chunk_vector_384v. This must contain chunk_vector and be a dense vector field.
    • Destination Field Name for Text Chunks: body_chunks_ss.
    • (Optional) In Chunker Configuration, set chunkSize=5 or overlapSize=1.
  2. In the same pipeline, add Solr Partial Update Indexer:
    • Uncheck Map to Solr Schema
    • Uncheck Enable Concurrency Control
    • Uncheck Reject Update if Solr Document is not Present
    • Check Process All Pipeline Doc Fields
    • Check Allow reserved fields
  3. Save to let the async results come back to the same pipeline.
  4. Index a sample and verify:
    • The original doc is present.
    • After async completes, the doc has body_chunk_vector_384v to indicate a vector, body_chunks_ss to indicate text chunks, and any _lw_chunk_doctype_s markers for root and children.
Fusion truncates text sent for chunking to ~50,000 characters, so plan chunking inputs accordingly.

What to use for query

Use Chunking Neural Hybrid Query to combine lexical and vector search over parent and child chunks. It expects a vector field like body_chunk_vector_384v and the _lw_chunk_doctype_s markers described above.

Configuration

When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.
I