LWAI Chunker Index Stage

When you include chunking in your index pipeline, Lucidworks AI automatically splits large documents into smaller, more focused segments. This approach is especially powerful when paired with Neural Hybrid Search to surface the most relevant chunks instead of entire documents. Chunking also improves the accuracy of AI assistants by delivering semantically rich training data in precise, context-aware pieces. This stage performs chunking asynchronously and stores those vectors in Solr. Click your use case below to see examples of how chunking can enhance the search experience:

Business-to-Consumer
Business-to-Business
Knowledge Management

Break product descriptions into focused chunks so customers can find relevant details faster.
Reduce support tickets by training AI assistants on semantically-segmented help articles for more accurate answers.
Split multimedia transcripts (like product videos or webinars) into meaningful chunks so customers can find answers in content they wouldn’t normally read.

For more details about configuring this feature, see these topics:

Prerequisites

To use this stage, non-admin Fusion users must be granted the PUT,POST,GET:/LWAI-ACCOUNT-NAME/** permission in Fusion, which is the Lucidworks AI API Account Name defined in Lucidworks AI Gateway when this stage is configured. Click Get Started below to see how to enable chunking in Fusion:

Additional requirements for the stage are:

Use a V2 connector. Only V2 connectors work for this task and not other options, such as PBL or V1 connectors.
Remove the Apache Tika stage from your parser because it can cause datasource failures with the following error: “The following components failed: [class com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor : Only Tika Container parser can support Async Parsing.]”

Strategies

Choose one of these chunking strategies.

Strategy descriptions

dynamic-newline

Split on newlines, then merges lines up to maxChunkSize. Default for maxChunkSize is 512 tokens.

dynamic-sentence

Join sentences until maxChunkSize. Able to overlap using overlapSize.

sentence

Fixes the number of sentences per chunk. The default for chunkSize is 5. Able to overlap using overlapSize.

regex-splitter

Set regex to split with regex using Python re conventions.

semantic

Group semantically similar sentences up to maxChunkSize. This strategy is the slowest but most precise. Able to overlap using overlapSize.

Additional information about these Chunker names and keys are defined in the Async Chunking API.

How asynchronous results return

The LWAI Chunker Index stage submits text to the Async Chunking API, which returns a chunkingId. Later, results are fetched and written back to the same index pipeline using Solr Partial Update Indexer. This means the same pipeline is visited twice: once for the original document and once to apply chunk fields and vectors.

What this stage writes

Vector field (required): in Destination Field Name & Context Output, use a dense vector field and include chunk_vector in the field name, for example, body_chunk_vector_384v.
Text chunks field (recommended): set Destination Field Name for Text Chunks, for example, body_chunks_ss.
Doctype marker (required for chunking queries): add _lw_chunk_doctype_s with a marker for use in Chunking Neural Hybrid Query stage.
Markers:
- _lw_chunk_root on the root document
- The vector field name, such as body_chunk_vector_384v, on child documents

Example setup for this stage

Add LWAI Chunker Index Stage to your index pipeline:
- Chunking Strategy: for example, you can use sentence.
- Model for Vectorization: pick your embedding model.
- Input context variable: field/ctx containing the text to chunk.
- Destination Field Name & Context Output: body_chunk_vector_384v. This must contain chunk_vector and be a dense vector field.
- Destination Field Name for Text Chunks: body_chunks_ss.
- (Optional) In Chunker Configuration, set chunkSize=5 or overlapSize=1.
In the same pipeline, add Solr Partial Update Indexer:
- Uncheck Map to Solr Schema
- Uncheck Enable Concurrency Control
- Uncheck Reject Update if Solr Document is not Present
- Check Process All Pipeline Doc Fields
- Check Allow reserved fields
Save to let the async results come back to the same pipeline.
Index a sample and verify:
- The original doc is present.
- After async completes, the doc has body_chunk_vector_384v to indicate a vector, body_chunks_ss to indicate text chunks, and any _lw_chunk_doctype_s markers for root and children.

Fusion truncates text sent for chunking to ~50,000 characters, so plan chunking inputs accordingly.

What to use for query

Use Chunking Neural Hybrid Query to combine lexical and vector search over parent and child chunks. It expects a vector field like body_chunk_vector_384v and the _lw_chunk_doctype_s markers described above.

Configuration

When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Get Started

Lucidworks Platform

Lucidworks AI

Core Settings

Agent Studio

Commerce Studio

Analytics Studio

Prerequisites

Strategies

How asynchronous results return

What this stage writes

Example setup for this stage

What to use for query

Configuration

Get Started

Lucidworks Platform

Lucidworks AI

Core Settings

Agent Studio

Commerce Studio

Analytics Studio

​Prerequisites

​Strategies

​How asynchronous results return

​What this stage writes

​Example setup for this stage

​What to use for query

​Configuration

Prerequisites

Strategies

How asynchronous results return

What this stage writes

Example setup for this stage

What to use for query

Configuration