Prerequisites
To use this stage, non-admin Fusion users must be granted thePUT,POST,GET:/LWAI-ACCOUNT-NAME/**
permission in Fusion, which is the Lucidworks AI API Account Name defined in Lucidworks AI Gateway when this stage is configured.
Click Get Started below to see how to enable chunking in Fusion:
Additional requirements for the stage are:
- Use a V2 connector. Only V2 connectors work for this task and not other options, such as PBL or V1 connectors.
- Remove the
Apache Tika
stage from your parser because it can cause datasource failures with the following error: “The following components failed: [class com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor : Only Tika Container parser can support Async Parsing.]”
Strategies
Choose one of these chunking strategies.Strategy descriptions
Strategy descriptions
dynamic-newline
Split on newlines, then merges lines up to
maxChunkSize
. Default for maxChunkSize
is 512 tokens.dynamic-sentence
Join sentences until
maxChunkSize
. Able to overlap using overlapSize
.sentence
Fixes the number of sentences per chunk. The default for
chunkSize
is 5. Able to overlap using overlapSize
.regex-splitter
Set
regex
to split with regex using Python re
conventions.semantic
Group semantically similar sentences up to
maxChunkSize
. This strategy is the slowest but most precise. Able to overlap using overlapSize
.How asynchronous results return
The LWAI Chunker Index stage submits text to the Async Chunking API, which returns achunkingId
.
Later, results are fetched and written back to the same index pipeline using Solr Partial Update Indexer.
This means the same pipeline is visited twice: once for the original document and once to apply chunk fields and vectors.
What this stage writes
- Vector field (required): in Destination Field Name & Context Output, use a dense vector field and include
chunk_vector
in the field name, for example,body_chunk_vector_384v
. - Text chunks field (recommended): set Destination Field Name for Text Chunks, for example,
body_chunks_ss
. - Doctype marker (required for chunking queries): add
_lw_chunk_doctype_s
with a marker for use in Chunking Neural Hybrid Query stage.
Markers:_lw_chunk_root
on the root document- The vector field name, such as
body_chunk_vector_384v
, on child documents
Example setup for this stage
- Add LWAI Chunker Index Stage to your index pipeline:
- Chunking Strategy: for example, you can use
sentence
. - Model for Vectorization: pick your embedding model.
- Input context variable: field/ctx containing the text to chunk.
- Destination Field Name & Context Output:
body_chunk_vector_384v
. This must containchunk_vector
and be a dense vector field. - Destination Field Name for Text Chunks:
body_chunks_ss
. - (Optional) In Chunker Configuration, set
chunkSize=5
oroverlapSize=1
.
- Chunking Strategy: for example, you can use
- In the same pipeline, add Solr Partial Update Indexer:
- Uncheck Map to Solr Schema
- Uncheck Enable Concurrency Control
- Uncheck Reject Update if Solr Document is not Present
- Check Process All Pipeline Doc Fields
- Check Allow reserved fields
- Save to let the async results come back to the same pipeline.
- Index a sample and verify:
- The original doc is present.
- After async completes, the doc has
body_chunk_vector_384v
to indicate a vector,body_chunks_ss
to indicate text chunks, and any_lw_chunk_doctype_s
markers for root and children.
Fusion truncates text sent for chunking to ~50,000 characters, so plan chunking inputs accordingly.
What to use for query
Use Chunking Neural Hybrid Query to combine lexical and vector search over parent and child chunks. It expects a vector field likebody_chunk_vector_384v
and the _lw_chunk_doctype_s
markers described above.
Configuration
When entering configuration values in the UI, use unescaped characters, such as
\t
for the tab character. When entering configuration values in the API, use escaped characters, such as \\t
for the tab character.