> ## Documentation Index
> Fetch the complete documentation index at: https://doc.lucidworks.com/llms.txt
> Use this file to discover all available pages before exploring further.

# LWAI Chunker Stage

> Index pipeline stage configuration specifications

export const schema = {
  "type": "object",
  "title": "LWAI Chunker Stage",
  "description": "Pass in large text, get back smaller chunks and associated vectors.",
  "required": ["accountName", "chunkingStrategy", "modelName", "inputContextVariable", "outputContextVariable"],
  "properties": {
    "skip": {
      "type": "boolean",
      "title": "Skip This Stage",
      "description": "Set to true to skip this stage.",
      "default": false,
      "hints": ["advanced"]
    },
    "label": {
      "type": "string",
      "title": "Label",
      "description": "A unique label for this stage.",
      "hints": ["advanced"],
      "maxLength": 255
    },
    "condition": {
      "type": "string",
      "title": "Condition",
      "description": "Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.",
      "hints": ["code", "code/javascript", "advanced"]
    },
    "accountName": {
      "type": "string",
      "title": "Account Name",
      "description": "Lucidworks AI API Account Name as defined in AI Gateway Service.  This entry should match the account name set in the AI Gateway.",
      "hints": ["enumUrl:/api/query-stages/lwai-accounts"]
    },
    "chunkingStrategy": {
      "type": "string",
      "title": "Chunking Strategy",
      "description": "Chunking strategy to use",
      "hints": ["enumUrl:/api/query-stages/lwai-chunking-strategies?account=${accountName}"]
    },
    "modelName": {
      "type": "string",
      "title": "Model for Vectorization",
      "description": "Lucidworks AI Model as defined in documentation",
      "hints": ["enumUrl:/api/query-stages/lwai-model?account=${accountName}&useCase=embedding"]
    },
    "inputContextVariable": {
      "type": "string",
      "title": "Input context variable",
      "description": "Name of the variable in context to be used as input. Supports template expressions."
    },
    "outputContextVariable": {
      "type": "string",
      "title": "Destination Field Name & Context Output",
      "description": "Note:  MUST contain '*_chunk_vector_*' and must be a dense vector field type.  The name here is used to populate two things with the prediction results:  1) The field name in the document that will contain the prediction, and 2) The name of the context variable that will contain the prediction."
    },
    "outputTextSpans": {
      "type": "string",
      "title": "Destination Field Name for Text Spans",
      "description": "For example, body_spans_ss.  This field will contain the spans ([start,stop] positions) that are generated by the chunker."
    },
    "outputTextChunks": {
      "type": "string",
      "title": "Destination Field Name for Text Chunks (not the vectors)",
      "description": "For example, body_chunks_ss.  This field will contain the text chunks that are generated by the chunker."
    },
    "chunkerConfig": {
      "type": "array",
      "title": "Chunker Configuration",
      "description": "Additional Chunker keys and values to be sent to Lucidworks AI",
      "minItems": 0,
      "items": {
        "type": "object",
        "required": ["key"],
        "properties": {
          "key": {
            "type": "string",
            "title": "Parameter Name"
          },
          "value": {
            "type": "string",
            "title": "Parameter Value"
          }
        }
      }
    },
    "modelConfig": {
      "type": "array",
      "title": "Model Configuration",
      "description": "Additional Model configuration parameters to be sent to Lucidworks AI",
      "minItems": 0,
      "items": {
        "type": "object",
        "required": ["key"],
        "properties": {
          "key": {
            "type": "string",
            "title": "Parameter Name"
          },
          "value": {
            "type": "string",
            "title": "Parameter Value"
          }
        }
      }
    },
    "maxTries": {
      "type": "integer",
      "title": "Maximum Asynchronous Call Tries",
      "description": "The maximum number of attempts to issue an asynchronous Lucidworks AI API call",
      "default": 1,
      "minimum": 1,
      "exclusiveMinimum": false
    },
    "failOnError": {
      "type": "boolean",
      "title": "Fail on Error",
      "description": "Flag to indicate if this stage should throw an exception if an error occurs while generating a prediction for a document.",
      "default": false
    }
  },
  "category": "AI",
  "categoryPriority": 10,
  "unsafe": false
};

export const SchemaParamFields = ({schema}) => {
  const sanitize = str => {
    if (typeof str !== "string") return str;
    return str.replace(/^"(.*)"$/s, "$1").replace(/\\/g, "").replace(/"/g, "'");
  };
  const formatDescription = str => {
    const s = sanitize(str);
    return (/[.!?]\)*$/).test(s) ? s : `${s}.`;
  };
  const {description, properties = {}, required: requiredProps = []} = schema;
  const visibleProps = useMemo(() => Object.entries(properties).filter(([, prop]) => !prop.hints?.includes("hidden")), [properties]);
  return <div>
      {description && <p>{formatDescription(description)}</p>}

      {visibleProps.map(([name, prop]) => {
    const isRequired = requiredProps.includes(name);
    const hasDefault = prop.default !== undefined;
    const rawDefault = prop.default;
    const isComplexDefault = hasDefault && (typeof rawDefault === "object" || typeof rawDefault === "string" && (rawDefault.length > 20 || rawDefault.includes('"')));
    const fieldProps = {
      key: name,
      body: prop.title || name,
      type: prop.type,
      ...prop.title && ({
        post: [<><span className="text-stone-400 dark:text-stone-500">API property: </span>{name}</>]
      }),
      ...isRequired && ({
        required: true
      }),
      ...!isComplexDefault && hasDefault ? {
        default: sanitize(String(rawDefault))
      } : {}
    };
    const isObject = prop.type === "object" && prop.properties;
    const isArrayOfObjects = prop.type === "array" && prop.items?.type === "object" && prop.items.properties;
    return <ParamField {...fieldProps}>
            {prop.description && <p>{formatDescription(prop.description)}</p>}

            {isComplexDefault && <div className="flex">
                <p>
                  <strong>Default:</strong>
                </p>
                <pre className="!my-0">
                  <code>
                    {JSON.stringify(rawDefault, null, 2)}
                  </code>
                </pre>
              </div>}

            {isArrayOfObjects && <div className="flex">
              <p>
                <strong>Object attributes:</strong>
              </p>
              <pre className="!my-0">
                <code>
                  {'{\n'}
                  {Object.entries(prop.items.properties).map(([iname, iprop]) => <>
                      {`  ${iname}`}
                      {prop.items?.required?.includes(iname) && <span style={{
      color: 'red'
    }}> required</span>}
                      {`: {\n    display name: ${sanitize(iprop.title || '')}\n    type: ${iprop.type}\n  }\n`}
                    </>)}
                  {'}'}
                </code>
              </pre>
              </div>}

            {isObject && <Expandable title="properties">
                <SchemaParamFields schema={{
      properties: prop.properties,
      required: prop.required
    }} />
              </Expandable>}
          </ParamField>;
  })}
    </div>;
};

export const LwTemplate = ({title = "Key questions to get you started", icon = "sparkles", cta = "Powered by Agent Studio", linkHref = "https://lucidworks.com/demo/?utm_source=docs&utm_medium=referral&utm_campaign=docs_cta_ai"}) => {
  const [isLoaded, setIsLoaded] = useState(false);
  useEffect(() => {
    const timer = setTimeout(() => {
      setIsLoaded(true);
    }, 500);
    return () => clearTimeout(timer);
  }, []);
  return <div className="lw-template-container">
      <Card title={title} icon={icon}>
        {isLoaded && <span dangerouslySetInnerHTML={{
    __html: `<lw-template id="a029c1a9-28be-427e-b0e1-5d918920246a"></lw-template
            >`
  }} />}
        <Link href={linkHref} className="agent-studio-link text-left text-gray-600 gap-2 dark:text-gray-400 text-sm font-medium flex flex-row items-center hover:text-primary dark:hover:text-primary-light group-hover:text-primary group-hover:dark:text-primary-light">Powered by Lucidworks Agent Studio</Link>
      </Card>
    </div>;
};

[localhost link]: http://localhost:3000/docs/lucidworks-search/09-developer-documentation/config-specs/index-pipeline-stages/lwai-chunker-stage

[mintlify link]: https://doc.lucidworks.com/docs/lucidworks-search/09-developer-documentation/config-specs/index-pipeline-stages/lwai-chunker-stage

[old doc.lw link]: https://doc.lucidworks.com/managed-fusion/5.9/uiycyd

Lucidworks Search 5.9.12 and later integrates with Lucidworks AI to perform [chunking](/docs/lucidworks-search/11-vector-search/chunking).
When you include chunking in your index pipeline, Lucidworks AI automatically splits large documents into smaller, more focused segments.
This approach is especially powerful when paired with Neural Hybrid Search to surface the most relevant chunks instead of entire documents.
Chunking also improves the accuracy of AI assistants by delivering semantically rich training data in precise, context-aware pieces.

This stage performs chunking [asynchronously](/docs/lucidworks-search/09-developer-documentation/config-specs/index-pipeline-stages/overview#run-stages-asynchronously) and stores those vectors in Solr.

Click your use case below to see examples of how chunking can enhance the search experience:

<Tabs>
  <Tab title="Business-to-Consumer" icon="cart-shopping" iconType="sharp-solid">
    * Break product descriptions into focused chunks so customers can find relevant details faster.
    * Reduce support tickets by training AI assistants on semantically-segmented help articles for more accurate answers.
    * Split multimedia transcripts (like product videos or webinars) into meaningful chunks so customers can find answers in content they wouldn't normally read.
  </Tab>

  <Tab title="Business-to-Business" icon="briefcase" iconType="sharp-solid">
    * Break down long technical specs so buyers can validate requirements without reading full documents.
    * Improve Request For Quote (RFQ) matching by extracting key details so sales teams can respond faster and more accurately.
    * Surface relevant regulatory or compliance info within dense documents for more efficient legal reviews.
  </Tab>

  <Tab title="Knowledge Management" icon="lightbulb" iconType="sharp-solid">
    * Chunk large case studies, policies, or contracts into semantically useful segments so employees get faster answers.
    * Boost AI assistant quality by providing smaller, context-rich units of information so responses are more precise.
    * Improve onboarding by connecting new hires with task-relevant excerpts from training content, runbooks, and policy documents.
  </Tab>
</Tabs>

<Note>
  This feature is available starting in Lucidworks Search 5.9.12 and in all subsequent Lucidworks Search 5.9 releases.
</Note>

<LwTemplate />

## Prerequisites

To use this stage, non-admin Fusion users must be granted the `PUT,POST,GET:/LWAI-ACCOUNT-NAME/**` permission in Fusion, which is the Lucidworks AI API Account Name defined in [Lucidworks AI Gateway](/docs/lw-platform/lw-ai/lw-ai-gateway) when this stage is configured.

Click **Get Started** below to see how to enable chunking in Fusion:

<iframe src="https://app.supademo.com/embed/cmfzg6uw4009oxx0i1ptmac82?embed_v=2&utm_source=embed" loading="lazy" title="Enable chunking in Fusion" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style={{  width: '100%', height: '500px' }} />

<Note>
  Additional requirements for the stage are:

  * Use a V2 connector. Only V2 connectors work for this task and not other options, such as PBL or V1 connectors.
  * Remove the `Apache Tika` stage from your parser because it can cause datasource failures with the following error: "The following components failed: \[class com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor : Only Tika Container parser can support Async Parsing.]"
</Note>

## Strategies

Choose one of these chunking strategies.

<Accordion title="Strategy descriptions" defaultOpen="true">
  <ParamField path="dynamic-newline">
    Split on newlines, then merges lines up to `maxChunkSize`. Default for `maxChunkSize` is 512 tokens.
  </ParamField>

  <ParamField path="dynamic-sentence">
    Join sentences until `maxChunkSize`. Able to overlap using `overlapSize`.
  </ParamField>

  <ParamField path="sentence">
    Fixes the number of sentences per chunk. The default for `chunkSize` is 5.  Able to overlap using `overlapSize`.
  </ParamField>

  <ParamField path="regex-splitter">
    Set `regex` to split with regex using Python `re` conventions.
  </ParamField>

  <ParamField path="semantic">
    Group semantically similar sentences up to `maxChunkSize`. This strategy is the slowest but most precise. Able to overlap using `overlapSize`.
  </ParamField>
</Accordion>

Additional information about these Chunker names and keys are defined in the [Async Chunking API](/docs/lw-platform/lw-ai/lw-ai-apis/lw-ai-async-chunking-api#chunkerconfig).

## How asynchronous results return

The LWAI Chunker Index stage submits text to the Async Chunking API, which returns a `chunkingId`.
Later, results are fetched and written back to the same index pipeline using [Solr Partial Update Indexer](/docs/lucidworks-search/09-developer-documentation/config-specs/index-pipeline-stages/solr-partial-update-indexer).
This means the same pipeline is visited twice: once for the original document and once to apply chunk fields and vectors.

## What this stage writes

* Vector field (required): in **Destination Field Name & Context Output**, use a **dense vector** field and include `chunk_vector` in the field name, for example, `body_chunk_vector_384v`.
* Text chunks field (recommended): set **Destination Field Name for Text Chunks**, for example, `body_chunks_ss`.
* Doctype marker (required for chunking queries): add `_lw_chunk_doctype_s` with a marker for use in [Chunking Neural Hybrid Query stage](/docs/lucidworks-search/09-developer-documentation/config-specs/query-pipeline-stages/chunking-neural-hybrid-query-stage).\
  Markers:
  * `_lw_chunk_root` on the root document
  * The vector field name, such as `body_chunk_vector_384v`, on child documents

## Example setup for this stage

1. Add **LWAI Chunker Index Stage** to your index pipeline:
   * **Chunking Strategy:** for example, you can use `sentence`.
   * **Model for Vectorization:** pick your embedding model.
   * **Input context variable:** field/ctx containing the text to chunk.
   * **Destination Field Name & Context Output:** `body_chunk_vector_384v`. This must contain `chunk_vector` and be a dense vector field.
   * **Destination Field Name for Text Chunks:** `body_chunks_ss`.
   * (Optional) In **Chunker Configuration**, set `chunkSize=5` or `overlapSize=1`.
2. In the same pipeline, add **Solr Partial Update Indexer**:
   * Uncheck **Map to Solr Schema**
   * Uncheck **Enable Concurrency Control**
   * Uncheck **Reject Update if Solr Document is not Present**
   * Check **Process All Pipeline Doc Fields**
   * Check **Allow reserved fields**
3. Save to let the async results come back to the same pipeline.
4. Index a sample and verify:
   * The original doc is present.
   * After async completes, the doc has `body_chunk_vector_384v` to indicate a vector, `body_chunks_ss` to indicate text chunks, and any `_lw_chunk_doctype_s` markers for root and children.

<Tip>
  Fusion truncates text sent for chunking to \~50,000 characters, so plan chunking inputs accordingly.
</Tip>

## What to use for query

Use **Chunking Neural Hybrid Query** to combine lexical and vector search over parent and child chunks.
It expects a vector field like `body_chunk_vector_384v` and the `_lw_chunk_doctype_s` markers described above.

## Configuration

<Tip>
  When entering configuration values in the UI, use *unescaped* characters, such as `\t` for the tab character. When entering configuration values in the API, use *escaped* characters, such as `\\t` for the tab character.
</Tip>

<SchemaParamFields schema={schema} />