> ## Documentation Index
> Fetch the complete documentation index at: https://doc.lucidworks.com/llms.txt
> Use this file to discover all available pages before exploring further.

# LWAI Chunker Index Stage

> Lucidworks AI

export const schema = {
  "type": "object",
  "title": "LWAI Chunker Stage",
  "description": "Pass in large text, get back smaller chunks and associated vectors.",
  "required": ["accountName", "chunkingStrategy", "modelName", "inputContextVariable", "outputContextVariable"],
  "properties": {
    "skip": {
      "type": "boolean",
      "title": "Skip This Stage",
      "description": "Set to true to skip this stage.",
      "default": false,
      "hints": ["advanced"]
    },
    "label": {
      "type": "string",
      "title": "Label",
      "description": "A unique label for this stage.",
      "hints": ["advanced"],
      "maxLength": 255
    },
    "condition": {
      "type": "string",
      "title": "Condition",
      "description": "Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.",
      "hints": ["code", "code/javascript", "advanced"]
    },
    "accountName": {
      "type": "string",
      "title": "Account Name",
      "description": "Lucidworks AI API Account Name as defined in AI Gateway Service.  This entry should match the account name set in the AI Gateway.",
      "hints": ["enumUrl:/api/query-stages/lwai-accounts"]
    },
    "chunkingStrategy": {
      "type": "string",
      "title": "Chunking Strategy",
      "description": "Chunking strategy to use",
      "hints": ["enumUrl:/api/query-stages/lwai-chunking-strategies?account=${accountName}"]
    },
    "modelName": {
      "type": "string",
      "title": "Model for Vectorization",
      "description": "Lucidworks AI Model as defined in documentation",
      "hints": ["enumUrl:/api/query-stages/lwai-model?account=${accountName}&useCase=embedding"]
    },
    "inputContextVariable": {
      "type": "string",
      "title": "Input context variable",
      "description": "Name of the variable in context to be used as input. Supports template expressions."
    },
    "outputContextVariable": {
      "type": "string",
      "title": "Destination Field Name & Context Output",
      "description": "Note:  MUST contain '*_chunk_vector_*' and must be a dense vector field type.  The name here is used to populate two things with the prediction results:  1) The field name in the document that will contain the prediction, and 2) The name of the context variable that will contain the prediction."
    },
    "outputTextSpans": {
      "type": "string",
      "title": "Destination Field Name for Text Spans",
      "description": "For example, body_spans_ss.  This field will contain the spans ([start,stop] positions) that are generated by the chunker."
    },
    "outputTextChunks": {
      "type": "string",
      "title": "Destination Field Name for Text Chunks (not the vectors)",
      "description": "For example, body_chunks_ss.  This field will contain the text chunks that are generated by the chunker."
    },
    "chunkerConfig": {
      "type": "array",
      "title": "Chunker Configuration",
      "description": "Additional Chunker keys and values to be sent to Lucidworks AI",
      "minItems": 0,
      "items": {
        "type": "object",
        "required": ["key"],
        "properties": {
          "key": {
            "type": "string",
            "title": "Parameter Name"
          },
          "value": {
            "type": "string",
            "title": "Parameter Value"
          }
        }
      }
    },
    "modelConfig": {
      "type": "array",
      "title": "Model Configuration",
      "description": "Additional Model configuration parameters to be sent to Lucidworks AI",
      "minItems": 0,
      "items": {
        "type": "object",
        "required": ["key"],
        "properties": {
          "key": {
            "type": "string",
            "title": "Parameter Name"
          },
          "value": {
            "type": "string",
            "title": "Parameter Value"
          }
        }
      }
    },
    "maxTries": {
      "type": "integer",
      "title": "Maximum Asynchronous Call Tries",
      "description": "The maximum number of attempts to issue an asynchronous Lucidworks AI API call",
      "default": 1,
      "minimum": 1,
      "exclusiveMinimum": false
    },
    "failOnError": {
      "type": "boolean",
      "title": "Fail on Error",
      "description": "Flag to indicate if this stage should throw an exception if an error occurs while generating a prediction for a document.",
      "default": false
    }
  },
  "category": "AI",
  "categoryPriority": 10,
  "unsafe": false
};

export const SchemaParamFields = ({schema}) => {
  const sanitize = str => {
    if (typeof str !== "string") return str;
    return str.replace(/^"(.*)"$/s, "$1").replace(/\\/g, "").replace(/"/g, "'");
  };
  const formatDescription = str => {
    const s = sanitize(str);
    return (/[.!?]\)*$/).test(s) ? s : `${s}.`;
  };
  const {description, properties = {}, required: requiredProps = []} = schema;
  const visibleProps = useMemo(() => Object.entries(properties).filter(([, prop]) => !prop.hints?.includes("hidden")), [properties]);
  return <div>
      {description && <p>{formatDescription(description)}</p>}

      {visibleProps.map(([name, prop]) => {
    const isRequired = requiredProps.includes(name);
    const hasDefault = prop.default !== undefined;
    const rawDefault = prop.default;
    const isComplexDefault = hasDefault && (typeof rawDefault === "object" || typeof rawDefault === "string" && (rawDefault.length > 20 || rawDefault.includes('"')));
    const fieldProps = {
      key: name,
      body: prop.title || name,
      type: prop.type,
      ...prop.title && ({
        post: [<><span className="text-stone-400 dark:text-stone-500">API property: </span>{name}</>]
      }),
      ...isRequired && ({
        required: true
      }),
      ...!isComplexDefault && hasDefault ? {
        default: sanitize(String(rawDefault))
      } : {}
    };
    const isObject = prop.type === "object" && prop.properties;
    const isArrayOfObjects = prop.type === "array" && prop.items?.type === "object" && prop.items.properties;
    return <ParamField {...fieldProps}>
            {prop.description && <p>{formatDescription(prop.description)}</p>}

            {isComplexDefault && <div className="flex">
                <p>
                  <strong>Default:</strong>
                </p>
                <pre className="!my-0">
                  <code>
                    {JSON.stringify(rawDefault, null, 2)}
                  </code>
                </pre>
              </div>}

            {isArrayOfObjects && <div className="flex">
              <p>
                <strong>Object attributes:</strong>
              </p>
              <pre className="!my-0">
                <code>
                  {'{\n'}
                  {Object.entries(prop.items.properties).map(([iname, iprop]) => <>
                      {`  ${iname}`}
                      {prop.items?.required?.includes(iname) && <span style={{
      color: 'red'
    }}> required</span>}
                      {`: {\n    display name: ${sanitize(iprop.title || '')}\n    type: ${iprop.type}\n  }\n`}
                    </>)}
                  {'}'}
                </code>
              </pre>
              </div>}

            {isObject && <Expandable title="properties">
                <SchemaParamFields schema={{
      properties: prop.properties,
      required: prop.required
    }} />
              </Expandable>}
          </ParamField>;
  })}
    </div>;
};

export const LwTemplate = ({title = "Key questions to get you started", icon = "sparkles", cta = "Powered by Agent Studio", linkHref = "https://lucidworks.com/demo/?utm_source=docs&utm_medium=referral&utm_campaign=docs_cta_ai"}) => {
  const [isLoaded, setIsLoaded] = useState(false);
  useEffect(() => {
    const timer = setTimeout(() => {
      setIsLoaded(true);
    }, 500);
    return () => clearTimeout(timer);
  }, []);
  return <div className="lw-template-container">
      <Card title={title} icon={icon}>
        {isLoaded && <span dangerouslySetInnerHTML={{
    __html: `<lw-template id="a029c1a9-28be-427e-b0e1-5d918920246a"></lw-template
            >`
  }} />}
        <Link href={linkHref} className="agent-studio-link text-left text-gray-600 gap-2 dark:text-gray-400 text-sm font-medium flex flex-row items-center hover:text-primary dark:hover:text-primary-light group-hover:text-primary group-hover:dark:text-primary-light">Powered by Lucidworks Agent Studio</Link>
      </Card>
    </div>;
};

[localhost link]: http://localhost:3000/docs/lw-platform/lw-ai/lw-ai-stages/lucidworks-ai-chunker-index-stage

[mintlify link]: https://doc.lucidworks.com/docs/lw-platform/lw-ai/lw-ai-stages/lucidworks-ai-chunker-index-stage

[old doc.lw link]: https://doc.lucidworks.com/lw-platform/ai/i2nz4x

When you include chunking in your index pipeline, Lucidworks AI automatically splits large documents into smaller, more focused segments.
This approach is especially powerful when paired with [Neural Hybrid Search](/docs/5/fusion/hybrid-search/chunking) to surface the most relevant chunks instead of entire documents.
Chunking also improves the accuracy of AI assistants by delivering semantically rich training data in precise, context-aware pieces.

This stage performs chunking [asynchronously](/docs/lw-platform/lw-ai/lw-ai-stages/overview#run-stages-asynchronously) and stores those vectors in Solr.

Click your use case below to see examples of how chunking can enhance the search experience:

<Tabs>
  <Tab title="Business-to-Consumer" icon="cart-shopping" iconType="sharp-solid">
    * Break product descriptions into focused chunks so customers can find relevant details faster.
    * Reduce support tickets by training AI assistants on semantically-segmented help articles for more accurate answers.
    * Split multimedia transcripts (like product videos or webinars) into meaningful chunks so customers can find answers in content they wouldn't normally read.
  </Tab>

  <Tab title="Business-to-Business" icon="briefcase" iconType="sharp-solid">
    * Break down long technical specs so buyers can validate requirements without reading full documents.
    * Improve Request For Quote (RFQ) matching by extracting key details so sales teams can respond faster and more accurately.
    * Surface relevant regulatory or compliance info within dense documents for more efficient legal reviews.
  </Tab>

  <Tab title="Knowledge Management" icon="lightbulb" iconType="sharp-solid">
    * Chunk large case studies, policies, or contracts into semantically useful segments so employees get faster answers.
    * Boost AI assistant quality by providing smaller, context-rich units of information so responses are more precise.
    * Improve onboarding by connecting new hires with task-relevant excerpts from training content, runbooks, and policy documents.
  </Tab>
</Tabs>

For more details about configuring this feature, see these topics:

* [Fusion Chunking](/docs/5/fusion/hybrid-search/chunking).
* [Lucidworks Search Chunking](/docs/lucidworks-search/11-vector-search/chunking).
* [Lucidworks AI Async Chunking API](/docs/lw-platform/lw-ai/lw-ai-apis/lw-ai-async-chunking-api).

<LwTemplate />

## Prerequisites

To use this stage, non-admin Fusion users must be granted the `PUT,POST,GET:/LWAI-ACCOUNT-NAME/**` permission in Fusion, which is the Lucidworks AI API Account Name defined in [Lucidworks AI Gateway](/docs/lw-platform/lw-ai/lw-ai-gateway) when this stage is configured.

Click **Get Started** below to see how to enable chunking in Fusion:

<iframe src="https://app.supademo.com/embed/cmfzg6uw4009oxx0i1ptmac82?embed_v=2&utm_source=embed" loading="lazy" title="Enable chunking in Fusion" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style={{  width: '100%', height: '500px' }} />

<Note>
  Additional requirements for the stage are:

  * Use a V2 connector. Only V2 connectors work for this task and not other options, such as PBL or V1 connectors.
  * Remove the `Apache Tika` stage from your parser because it can cause datasource failures with the following error: "The following components failed: \[class com.lucidworks.connectors.service.components.job.processor.DefaultDataProcessor : Only Tika Container parser can support Async Parsing.]"
</Note>

## Strategies

Choose one of these chunking strategies.

<Accordion title="Strategy descriptions" defaultOpen="true">
  <ParamField path="dynamic-newline">
    Split on newlines, then merges lines up to `maxChunkSize`. Default for `maxChunkSize` is 512 tokens.
  </ParamField>

  <ParamField path="dynamic-sentence">
    Join sentences until `maxChunkSize`. Able to overlap using `overlapSize`.
  </ParamField>

  <ParamField path="sentence">
    Fixes the number of sentences per chunk. The default for `chunkSize` is 5.  Able to overlap using `overlapSize`.
  </ParamField>

  <ParamField path="regex-splitter">
    Set `regex` to split with regex using Python `re` conventions.
  </ParamField>

  <ParamField path="semantic">
    Group semantically similar sentences up to `maxChunkSize`. This strategy is the slowest but most precise. Able to overlap using `overlapSize`.
  </ParamField>
</Accordion>

Additional information about these Chunker names and keys are defined in the [Async Chunking API](/docs/lw-platform/lw-ai/lw-ai-apis/lw-ai-async-chunking-api#chunkerconfig).

## How asynchronous results return

The LWAI Chunker Index stage submits text to the Async Chunking API, which returns a `chunkingId`.
Later, results are fetched and written back to the same index pipeline using [Solr Partial Update Indexer](/docs/5/fusion/reference/config-ref/pipeline-stages/index-stages/solr-partial-update-indexer-stage).
This means the same pipeline is visited twice: once for the original document and once to apply chunk fields and vectors.

## What this stage writes

* Vector field (required): in **Destination Field Name & Context Output**, use a **dense vector** field and include `chunk_vector` in the field name, for example, `body_chunk_vector_384v`.
* Text chunks field (recommended): set **Destination Field Name for Text Chunks**, for example, `body_chunks_ss`.
* Doctype marker (required for chunking queries): add `_lw_chunk_doctype_s` with a marker for use in [Chunking Neural Hybrid Query stage](/docs/5/fusion/reference/config-ref/pipeline-stages/query-stages/chunking-neural-hybrid-query).\
  Markers:
  * `_lw_chunk_root` on the root document
  * The vector field name, such as `body_chunk_vector_384v`, on child documents

## Example setup for this stage

1. Add **LWAI Chunker Index Stage** to your index pipeline:
   * **Chunking Strategy:** for example, you can use `sentence`.
   * **Model for Vectorization:** pick your embedding model.
   * **Input context variable:** field/ctx containing the text to chunk.
   * **Destination Field Name & Context Output:** `body_chunk_vector_384v`. This must contain `chunk_vector` and be a dense vector field.
   * **Destination Field Name for Text Chunks:** `body_chunks_ss`.
   * (Optional) In **Chunker Configuration**, set `chunkSize=5` or `overlapSize=1`.
2. In the same pipeline, add **Solr Partial Update Indexer**:
   * Uncheck **Map to Solr Schema**
   * Uncheck **Enable Concurrency Control**
   * Uncheck **Reject Update if Solr Document is not Present**
   * Check **Process All Pipeline Doc Fields**
   * Check **Allow reserved fields**
3. Save to let the async results come back to the same pipeline.
4. Index a sample and verify:
   * The original doc is present.
   * After async completes, the doc has `body_chunk_vector_384v` to indicate a vector, `body_chunks_ss` to indicate text chunks, and any `_lw_chunk_doctype_s` markers for root and children.

<Tip>
  Fusion truncates text sent for chunking to \~50,000 characters, so plan chunking inputs accordingly.
</Tip>

## What to use for query

Use **Chunking Neural Hybrid Query** to combine lexical and vector search over parent and child chunks.
It expects a vector field like `body_chunk_vector_384v` and the `_lw_chunk_doctype_s` markers described above.

## Configuration

<Tip>
  When entering configuration values in the UI, use *unescaped* characters, such as `\t` for the tab character. When entering configuration values in the API, use *escaped* characters, such as `\\t` for the tab character.
</Tip>

<SchemaParamFields schema={schema} />
