The Lucidworks AI Async Chunking API asynchronously separates large pieces of text into smaller pieces, called chunks. The API then returns the chunks and their associated vectors. Currently, the maximum text size allowed for input is approximately 1 MB. Breaking text into chunks can produce a significant number of chunks, especially if there is overlap between chunks or small chunk sizes, so there are limits on how many chunks and vectors can be generated. These limits are based on factors such as the dimension size of the embedding model and whether vector quantization is used. The Async Chunking API contains two requests:
  • POST Request. This request is used to submit text for a chunking strategy and model. Upon submission, the API responds with the following information:
    • chunkingId that is a unique UUID for the submitted chunking task, and can be used later to retrieve the results.
    • status that indicates the current state of the chunking task.
  • GET Request. This request is used to retrieve the results of a previously-submitted chunking request. You must provide the unique chunkingId received from the POST request. The API then returns the results of the chunking request associated with that chunkingId.
For more information, see the API specification.

Chunking strategies (chunkers)

There are five chunking strategies (chunkers) available in the Async Chunking API. Each chunker splits and processes submitted text differently.

Prerequisites

To use this API, you need:
  • The unique APPLICATION_ID for your Lucidworks AI application. For more information, see credentials to use APIs.
  • A bearer token generated with a scope value of machinelearning.predict. For more information, see Authentication API.
  • The CHUNKER and MODEL_ID fields for the use case request. The path is: /ai/async-chunking/CHUNKER/MODEL_ID. A list of supported models is returned in the Lucidworks AI Use Case API.

Common parameters and fields

Some parameters in the /ai/async-chunking/CHUNKER/MODEL_ID request are common to all of the Async Chunking API requests, such as the modelConfig parameter. Also referred to as hyperparameters, these fields set certain controls on the response. Refer to the API spec for more information.

Vector quantization

To process large chunks of text efficiently, Lucidworks recommends you enter the appropriate value in the "modelConfig": "vectorQuantizationMethod" field to ensure that as much of the text as possible is chunked, even for large inputs. Quantized vectors are less resource intensive to store and compute, which decreases index and query processing time. In addition, due to their size, more quantized vectors can be used to reach the same amount of memory as typical vectors. For example, if the quantized vector sizes are [1,0,2], [2,3,1], [6,0,0], [0,0,2] and typical vectors are [0.012341,0.23434,0.01334], [0.5434,0.02134,0.05434], [0.76534,0.0953,0.1334], [0.398,0.38574,0.01384], and the amount of memory is 5MB: The quantized vector can have 5000 vectors that reach 5MB because they are smaller. But typical vectors can only have 500 because they have more numerical values to save in memory. In this example, the numbers 5MB, 5000, and 500 are random numbers. The following table specifies the number of chunks returned for a single request based on vector dimension and the setting of vector quantization.
Vector Dimension SizeMaximum Chunks Returned if Quantized Vector = trueMaximum Chunks if Quantized Vector = false
324000011000
64225005800
128120003000
25665001500
38445001000
5123250750
7682250500
10241700380
1536250250
2048850190

useCaseConfig

The "useCaseConfig": "dataType": "string" parameter is common to all of the Async Chunking API chunkers in the /ai/async-chunking/CHUNKER/MODEL_ID request. If you do not enter the value, the default of query is used. This optional parameter enables model-specific handling in the Async Chunking API to help improve model accuracy. Use the most applicable fields based on available dataTypes and the dataType value that best aligns with the text sent to the Async Chunking API. The string values to use are:
  • "dataType": "query" for the query.
  • "dataType": "passage" for fields searched at query time.
The syntax example is:
"useCaseConfig":
  {
    "dataType": "query"
  }

Unique parameters and fields

chunkerConfig

The parameters to configure each chunker are as follows:

dynamic-newline chunker

The dynamic-newline chunker splits the provided text on all new line characters. Then all of the split chunks under the maxChunkSize limit will be merged. This is the default chunker configuration if nothing is passed.
  • "chunkerConfig": "maxChunkSize" - This integer field defines the maximum token limit for a chunker. The default is 512 tokens, which matches the maximum context size of the Lucidworks-hosted embedding models.
    "chunkerConfig": {
    "maxChunkSize": 512
        }
    

dynamic-sentence chunker

The dynamic-sentence chunker splits the provided text into sentences. Sentences are joined until they reach the maxChunkSize. If overlapSize is provided, adjacent chunks overlap by that many sentences. Example:
  • Chunk 1: Sentence 1, Sentence 2, Sentence 3
  • Chunk 2: Sentence 3, Sentence 4, Sentence 5
  • Chunk 3: Sentence 5, Sentence 6, Sentence 7
This is the default chunker configuration if nothing is passed.
  • "chunkerConfig": "maxChunkSize" - This integer field defines the maximum token limit for a chunker. The default is 512 tokens, which matches the maximum context size of the Lucidworks-hosted embedding models.
    {
      "chunkerConfig": {
        "maxChunkSize": 512
      }
    }
    
  • "chunkerConfig": "overlapSize" - This integer field sets the number of sentences that can overlap between consecutive chunks. The default is 1 sentence for most configurations.
    "chunkerConfig": {
    "overlapSize": 1
        }
    

regex-splitter chunker

The regex-splitter chunker splits the submitted text based on the specified regex (regular expression), according to the conventions employed by the re python package. This is the default chunker configuration if nothing is passed. For more information about the re operations, see https://docs.python.org/3/library/re.html.
  • "chunkerConfig": "regex" - This field sets the regular expression used to split the provided text. For example, \\n.
    "chunkerConfig": {
    "regex": "\\n"
        }
    

semantic chunker

The semantic chunker creates chunks based on semantic similarity. Using the model defined in the URL request, the semantic chunker splits text into sentences, encodes the sentences, and then compares the sentence to the building chunk to determine if they are similar enough to group together. After merging two semantically-similar sentences into a pre-chunk, the semantic chunker needs to encode it to get its vector to compare with the next sentence vector. This chunker is the slowest of all of the chunkers even if you set the approximate field to true. This is the default chunker configuration if nothing is passed.
  • "chunkerConfig": "maxChunkSize" - This integer field defines the maximum token limit for a chunker. The default is 512 tokens, which matches the maximum context size of the Lucidworks-hosted embedding models.
    "chunkerConfig": {
    "maxChunkSize": 512
        }
    
  • "chunkerConfig": "overlapSize" - This integer field sets the number of sentences that can overlap between consecutive chunks. The default is 1 sentence for most configurations.
    "chunkerConfig": {
    "overlapSize": 1
        }
    
  • "chunkerConfig": "cosineThreshold" - This decimal field controls how similar a sentence must be to a chunk (based on cosine similarity), in order for the sentence to be merged into the chunk. This value is a decimal between 0 and 1. The default threshold is 0.5.
    "chunkerConfig": {
      "cosineThreshold": 0.5
    }
    
  • "chunkerConfig": "approximate" - If this boolean field is set to true, the semantic chunker does not encode the split text to get its vector to compare with the next sentence vector. This greatly increases processing time with no loss in the result quality. However, even with the ability to specify true in the approximate field, the semantic chunker is the slowest of all the chunkers. If this field is set to false, the semantic chunking is, on average, 5 times slower than when set to true, with very minimal or no precision increase.
    "chunkerConfig": {
      "approximate": true
    }
    

sentence chunker

The sentence chunker splits text on sentences. This is the default chunker configuration if nothing is passed.
  • "chunkerConfig": "chunkSize" - This integer field sets the maximum number of sentences per chunk. The default is 5.
    "chunkerConfig": {
    "chunkSize": 5
        }
    
  • "chunkerConfig": "overlapSize" - This integer field sets the number of sentences that can overlap between consecutive chunks. The default is 1 sentence for most configurations.
    "chunkerConfig": {
    "overlapSize": 1
        }
    

POST request

The following is an example of the POST request used by every chunker. Fields and values unique to each chunker are detailed in Unique parameters and fields. These are the possible values of the status of a request:
  • SUBMITTED. The POST request was successful and the response has returned the chunkingId and status that is used by the GET request.
  • ERROR. An error was generated when the GET request was sent.
  • READY. The results associated with the chunkingId are available and ready to be retrieved.
  • RETRIEVED. The results associated with the chunkingId are returned successfully when the GET request was sent.
curl --request POST \
  --url https://APPLICATION_ID.applications.lucidworks.com/ai/async-chunking/{CHUNKER}/{MODEL_ID} \
  --header 'Authorization: Bearer ACCESS_TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
    "batch": [
      {
        "text": "The itsy bitsy spider climbed up the waterspout.\nDown came the rain.\nAnd washed the spider out.\nOut came the sun.\nAnd dried up all the rain.\nAnd the itsy bitsy spider climbed up the spout again."
      }
    ],
    "useCaseConfig": {
      "dataType": "query"
    },
    "modelConfig": {
      "vectorQuantizationMethod": "max-scale"
    }
  }'

GET request

To retrieve the chunked results, use the chunkingId from the POST response in a GET request. The following is an example of the GET request used by every chunker.
curl --request GET \
  --url https://APPLICATION_ID.applications.lucidworks.com/ai/async-chunking/{CHUNKING_ID} \
  --header 'Authorization: Bearer ACCESS_TOKEN' \
  --header 'Content-type: application/json'