> ## Documentation Index
> Fetch the complete documentation index at: https://doc.lucidworks.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Detect Language Index Stage

export const schema = {
  "type": "object",
  "title": "Detect Language",
  "description": "Detect the language of the input source fields using https://github.com/optimaize/language-detector. If the output is stored to the document, there will be a new field created for each source field with the language",
  "required": ["source"],
  "properties": {
    "skip": {
      "type": "boolean",
      "title": "Skip This Stage",
      "description": "Set to true to skip this stage.",
      "default": false,
      "hints": ["advanced"]
    },
    "label": {
      "type": "string",
      "title": "Label",
      "description": "A unique label for this stage.",
      "hints": ["advanced"],
      "maxLength": 255
    },
    "condition": {
      "type": "string",
      "title": "Condition",
      "description": "Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.",
      "hints": ["code", "code/javascript", "advanced"]
    },
    "source": {
      "type": "array",
      "title": "Source",
      "description": "The fields/context keys to detect on.  May be a String Template. See https://github.com/antlr/stringtemplate4/blob/master/doc/index.md",
      "minItems": 1,
      "items": {
        "type": "string"
      }
    },
    "outputKey": {
      "type": "string",
      "title": "Output Key",
      "description": "The name of the key to insert into the context if the output type is 'context'.  The value is a map of source name to language.  May be a String Template. See https://github.com/antlr/stringtemplate4/blob/master/doc/index.md",
      "default": "languages"
    },
    "documentPostfix": {
      "type": "string",
      "title": "Document Postfix",
      "description": "The postfix to add to the source name when storing the results on the document (via the output type).",
      "default": "_lang"
    },
    "outputType": {
      "type": "string",
      "title": "Output Type",
      "description": "Select whether the flag should be set on the document or in the Pipeline Context.",
      "enum": ["document", "context"]
    }
  },
  "category": "Document Filtering and Enrichment",
  "categoryPriority": 7,
  "unsafe": false
};

export const SchemaParamFields = ({schema}) => {
  const sanitize = str => {
    if (typeof str !== "string") return str;
    return str.replace(/^"(.*)"$/s, "$1").replace(/\\/g, "").replace(/"/g, "'");
  };
  const formatDescription = str => {
    const s = sanitize(str);
    return (/[.!?]\)*$/).test(s) ? s : `${s}.`;
  };
  const {description, properties = {}, required: requiredProps = []} = schema;
  const visibleProps = useMemo(() => Object.entries(properties).filter(([, prop]) => !prop.hints?.includes("hidden")), [properties]);
  return <div>
      {description && <p>{formatDescription(description)}</p>}

      {visibleProps.map(([name, prop]) => {
    const isRequired = requiredProps.includes(name);
    const hasDefault = prop.default !== undefined;
    const rawDefault = prop.default;
    const isComplexDefault = hasDefault && (typeof rawDefault === "object" || typeof rawDefault === "string" && (rawDefault.length > 20 || rawDefault.includes('"')));
    const fieldProps = {
      key: name,
      body: prop.title || name,
      type: prop.type,
      ...prop.title && ({
        post: [<><span className="text-stone-400 dark:text-stone-500">API property: </span>{name}</>]
      }),
      ...isRequired && ({
        required: true
      }),
      ...!isComplexDefault && hasDefault ? {
        default: sanitize(String(rawDefault))
      } : {}
    };
    const isObject = prop.type === "object" && prop.properties;
    const isArrayOfObjects = prop.type === "array" && prop.items?.type === "object" && prop.items.properties;
    return <ParamField {...fieldProps}>
            {prop.description && <p>{formatDescription(prop.description)}</p>}

            {isComplexDefault && <div className="flex">
                <p>
                  <strong>Default:</strong>
                </p>
                <pre className="!my-0">
                  <code>
                    {JSON.stringify(rawDefault, null, 2)}
                  </code>
                </pre>
              </div>}

            {isArrayOfObjects && <div className="flex">
              <p>
                <strong>Object attributes:</strong>
              </p>
              <pre className="!my-0">
                <code>
                  {'{\n'}
                  {Object.entries(prop.items.properties).map(([iname, iprop]) => <>
                      {`  ${iname}`}
                      {prop.items?.required?.includes(iname) && <span style={{
      color: 'red'
    }}> required</span>}
                      {`: {\n    display name: ${sanitize(iprop.title || '')}\n    type: ${iprop.type}\n  }\n`}
                    </>)}
                  {'}'}
                </code>
              </pre>
              </div>}

            {isObject && <Expandable title="properties">
                <SchemaParamFields schema={{
      properties: prop.properties,
      required: prop.required
    }} />
              </Expandable>}
          </ParamField>;
  })}
    </div>;
};

export const LwTemplate = ({title = "Key questions to get you started", icon = "sparkles", cta = "Powered by Agent Studio", linkHref = "https://lucidworks.com/demo/?utm_source=docs&utm_medium=referral&utm_campaign=docs_cta_ai"}) => {
  const [isLoaded, setIsLoaded] = useState(false);
  useEffect(() => {
    const timer = setTimeout(() => {
      setIsLoaded(true);
    }, 500);
    return () => clearTimeout(timer);
  }, []);
  return <div className="lw-template-container">
      <Card title={title} icon={icon}>
        {isLoaded && <span dangerouslySetInnerHTML={{
    __html: `<lw-template id="a029c1a9-28be-427e-b0e1-5d918920246a"></lw-template
            >`
  }} />}
        <Link href={linkHref} className="agent-studio-link text-left text-gray-600 gap-2 dark:text-gray-400 text-sm font-medium flex flex-row items-center hover:text-primary dark:hover:text-primary-light group-hover:text-primary group-hover:dark:text-primary-light">Powered by Lucidworks Agent Studio</Link>
      </Card>
    </div>;
};

[localhost link]: http://localhost:3000/docs/4/fusion-server/reference/pipeline-stages/indexing/language-detection-index-stage

[mintlify link]: https://doc.lucidworks.com/docs/4/fusion-server/reference/pipeline-stages/indexing/language-detection-index-stage

[old doc.lw link]: https://doc.lucidworks.com/fusion/5.9/210

The Detect Language index stage (called the Language Detection stage in versions earlier than 3.0) operates over one or more fields in the Pipeline Document.
The contents of each field are analyzed using the
[Language Detection Library for Java](https://github.com/optimaize/language-detector),
which is an open source project hosted on GitHub.
The analyzer returns the ID of the language which best matches the contents of that field, if any.
These IDs can be returned as an annotation on the Pipeline Document context, or as annotation on each field analyzed.

The language identification algorithm breaks the text in each source field into ngrams and compares them to sets of ngrams compiled from all the different language versions of the Wikipedia.
This library will only produce reasonable results for document fields which are comparable in length, vocabulary, and style to the known texts compiled from the Wikipedia.
Caveats are discussed below.

If a positive language identification is made, that information is added to the Pipeline Document
according to the choice of configuration property "Output Type".
If the language annotation is added to the PipelineDocument context object, the name of the context key string is
specified by configuration property "Output Key".
For Output Type configuration property "Document", per-field language annotations are added to the document
using a parallel naming convention where the name of the language identification field starts with the name of the analyzed field and
has an additional suffix string, default value "\_lang".
For example, if a document contains fields named "plot\_summary\_txt" and "user\_reviews\_txt" to be analyzed,
if the software can detect the language, it will add fields "plot\_summary\_txt\_lang" and "user\_reviews\_txt\_lang".

There is also an option to allow detection of multiple languages. This can be achieved by setting "Return all detected languages and their confidence scores." to true.
In this case, the detected languages will be either set as document fields in a form of "Field Name\_Document Postfix.Language:Confidence",
or as a field with name "Output Key" in the Context having a dictionary of following form "\{ "language":"probability" }" as a value.
Example Document fields could look like this: "plot\_summary\_txt\_lang.pl\_: \[0.99]", "plot\_summary\_txt\_lang.en\_: \[0.99]"
when languages pl and en would be detected.

<LwTemplate />

## Languages

The Language Detection Library for Java has build-in profiles for [many languages](https://github.com/optimaize/language-detector/blob/master/README.md#71-built-in-language-profiles). These are the language profiles that can be used as object attributes in the `languages` array. If there is a set of Wikipedia entries written in a language, it is likely that the Language Detection Library can identify texts written in this language.

## Caveats

This library should produce reasonable results on document fields which are comparable in length, vocabulary, and style to the known texts compiled from the Wikipedia.

The documentation lists the following [challenges](https://github.com/optimaize/language-detector/blob/master/README.md#challenges):

* This software does not work as well when the input text to analyze is short, or unclean. For example tweets.
* When a text is written in multiple languages, the default algorithm of this software is not appropriate. You can try to split the text (by sentence or paragraph) and detect the individual parts. Running the language guesser on the whole text will just tell you the language that is most dominant, in the best case.
* This software cannot handle it well when the input text is in none of the expected (and supported) languages.
* Detection of unwanted languages (for example the stage might detect some language that is not even used in the input data because of some language similarities). By default, the stage uses a full array of available languages for detection ([List here](https://github.com/optimaize/language-detector)). If one wants to only use selected languages, this can be configured

## Configuration

<Tip>
  When entering configuration values in the UI, use *unescaped* characters, such as `\t` for the tab character. When entering configuration values in the API, use *escaped* characters, such as `\\t` for the tab character.
</Tip>

<SchemaParamFields schema={schema} />
