Skip to main content
The Detect Language index stage operates over one or more fields in the Pipeline Document. The contents of each field are analyzed using the Language Detection Library for Java, which is an open source project hosted on GitHub. The analyzer returns the ID of the language which best matches the contents of that field, if any. The stage can return these IDs as annotations on the Pipeline Document context, or as annotation on each field analyzed. The language identification algorithm breaks the text in each source field into n-grams and compares them to sets of n-grams compiled from all the different language versions of the Wikipedia. This library only produces reasonable results for document fields which are comparable in length, vocabulary, and style to the known texts compiled from the Wikipedia. Caveats are discussed below. If a positive language identification is made, the stage writes the detected language information to the Pipeline Document based on the Output Type configuration property. If the language annotation is added to the Pipeline Document context object, the name of the context key string is specified by the Output Key configuration property.
Although documentPostfix is primarily associated with document output, it must also be set when outputType is set to context. If documentPostfix is not specified, the detected language information may not be captured in the context output, even when outputKey is configured.
For Output Type configuration property Document, per-field language annotations are added to the document using a parallel naming convention where the name of the language identification field starts with the name of the analyzed field and has an additional suffix string, default value _lang. For example, if a document contains fields named plot_summary_txt and user_reviews_txt to be analyzed, if the software can detect the language, it adds the fields plot_summary_txt_lang and user_reviews_txt_lang. There is also an option to allow detection of multiple languages. This can be achieved by setting Return all detected languages and their confidence scores to true. In this case, the detected languages is either set as document fields in a form of Field Name_Document Postfix.Language:Confidence, or as a field with name Output Key in the context having a dictionary of following form { "language":"probability" } as a value. Example Document fields could look like this: plot_summary_txt_lang.pl_: [0.99], plot_summary_txt_lang.en_: [0.99] when languages pl and en would be detected.

Languages

The Language Detection Library for Java has build-in profiles for many languages. These are the language profiles that can be used as object attributes in the languages array. If there is a set of Wikipedia entries written in a language, it is likely that the Language Detection Library can identify texts written in this language.

Caveats

This library should produce reasonable results on document fields which are comparable in length, vocabulary, and style to the known texts compiled from the Wikipedia. The documentation lists the following challenges:
  • This software does not work as well when the input text to analyze is short, or unclean. For example tweets.
  • When a text is written in multiple languages, the default algorithm of this software is not appropriate. You can try to split the text (by sentence or paragraph) and detect the individual parts. Running the language guesser on the whole text will just tell you the language that is most dominant, in the best case.
  • This software cannot handle it well when the input text is in none of the expected (and supported) languages.
  • Detection of unwanted languages (for example the stage might detect some language that is not even used in the input data because of some language similarities). By default, the stage uses a full array of available languages for detection (List here). If one wants to only use selected languages, this can be configured

Configuration

When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.