Couchbase Connector and Datasource Configuration

The Couchbase connector uses the Cross-Datacenter Replication (XDCR) feature of Couchbase to retrieve data stored in Couchbase continuously in real-time. For more information about Couchbase, see their website at http://www.couchbase.com/.

This connector has been tested for compatibility with Couchbase Server 2.5.1 Enterprise Edition.

Indexing and Commits

Because this connector retrieves data continuously, two properties are available to control the frequency of commits to Solr, which makes the documents available for user queries. The properties define the maximum number of documents to queue for a commit (set to 50,000 by default) and the maximum amount of time to wait between commits (set to 120 seconds, or 2 minutes). Documents will be committed when one of those thresholds is reached first, meaning that if 2 minutes have passed and there are only 20,000 documents, a commit will occur. Similarly, if only 1 minute has passed and there are 50,000 documents in the queue, a commit will occur. These properties can be adjusted for your own requirements if needed.

Note
This connector retrieves data continuously. You can limit the number of documents it fetches during testing by setting the maximum number of documents retrieved, or you can manually stop the connector with the Fusion UI or Connector Datasources API.

Splitting Couchbase Documents

Because Couchbase has a flexible data model, documents may have a nested JSON structure. It is possible to split nested documents with a splitpath property, which uses an XPath-style path to the element to split on. These paths do not accept wildcards.

For example, if you have a document that looks like this:

{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
      {
        "subject": "Maths",
        "test"   : "term1",
        "marks":90},
        {
         "subject": "Biology",
         "test"   : "term1",
         "marks":86}
      ]
}

If we want to split this document on the 'exams' element and create two documents each with a different subject, we would define "splitpath":"/exams" in our datasource definition (if using the Fusion UI to configure the datasource, you would enter the path without quotes).

The output from retrieving the document will look like this:

{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
      {
        "subject": "Maths",
        "test"   : "term1",
        "marks":90
      }
    ]
},
{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
        {
         "subject": "Biology",
         "test"   : "term1",
         "marks":86
        }
      ]
}

Field Mapping with Couchbase

The Couchbase connector has built-in field mapping allows mapping Couchbase fields to fields in your schema. The mapping configuration defines a field from your schema and an XPath-style path to the field in the Couchbase JSON document.

The field mapping can accept wildcards and double-wildcards to map fields automatically. Wildcards can be used, but only at the end of the path definition.

  • field_name="" and field_path=/docs/* - maps all the fields under docs to the same name as given in JSON.

  • field_name="" and field_path=/docs/** - maps all the fields under docs and their children fields to the same name as given in JSON.

  • field_name=searchField and field_path=/docs/* - maps all the fields under /docs to a single field named 'searchField'.

  • field_name=searchField and field_path=/docs/** - maps all the fields under /docs and their children fields to a single field named 'searchField'.

If mapping is not defined, a default mapping will be assigned, in the format of the second example above, i.e., field_name="" and field_path=/docs/**.

Example

This example some simple field mapping, using a single document such as this:

{
  "first": "John",
  "last": "Doe",
  "grade": 8,
  "exams": [
      {
        "subject": "Maths",
        "test"   : "term1",
        "marks": 90 },
        {
         "subject": "Biology",
         "test"   : "term1",
         "marks": 86 }
      ]
}

When we configure the datasource, we can define our field mapping as follows:

"field_mapping": [
{
    "field_name":"points_i",
    "field_path":"/exams/marks"
},
{
    "field_name":"",
    "field_path":"/**"
}
]

Two mappings are defined. The first will map the '/exams/marks' field from Couchbase to the 'points_i' field in Solr. The second maps all top-level and child fields from Couchbase to either the same field name in Solr or to a dynamic field rule.

After retrieving the document, it will look like this:

{
  "first_s": "John",
  "last_s": "Doe",
  "grade_i": 8,
  "exams": [
      {
        "subject_s": "Maths",
        "test_s"   : "term1",
        "points_i":90},
        {
         "subject_s": "Biology",
         "test_s"   : "term1",
         "points_i":86}
      ]
}

The 'marks' field from the original document has been mapped to the 'points_i' field; most of the other fields have been mapped to appropriate dynamic field rules.

Note that the representation of the document above is after it has been retrieved from Couchbase, but before it has been processed by the index pipelines. Since the index pipelines contain several stage types that can further transform the document, such as the Apache Tika Parser stage and the Field Mapping stage, the document that ends up indexed to Solr may be different from the document representation above. Some small iterations of crawling are recommended to be sure the documents are indexed as required.

Configuration

Tip
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.