Product Selector

Fusion 5.9
    Fusion 5.9

    AlfrescoREST V2 connector

    The Alfresco recipe is used to retrieve data from the Alfresco platform for information management.

    The configuration details and JSON recipe are added here for convenience, but you can also view them directly at the public REST configuration repository on GitHub.

    Alfresco REST Configuration

    This documentation describes aspects of the Alfresco REST file configuration alfresco-v1.json, such as the authentication, data fetched, requests configured (endpoints, query params, pagination) and variables needed. Terminology is also provided as a reference.

    • The alfresco-v1.json is configured to retrieve the initial folders from the root folder -root- (Company Home) and then retrieve the folders, nestedFolders and files.

    • The following objects are indexed into separate solr document:

      • Folders (this also include the 'Sites' folder)

      • Files

    Authentication

    The Alfresco REST configuration supports:

    • Basic Authentication using the username and password from an Alfresco account.

    Supported crawl options

    The Alfresco REST configuration supports the following crawl options:

    • Full crawl:

      • All the content from the source is fetched.

    • Re-Crawl:

      • Per re-crawl, all the content from the source is retrieved as it were a full-crawl

      • Orphan files/folders (deleted in the alfresco source that are not retrieved with a current crawl), will be deleted from the index using the strayContentDeletion feature from connectors-service, which is run when a crawl finishes.

    Parser

    The default parser is set to _system but can be changed to any parser based on index needs.

    Variables used

    The Alfresco REST configuration use the following variables, which are created with the rest-connector:

    • ${LW_BATCH_SIZE} - Used with pagination feature. Used to set the maxItems query parameter, which controls the number of entries (folder/files) that are returned in the response.

    • ${LW_INDEX_START} - Used with pagination feature. Used to set the skipCount query parameter, which is used to traverse the pagination.

    • ${LW_PARENT_DATA_KEY} - Used with the Child Request Configurations. In crawl-time, this variable is replaced with the parent object ID value extracted by setting the property 'Parent Data Key'. Note: The parent object is retrieved with a previous request (parent-request).

    Pagination Setup

    Pagination by Batch Size is configured per Request. Needs to configure properties: 'Query Params', and 'Pagination By BatchSize'

    Configure the 'Pagination By BatchSize' properties:

    • IndexStart: The starting point. It replaces the variable ${LW_INDEX_START}

    • BatchSize: The number of elements to retrieve. Set to 50 by default. It replaces the variable ${LW_BATCH_SIZE}

    • Stop Condition Key: Reference the “key” in the response, that needs to be met in order to stop the pagination. For the Alfresco Config, use “list.entries”

    • Stop Condition Value: Reference the “value” in the response, that needs to be met in order to stop the pagination. For the Alfresco Config, to stop pagination the list of objects retrieved must be empty, then the stop condition should be []

    Query Params:

    • maxItems=${LW_BATCH_SIZE}, where ${LW_BATCH_SIZE} is replaced with the value of property BatchSize. For more information about maxItems, see Alfresco documentation for limiting-result-items

    • skipCount=${LW_INDEX_START}, where ${LW_INDEX_START} is replaced with the value of property IndexStart to request the first page, then internally replaced with 'IndexStart + BatchSize' to request next pages. For more information about skipCount, see Alfresco documentation for skipping-result-items

    Endpoints Configuration with Alfresco REST connector

    • The following table describes the Alfresco REST connector endpoints needed to retrieve the files and folders, and how those are configured with the rest-connector.

    • Each requests in configured under the property List of Requests Configuration (requestConfigurations in the alfresco-v1.json` file)

    Request type ObjectType Parent ObjectType Endpoint HTTP operation Query parameters Description

    Root Request

    INITIAL_FOLDER

    /alfresco/api/-default-/public/alfresco/versions/1/nodes/-root-/children

    GET

    include=path,properties&skipCount=${LW_INDEX_START}&maxItems=${LW_BATCH_SIZE}&where=(isFolder=true

    Returns the folders from -root- folder (Company Home)

    Child Request

    FOLDER

    INITIAL_FOLDER

    /alfresco/api/-default-/public/alfresco/versions/1/nodes/${LW_PARENT_DATA_KEY}/children

    GET

    include=path,properties&skipCount=${LW_INDEX_START}&maxItems=${LW_BATCH_SIZE}&where=(isFolder=true)

    Return children folders from each parent folder retrieved with the previous request 'INITIAL_FOLDER'. Internally, the variable ${LW_PARENT_DATA_KEY} is replaced with the 'id' of the parent folder, which is extracted by setting the property Response Handling → parentDataKey=entry.id. This request enable the property 'Recursive Request'.

    Child Request

    FILE

    FOLDER

    /alfresco/api/-default-/public/alfresco/versions/1/nodes/${LW_PARENT_DATA_KEY}/children

    GET

    include=path,properties&skipCount=${LW_INDEX_START}&maxItems=${LW_BATCH_SIZE}&where=(isFile=true)

    Returns children files from each parent folder retrieved with the previous request 'FOLDER'. Internally, the variable ${LW_PARENT_DATA_KEY} is replaced with the 'id' of the parent folder, which is extracted by setting the property Response Handling → parentDataKey=entry.id. This request enable the property 'Skip Indexation'.

    Child Request

    FILE_DOWNLOAD

    FILE

    /alfresco/api/-default-/public/alfresco/versions/1/nodes/${LW_PARENT_DATA_KEY}/content

    GET

    Download the content from each file retrieved with the previous request 'FILE'. Internally, the variable ${LW_PARENT_DATA_KEY} is replaced with the 'id' of the file, which is extracted by setting the property Response Handling → parentDataKey=entry.id

    Response Parsing Configuration

    Per request, configure the property Response Handling to set up how to parse the response (responseConfiguration in the alfresco-v1.json` file)

    Plugin Parsing:

    • This parsing happens by default. The responses are parsed as a JSON Object structure using JsonPath.

    • Plugin Parsing will happen for requests: INITIAL_FOLDER, FOLDER, FILE

    • Properties Response Handling → Data ID, Data Path, Parent Data Key can be configured to extract certain information from the Objects parsed (see section Terminology for more information

    Binary Parsing:

    • Enable by setting the property Response Handling → Parse Binary Data (binaryResponse in the alfresco-v1.json` file). Send the whole response to the Fusion Parsers. If disabled (default), the response is parsed as a JSON object

    • This parsing is configured for request: FILE_DOWNLOAD

    Skip Indexation of Objects

    When enabled, the response is not indexed. This is useful when objects are requested solely to discover their child objects, without needing to index the parent object itself.

    • For Alfresco Configuration:

      • Given a parent Request FILE, to retrieve a list of files metadata. The request is needed to discover the IDs of files to be downloaded in a following request.

      • Given a child Request FILE_DOWNLOAD to download the binary content from the files found previously

      • Both request will index two solr-docs that represents the same file: 1) doc the file-metadata only, 2) doc with the file-metadata joined with the file-content.

    1) doc the file-metadata only (Request FILE)
    
    id: "serverURL_/<parent-request>/fileID"
    size_i: 10
    author_s: "any"
    _lw_rest_object_type_s: "file"
    2) doc with the file-metadata joined with the file-content (Request FILE_DOWNLOAD)
    
    id: "serverURL_/<child-request>/fileID_binary"
    size_i: 10
    author_s: "any"
    body_s: "body of txt"
    _lw_rest_object_type_s: "file_download"
    • There is no need to index the first solr-doc. To avoid indexing this, the property 'Skip Indexation' for the Request FILE is enabled in the 'alfresco-v1.json' file.

    • If needed to avoid indexing another objects, enable the property 'Skip Indexation' in the corresponding request configuration.

    Limit Documents

    Exclude by RegEx:

    Allows specifying a list of key-value pairs to exclude objects from indexing:

    • Key: Reference the field name of the object to exclude. It also accepts JsonPath expressions for navigating through nested object, e.g. objects.nested.path

    • Value: The value contains a regular expression that will be matched against the field value in the object. If the match succeeds, the entire object will be excluded.

    For Alfresco Rest Configuration, this property can be used to exclude objects that matches values from certain fields.

    • e.g. Exclude objects from the field 'entry.path.name'. Let’s consider the Alfresco object named "testFolder" belonging to the path /Company Home/Sites/sample1.

    {
        "entry": {
            "path": {
                "name": "/Company Home/Sites/sample1",
                "isComplete": true,
                "elements": [...]
            },
            "isFolder": true,
            "isFile": false,
            "name": "testFolder",
            "id": "c2bf6e7d-db3e-4f21-850a-90389fe1d2e1",

    In order to exclude all objects under the path '/Company Home/Sites/sample1', add a key-value pair to the exclusion list:

    • Key: entry.path.name (points to the field containing the path)

    • Value: /Company Home/Sites/sample1 (regular expression to match the path)

    More key-value pairs can be added using different keys, or the same key

    ExcludeByRegexSample

    Exclude by File Size:

    Allows specifying minimum and maximum sizes that will exclude all files that do not meet the requirements.

    • Key: Reference the field name of the object with the file size. This property also accepts JsonPath expressions e.g. objects.nested.path

    • Minimum File Size: Used for excluding files with sizes smaller than the configured value.

    • Maximum File Size: Used for excluding files with sizes larger than the configured value. Set to -1 when there is no limit

    For the Alfresco Rest Configuration, this property can be used to exclude files that is above or below a certain size.

    • For example: Let’s consider this sample snippet from an Alfreco response

                    "entry": {
                        "createdAt": "2011-03-03T10:31:30.596+0000",
                        "isFolder": false,
                        "isFile": true,
                        "createdByUser": {
                            "id": "mjackson",
                            "displayName": "Mike Jackson"
                        },
                        "modifiedAt": "2011-03-03T10:31:31.651+0000",
                        "modifiedByUser": {
                            "id": "mjackson",
                            "displayName": "Mike Jackson"
                        },
                        "name": "Project Objectives.ppt",
                        "id": "5515d3e1-bb2a-42ed-833c-52802a367033",
                        "nodeType": "cm:content",
                        "content": {
                            "mimeType": "application/vnd.ms-powerpoint",
                            "mimeTypeName": "Microsoft PowerPoint",
                            "sizeInBytes": 2117632,
                            "encoding": "UTF-8"
                        },
                        "parentId": "38745585-816a-403f-8005-0a55c0aec813"
                    }

    You can specify the path to the property that has the size, as well as what the minimum and maximum sizes should be. Therefore:

    • Set key = entry.content.sizeInBytes

    • Then set minimum, all sizes below the minimum will be excluded.

    • Then set maximum, all sizes above the maximum will be excluded (Set to -1 when there should be no

    Example of configuration based on the above snippet to only index files of 2117632 bytes and below.

    ExcludeByFileSizeSample

    Terminology

    The following terms are provided as a reference.

    Term Description

    List of Requests Configuration

    Configure List of Requests to extract data from the Rest source. Requests are linked hierarchically by using the properties ObjectType and ParentObjectType.

    Object Type

    The unique name to identify the request.

    Parent Object Type

    Reference an existent Object Type. Create a parent-child hierarchy, where the current request becomes the child of the specified Parent Object Type. If blank, the current request is considered a Root-Request.

    Root Request

    The request to retrieve the initial objects.

    Child Request

    The type of request to retrieve additional information for the root data objects. The child requests will be performed per each root data object.

    Recursive Request

    When enabled, extra-requests are performed to retrieve nested objects within the objects found with the current-request. For example, the request ObjectType=FOLDER enable this property, then extra-request is made per Folder found to retrieve NestedFolders. This process will continue until no more NestedFolders are found.

    Skip Indexation

    When enabled, the response is not indexed. Useful when requests of objects are needed only to discover child-objects, without need to index the object itself.

    Response Handling

    The responseConfiguration Defines the mapping between the response and data objects to be indexed.

    Data Path

    The path to access a specific data object within a response. For example, to access a list of elements named with key objects, the DataPath would be objects. If not provided, the entire response body will be indexed. This property accepts JsonPath expressions e.g. objects, objects[*], or list.entries to extract the list of alfresco objects.

    Data ID

    The identifier key for the data objects extracted with 'Data Path'. This value will be used to build the solr-document’s ID. If not provided, a random UUID will be used. This property accepts JsonPath expressions, e.g. entry.id to extract the ID of the alfresco file/folder

    Parent Data Key

    Only configure with Child Requests. Set the 'key' to extract the ID of the root/parent response, which value is used to replace the ${LW_PARENT_DATA_KEY} variable in the child request configuration (endpoint, query params or body). For example, /alfresco/api/-default-/public/alfresco/versions/1/nodes/${LW_PARENT_DATA_KEY}/content

    Parse Binary Data

    Enable to send the whole response to the Fusion Parsers. If enabled, properties Data Path, Data ID will be ignored and pagination will not happen.

    Recipe

    {
      "parserId": "_system",
      "coreProperties": {},
      "id": "rest-alfresco",
      "type": "lucidworks.rest",
      "properties": {
        "serviceURL": "https://{add alfresco url}",
        "authenticationMode": {
          "basicAuth": {
            "password": "xXx-Redacted-xXx",
            "user": "{add username here!!!}"
          }
        },
        "requestConfigurations": [
          {
            "request": {
              "recursiveRequest": false,
              "linkRequest": {
                "objectType": "INITIAL_FOLDER"
              },
              "requestConfiguration": {
                "endpoint": "/alfresco/api/-default-/public/alfresco/versions/1/nodes/-root-/children",
                "pagination": {
                  "paginationByBatchSize": {
                    "paginationStopConditionValue": "[]",
                    "paginationStopConditionKey": "list.entries",
                    "batchSize": 50,
                    "indexStart": 0
                  }
                },
                "httpMethod": "GET",
                "queries": [
                  {
                    "queryKey": "include",
                    "queryValue": "path,properties"
                  },
                  {
                    "queryKey": "where",
                    "queryValue": "(isFolder=true)"
                  },
                  {
                    "queryKey": "skipCount",
                    "queryValue": "${LW_INDEX_START}"
                  },
                  {
                    "queryKey": "maxItems",
                    "queryValue": "${LW_BATCH_SIZE}"
                  }
                ]
              },
              "responseConfiguration": {
                "dataId": "entry.id",
                "binaryResponse": false,
                "dataPath": "list.entries",
                "parentIdKey": ""
              }
            }
          },
          {
            "request": {
              "recursiveRequest": true,
              "linkRequest": {
                "parentObjectType": "INITIAL_FOLDER",
                "objectType": "FOLDER"
              },
              "requestConfiguration": {
                "endpoint": "/alfresco/api/-default-/public/alfresco/versions/1/nodes/${LW_PARENT_DATA_KEY}/children",
                "pagination": {
                  "paginationByBatchSize": {
                    "paginationStopConditionValue": "[]",
                    "paginationStopConditionKey": "list.entries",
                    "batchSize": 50,
                    "indexStart": 0
                  }
                },
                "httpMethod": "GET",
                "queries": [
                  {
                    "queryKey": "include",
                    "queryValue": "path,properties"
                  },
                  {
                    "queryKey": "where",
                    "queryValue": "(isFolder=true)"
                  },
                  {
                    "queryKey": "skipCount",
                    "queryValue": "${LW_INDEX_START}"
                  },
                  {
                    "queryKey": "maxItems",
                    "queryValue": "${LW_BATCH_SIZE}"
                  }
                ]
              },
              "responseConfiguration": {
                "dataId": "entry.id",
                "binaryResponse": false,
                "dataPath": "list.entries",
                "parentIdKey": "entry.id"
              }
            }
          },
          {
            "request": {
              "recursiveRequest": false,
              "linkRequest": {
                "parentObjectType": "FOLDER",
                "objectType": "FILE"
              },
              "requestConfiguration": {
                "endpoint": "/alfresco/api/-default-/public/alfresco/versions/1/nodes/${LW_PARENT_DATA_KEY}/children",
                "pagination": {
                  "paginationByBatchSize": {
                    "paginationStopConditionValue": "[]",
                    "paginationStopConditionKey": "list.entries",
                    "batchSize": 50,
                    "indexStart": 0
                  }
                },
                "httpMethod": "GET",
                "queries": [
                  {
                    "queryKey": "where",
                    "queryValue": "(isFile=true)"
                  },
                  {
                    "queryKey": "include",
                    "queryValue": "path,properties"
                  },
                  {
                    "queryKey": "skipCount",
                    "queryValue": "${LW_INDEX_START}"
                  },
                  {
                    "queryKey": "maxItems",
                    "queryValue": "${LW_BATCH_SIZE}"
                  }
                ]
              },
              "skipIndexation": true,
              "responseConfiguration": {
                "dataId": "entry.id",
                "binaryResponse": false,
                "dataPath": "list.entries",
                "parentIdKey": "entry.id"
              }
            }
          },
          {
            "request": {
              "recursiveRequest": false,
              "linkRequest": {
                "parentObjectType": "FILE",
                "objectType": "FILE_DOWNLOAD"
              },
              "requestConfiguration": {
                "endpoint": "/alfresco/api/-default-/public/alfresco/versions/1/nodes/${LW_PARENT_DATA_KEY}/content",
                "httpMethod": "GET"
              },
              "responseConfiguration": {
                "binaryResponse": true,
                "parentIdKey": "entry.id"
              }
            }
          }
        ],
        "serviceEndpoints": [],
        "collection": "{add collection name here}"
      },
      "pipeline": "{add pipeline name here}",
      "connector": "lucidworks.rest"
    }