Connectors Crawl Database API

Some of the connectors use a crawl database to track documents that have been seen by prior crawls and are able to use this information to understand which documents are new or have been updated or removed and take appropriate action in the index. The connectors that support this are currently lucid.fs and lucid.solrxml.

This API allows looking into the crawl database and dropping tables or clearing the database.

Note:

This API can be used only with connectors which support it, which at the current time are the 'lucid.fs' and 'lucid.solrxml' connectors.

The 'lucid.anda' connector also uses a crawl database, but it is not the same database, and does not have a REST API or other interface to access it.

Get Statistics for a Datasource Database or Drop Database

The path for this request is:

/api/apollo/connectors/datasources/<id>/db

where <id> is the name of the datasource.

A GET request will return statistics from the crawl database associated with a specific datasource. DELETE will drop the tables, meaning that the history of any crawl will be removed and all documents found on the next crawl will be treated as brand new documents and will be submitted to the indexing pipeline.

Input

None.

Output

The output from a GET request will include several sections detailing the database structure:

  • counters: the counters section reports the document counts of database activities, such as table inserts.

  • ops: the ops section reports on database operations that have occurred, such as initiating tables, retrieving items, processing items and table drops.

  • tables: the tables section lists the tables in the database with a count of the number of items in each table. Inspecting the items is described in the next section.

The output from a DELETE request will be empty. When dropping the database, note that no documents will be removed from the index. However, the crawl database will be empty, so on the next datasource run, all documents will be treated as though they were never seen by the connectors.

Examples

Note
Use port 8765 in local development environments only. In production, use port 8764.

Get the crawl database statistics for the datasource named "SolrXML":

REQUEST

curl -u user:pass http://localhost:8764/api/apollo/connectors/datasources/SolrXML/db

RESPONSE

{
  "counters" : {
    "new" : 14,
    "processed.insert" : 14
  },
  "ops" : {
    "initTable" : 4,
    "dropTable" : 7,
    "flush" : 1,
    "getItem" : 28,
    "renameTable" : 2,
    "commitUpdates" : 1,
    "listTables" : 2,
    "finishProcessing" : 14,
    "beginUpdates" : 1,
    "insertItem" : 14
  },
  "tables" : {
    "deleted" : {
      "count" : 0
    },
    "discarded" : {
      "count" : 0
    },
    "errors" : {
      "count" : 0
    },
    "items" : {
      "count" : 14
    }
  }
}

Get Table Statistics or Drop the Table

The path for this request is:

/api/apollo/connectors/datasources/<id>/db/<table>

where <id> is the name of the datasource and <table> is the name of a database table.

A GET request will return the table statistics. A DELETE request will drop the table and clear its data.

Input

None.

Output

The output from a GET request will be the statistics for the named table. This is usually the item count.

The output from a DELETE request will be empty.

When dropping tables, be aware that the 'items' table does not delete documents from the index, but instead changes the database so database considers them new documents. When dropping other tables, such as the 'errors' table, it will merely clear out old error messages.

Examples

Get the statistics for the 'items' table in the 'SolrXML' datasource’s connector database:

REQUEST

curl -u user:pass http://localhost:8764/api/apollo/connectors/datasources/SolrXML/db/items

RESPONSE

{
  "count" : 14
}

Get or Delete Table Items

The path for this request is:

/api/apollo/connectors/datasources/<id>/db/items/<item>

where <id> is the name of the datasource and <item> is the name of a specific item in the table. If no item name is specified, the request will get all items.

A GET request retrieves information about an item or items.

A DELETE request removes the information from the Crawl Database only. Note that this doesn’t affect the Solr Index.

A request takes two optional parameters:

Parameter Description

start

The starting key, which is the document ID. If empty, response will start at the first row of the table. Used with a GET request only.

rows

The number of rows to return. The default is to return all records. Used with a GET request only.

Input

None.

Output

The output of a GET request will include information on when the document was fetched, if it contained any links to other documents, and the size of the document.

The output of a DELETE request will be empty. Note that this does not delete a document from the index, it only changes the database so if or when the document is crawled again, the database considers it a new document.

Examples

REQUEST

curl -u user:pass http://localhost:8764/api/apollo/connectors/datasources/SolrXML/db/items/items?rows=5

RESPONSE

{
  "/Applications/solr-4.8.0/example/exampledocs/gb18030-example.xml" : {
    "timestamp" : "1398117855000",
    "fetchedUri" : null,
    "fetchTime" : 1402503143632,
    "docsCount" : 1,
    "outlinks" : null,
    "discarded" : false,
    "discardMessage" : null,
    "byteSize" : 1331,
    "exception" : null,
    "fetched" : true
  },
  "/Applications/solr-4.8.0/example/exampledocs/hd.xml" : {
    "timestamp" : "1398117855000",
    "fetchedUri" : null,
    "fetchTime" : 1402503143691,
    "docsCount" : 1,
    "outlinks" : null,
    "discarded" : false,
    "discardMessage" : null,
    "byteSize" : 2241,
    "exception" : null,
    "fetched" : true
  }
}