The Javascript connector executes a JavaScript program that is compiled by the JDK. This program returns a content item which is handed off to the fetcher. The JavaScript program must be standard ECMAScript.
You can use any Java class available to the connectors JDK ClassLoader to manipulate that object within a function.
As in Java, to access Java classes by their simple names instead of their fully specified class names, e.g. to be able to write String
instead of java.lang.String
, these classes must be imported.
The java.lang package is not imported by default, because its classes would conflict with Object, Boolean, Math, and other built-in JavaScript objects.
To import a Java class, use the JavaImporter object and the with
statement, which limits the scope of the imported Java packages and classes.
var imports = new JavaImporter(java.lang.String);
...
with (imports) {
var name = new String("foo"); ...
}
For global variables, you can reference these objects using the Java.type
API extension.
See this tutorial for details: http://winterbe.com/posts/2014/04/05/java8-nashorn-tutorial/
The JavaScript Program
The Javascript context provides the following variables:
Variable | Type | Description |
---|---|---|
|
java.lang.String |
The ID of the object to fetch. This is almost always the URI of the datasource to connect to and fetch content. |
|
long |
The time since the epoch from which the item was last touched. |
|
java.lang.String |
An optional string meant to be used to compare versions of the ID being fetched, e.g. an ETag in a web-crawl. |
|
crawler.common.MutableObject |
A Content object that can be modified and returned, for fine grained control over the return. |
|
Fetcher |
The current Fetcher instance (usually type |
|
java.util.Map |
A map used to store data to persist across calls to |
The program must return one of the following kinds of objects:
Object | Description | ||
---|---|---|---|
String |
A string object. This is converted to UTF-8 bytes and added as the raw content on a |
||
byte [] |
A byte array. This array is set on a |
||
|
If you want to have complete control over the return from
|
||
An array of Objects |
The array is converted to Embedded Content. The Fetcher returns a parent Content object that has a "Container" discardMessage. The Embedded Content on that container is generated by calling |
||
A JavaScript Map |
The map is converted to fields on the Content item returned. |
If the JavaScript script is implemented as a function, the return statement must return one of the above types. If the script is not function-based, the last line in the script must evaluate to one of these object types.
Examples
Return content as a java.lang.String
var str = new java.lang.String("Java");
str;
Return content as a byte array
var bytes = new java.lang.String("Java");
bytes.getBytes('UTF-8');
Return content as a JavaScript array
var strings = ["hi", "bye"];
strings;
Return content as a JavaScript map
var map = {"hi": "bye", "bye": "hi", "number":1};
map;
Leverage the Fetcher
var webFetcher = _context.get("webFetcher");
if (null == webFetcher) {
webFetcher = _fetcher.getWebFetcher();
// it's possible to pass config options to getWebFetcher() as a map as well, e.g.:
// _fetcher.getWebFetcher({"f.discardLinkURLQueries" : false });
_context.put("webFetcher", webFetcher);
}
var webContent = webFetcher.fetch(id, lastModified, signature);
var jsoupDoc = webContent.getDocument();
if (null !== jsoupDoc) {
// modify the Jsoup document or web-content as-needed here, adding new links, removing sections etc.
// ...
// ...
webContent.setRawContent(jsoupDoc.toString().getBytes("UTF-8"));
}
webContent;