Collections

Fusion collections are Solr collections which are managed by Fusion. A Solr collection is a distributed index defined by a named configuration stored in ZooKeeper, with these properties:

  • number of shards

    Documents are distributed across this number of partitions.

  • document routing strategy

    How documents are assigned to shards.

  • replication factor

    How many copies of each document in the collection.

  • replica placement strategy

    Where to place replicas in the cluster.

When you first install Fusion, a collection called "default" is created automatically. You can view the simplest collection configuration by hitting the Collections API endpoint at http://localhost:8765/api/v1/collections/default/, if you haven’t modified the default collection yet.

Solr is the underlying engine which indexes, stores, and searches your data. Fusion manages Solr collections, manipulates data and queries before passing them to Solr, and provides analytics and monitoring features.

If your data is already stored in a Solr instance or cluster, you can manage this collection in Fusion by creating a Fusion collection which is configured to import the existing Solr collection. See Installation with an existing Solr instance or cluster.

Primary and Auxiliary Collections

In Fusion, the "Primary" collection is the collection which contains your application data, that is, the set of documents over which search and indexing happens. Fusion registers the collection name and information about the Solr cluster that manages this collection.

Note
All collection names should be considered to be case-insensitive, even though Fusion preserves case in referring to these collections.

If your application uses Fusion’s signals, analytics, or monitoring services, then Fusion will create a set of auxiliary collections in which to store signals, query, and other logfiles. Naming conventions relate auxiliary collections with the primary collection. Auxiliary collections have the same base name as the name of the primary collection plus a suffix which indicates the kind of auxiliary collection, e.g., the suffix for a query logs auxiliary collection is "_logs" so that for a primary collection named "COLL", Fusion creates an auxiliary collection named "COLL_logs". These auxiliary collections include:

  • A search query logs collection, suffix "_logs".

  • A pair of associated collections for signals and aggregated signals, suffixes "_signals", "_signals_aggr" respectively.

Note
Do not create primary collections with names that end in suffix "_logs", "_signals", or "_signals_aggr". Such names can only be used for Fusion auxiliary collections, which are created and managed by Fusion directly.

Fusion maintains a set of Solr collections which store Fusion’s own logfiles and other internal information. These are called System Collections, described below.

Note
Do not create primary collections named "logs", or which beging with "system_". These names are reserved for Fusion system collections.

Fusion uses ZooKeeper to register information about all collections, and the Fusion components and services related to a collection. The Fusion components associated with a collection include:

  • Datasources

  • Pipelines

  • Profiles

  • Signals and aggregations

  • Analytics Dashboards

System Collections

Fusion includes several collections that are used for internal purposes:

  • logs indexes the log messages from the Fusion API services.

  • audit_logs indexes all HTTP requests sent to the Fusion API services.

  • system_banana stores configurations used by Fusion Dashboards.

  • system_blobs stores blobs in Solr. This is used to store model files for the NLP components and other binary files used by Fusion components.

  • system_messages is used by Fusion’s Messaging Services.

  • system_metrics stores information about the running process itself, such as the amount of memory in the system, the average response time for services, Solr heap size, etc. The data is polled at regular intervals according to the internal configuration variable: com.lucidworks.apollo.metrics.poll.seconds. This collection doesn’t appear until after the first set of metrics are collected.

Collection Configuration Properties

Collections have three configurable properties which are set to default values in the Fusion UI. They can be configured as appropriate for your application by creating the collection using the Fusion API service Collections API.

Property Description

signals

Property signals determines whether or not to create an auxiliary collections "_signals" and "_signals_aggr".
When creating a collection in the Fusion UI, this property defaults to true. When creating a collection using Fusion’s API services, this property defaults to false.

searchLogs

Property searchLogs determines whether or not to create an auxiliary search query logs collection with suffix "_logs".
When creating a collection in the Fusion UI, this property defaults to true. When creating a collection using Fusion’s API services, this property defaults to false.

dynamicSchema

Property dynamicSchema always defaults to false.
When dynamicSchema is true, Fusion and Solr use schemaless mode to administer search and indexing over that collection.

Signals are events with timestamps that can be used to improve search results. For more information about signals in Fusion, see the section Signals.

Search logs data is used for Search Query Reporting. The set of reports available includes most popular documents, queries that generated less than a minimum number of results, and search histograms.

The name schemaless mode is misleading: Solr always uses a schema when managing a collection. In schemaless mode, if a document contains a field not currently in the Solr schema, Solr processes the field value to determine what the field type should be defined as, and then adds a new field to the schema with the field name and field type. This behavior may be convenient during preliminary application development, but is rarely appropriate in a production environment, therefore the default is false.

Collection Profiles

Profiles are used to create pipeline aliases for a specific collection. In Fusion, index and query pipelines are not connected to a specific collection by default so that pipeline can be created once and re-used in several collections. This complicates the way that pipelines are used with collections. Profiles provide a shortcut.