Collections

Fusion collections are Solr collections managed by Fusion. A Solr collection is a distributed index defined by a named configuration stored in ZooKeeper, with these properties:

  • Number of shards

    Documents are distributed across this number of partitions.

  • Document routing strategy

    How documents are assigned to shards.

  • Replication factor

    How many copies of each document in the collection.

  • Replica placement strategy

    Where to place replicas in the cluster.

When you first install Fusion, a collection called "default" is created automatically. You can view the simplest collection configuration by hitting the Collections API endpoint at http://localhost:8764/api/collections/default/, if you haven’t modified the default collection yet.

Solr is the underlying engine that indexes, stores, and searches your data. Fusion manages Solr collections, manipulates data and queries before passing them to Solr, and provides analytics and monitoring features.

If your data is already stored in a Solr instance or cluster, you can manage this collection in Fusion by creating a Fusion collection that imports the existing Solr collection. See Installation with an existing Solr instance or cluster.

Primary and Auxiliary Collections

In Fusion, the primary collection is the collection that contains your application data, that is, the set of documents over which search and indexing happens. Fusion registers the collection name and information about the Solr cluster that manages this collection.

Note
All collection names should be considered to be case-insensitive, even though Fusion preserves case in referring to these collections.

If your application uses Fusion’s signals, analytics, or monitoring services, then Fusion will create a set of auxiliary collections in which to store signals, query, and other log files. Naming conventions relate auxiliary collections with the primary collection. Auxiliary collections have the same base name as the name of the primary collection plus a suffix indicating the kind of auxiliary collection. For example, the suffix for the auxiliary collection containing query logs and signals is "_signals" so that for a primary collection named "COLL", Fusion creates an auxiliary collection named "COLL_signals". These auxiliary collections include:

  • A search query logs and signals collection, suffix "_signals".

  • An associated collection for aggregated signals, suffix "_signals_aggr".

Note
Do not create primary collections with names that end in "_signals" or "_signals_aggr". Such names can be used only for Fusion auxiliary collections, which are created and managed by Fusion directly.

Fusion maintains a set of Solr collections that store Fusion’s own log files and other internal information. These are called System Collections, described below.

Note
Do not create primary collections named "logs", or beginning with "system_". These names are reserved for Fusion system collections.

Fusion uses ZooKeeper to register information about all collections, and the Fusion components and services related to a collection. The Fusion components associated with a collection include:

  • Datasources

  • Pipelines

  • Profiles

  • Signals and aggregations

  • Analytics dashboards

System Collections

Fusion includes several collections that are used for internal purposes:

  • logs indexes the log messages from the Fusion API services.

  • audit_logs indexes all HTTP requests sent to the Fusion API services.

  • recommender_models stores recommendation models (when configured), plus intermediate data which Fusion can use to update recommendations in near-real-time.

  • system_blobs stores blobs in Solr. This is used to store model files for the NLP components and other binary files used by Fusion components.

  • system_history keeps a record of configuration changes, start and stop times for services and experiments, and more.

  • system_jobs_history keeps a record of Fusion jobs, including start/stop times and status.

  • system_messages is used by Fusion’s messaging services.

  • system_metrics stores information about the running process itself, such as the amount of memory in the system, the average response time for services, and Solr heap size. The data is polled at regular intervals according to the internal configuration variable: com.lucidworks.apollo.metrics.poll.seconds. This collection doesn’t appear until after the first set of metrics are collected.

Collection Configuration Properties

Collections have three properties that you can configure only when you are creating a collection using the Collections API.

Property Description Default behavior

signals*

The signals property determines whether to create auxiliary collections with suffixes _signals and _signals_aggr.

When you create a collection in the Fusion UI, signals defaults to true.
When you create a collection using the Fusion API, this property defaults to false.

searchLogs

The searchLogs property determines whether to create an auxiliary search query logs collection with suffix _logs.

When you create a collection in the Fusion UI, this property defaults to true.
When you create a collection using the Fusion API, this property defaults to false.

dynamicSchema

When dynamicSchema is true, Fusion and Solr use schemaless mode to administer search and indexing over that collection.**

Property dynamicSchema always defaults to false.

*Signals are events with timestamps that can be used to improve search results. For more information about signals in Fusion, see Signals in the Fusion AI documentation.

**In schemaless mode, if a document contains a field not currently in the Solr schema, Solr processes the field value to determine what the field type should be defined as, and then adds a new field to the schema with the field name and field type. This behavior can be convenient during preliminary application development, but it is rarely appropriate in a production environment.

Collection Profiles

Profiles are used to create pipeline aliases for a specific collection. In Fusion, index and query pipelines are not connected to a specific collection by default so that pipeline can be created once and re-used in several collections. This complicates the way that pipelines are used with collections. Profiles provide a shortcut.