Apache Pulsar

Basics

  • Pulsar gives you the ability to subscribe a pipeline to a topic

  • Topics belong to namespaces

    • At the namespace level, you can specify retention policies and schema enforcement

  • Namespaces belong to tenants

    • Tenants can specify authorization/authentication

Highlights

  • Supports native streaming transformations through Pulsar Functions

  • Uses Apache Zookeeper for configuration management

  • Uses Apache BookKeeper for durable storage

  • Each document processed has a result object persisted to another topic

Overview

Apache Pulsar is a message bus that combines queuing with publishing-subscribing features for server-to-server messaging. A producer publishes a message to a topic and a consumer subscribes to that topic using a subscription type to receive the topic messages.

Thanks to Pular’s messaging guarantees (and Apache Bookeeper), messages that have not been acknowledged by a consumer will be persisted and re-published. This means saves to crawl-db collection can be done once when an item is completed (versus stored in Solr first, queried for, and then fetched).

Apache Pulsar is a multi-tenant messaging system. Several tenants can use the message bus concurrently. Each tenant can have multiple namespaces and each namespace can have multiple topics.

Apache Pulsar also supports geo-replication. Data published to a cluster in Datacenter-A can be replicated to a cluster in Datacenter-B, and can be consumed by consumers listening to the same topic in Datacenter-B pulsar cluster.

Types of server-to-plugin calls:

  • Broadcast (pub/sub): The server publishes a message to a topic. Each plugin instance consumes the message through its own unique, exclusive subscription. In this scenario, all plugin instances for a given configId consume the same message. Examples include job actions (start, stop) and health status requests.

  • Queueing: The server publishes messages to a topic, where plugins consume through a shared subscription. Subscriptions that are shared allow multiple consumers to subscribe and consume messages one by one. Shared subscriptions are needed for distributed fetching.

  • Failover: The connector job service processes messages from a plugin output topic. This is just like exclusive (queueing), but allows another server instance to take over if the current instance fails.

Fusion Integration

Fusion will create a Pulsar namespace for every app that is created. The default tenant name for Pulsar is {kubernetes-namespace}. Pulsar does not store messages if there are no active subscriptions for the topic. Due to this, namespaces are configured for message retention with a default time of 1 day and a default size of 5GB.

Messages for topics that have subscriptions but no active consumers will be stored in the backlog. Backlog quota is managed per topic.

Pulsar is used by default for Signals. _signals_ingest pipeline has a subscription to the _signals_ingest topic in the _system namespace. Signals belong to the _system app that is created in Fusion when the API service boots up.

Tip
Subscriptions are integrated with the Fusion UI. See Subscriptions UI for more information.