Use Virtual Tables with a Common Join Key

With Solr, you can index different document types into the same shard using the composite ID router based on a common route key field. For example, a customer 360 application can index different customer-related document types (contacts, apps, support requests, and so forth) into the same collection, each with a common customer_id field. This lets Solr perform optimized joins between the document types using the route key field. This configuration uses Solr’s composite ID routing, which ensures that all documents with the same join key field end up in the same shard. See Document Routing.

Providing a compositeIdSpec for the Fusion collection

Before indexing, you need to provide a compositeIdSpec for the Fusion collection. For example:

curl -u $FUSION_USER:$FUSION_PASS -X POST -H "Content-type:application/json" \
  -d '{"id":"customer","solrParams":{"replicationFactor":1,"numShards":1,"maxShardsPerNode":10},"type":"DATA","compositeIdSpec":{"routeKey1Field":"customer_id_s"}}' \
  "$FUSION_API/apps/$MYAPP/collections?defaultFeatures=false"

In the example request above, we create a collection named customer with the route key field set to customer_id_s. When documents are indexed through Fusion, the Solr Index pipeline stage uses the compositeIdSpec to create a composite document ID, so documents get routed to the correct shard.

Exposing document types as virtual tables

If you configure your Fusion collection to use a route key field to route different document types to the same shard, then the Fusion SQL service can expose each document type as a virtual table and perform optimized joins between these virtual tables using the route key. To create virtual tables, you simply need to use the Fusion Catalog API on the data asset for the main collection to set the name of the field that determines the document type. For example, if you have a collection named customer that contains different document types (contacts, support tickets, sales contracts, and so forth), then you would set up virtual tables using the following Catalog API update request:

curl -XPUT -H "Content-type:application/json" http://localhost:8765/api/v1/catalog/fusion/assets/customer -d '{
  "projectId" : "fusion",
  "name" : "customer",
  "assetType" : "table",
  "description" : "Fusion collection customer",
  "format" : "solr",
  "options" : [ "collection -> customer", "exclude_fields -> _lw_*,*_\\d_coordinate,_raw_content_", "solr.params -> sort=id asc" ],
  "cacheOnLoad" : false,
  "id" : "fusion.customer",
  "additionalSettings": {
    "virtualTableField":"doc_type_s"
  }
}'

In the example above, we set the virtualTableField to doc_type_s. Fusion sends a facet request to the customer collection to get the unique values of the doc_type_s field and creates a data asset for each unique value. Each virtual table is registered in the Fusion SQL service as a table.

Performing optimized joins in SQL

After you have virtual tables configured and documents routed to the same shard using a compositeIdSpec, you can perform optimized joins in SQL that take advantage of Solr’s domain-join facet feature. For example, the following SQL statement results in a JSON facet request to Solr to perform the aggregation:

select count(1) num_support_requests,
       c.industry as industry,
       a.app_id as app_id,
       a.feature_id as feature_id
from customer c
join support s on c.customer_id = s.customer_id
join apps a on s.customer_id = a.customer_id
where c.region='US-East' AND s.support_type='Enhancement' AND a.app_type='Search'
group by industry, app_id, feature_id

In the example above, we compute the number of feature enhancement requests for Search applications from customers in the US-East region by performing a 3-way join between the customer, support, and apps virtual tables using the customer_id join key. Behind the scenes, Fusion SQL performs a JSON facet query that exploits all documents with the same customer_id value being in the same shard. This lets Solr compute the count for the industry, app_id, feature_id group by key more efficiently than is possible using table scans in Spark.