Available data sources
The Parallel Bulk Loader can load documents from any system that implements the Data Sources API for Spark SQL 2.2.1 or later. These are data sources that the Parallel Bulk Loader can use. For data sources that use Spark SQL connectors, the source of the connector is indicated.-
Solr databases
Connector (Lucidworks): spark-solr - Files in these common formats: JSON, CSV, XML, Apache Avro, and Apache Parquet
- JDBC-compliant databases
-
Apache HBase databases
Connector (Hortonworks): Apache Spark - Apache HBase Connector Spark-on-HBase - Datasets accessible through Apache Hive
-
Apache Cassandra NoSQL databases
Connector (DataStax): Spark-Cassandra connector -
Elastic databases
Connector: Elasticsearch-Hadoop connector
Use the package:org.elasticsearch:elasticsearch-spark-20_2.11:6.2.2
- MongoDB databases
- Riak databases
-
Couchbase NoSQL databases
Connector: Couchbase-Spark connector
Use the package:com.couchbase.client:spark-connector_2.11:2.2.0
- Redis in-memory data structure stores
-
Google data sources, including Google Analytics, Sheets, and BigQuery
Connectors: Analytics Sheets Big Query - Microsoft Azure DataLake, Cosmos DB, and SQL Database
Key features
Key features of the Parallel Bulk Loader are:- Load distribution. To distribute load and maximize performance, the Parallel Bulk Loader parallelizes operations and distributes them across the Fusion Spark cluster.
- No parsing. No parsing is needed. The Spark Data Sources API returns a DataFrame (RDD + schema) that has an easy-to-use tabular format.
- Dynamic resolution of dependencies. There is nothing to download or install. Users just provide the Maven coordinates of dependencies during configuration, and Spark distributes the necessary JAR files to worker nodes.
- Leverage integration libraries. Similar to JDBC, the Parallel Bulk Loader leverages integration libraries built by the experts of the underlying systems, for example, Databricks, DataStax, Hortonworks, Elastic, Lucidworks, Microsoft, and so forth.
- Direct write operations. The Parallel Bulk Loader writes directly to Solr (for maximum performance) or to Fusion index pipelines (for maximum flexibility).
- Solr atomic updates. The Parallel Bulk Loader uses atomic updates to update existing documents in Solr.
- Incremental queries. To obtain the latest changes to data sources, macros built into the Parallel Bulk Loader use timestamps to filter queries.
- Seamless integration with Spark-NLP. Do natural language processing, including part-of-speech tagging, stemming or lemmatization, sentiment analysis, named-entity recognition, and other NLP tasks.
- SQL joins. Use SQL to join data from multiple Solr collections.
- Load Fusion ML models. To classify incoming documents, load Fusion Machine Learning models stored in the Fusion blob store.
- SQL transformations. Leverage the full power of the Spark Scala’s DataFrame APIs and SQL to filter and transform data.
- UDF and UDAF functions. Select from hundreds of user-defined functions and user-defined aggregate functions.
Differences between the Parallel Bulk Loader and Fusion classic connectors
The primary difference between the Bulk Loader and Fusion classic connectors is that the Bulk Loader uses Spark SQL and Spark/Solr integration to perform distributed reads from data sources. Here are some examples of how the Parallel Bulk Loader performs distributed reads:- HBase table. To support high-volume data ingestion into Solr, the Parallel Bulk loader can distribute queries sent to HBase tables across multiple region servers.
- Parquet files. The Parallel Bulk loader processes a directory of Parquet files in HDFS in parallel using the built-in support for computing splits in HDFS files.
- Spark/Solr integration. With Spark/Solr integration, the Parallel Bulk Loader uses a Spark-Solr data source to send queries to all replicas of a collection, so it can read from Solr in parallel across the Spark cluster.
