ETL

The OrientDB-ETL module is an amazing tool to move data from and to OrientDB by executing an ETL process. It's super easy to use. OrientDB ETL is based on the following principles:

  • one configuration file in JSON format
  • one Extractor is allowed to extract data from a source
  • one Loader is allowed to load data to a destination
  • multiple Transformers that transform data in a pipeline. They receive something as input, do something, then return something as output that will be processed as input by the next component

How ETL works

EXTRACTOR => TRANSFORMERS[] => LOADER

An example of a process that extracts from a CSV file, applies some change, does a lookup to see if the record has already been created and then stores the record as a document against OrientDB database:

+-----------+-----------------------+-----------+
|           |              PIPELINE             |
+ EXTRACTOR +-----------------------+-----------+
|           |     TRANSFORMERS      |  LOADER   |
+-----------+-----------------------+-----------+
|   FILE   ==>  CSV->FIELD->MERGE  ==> OrientDB |
+-----------+-----------------------+-----------+

The pipeline, composed of transformation and loading phases, can run in parallel by setting the configuration {"parallel":true}.

Installation

Starting from OrientDB v2.0 the ETL module is bundled with the official release. Follow these steps to use the module:

  • Clone the repository on your computer, by executing:
    • git clone https://github.com/orientechnologies/orientdb-etl.git
  • Compile the module, by executing:
    • mvn clean install
  • Copy script/oetl.sh (or .bat under Windows) to $ORIENTDB_HOME/bin
  • Copy target/orientdb-etl-2.0-SNAPSHOT.jar to $ORIENTDB_HOME/lib

Usage

$ cd $ORIENTDB_HOME/bin
$ ./oetl.sh config-dbpedia.json
NOTE NOTE: If you are importing data for use in a distributed database, then you must set ridBag.embeddedToSbtreeBonsaiThreshold=Integer.MAX\_VALUE for the ETL process to avoid replication errors, when the database is updated online.

Run-time configuration

In an ETL JSON file you can define variables, which will be resolved at run-time by passing them at startup. You could, for example, assign the database URL as ${databaseURL} and then pass the database URL at execution time with:

$ ./oetl.sh config-dbpedia.json -databaseURL=plocal:/temp/mydb

Available Components

Examples: