Skip to content

Latest commit

 

History

History
87 lines (60 loc) · 3.36 KB

ETL-Introduction.md

File metadata and controls

87 lines (60 loc) · 3.36 KB
search
keywords
etl
ETL

ETL

The Extractor Transformer and Loader, or ETL, module for OrientDB provides support for moving data to and from OrientDB databases using ETL processes.

  • Configuration: The ETL module uses a configuration file, written in JSON.
  • Extractor Pulls data from the source database.
  • Transformers Convert the data in the pipeline from its source format to one accessible to the target database.
  • Loader loads the data into the target database.

How ETL Works

The ETL module receives a backup file from another database, it then converts the fields into an accessible format and loads it into OrientDB.

EXTRACTOR => TRANSFORMERS[] => LOADER

For example, consider the process for a CSV file. Using the ETL module, OrientDB loads the file, applies whatever changes it needs, then stores the record as a document into the current OrientDB database.

+-----------+-----------------------+-----------+
|           |              PIPELINE             |
+ EXTRACTOR +-----------------------+-----------+
|           |     TRANSFORMERS      |  LOADER   |
+-----------+-----------------------+-----------+
|   FILE   ==>  CSV->FIELD->MERGE  ==> OrientDB |
+-----------+-----------------------+-----------+

You can modify this pipeline, allowing the transformation and loading phases to run in parallel by setting the configuration variable "parallel" to true.

{"parallel": true}

Installation

Beginning with version 2.0, OrientDB bundles the ETL module with the official release.

Usage

To use the ETL module, run the oetl.sh script with the configuration file given as an argument.

$ $ORIENTDB_HOME/bin/oetl.sh config-dbpedia.json
NOTE NOTE: If you are importing data for use in a distributed database, then you must set ridBag.embeddedToSbtreeBonsaiThreshold=Integer.MAX\_VALUE for the ETL process to avoid replication errors, when the database is updated online.

Run-time Configuration

When you run the ETL module, you can define its configuration variables by passing it a JSON file, which the ETL module resolves at run-time by passing them as it starts up.

You could also define the values for these variables through command-line options. For example, you could assign the database URL as ${databaseURL}, then pass the relevant argument through the command-line:

$ $ORIENTDB_HOME/bin/oetl.sh config-dbpedia.json \
      -databaseURL=plocal:/tmp/mydb

When the ETL module initializes, it pulls /tmp/mydb from the command-line to define this variable in the configuration file.

Available Components

Examples: