ETL - Extractors

Extractor components are the first part of the ETL process and are responsible for the extracting of data.

Available Extractors

row	jdbc	json	csv	xml

row

Extracts content row by row.

Component name: row
Output class: [String]

Syntax

Parameter	Description	Type	Mandatory	Default value
multiLine	Supports multi line. This is useful with CSV supporting linefeed inside strings. As of 2.0.9	boolean	false	true
lineFeed	Linefeed to use in case of multiline (see above). Since 2.0.9	string	false	`\r\n`

Example with default configuration

{ "row": {} }

csv (as of v2.1.4)

Extract content from csv files. Apache Commons-csv is used to parse csv files. This component is avaliable as of version 2.1.4

Component name. csv
Output class: [ODocument]

Syntax

Parameter	Description	Type	Mandatory	Default value
separator	Column separator	char	false	,
columnsOnFirstLine	Columns are described in the first line	boolean	false	true
columns	Columns array containing names, and optionally types by postfixing names with: . Specifying type guarantee better performances'	string[]	false	-
nullValue	value to consider as NULL	string	false	NULL
dateFormat	date format to use for parsing dates	string	false	yyy-mm-dd
quote	String character delimiter	char	false	"
skipFrom	Line number where start to skip	integer	false	-
skipTo	Line number where skip ends	integer	false	-
ignoreEmptyLines	Ignore empry lines	boolean	false	false
predefinedFormat	Name of standard csv format (from Apache commons-csv): DEFAULT, EXCEL, MYSQL, RFC4180, TDF	string	false	-

Documentation about commons-csv predefined format is available here: (https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVFormat.html)

Examples

Extract lines from CSV (as ODocument), using comma as separator, considering "NULL" as null value and skipping the rows 2-4:

{ "csv": 
    {  "separator": ",", 
        "nullValue": "NULL",
        "skipFrom": 1, 
        "skipTo": 3 
    }
}

Extract lines from a CSV exported from MYSQL:

{ "csv": 
    {  "predefinedFormat": "MYSQL"}
}

Extract lines from a CSV with default format using 'N/A' as null value placeholder and custom date format:

{ "csv": 
    {  "predefinedFormat": "DEFAULT",
        "nullValue" : "N/A",
        "dateFormat" : "dd-mm-yyyy HH:MM"
    }
}

jdbc

Extracts data from any DBMS that support JDBC driver. In order to get the ETL component to connect to the source database, put the DBMS's JDBC driver in the classpath or $ORIENTDB_HOME/lib directory.

Component name: jdbc
Output class: [ODocument]

Syntax

Parameter	Description	Type	Mandatory	Default value
driver	JDBC Driver class	string	true	-
url	JDBC URL to connect	string	true	-
userName	DBMS User name	string	true	-
userPassword	DBMS User password	string	true	-
query	Query that extract the record to import	string	true	-
queryCount	Query that return the count of the fetched records. This is used to provide a correct progress indicator	string	false	-

Example

Extracts all the "Client" table from a MySQL database "test" hosted on localhost:

{ "jdbc": {
    "driver": "com.mysql.jdbc.Driver",
    "url": "jdbc:mysql://localhost/test",
    "userName": "root",
    "userPassword": "",
    "query": "select * from Client"
  }
}

json

Extracts data by parsing json objects. If the data has more json items, they must be enclosed between [].

Component name: json
Output class: [ODocument]

Example

{ "json": {} }

xml (as of v2.2)

Extracts the data by parsing XML.

Component name: xml
Output class: [ODocument]

Syntax

Parameter	Description	Type	Mandatory	Default value
rootNode	Root node to consider. By default it build a document starting from the root tag	string	false	empty
tagsAsAttribute	array of tags where children tags are considered as attributes of document and the attribute value is the text inside the tag	string[]	false	empty

Examples

Example 1: extract data from an XML file.

simple.xml XML file content:

<?xml version="1.0" encoding="UTF-8"?>
<a>
    <b>
        <c name='Ferrari' color='red'>ignore</c>
        <c name='Maserati' color='black'/>
    </b>
</a>

OrientDB ETL configuration file:

{"source": { "file": { "path": "src/test/resources/simple.xml" } }, "extractor" : { "xml": {} }, "loader": { "test": {} } }

Result:

{
  "a": {
    "b": {
      "c": [
        {
          "color": "red",
          "name": "Ferrari"
        },
        {
          "color": "black",
          "name": "Maserati"
        }
      ]
    }
  }
}

Example 2: extract a collection from XML.

simple.xml XML file content:

<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
    <CD>
        <TITLE>Empire Burlesque</TITLE>
        <ARTIST>Bob Dylan</ARTIST>
        <COUNTRY>USA</COUNTRY>
        <COMPANY>Columbia</COMPANY>
        <PRICE>10.90</PRICE>
        <YEAR>1985</YEAR>
    </CD>
    <CD>
        <TITLE>Hide your heart</TITLE>
        <ARTIST>Bonnie Tyler</ARTIST>
        <COUNTRY>UK</COUNTRY>
        <COMPANY>CBS Records</COMPANY>
        <PRICE>9.90</PRICE>
        <YEAR>1988</YEAR>
    </CD>
    <CD>
        <TITLE>Greatest Hits</TITLE>
        <ARTIST>Dolly Parton</ARTIST>
        <COUNTRY>USA</COUNTRY>
        <COMPANY>RCA</COMPANY>
        <PRICE>9.90</PRICE>
        <YEAR>1982</YEAR>
    </CD>
</CATALOG>

OrientDB ETL configuration file:

{"source": { "file": { "path": "src/test/resources/music.xml" } }, "extractor" : { "xml": { "rootNode": "CATALOG.CD", "tagsAsAttribute": ["CATALOG.CD"] } }, "loader": { "test": {} } }

Result:

{
  "TITLE": "Empire Burlesque",
  "ARTIST": "Bob Dylan",
  "COUNTRY": "USA",
  "COMPANY": "Columbia",
  "PRICE": "10.90",
  "YEAR": "1985"
}
{
  "TITLE": "Hide your heart",
  "ARTIST": "Bonnie Tyler",
  "COUNTRY": "UK",
  "COMPANY": "CBS Records",
  "PRICE": "9.90",
  "YEAR": "1988"
}
{
  "TITLE": "Greatest Hits",
  "ARTIST": "Dolly Parton",
  "COUNTRY": "USA",
  "COMPANY": "RCA",
  "PRICE": "9.90",
  "YEAR": "1982"
}