Posts Tagged ‘ETL’

Understanding ETL

November 29, 2009 Leave a comment

The Extract-Transform-Load (ETL) system is the foundation of the data warehouse. A properly designed ETL system extracts the data from source systems, performs transformations and cleansing and delivers the data in a presentation ready format after which the data will be loaded to the ware house. The following figure is the schematic description of the ETL process.


The 4 steps of ETL process are explained below:

Extracting: In this phase, data from different types of source systems are fetched into the staging area. The source systems can be mainframe, production sources or any other OLTP sources. Source files can also be different. Data can be stored in relational tables as well as flat files (e.g. Notepad files). The first job of ETL is to fetch data from these different sources.

Cleansing: In most cases level of data quality in an OLTP source system is different from what is required in a data warehouse. To achieve those data quality, cleansing has to be applied. Data cleansing consists of many discrete steps, including checking for valid values, ensuring consistency across values, removing duplicates and checking whether complex business rules and procedures have been applied.

Conforming: Data conformation is required whenever 2 or more data sources are merged in the data warehouse. Separate data sources cannot be queried together unless some or all the textual labels in these sources have been made identical and unless similar numerical measures have been rationalized.

Delivering: The main motto of this step is to make the data ready for querying. It includes physical structuring the data into a set of simple, symmetric schemas known as dimensional models, or star schemas (we will discuss star schema and dimensional model later). These schemas are a necessary basis for building an OLAP system.

Technorati Tags: ,