ETL stands for extract transform load. It may refer to the process of migrating data from one source to another, or it may refer to a piece of software that does the job.
Datawarehousing specialists use ETL tools to move a subset of data from a larger data silo to a smaller one (say, a data mart). Extract means that only certain records might be selected. Transform means that the data might be altered and tidied up along the way (as complicated as normalising a flat file into a relational format, or as simple as changing dates from Julian to Gregorian). Load means that the data is created in a new source, along with any necessary dependencies, primary keys, indexes or whatever a target platform requires of its tables.
ETL is a good example of an acronym for a process so simple that you didn't even think it warrented an acronym, so without even knowing what ETL means you think you are incapable of taking up work that demands ETL experience. Theoretically you can claim burning CDs as managing an ETL operation.
ETL
An acronym standing for Extract, Transform and Load. Sometimes extended with a C for Cleanse (ECTL).
Usually used in the context of Data Warehousing to refer to a software tool that is employed to move large amounts of data from one database to another, such as from a DB2 database on a mainframe to an Oracle database on a midrange server. The tool's use is not limited to data warehousing only.
ETL popped up about 1988 with Prism III, a COBOL code generator with interfaces to several popular database engines of the time. Back then, they were known as Data Integration Tools. Later, Informatica, Ab Initio, Datastage, and other tools became available.
The basic functions required of an ETL tool are: a database interface, a mapping facility, and a transformation facility.
The earliest ETL tools (now called First-Generation) were code generators, that is, the ETL developer would enter information about the source of the data, the transformations required, the mapping of the source columns to the target columns, and then the tool would take that info and create programs that would then be executed to do the work.
The newest tools divide between doing some generation of code, and simply entering configurations for a standard engine. Informatica takes the standard engine approach and saves all of its configurations in a database, such as Oracle. AbInitio takes more of a generation approach, and generates Korn shell script that strings modules of code together, saving the conigurations in a proprietary format.
The transformation part of the equation is usually accomplished through the input of some kind of code by the ETL programmer. For Informatica, the code consists of SQL extended by procedural commands and calls to functions. For Ab Initio, the code is much more like the programming language "C".
Mapping consists of making logical connections between physical data, usually between columns in files, or columns in database tables. Most tools allow you to click and drag to draw lines between data entities. ETI Extract worked along these lines, like a HIPO or Hierarchical Input Process Output diagram, with a list of input columns along the left side of the screen, and output columns along the right. The lines across the middle were clicked on and transformation logic could be entered at that point. Informatica takes a more ERD or Entity Relationship Diagram approach, where the connections are between boxes on the screen, and the columns are represented as ports on the boxes. The transformations occur inside some of the boxes.
Finally, Extraction is most often accomplished through SQL or Structured Query Language, the interface usually provided to most modern Databases.
Saturday, April 19, 2008
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment