Monday, February 25, 2008

ETL and Datawarehouse Basics with added advantages

ETL and Datawarehouse Basics with added advantages

Are the dimensions independent? Or is there a statistical correlation between them?
Weak statistical correlation and big dimensions
Even if there is a weak correlation, build the DW as if the dimensions are independent, especially for large dimensions
100 stores, 1 million products, 100K millions for combining the two, too large
If the statistical correlation is significant, build a fact table for the correlation, and there may be many such correlations, e.g., merchandizing correlation, pricing-strategy correlation, changing-seasonality correlation
Leave the dimensions simple and independent if the dimensions are large and the statistical correlation is weak
Strong correlations and small resulting dimensions
If the correlation is strong, e.g., product always has a brand, leave the brand OUT of the fact table, because the product always rolls up to a single brand and COMBINE the product and the brand into a single dimension
Arbitrary bound: 100K records is not small dimension


When the same dimension is attached to a fact table multiple times
Sale has an order date, payment date, a shipping date, a return date
A claim transaction has multiple internal people, a claim intake, a medical nurse, a claim adjuster, a payer, etc
End-user might be confused when they do drill-down into the dimension as to what the dimension represents, so it helps to have the dimension have different names
Solution: Create views over the dimensions for each of the dimensional role to help with the name recognition
For very large dimensions (locations) might be appropriate to create physical copies of the dimension table to help with the name confusion


Degenerate dimensions
When a parent-child relationship exists and the grain of the fact table is the child, the parent is kind of left out in the design process
Example:
grain of the fact able is the line item in an order
the order number is significant part of the key
but we don’t create a dimension for the order number, because it would be useless
we insert the order number as part of the key, as if it was a dimension, but we don’t create a dimension table for it


When the DW receives notification that some record in a dimension has changed, there are three basic responses:
Type 1 slow changing dimension (Overwrite)
Type 2 slow changing dimension (Partitioning History)
Type 3 slow changing dimension (Alternate Realities)


Overwrite one or more values of the dimension with the new value
Use when
the data are corrected
there is no interest in keeping history
there is no need to run previous reports or the changed value is immaterial to the report
Type 1 Overwrite results in an UPDATE SQL statement when the value changes
If a column is Type-1, the ETL subsystem must
Add the dimension record, if it’s a new value or
Update the dimension attribute in place
Must also update any Staging tables, so that any subsequent DW load from the staging tables will preserve the overwrite
This update never affects the surrogate key
But it affects materialized aggregates that were built on the value that changed (will be discussed more next week when we talk about delivering fact tables)

Beware of ETL tools Update else Insert statements, which are convenient but inefficient
Some developers use UPDATE else INSERT for fast changing dimensions and INSERT else UPDATE for very slow changing dimensions
Better Approach: Segregate INSERTS from UPDATES, and feed the DW independently for the updates and for the inserts
No need to invoke a bulk loader for small tables, simply execute the SQL updates, the performance impact is immaterial, even with the DW logging the SQL statement
For larger tables, a loader is preferable, because SQL updates will result into unacceptable database logging activity
Turn the logger off before you update with SQL Updates and separate SQL Inserts
Or use a bulk loader
Prepare the new dimension in a staging file
Drop the old dimension table
Load the new dimension table using the bulk loader


Standard
When a record changes, instead of overwriting
create a new dimension record
with a new surrogate key
add the new record into the dimension table
use this record going forward in all fact tables
no fact tables need to change
no aggregates need to be re-computed
Perfectly partitions history because at each detailed version of the dimension is correctly connected to the span of fact tables for which that version is correct


The natural key does not change
The job attribute changes
We can constraint our query
the Manager job
Joe’s employee id
Type-2 do not change the natural key (the natural key should never change)


With a Type-2 change, you might want to include the following additional attributes in the dimension
Date of change
Exact timestamp of change
Reason for change
Current Flag (current/expired)


The source system does not notify changes and does not time/date-stamp its own updates
For small tables, use a brute force approach of comparing every incoming field with every field in the DW to see if anything changes
For larger tables, use the CRC, or cyclic-redundancy checksum, a number of about 20 digits, like this:
Treat the entire incoming record as a string
Compute the CRC value of the string (some numeric value)
Compare the CRC value of today’s record with the CRC value of yesterday’s record
If the CRC values match, today’s record is identical to yesterday’s record
If the CRC values do not match, do a field by field comparison to see what has changed
Depending on whether the changed field is a Type-1, Type-2 or Type-3 change, do the necessary updates
Most ETL packages and the Internet include CRC computation and comparison code


To identify deletions
Read the source log file, if one exists
Determine that the CRC number of a record in the DW does not match some incoming record
Use the MINUS set operator to compare the DW dimension with the incoming dimension (usually works if the ETL staging table with today’s dimension and the DW are part of the same database)
Delete the record if this is valid for the business, most cases you don’t want to delete anything from the DW


Applicable when a change happens to a dimension record but the old record remains valid as a second choice
Product category designations
Sales-territory assignments
Instead of creating a new row, a new column is inserted (if it does not already exist)
The old value is added to the secondary column
Before the new value overrides the primary column
Example: old category, new category
Usually defined by the business after the main ETL process is implemented
Please move Brand X from Men’s Sportswear to Leather goods but allow me to track Brand X optionally in the old category
The old category is described as an Alternate reality

Automated data movement across data stores and the analytical area in support of a data warehouse, data mart or ODS effort
Extensible, robust, scalable infrastructure
Standardization of ETL processes across the enterprise
Reusability of custom and pre-build functions
Better utilization of existing hardware resources
Faster change-control & management
Integrated meta-data management
Complete development environment, work as you think design metaphor

Execute stored procedures at the source to extract data, perhaps filtering some
Store the extracted data into staging tables at the source
Generate a file with the source staging data
Transfer the file into the destination
Load the data into staging tables at the destination
Execute stored procedures at the destination to read the incoming staged data and compute what changed since the last time by comparing with the current value of the data warehouse
Populate the data warehouse with new changes
Supports incremental updates to sources
Supports simple data cleansing
Logs warning/errors

DataStage
big player, high ability to execute, gets good results from Gartner, visionary
Informatica
another big player, gets good results from Gartner, visionary with high ability to execute
most expensive
SAS ETL Server
fast becoming a major player, very positive results from Gartner
low exposure as an ETL tool (SAS a significant statistical analysis vendor)
Information Builder's Data Migrator/ETL Manager tool suite
part of Enterprise Focus/WebFocus
not a major player but industrial strength language, data connectors, etc
Sunopsis
cheap
relies on native RDBMS functionality
CIGNA people exposed to it at conferences liked it

Vendor tools promote standardization of the ETL process, reusability of custom and pre-built functions, lowering the time (and cost) of additional ETL efforts
Vendor ETL tools are somewhat self-documenting
Many tools can connect to a variety of sources (RDBMSs, non-relational, different OS, ERP, PeopleSoft, etc) without exposing the ETL developer to the differences
Vendor tools deal with changes in the source systems or the necessary transformations better, reducing long term ETL maintenance
Meta-data management is a huge advantage, especially when sharing data from many applications
Vendor ETL tools place discipline in the extraction and transformation process
ETL prices have not dropped much over the years, but there is increased functionality, performance, and usability from vendor tools

A conflict between
getting the data from the operational systems as fast as possible
having the ability to restart without repeating the process from the beginning
Reasons for staging
Recoverability: stage the data as soon as it has been extracted from the source systems and immediately after major processing (cleaning, transformation, etc).
Backup: can reload the data warehouse from the staging tables without going to the sources
Auditing: lineage between the source data and the underlying transformations before the load to the data warehouse


The staging area is owned by the ETL team
no indexes, no aggregations, no presentation access, no querying, no service level agreements
Users are not allowed in the staging area for any reason
staging is a construction site
Reports cannot access data in the staging area
tables can be added, or dropped without modifying the user community
Only ETL processes can read/write the staging area (ETL developers must capture table names, update strategies, load frequency, ETL jobs, expected growth and other details about the staging area)
The staging area consists of both RDBMS tables and data files

Flat files
fast to write, append to, sort and filter (grep) but slow to update, access or join
enables restart without going to the sources
XML Data Sets (not really used in Staging)
Relational Tables
Metadata, SQL interface, DBA support
Dimensional Model Constructs: Facts, Dimensions, Atomic Facts tables, Aggregate Fact Tables (OLAP Cubes)
Surrogate Key Mapping Tables
map natural keys from the OLTP systems to the surrogate key from the DW
can be stored in files or the RDBMS (but you can use the IDENTITY function if you go with the RDBMS approach)
Best Practices about these data structures:
perform impact analysis, capture metadata, use naming conventions,


Effectively integrate data from
different DBMS, OS, H/W, communication protocols
need a logical map, data movement view documents, data lineage report
have a plan
identity source candidates
analyze source systems with a data profiling tool
receive walk-through of data lineage and business rules (from the DW architect and business analyst to the ETL developer)
data alterations during data cleansing, calculations and formulas
measure twice, cut once
standard conformance to dimensions and numerical facts
receive walk-through of the dimensional model

Target table and column, table type (Dimension, Fact)
Slow-changing dimension type per target column
Type 1, overwrite (Customer first name)
Type 2, retain history (Customer last name)
Type 3, retain multiple valid alternative values
Source database, table, column
Transformations (the guts of the document)
Where do you capture this information? Which tool? How do you maintain this metadata?

Data modelers and ETL developers should maintain a documentation of the source systems including
subject area, application name, business name, department name
priority
business owner and technical owner
DBMS, production server
DB size, #users, complexity, #transactions per day
comments
Determine the system of record


Dealing with derived data (derive from base facts or accepted calculated columns from the source systems?)
if calculations are done by the DW infrastructure, the ETL developer is responsible for them
and what if the numbers don’t match?
recommendations: stay true to the definition of system-of-record

The further downstream you go from the originating data source, the more you increase the risk of extracting corrupt data. Barring rare exceptions, maintain the practice of sourcing data only from the system-of-record.
Analyze your source system
get a ER-model for the system or reverse engineering one (develop one by looking at the metadata of the system)
reverse engineering is not the same as forward engineering , i.e., given the ER-models of the source systems derive the dimensional schema of the data warehouse


Reverse engineering of the understanding of a source system
unique identifiers and natural keys
data types
relationships between tables (1-to-1, many-to-1, many to many), problematic when source database does not have foreign keys defined
discrete relationships (static data, reference tables)
Data content analysis
NULL values, especially in foreign keys, NULL result in lossy joins
In spite of the most detailed analysis, we recommend using outer join logic when extracting from relational source systems, simply because referential integrity often cannot be trusted on remote systems
Dates in non-date fields


What is the standard for the enterprise?
ODBC, OLE DB, JDBC, .NET
access databases from windows applications, so that applications are portable
performance is major drawback
every DBMS has an ODBC driver, even flat files
Adds two layers of interaction between the ETL and the database


Mainframe
COBOL copybooks give you the datatypes
EBCDIC and not ASCII character set (FTP does the translation between the mainframe and Unix/Windows)
Working with redefined fields (To save space the same field is used for different types of data)
Extracting from IMS, IDMS, Adabase
you need special adapters or you get someone in those systems to give you a file
XML sources, Web Log Files: doable, if you undestand the structure of those sources
Enterprise-Resource-Planning ERP Systems (SAP, PeopleSoft, Oracle)
Don’t treat it as a relational system -- it’s a mess
Use adapters


Using Audit Columns
Use the last update timestamp, populated by triggers or the front-end application
Must ensure that the timestamp is dependable, that is if the front-end modifies it, a batch job does not override it
Index the timestamp it if it’s dependable
Database Log Scrapping or sniffing
Take the log of the source file and try to determine the transactions that affect you
Sniffing does it real time
Timed extracts
Retrieve all records from the source that were modified today
POTENTIALLY dangerous -- what if the process fails today? When it runs tomorrow, you’d have lost today’s changes
Process of elimination
Preserve yesterday’s data in the stage area
Bring today’s entire data in the stage area
Perform a comparison
Inefficient, but the most reliable
Initial and Incremental Loads
Create two tables, previous-load and current-load
Load into the current-load, compare with the previous-load, when you are done drop the previous-load, rename the current-load into previous-load, create a new curent-log


Constrain on indexed columns
Retrieve only the data you need
Use DISTINCT sparingly
Use the SET operations sparingly
Use HINT (HINT tells the DBMS to make sure it uses a certain index)
Avoid NOT
Avoid functions in the where clause
Avoid subqueries


When a dimension is populated by several distinct systems, it is important to include the unique identifier from each of those systems in the target dimension in the data warehouse. Those identifiers should be viewable by end users to ensure peace of mind that the dimension reflects their data that they can tie back to in their transaction system.


Primary key (PK)
Meaningless, unique integer
Aka as surrogate key
Joins to Fact Tables
Is a Foreign Key to Fact Tables
Natural key (NK)
Meaningful key extracted from source systems
1-to-1 relationship to the PK for static dimensions
1-to-many relationship to the PK for slowly changing dimensions, tracks history of changes to the dimension
Descriptive Attributes
Primary textual but numbers legitimate but not numbers that are measured quantities
100 such attributes normal
Static or slow changing only
Product price -- either fact or dimension attribute


Via triggers in the DBMS
Read the latest surrogate key, generate the next value, create the record
Disadvantages: severe performance bottlenecks
Via the ETL process, an ETL tool or a 3-rd party application generate the unique numbers
A surrogate key counter per dimension
Maintain consistency of surrogate keys between dev, test and production
Using Smart Keys
Concatenate the natural key of the dimension in the source(s) with the timestamp of the record in the source or the Data Warehouse.
Tempting but wrong

By definition
Surrogate keys are supposed to be meaningless
Do you update the concatenate smart key if the natural key changes?
Performance
Natural keys may be chars and varchars, not integers
Adding a timestamp to it makes the key very big
The dimension is bigger
The fact tables containing the foreign key are bigger
Joining facts with dimensions based on chars/varchars become inefficient
Heterogeneous sources
Smart keys work for homogeneous environments, but most likely than not the sources are heterogeneous, each having the own definition of the dimension
How does the definition of the smart key changes when there is another source added? It doesn’t scale very well.
One advantage: simplicity in the ETL process


The definition of the key of the dimension in business terms, what does the dimension represent
Challenge: analyze the source systems so that a particular set of fields in that source corresponds to the grain of the dimension
Verify that a given source (file) implements the intended grain
Nothing should be returned by this from the source system/file
If something is returned by this, the fields A, B and C do not represent the grain of the dimension

Simple Case: the dimension is loaded as a lookup table
Typical Case
Data cleaning
Validate the data, apply business rules to make the data consistent, column validity enforcement, cross-column value checking, row de-duplication
Data conforming
Align the content of some or all of the fields in the dimension with fields in similar or identical dimensions in other parts of the data warehouse
Fact tables: billing transactions, customer support calls
IF they use the same dimensions, then the dimensions are conformed
Data Delivery
All the steps required to deal with slow-changing dimensions
Write the dimension to the physical table
Creating and assigning the surrogate key, making sure the natural key is correct, etc.

Dimensions are de-normalized flat tables
It is normal to have multiple embedded hierarchical structures in the same dimension
Eg. Store dimension
Geographic hierarchy: location, city, state, country
Merchandizing hierarchy: location, district, region
Can co-exist in the same dimension IF every attribute is single valued in the presence of the dimension’s primary key
If some attribute takes multiple values, it is not part of the dimension
E.g., a store can have multiple cash-registers, I.e., if the grain of the dimension is the store, the cash-register cannot be part of the store dimension, to include the cash register in the dimension the grain must be changed to be the cash-register, not the store
Normalized structures (3NF)
ok to exist in the staging area
should be flattened (2NF) before the delivery phase, using a simple query over the normalized structures


Dimensions that remain normalized
Looks like a snowflake, if the levels of the hierarchy obey perfect many-to-1 relationships

There is no semantic difference, the information content between the two approaches are the same
Problems with snow-flakes
If the hierarchical model changes, the normalized table schema and the declared joins must change, as well as the resulting end-user queries
More confusing to the end user, flat dimensions mask the complexity of the data
However, appropriate snow-flakes do exist (as sub-dimensions of dimensions)


Virtually everywhere: measurements are defined at specific times, repeated over time, etc.
Most common: calendar-day dimension with the grain of a single day, many attributes
Doesn’t have a conventional source:
Built by hand, speadsheet
Holidays, workdays, fiscal periods, week numbers, last day of month flags, must be entered manually
10 years are about 4K rows

Note the Natural key: a day type and a full date
Day type: date and non-date types such as inapplicable date, corrupted date, hasn’t happened yet date
fact tables must point to a valid date from the dimension, so we need special date types, at least one, the N/A date
How to generate the primary key?
Meaningless integer?
Or 10102005 meaning Oct 10, 2005 ? (reserving 9999999 to mean N/A?)
This is a close call, but even if meaningless integers are used, the numbers should appear in numerical order (why? Because of data partitioning requirements in a DW, data in a fact table can be partitioned by time)


Also typically needed are time dimensions whose grain is a month, a week, a quarter or a year, if there are fact tables in each of these grains
These are physically different tables
Are generated by eliminating selected columns and rows from the Date dimension, keep either the first of the last day of the month
Do NOT use database views
A view would drag a much larger table (the date) into a month-based fact table


How about a time dimension based on seconds?
There are over 31 million seconds in a year!
Avoid them as dimensions
But keep the SQL date-timestamp data as basic attributes in facts (not as dimensions), if needed to compute precise queries based on specific times
Older approach: keep a dimension of minutes or seconds and make it based on an offset from midnight of each day, but it’s messy when timestamps cross days
Might need something fancy though if the enterprise has well defined time slices within a day such as shift names, advertising slots -- then build a dimension


BIG
Examples: Customer, Product, Location
Millions or records with hundreds of fields (insurance customers)
Or hundreds of millions of records with few fields (supermarket customers)
Always derived by multiple sources
These dimensions should be conformed


SMALL
Examples: Transaction Type, Claim Status
Tiny lookup tables with only a few records and one ore more columns
Build by typing into a spreadsheet and loading the data into the DW
These dimensions should NOT be conformed
JUNK dimension: a tactical maneuver to reduce the number of FKs from a fact table by combining the low-cardinality values of small dimensions into a single junk dimension, generate as you go, don’t generate the cartesian product

No comments: