Tuesday, March 04, 2008

ETL Datawarehouseing important questions and answers

1 What is the difference between a data warehouse and a data mart?
Dataware house:

It is a collection of data marts. Represents historical data. a data warehouse is a relational database which is specially designed for analysis purpose rather then for transactional purpose.



Data mart:

It is a sub set of data ware housing.It can provide the data to analyze query reporting

& analysis. a datmart is subject oriented database which gives the data about each and every individual department in an organisation.


2 how can one connect two fact tables ? is it possible ? how?
This confirm dimenstion methodology.If a dimension table is connected to more then one Fact table is called confirm dimension.

Fact Tables are connected by confirmed dimensions, Fact tables cannot be connected directly, so means of dimension we can connect

3 suppose data are coming from different locations and those data will not change . is there any need to use surrogate key ?
Yes, We should use the surrogate key, here we are getting data from different locations means every one have one primary key, while transforming the data into target that time more than two key not in use so if you use surrogate key it will identified the duplicate fields in dimensional table.

4 what is the difference between aggregate table and fact table ? how do you load these two tables ?
A fact table typically has two types of columns: those that contain numeric facts (often called measurements), and those that are foreign keys to dimension tables. A fact table contains either detail-level facts or facts that have been aggregated. Fact tables that contain aggregated facts are often called summary tables or aggregated fact. A fact table usually contains facts with the same level of aggregation. Though most facts are additive, they can also be semi-additive or non-additive. Additive facts can be aggregated by simple arithmetical addition. A common example of this is sales. Non-additive facts cannot be added at all. An example of this is averages. Semi-additive facts can be aggregated along some of the dimensions and not along others. An example of this is inventory levels, where you cannot tell what a level means simply by looking at it.


5 What Oracle features can be used to optimize my Warehouse system?
Partition table, bitmap index, sequence ,table function ,sql loader ,function like cube ,roll_up etc.


6 When should you use a STAR and when a SNOW-FLAKE schema?
The snowflake and star schema are methods of storing data which are multidimensional in nature (i.e. which can be analysed by any or all of a number of independent factors) in a relational database .The snowflake schema (sometimes called snowflake join schema) is a more complex schema than the star schema because the tables which describe the dimensions are normalized.Snowflake schema is nothing but one dimension table will be connected to another dimension table and so on.

------------

Snowflake

------------

? If a dimension is very sparse (i.e. most of the possible values for the dimension have no data) and/or a dimension has a very long list of attributes which may be used in a query, the dimension table may occupy a significant proportion of the database and snow flaking may

be appropriate.

? A multidimensional view is sometimes added to an existing transactional database to aid reporting. In this case, the tables which describe the dimensions will already exist and will typically be normalized. A snowflake schema will hence be easier to implement.

? A snowflake schema can sometimes reflect the way in which users think about data. Users may prefer to generate queries using a star schema in some cases, although this may or may not be reflected in the underlying organization of the database.

? Some users may wish to submit queries to the database which, using conventional multidimensional reporting tools, cannot be expressed within a simple star schema. This is particularly common in data mining of customer databases, where a common requirement is to locate common factors between customers who bought products meeting complex criteria. Some snow flaking would typically be required to permit simple query tools such as Cognos

Power play to form such a query, especially if provision for these forms of query weren't anticipated when the data warehouse was first designed.


---------

Star

----------

The star schema (sometimes referenced as star join schema) is the simplest data warehouse schema, consisting of a single "fact table" with a compound primary key, with one segment for each "dimension" and with additional columns of additive, numeric facts.The star schema makes multi-dimensional database (MDDB) functionality possible using a traditional relational database. Because relational databases are the most common data management system in organizations today, implementing multi-dimensional views of data using a relational database is very appealing. Even if you are using a specific MDDB solution, its sources likely are relational databases. Another reason for using star schema is its ease of understanding. Fact tables in star schema are mostly in third normal form (3NF), but dimensional tables are in de-normalized second normal form (2NF). If you want to

normalize dimensional tables, they look like snowflakes (see snowflake schema) and the same problems of relational databases arise - you need complex queries and business users cannot easily understand the meaning of data. Although query performance may be improved by advanced DBMS technology and hardware, highly normalized tables make reporting difficult and applications complex.

7 When should one use an MD-database (multi-dimensionaldatabase) and not a relational one?
1 Because More than one dimensions can be shareble for Other Department


2 The Physical Load will be less.


3 Less Complexity of Fact


8 What is the difference between an ODS and a W/H?
An ODS is an environment that pulls together, validates, cleanses and integrates data from disparate source application systems. This becomes the foundation for providing the end-user community with an integrated view of enterprise data to enable users anywhere in the organization to access information for strategic and/or tactical decision support, day-to-day operations and management reporting.


The defination of Data Warehouse is as follows.

? Subject-oriented, meaning that the data in the database is organized so that all the data elements relating to the same real-world event or object are linked together;

? Time-variant, meaning that the changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time;

? Non-volatile, meaning that data in the database is never over-written or deleted, but retained for future reporting;

? Integrated, meaning that the database contains data from most or all of an organization's operational applications, and that this data is made consistent.


Difference

------------

Ods

--------

Transactions similar to those of an Online Transaction Processing System

Data Warehouse

--------------

Queries process larger volumes of data


Ods

--------

Contains current and near current data

Data Warehouse

------------

Contains historical data.


Ods

-----------

Typically detailed data only, often resulting in very large data volumes


Data Warehouse

--------------

Contains summarised and detailed data,generally smaller in size than on ODS


Ods

--------

Real-time and near real-time data loads

Data Warehouse

------------

Typically batch data loads


Ods

--------

Generally modeled to support rapid data update


Data Warehouse

-----------------

Generally dimensionally modeled and tunes to optimise query performance


Ods

------

Updated at the data field leve

Data Warehouse

---------------

Data is appended, not updated


Ods

-------

Used for detailed decision making and

operational reporting

Data Warehouse

----------------

Used for ling-term decision making and

management reporting


Ods

-----

Knowledge workers (customer service

representatives, line managers)

Data Warehouse

-------------

Strategic audience (executives, business

unit management)

9 What is the difference between a W/H and an OLTP application?
Warehouse is used for high level data analysis purpose .It is used for predictions, time series analysis, financial analysis , what -if simulations etc. Basically it is used for better decision making.


OLTP is NOT used for analysis purpose.It is used for transaction and data processing.

Its basically used for storing day-to-day transactions that take place in an organisation.

The main focus of OLTP is easy and fast inputing of data, while the main focus in data warehouse is easy retrieval of data.

OLTP doesnt store historical data.(this is the reason why it cant be used for analysis)

DW stores historical data.


10 What is the difference between OLAP, ROLAP, MOLAP and HOLAP?
ROLAP

ROLAP stands for Relational Online Analytical Process that provides multidimensional analysis of data, stored in a Relational database(RDBMS).

MOLAP

------

MOLAP(Multidimensional OLAP), provides the analysis

of data stored in a multi-dimensional data cube.

HOLAP

------

HOLAP(Hybrid OLAP) a combination of both ROLAP and MOLAP can

provide multidimensional analysis simultaneously of data

stored in a multidimensional database and in a relational

database(RDBMS).

DOLAP

-----

DOLAP(Desktop OLAP or Database OLAP)provide multidimensional

analysis locally in the client machine on the data collected

from relational or multidimensional database servers.

11 what are the types of dimension tables
1.Confirmed Dimension.

2.Junk Dimension.

3.Degenerated Dimension.

4.Slowly changing Dimensions.


12 What is a Data Cube?
Data cube is the logical representation of multidimensional data .The edge of the cube contains dimentions and the body of the cube contains datas.


13 How does data mining and data warehousing work together?
data warehousing is used to store the historical data.by using dwh bsiness users can analize thier business.data mining is used to predict the future .dwh will act as the source for the data mining


14 What is Bulk Insert?
In informatica ..bulk insert or bulk load does 2 things :-


1) Ignores the commit interval specified at the session level.

2) Do not create a database session log file.

So, advantage is it's very fast as no entry goes into log.

Disadvantage is session cannot be rolled back as no entry exists in the log file .


15 What is surrogate key?
A surrogate key in a database is a unique identifier for either an entity in the modeled world or an object in the database. The surrogate key is not derived from application data.There appear to be two definitions of a surrogate in the literature. We shall call these surrogate (1) and surrogate

(2):Surrogate (1) This definition is based on that given by Hall, Owlett and Todd (1976). Here a surrogate represents an entity in the outside world. The surrogate is internally generated by the system but is nevertheless visible by the user or application.


Surrogate (2) This definition is based on that given by Wieringa and de Jung (1991). Here a surrogate represents an object in the database itself. The surrogate is internally generated by the system and is invisible to the user or

application.


16 What is a diff between joiner and lookup transformation?
Joiner will join the two different data sources based on a

join condition ,and pass only the rows which satisfy that

condition.discards the remaining rows.


Joiner transformation supports 4 types of joins at

Informatica level

Normal

Master Outer

Detail Outer

Full Outer


LookUp Transformation


Lookup transformation basically for Reference ,based on the

lookup condition .when u want some data based on target

data ,will take lookup on that particular table and

retrieve the corresponding fields from that table.

we can override the lookup transformation using the SQL

query.

17 What is the difference between View and Materialized View ?
A view has a logical existence but a materialized view has

a physical existence .Moreover a materialized view can be

indexed , analisied and so on....that is all the things that

we can do with a table can also be done with a materialized

view.


18 How could we generate the sequence of key values without using sequence generator transformation in the target ?

Do a lookup on the Target table with an Lookup SQl Override

Select MAX(FIELD_NAME), field 1 , field3 from target group

by field1, field2...


In the Expression increment the Max values of the field

which you just got from the lookup by 1.


Here MAX_FIELDNAME) is the Max value of the field you want

to generate the sequence of..

19 why we need to use unconnected transformation?
Unconencted is used when ever u want to call the same

transformation several times and u have one return port.


We use unconnected transformation to use multiple number

of tables or views without physically taking the entity

into mapping.This kind of transformation is also helpful

when single return port is required.


Use dynamic cache if u want to update the case while

updating the target table itself and static is untouched

with the cache.


20 Whats the difference between $, $$, $$$
$ - These are the system variables like $Bad file,$input

file, $output file, $DB connection



$$ - Can any one tell me the scenario with example for

user defined variables


$$$ - $$$SessStartTime


$$$SessStartTime returns the initial system date value on

the machine hosting the PowerCenter Server when the server

initializes a session. $$$SessStartTime returns the session

start time as a string value. The format of the string

depends on the database you are using.


21 What is Factless fact table ???
A Fact table without measures(numeric data) for a column is

called Factless Fact table.

FACT LESS FACT TABLES ARE USED TO CAPTURE DATE TRANSACTION

EVENTS


22 explain the scenario for bulk loading and the normal loading option in Informatica Work flow manager ???

Normal: In this case server manager allocates the

resources(Buffers) as per the parameter settings. It creates

the log files in database.


Bulk: In this case server manager allocates maximum

resources(Buffers) available irrespective of the parameter

settings. It will not create any log files in database.


In first case data loading process will be time taking

process but other applications are not affected. While in

bulk data loading will be much faster but other application

are affected.


Normal Load: It loads the record one by one and writes log each file.It will take more time to complete.

Bulk Load: Load the number of records at a time ,It wont fallow ant log files or trace levels,It takes less time .


23 Why is meant by direct and indirect loading options in sessions?
we use file type direct when we are loading single file into target. we use Indirect when we want to load multiple files through single session in the mapping


24 What is the method of loading 5 flat files of having same structure to a single target and which transformations I can use?
This can be handled by using the file list in informatica.

If we have 5 files in different locations on the server and

we need to load in to single target table.In session

properties we need to change the file type as Indirect.


am taking a notepad and giving following paths and file

names in this notepad and saving this notepad as

emp_source.txt in the directory /ftp_data/webrep/


/ftp_data/webrep/SrcFiles/abc.txt

/ftp_data/webrep/bcd.txt

/ftp_data/webrep/srcfilesforsessions/xyz.txt

/ftp_data/webrep/SrcFiles/uvw.txt

/ftp_data/webrep/pqr.txt


In session properties i give /ftp_data/webrep/ in the

directory path and file name as emp_source.txt and file type

as Indirect.

25 On a day, I load 10 rows in my target and on next day if Iget 10 more rows to be added to my target out of which 5 are updated rows how can I send them to target? How can I insert and update the record?
the best way to do this use the slowy changing dimension in

the mappings-->wizzard-->slowly changing dimension-->type1


here u need to select the source and target tables.


take a look t/f and a update strategy t/f. basing on look

up , if the record exits in target then reject it ,if not

exit insert it,and if the record exist but it is changed

then update it


26 In real time scenario where can we use mapping parameters and variables?
Before using mapping parameters and mapping variables we

should declare these things in mapping tab of mapping

designer.


A mapping parameter cannot change untill the session has

completed unless a mapping variable can be changed in

between the session.


Example:::

if we declare mapping parameter we can use that parameter

untill completing the session,but if we declare mapping

variable we can change in between sessions.Use mapping

variable in Transcation Control Transformation......


27 By using Filter Transformation,How to pass rows that doesnot satisfy the condition(discarded rows) to another target?
Connect the ports of the filter transformation to the

second target table and enable the 'FORWARD REJECTED ROWS'

in the properties of the filter transformation. the

rejected rows will be forwarded to this table.



Well You can Use Router iif you need rejected rows along

with satisfied rows otherwis you just give the condition

for filter tx as you want it in your target table.


28 A table contains some null values .how to get ( Not Applicable) in place of that null value in target ?
with the help of ISNULL() function of the Informatica


In the column properties sheet, write N/A in the Default

value text box for the particular column


29 WHAT IS FACT TABLES?

Fact table is the primary table in the dimensional

modeling . the numerical performance of measures of the

business stored in fact table

mostly used facts ara numeric and additive

not every numeric is a fact ,but a numeric wich ara of type

key performance indicator is called fact



30 WHAT IS UPDATE OVERRIDE . DIFFERENCE BETWEEN SQL OVERRIDE AND UPDATE OVERRIDE ?
Update Override it is an option available in TARGET instance .By defalut Target table is updated based on Primary key values .To update the Target table on non primary key values u can generate the default Querey and override the Querey according to the requiremnet.Suppose for example u want to update the record in target table When a column value='AAA' then u can include this condition in where clause of default Querey.


Coming to SQL override it is an option available in Source

Qualifier and Lookup transafornmation where u can inlude

joins ,filters,Group by and order by.


31 In a mapping i have three dimensions. If i want to pass a same surrogate key value to all the three dimensions by using one sequence generator is possible?If the mapping is containing single flow? And in the same case if the mapping is contaning 3 flows for the three dimensions then by using one sequence generator can we populate surrogate key (same value) to all the three dimensions?
Use the Sequence and Expresion transfermations.first

genarate the surrogate with Seq trans,then send values to

exp trans,connect the exp trans o/p ports to 3 dimentions.


First Seq generate the surrogate key like 1,2,3,4,5

we wil pass this column to next tran(Exp) from there we

will connect o/p port to dimentions.so '1' wil go all

dimentions,then '2' wil go then '3' .....



32 what r the transformations that r not involved in mapplet?

1.Normalizer transformations

2.COBOL sources

3.XML Source Qualifier transformations

4.XML sources

5.Target definitions

6.Other mapplets

7.Pre- and post- session stored procedures

8.sequence generator

33 what is data driven?

data driven is the instruction fed to informatica server

wheather to insert/update/delete row whenever using update

strategy transformation



34 what are the transformations that are used in data cleansing ? and how data cleansing takes place ?
Expression tranformation is used in data cleansing.


If ur target table consist of not null column and source

table consisting null columns so assign some value in

expression tr. and then pass data to target.



35 TWO FLAT FILES ARE THERE , EACH HAVING NO MATCHING COLUMNS . HOW CAN U JOIN THESE TWO USING JOINER TRANSFORMATION ?
This can be done by passing all ports to an expression

transformation and then creating a output port say ID=1 in

both the expression transformation of each file and then

join it using a joiner on ID,hope this helps..



36 HOW TO GET THE LATEST DATA IN SCD ?
IN THREE WAYS WE CAN GET THE LATEST RECORD IN SCD ...


1)IN "SCD TYPE 2 TIME STAMP" FOR LATEST RECORD THE "END

DATE" FIELD WILL BE BLANK. THAT MEANS IT IS THE NEW ROW .


2)IN "SCD TYPE 2 FLAG " THE FLAG NUMBER OF THE NEW ROW WILL

BE ONE .


3) IN " SCD TYPE 2 VERSION " THE LATEST RECORD IS HAVING

THE MAXIMUM VERSION NUMBER


37 what is the function of 'F10' informatica ?
F10 and F5 are used in debugging process


By pressing F10, the process will move to the next

transformation from the current transformation and the

current data can be seen in the bottom panel of the window..


whereas F5 will process the full data at a stretch..in case

of F5, u can see the data in the targets at the end of the

process but cannot see intermediate transformation values.



38 two types of data are there . one is mainframe and the other is ascii format . in informatica how can you get both the data in a single format in ascii .


39 WHAT IS UPDATE OVERRIDE . DIFFERENCE BETWEEN SQL OVERRIDE AND UPDATE OVERRIDE ?
Coming to SQL override it is an option available in Source

Qualifier and Lookup transafornmation where u can inlude

joins ,filters,Group by and order by.


Update Override it is an option available in TARGET

instance. By defalut Target table is updated based on

Primary key values.To update the Target table on non

primary key values u can generate the default Querey and

override the Querey according to the requiremnet.Suppose

for example u want to update the record in target table

When a column value='AAA' then u can include this condition

in where clause of default Querey.


40 WHAT IS THE NAME OF THAT PORT IN DYNAMIC CACHE WHICH IS USED FOR INSERT , UPDATE OPRATION ?
New lookup now


41 what are the reusable tasks in informatica ?

command task

session task

email task



42 how the server recognises , if the session fails after loading the 100 records in to the target
Based on commit intervel session commits those many records

into target. suppose if commit intervel is 1000 if session

fails after 100 records it won't insert not a single record

into target.


43 How do you take care of security using a repository manager
REPOSITORY PRIVILAGES

FOLDER PERMISSION (OWNERS,GROUPS,USERS)

LOCKS (READ,WRITE,EXECUTE.FETCH ,SAVE)


44 if the session fails after 100 records agian we have to starts the session or we go for recovery session
informatcia server has 3 methods to recovering the

sessions.

1)Run the session again if the Informatica Server has not

issued a commit.

2)Truncate the target tables and run the session again if

the session is not recoverable.

3)Consider performing recovery if the Informatica Server

has issued at least one commit.

Use performing recovery to load the records from where the

session fails.


45 what r the values tht r passed between informatics server and stored procedure?

There are 3 types of data passing between informatica

server and stored proceduer these are

Input/Output parametors: Stored procedure it receive the

inputs and porvied the outputs.

Return Value:Ever data base to provied return value after

processing of stored procedure.

Status code: It is used for error handling.


46 why do u use shortcuts in informatica.?
Shortcut is a concept of reusability.If there is a mapping

that can be reused across several folders, create it in one

folder and use shortcuts of it in other folders. Thus, if

you have to make change, you can do it in main mapping

which reflects in shortcut mappings automatically.




47 scd methodology?

Type1: No historical datas will be available, when changes

are made the old data will be deleted and the new data will

be insrted.


Type2: Flag:The old data will be denoted as false and the

new data will be denoted as true.


Version:The changes made will be numbered as 0,1,2...so on.


Date:The changes along with the date in which they are made

are clearly mentioned.


Type3:The latest change which is made is alone available.



48 What is the filename which you need to configure in UNIX while installing infromatica?
in informatica 7, under $PMRootDir there is one utility

(script) called pmconfig exist, through it we can configure the inforamtica

49 what is confirmed dimension?
Conformed dimension is a dimension which is connected to or

shared by more than one fact table.


Eg:A business which takes care of both sales and orders of

products then product dimension becomes a conformed

dimension for both sales fact and order fact



50 surrogate keys usage in Oracle and Informatica?
surrogate key is one type of key which is used to maintain

history .

it is used in slowly changing dimension (scd)


If u have mulitiple records then for mentain this records we

need to generate surrogate key in informatica




51 What happens if you increase commit intervals and also decrease commit Explain grouped cross tab?
if you have increased your commit interval to ~25000 rows

the session will run faster but when your session fails at

24000th record you will not have any data in your target.


When you decrease your commit interval to ~10000 rows your

session will be slow when compared to previous but if the

session fails at 24000th record you will lose only 4000

records.


if commit interval is set to high value, performance will

be high. if commit is given for every 1000 rows say for eg,

it will affect the performance badly

52 What is an MDDB? What is the difference between MDDBs and RDBMSs?

MDDB stands for Multi Dimensional Database

MDDB: In MDDB, it views data in mutidimensional

(Perspective)i.e.,through various dimensions at a time with

the help of cubes developed using dimensions and stores

data in multidimensional i.e., stores data in power cubes.

In power cubes, each axis is a dimension and each member of

dimension is column.

In MDDB, at a glance we can see the dimensions and data

present in the dimensions

RDDB: In RDDB, it views data in two dimensional and stores

data in two dimensional i.e., in rows and columns in a

table .

In RDDB, we can just see the rows and columns, but only

after issuing select over that u can see the data.

No comments: