Dataware Housing

Monday, October 1, 2007

Types of Dimensions

Confirmed Dimensions, Junk Dimensions, and Degenerated Dimensions

Conformed Dimensions (CD): these dimensions are something that is built once in your model and can be reused multiple times with different fact tables. For example, consider a model containing multiple fact tables, representing different data marts. Now look for a dimension that is common to these facts tables. In this example let’s consider that the product dimension is common and hence can be reused by creating short cuts and joining the different fact tables.Some of the examples are time dimension, customer dimensions, product dimension.

Junked Dimensions (JD): When you consolidate lots of small dimensions and instead of having 100s of small dimensions, that will have few records in them, cluttering your database with these mini ‘identifier’ tables, all records from all these small dimension tables are loaded into ONE dimension table and we call this dimension table Junk dimension table. (Since we are storing all the junk in this one table) For example: a company might have handful of manufacture plants, handful of order types, and so on, so forth, and we can consolidate them in one dimension table called junked dimension table.

Degenerated Dimension (DD): An item that is in the fact table but is stripped off of its description, because the description belongs in dimension table, is referred to as Degenerated Dimension. Since it looks like dimension, but is really in fact table and has been degenerated of its description, hence is called degenerated dimension. Now coming to the slowly changing dimensions (SCD) and Slowly Growing Dimensions (SGD): I would like to classify them to be more of an attributes of dimensions its self.

Although other might disagree to this view but Slowly Changing Dimensions are basically those dimensions whose key value will remain static but description might change over the period of time. For example, the product id in a companies, product line might remain the same, but the description might change from time to time, hence, product dimension is called slowly changing dimension.

Lets consider a customer dimension, which will have a unique customer id but the customer name (company name) might change periodically due to buy out / acquisitions, Hence, slowly changing dimension, as customer number is static but customer name is changing, However, on the other hand the company will add more customers to its existing list of customers and it is highly unlikely that the company will acquire astronomical number of customer over night (wouldn’t the company CEO love that) hence, the customer dimension is both a Slowly changing as well as slowly growing dimension.

Tuesday, August 14, 2007

How do u identify the bottlenecks in Mappings?

Bottlenecks can occur in

A. Targets The most common performance bottleneck occurs when the informatica server writes to a target database. You can identify target bottleneck by configuring the session to write to a flat file target. If the session performance increases significantly when you write to a flat file, you have a target bottleneck.

Solution : Drop or Disable index or constraints Perform bulk load (Ignores Database log) Increase commit interval (Recovery is compromised) Tune the database for RBS, Dynamic Extension etc.,

B. Sources Set a filter transformation after each SQ and see the records are not through. If the time taken is same then there is a problem. You can also identify the Source problem by Read Test Session – where we copy the mapping with sources, SQ and remove all transformations and connect to file target. If the performance is same then there is a Source bottleneck. Using database query – Copy the read query directly from the log. Execute the query against the source database with a query tool. If the time it takes to execute the query and the time to fetch the first row are significantly different, then the query can be modified using optimizer hints.

Solutions: Optimize Queries using hints. Use indexes wherever possible.

C. Mapping

If both Source and target are OK then problem could be in mapping. Add a filter transformation before target and if the time is the same then there is a problem.

(OR) Look for the performance monitor in the Sessions property sheet and view the counters.

Solutions: If High error rows and rows in lookup cache indicate a mapping bottleneck. Optimize Single Pass Reading:

Optimize Lookup transformation :

1. Caching the lookup table:
When caching is enabled the informatica server caches the lookup table and queries the cache during the session. When this option is not enabled the server queries the lookup table on a row-by row basis.
Static, Dynamic, Shared, Un-shared and Persistent cache

2. Optimizing the lookup condition whenever multiple conditions are placed, the condition with equality sign should take precedence.

3. Indexing the lookup table

The cached lookup table should be indexed on order by columns.
The session log contains the ORDER BY statement, the un-cached lookup since the server issues a SELECT statement for each row passing into lookup transformation, it is better to index the lookup table on the columns in the condition

Optimize Filter transformation:

You can improve the efficiency by filtering early in the data flow. Instead of using a filter transformation halfway through the mapping to remove a sizable amount of data.
Use a source qualifier filter to remove those same rows at the source, If not possible to move the filter into SQ, move the filter transformation as close to the source qualifier as possible to remove unnecessary data early in the data flow.

Optimize Aggregate transformation:

1. Group by simpler columns. Preferably numeric columns.
2. Use Sorted input. The sorted input decreases the use of aggregate caches. The server assumes all input data are sorted and as it reads it performs aggregate calculations.
3. Use incremental aggregation in session property sheet.

Optimize Seq. Generator transformation:

1. Try creating a reusable Seq. Generator transformation and use it in multiple mappings 2. The number of cached value property determines the number of values the informatica server caches at one time.

Optimize Expression transformation:
1. Factoring out common logic
2. Minimize aggregate function calls.
3. Replace common sub-expressions with local variables.
4. Use operators instead of functions.

D. Sessions

If you do not have a source, target, or mapping bottleneck, you may have a session bottleneck. You can identify a session bottleneck by using the performance details. The informatica server creates performance details when you enable Collect Performance Data on the General Tab of the session properties. Performance details display information about each Source Qualifier, target definitions, and individual transformation. All transformations have some basic counters that indicate the Number of input rows, output rows, and error rows. Any value other than zero in the readfromdisk and writetodisk counters for Aggregate, Joiner, or Rank transformations indicate a session bottleneck.

Low bufferInput_efficiency and BufferOutput_efficiency counter also indicate a session bottleneck. Small cache size, low buffer memory, and small commit intervals can cause session bottlenecks.

E. System (Networks)

Monday, August 13, 2007

FAQ's On Informatica

Informatica Repository Manager

Q. What type of repositories can be created using Informatica Repository Manager?
A. Informatica PowerCenter includeds following type of repositories :
Standalone Repository : A repository that functions individually and this is unrelated to any other repositories.
Global Repository : This is a centralized repository in a domain. This repository can contain shared objects across the repositories in a domain. The objects are shared through global shortcuts.
Local Repository : Local repository is within a domain and it’s not a global repository. Local repository can connect to a global repository using global shortcuts and can use objects in it’s shared folders.
Versioned Repository : This can either be local or global repository but it allows version control for the repository. A versioned repository can store multiple copies, or versions of an object. This features allows to efficiently develop, test and deploy metadata in the production environment.

Q. What is a code page?
A. A code page contains encoding to specify characters in a set of one or more languages. The code page is selected based on source of the data. For example if source contains Japanese text then the code page should be selected to support Japanese text.
When a code page is chosen, the program or application for which the code page is set, refers to a specific set of data that describes the characters the application recognizes. This influences the way that application stores, receives, and sends character data.

Q. Which all databases PowerCenter Server on Windows can connect to?
A. PowerCenter Server on Windows can connect to following databases:
IBM DB2
Informix
Microsoft Access
Microsoft Excel
Microsoft SQL Server
Oracle
Sybase
Teradata

Q. Which all databases PowerCenter Server on UNIX can connect to?
A. PowerCenter Server on UNIX can connect to following databases:
IBM DB2
Informix
Oracle
Sybase
Teradata
Infomratica Mapping Designer

Q. How to execute PL/SQL script from Informatica mapping?
A. Stored Procedure (SP) transformation can be used to execute PL/SQL Scripts. In SP Transformation PL/SQL procedure name can be specified. Whenever the session is executed, the session will call the pl/sql procedure.

Q. How can you define a transformation? What are different types of transformations available in Informatica?
A. A transformation is a repository object that generates, modifies, or passes data. The Designer provides a set of transformations that perform specific functions. For example, an Aggregator transformation performs calculations on groups of data. Below are the various transformations available in Informatica:
Aggregator
Application Source Qualifier
Custom
Expression
External Procedure
Filter
Input
Joiner
Lookup
Normalizer
Output
Rank
Router
Sequence Generator
Sorter
Source Qualifier
Stored Procedure
Transaction Control
Union
Update Strategy
XML Generator
XML Parser
XML Source Qualifier

Q. What is a source qualifier? What is meant by Query Override?
A. Source Qualifier represents the rows that the PowerCenter Server reads from a relational or flat file source when it runs a session. When a relational or a flat file source definition is added to a mapping, it is connected to a Source Qualifier transformation.
PowerCenter Server generates a query for each Source Qualifier Transformation whenever it runs the session. The default query is SELET statement containing all the source columns. Source Qualifier has capability to override this default query by changing the default settings of the transformation properties. The list of selected ports or the order they appear in the default query should not be changed in overridden query.

Q. What is aggregator transformation?
A. The Aggregator transformation allows performing aggregate calculations, such as averages and sums. Unlike Expression Transformation, the Aggregator transformation can only be used to perform calculations on groups. The Expression transformation permits calculations on a row-by-row basis only.
Aggregator Transformation contains group by ports that indicate how to group the data. While grouping the data, the aggregator transformation outputs the last row of each group unless otherwise specified in the transformation properties.
Various group by functions available in Informatica are : AVG, COUNT, FIRST, LAST, MAX, MEDIAN, MIN, PERCENTILE, STDDEV, SUM, VARIANCE.

Q. What is Incremental Aggregation?
A. Whenever a session is created for a mapping Aggregate Transformation, the session option for Incremental Aggregation can be enabled. When PowerCenter performs incremental aggregation, it passes new source data through the mapping and uses historical cache data to perform new aggregation calculations incrementally.

Q. How Union Transformation is used?
A. The union transformation is a multiple input group transformation that can be used to merge data from various sources (or pipelines). This transformation works just like UNION ALL statement in SQL, that is used to combine result set of two SELECT statements.

Q. Can two flat files be joined with Joiner Transformation?
A. Yes, joiner transformation can be used to join data from two flat file sources.

Q. What is a look up transformation?
A. This transformation is used to lookup data in a flat file or a relational table, view or synonym. It compares lookup transformation ports (input ports) to the source column values based on the lookup condition. Later returned values can be passed to other transformations.

Q. Can a lookup be done on Flat Files?
A. Yes.

Q. What is the difference between a connected look up and unconnected look up?
A. Connected lookup takes input values directly from other transformations in the pipleline.
Unconnected lookup doesn’t take inputs directly from any other transformation, but it can be used in any transformation (like expression) and can be invoked as a function using :LKP expression. So, an unconnected lookup can be called multiple times in a mapping.

Q. What is a mapplet?
A. A mapplet is a reusable object that is created using mapplet designer. The mapplet contains set of transformations and it allows us to reuse that transformation logic in multiple mappings.

Q. What does reusable transformation mean?
A. Reusable transformations can be used multiple times in a mapping. The reusable transformation is stored as a metadata separate from any other mapping that uses the transformation. Whenever any changes to a reusable transformation are made, all the mappings where the transformation is used will be invalidated.

Q. What is update strategy and what are the options for update strategy?
A. Informatica processes the source data row-by-row. By default every row is marked to be inserted in the target table. If the row has to be updated/inserted based on some logic Update Strategy transformation is used. The condition can be specified in Update Strategy to mark the processed row for update or insert.

Following options are available for update strategy :
DD_INSERT : If this is used the Update Strategy flags the row for insertion. Equivalent numeric value of DD_INSERT is 0.
DD_UPDATE : If this is used the Update Strategy flags the row for update. Equivalent numeric value of DD_UPDATE is 1.
DD_DELETE : If this is used the Update Strategy flags the row for deletion. Equivalent numeric value of DD_DELETE is 2.
DD_REJECT : If this is used the Update Strategy flags the row for rejection. Equivalent numeric value of DD_REJECT is 3.

Dataware House Basic Concepts

What is a Data Warehouse?
A Data Warehouse is the "corporate memory". Academics will say it is a subject oriented, point-in-time, inquiry only collection of operational data.
Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases.

What is ETL?
ETL is the Data Warehouse acquisition processes of Extracting, Transforming (or Transporting) and Loading (ETL) data from source systems into the data warehouse.
What is the difference between a data warehouse and a data mart?
There are inherent similarities between the basic constructs used to design a data warehouse and a data mart. In general a Data Warehouse is used on an enterprise level, while Data Marts is used on a business division/department level. A data mart only contains the required subject specific data for local analysis.

What is the difference between a W/H and an OLTP application?
Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing (OLAP). As a result, data warehouses are designed differently than traditional relational databases.
Warehouses are Time Referenced, Subject-Oriented, Non-volatile (read only) and Integrated.
OLTP databases are designed to maintain atomicity, consistency and integrity (the "ACID" tests). Since a data warehouse is not updated, these constraints are relaxed.

What is the difference between OLAP, ROLAP, MOLAP and HOLAP?
ROLAP, MOLAP and HOLAP are specialized OLAP (Online Analytical Analysis) applications.
ROLAP stands for Relational OLAP. Users see their data organized in cubes with dimensions, but the data is really stored in a Relational Database (RDBMS) like Oracle. The RDBMS will store data at a fine grain level, response times are usually slow.
MOLAP stands for Multidimensional OLAP. Users see their data organized in cubes with dimensions, but the data is store in a Multi-dimensional database (MDBMS) like Oracle Express Server. In a MOLAP system lot of queries have a finite answer and performance is usually critical and fast.

HOLAP stands for Hybrid OLAP, it is a combination of both worlds.

What is the difference between an ODS and a W/H?
An ODS (Operational Data Store) is an integrated database of operational data. Its sources include legacy systems and it contains current or near term data. An ODS may contain 30 to 90 days of information.

A warehouse typically contains years of data (Time Referenced). Data warehouses group data by subject rather than by activity (subject-oriented). Other properties are: Non-volatile (read only) and Integrated.

When should one use an MD-database (multi-dimensional database) and not a relational one?
Data in a multi-dimensional database is stored as business people views it, allowing them to slice and dice the data to answer business questions. When designed correctly, an OLAP database will provide must faster response times for analytical queries.

Normal relational databases store data in two-dimensional tables and analytical queries against them are normally very slow.

How can Oracle Materialized Views be used to speed up data warehouse queries?
With "Query Rewrite" (QUERY_REWRITE_ENABLED=TRUE in INIT.ORA) Oracle can direct queries to use pre-aggregated tables instead of scanning large tables to answer complex queries.
Materialized views in a W/H environments is typically referred to as summaries, because they store summarized data.

Dimensional Modeling Definition

Dimensional modeling is the design concept used by many data warehouse designers to build their data warehouse. Dimensional model is the underlying data model used by many of the commercial OLAP products available today in the market. In this model, all data is contained in two types of tables called Fact Table and Dimension Table.

Dimensional Modeling - Fact TableIn a Dimensional Model, Fact table contains the measurements or metrics or facts of business processes. If your business process is Sales, then a measurement of this business process such as "monthly sales number" is captured in the fact table. In addition to the measurements, the only other things a fact table contains are foreign keys for the dimension tables.

Dimensional Modeling - Dimension TableIn a Dimensional Model, context of the measurements are represented in dimension tables. You can also think of the context of a measurement as the characteristics such as who, what, where, when, how of a measurement (subject ). In your business process Sales, the characteristics of the 'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What).

The Dimension Attributes are the various columns in a dimension table. In the Location dimension, the attributes can be Location Code, State, Country, Zip code. Generally the Dimension Attributes are used in report labels, and query constraints such as where Country='USA'. The dimension attributes also contain one or more hierarchical relationships.Before designing your data warehouse, you need to decide what this data warehouse contains. Say if you want to build a data warehouse containing monthly sales numbers across multiple store locations, across time and across products then your dimensions are: Location, Time, ProductEach dimension table contains data for one dimension. In the above example you get all your store location information and put that into one single table called Location. Your store location data may be spanned across multiple tables in your OLTP system (unlike OLAP), but you need to de-normalize all that data into one single table.

Monday, August 6, 2007

Want to know about schemas then follow this link

http://wewillrockheredwh.blogspot.com/

Moving data from one repository to another repository

How to move the data from development repository to testing repository:

Exporting File from development repository:

Ø Open the Informatica Power Center Client Designer.
Ø First Login to development repository by entering the Username, Password, Host name and Port number.
Ø Now Login to testing repository by entering the Username, Password, Host name and Port number.
Ø Now we have both repositories in repository navigator.
Ø Click on development repository mapping folder, then right click on the folder(Mapping)
Ø Click on open option from right click menu (Then mapping will be opened in mapping designer)
Ø Now select Export option from right click menu
Ø It will ask you to save the file (Then give the path in the browser where you want to export this file. It is .xml extension file.)
Ø Once you save it, it will display output, errors (if any), warnings (if any)
Ø Now click on close.

Well. Now you have successfully exported the file.

Import File from local folder to Test repository:

Ø From Informatica Power Center Client Designer menu bar select repository, in this menu you could find import objects option.
Ø Click on this option.
Ø It will display wizard to browse the object file, browse the file, which you have exported from development repository.
Ø Now click next button
Ø Add source, target, mapping objects to be imported to test repository
Ø Again click on next button.
Ø Finally click finish button.
Now the file will be imported to test repository.

Similarly Workflows will be moved from Development repository to Test repository in the workflow designer.

This is the same procedure to move the data from Testing repository to Production repository.

This is one way to move the data from one repository to another repository…
We can follow other methods also… Very soon I’ll update this blogger with other methods…
Please leave your comments….