Welcome to Edukum.com

Introduction to Data Warehouse and Data Mining

Data Warehouse Logical Design

Data Warehouse and Online Analytical Processing

Data Warehouse Modeling

Data Mining

Data Mining Languages

Mining Rules and Association Rules of Large Database

Concepts of Data Warehouse and Data Mining

Data is a representation of facts, approaches , or instructions in a formal manner that is suitable for communication, clarification, or processing by human beings or computers.

Basic concepts of data warehousing and data mining
Now a days data are growing in explosive rate from terabytes to petabytes. Data is always growing and gathering. Due to this reason Data doubles every 9 months. Data may be of High-dimensionality and High complexity as New and sophisticated applications are being used. There is a big gap from stored data to knowledge. Manual data analysis is not new but a bottleneck because it is not possible to analysis petabytes of data manually. Fast developing Computer Science and Engineering always generates new demands.

Data Warehouse

A data warehouse is a copy of transaction data specifically structured for querying and reporting” – Ralph Kimball

It is a Structured Repository of Historic Data. It is Developed in an Evolutionary Process by blending data from non-integrated legacy systems. A Data Warehouse is a repository of information collected from multiple sources that is stored under a unified schema, and that usually resides at a single site. Data Warehouses are constructed through a serial process of Data Cleaning, Data Integration, Data Transformation, Data Loading, and Periodic data refreshing.

According to W. H. Inmon, a data warehouse is a subject-oriented, integrated, time-variant, nonvolatile set of data in support of management decisions.

Data warehouse is considered subject-oriented as it Focus on Subject Areas rather than Applications. It is organized around major subjects i.e. customer,sales, products . Data warehouse Provides a simple but a concise view around particular subject issues by excluding data that are not useful in the process of decision support.

Data warehouse is also Integrated. They are Constructed by integrating multiple, diverse data sources. Integration tasks handles naming conventions, physical attributes of data.

Data warehouse is Time Variant . It is only accurate and valid at some point in time or over some time interval. The time boundary for the data warehouse is somewhat longer than that of operational systems as Operational database provides current value data but Data warehouse data provide information from a historical point of view (e.g., past 15-20 years).

Data Warehouse is relatively Static in nature. It is not updated in real-time but data in the data warehouse is loaded and replenished from operational systems, it is not revised by end users , thus it is considered Non-Volatile.

Data Warehousing is the process of building a data warehouse for an organization. Data Warehousing can also be considered a process of converting data into information and making it accessible to users in a applicable manner to make a difference.

Objective Of A Data Warehouse

To Collect Data-Scrub, Integrate & Make It Accessible
To Provide Information – For Our Businesses
To start Managing Knowledge So Our Business Partners Will Gain Wisdom!

Benefit of a Data Warehouse:

Allows the extraction of data from various source systems on different platforms
Transforms huge data volumes into meaningful information.
Analyze integrated data across multiple dimensions.
Provide access of the analyzed information anywhere to the users anytime.

Online Transaction Processing vs. Data Warehouse

OLTP systems are tuned for known transactions and workloads while in a data warehouse workload is not known before.
OLTP applications normally automate clerical data processing tasks of an organization such as data entry and enquiry, transaction handling,
To support Special data organization, data warehouse queries , access methods and methods for implementation are needed.
OLTP is Application Oriented whereas Data warehouse
OLTP is Used to run business , Data warehouse Used to analyze business
OLTP has Detailed data, DW has Summarized and refined.
OLTP Current up to date whereas Data Warehouse Snapshot data.
OLTP is Performance Sensitive, Data Warehouse is Performance relaxed.
In OLTP Few Records are accessed at a time (tens) but in Data Warehouse Large volumes are accessed at a time(millions).
OLTP ha No data redundancy whereas in Data Warehouse Redundancy is present.
OLTP has Database Size of 100MB -100 GB, Data warehouse has Database Size of 100 GB - few terabytes.

Data Mining
Data Mining refers to extracting or “mining” Knowledge from large amounts of data.

Data Mining as an analytic process constructed to explore data in search for consistent patterns and/or systematic relationship among them. The ultimate goal of data mining is prediction.The predictive data mining is the most frequent type of data mining and one that has most direct business application.

For a class of techniques that find patterns in data, Data mining is a hot buzzword.It is a user-centric, interactive process which benefits analysis technologies and computing power. It is a group of techniques that discovers relationships not previously been discovered. Data mining does not relies on an existing database.Data mining does not Brute-force crunching of bulk data nor is a “Blind” application of algorithms. Data mining does not find relationships where none exist,It does not Present data in different ways. It is easier to understand technology that does not require an advanced degree in computer science.It simply don’t turn your data into gold. Even after finding the patterns,Data Mining process is not finished. Queries to the database are not Data Mining.

For example: A online shopping company may want to use predictive data mining, to acquire a trained model or set of models (e.g., neural networks, meta learned)that can quickly identify what product the customers may buy and recommend the customer product accordingly .

Data Mining Functionalities

They are used to specify the kind of patterns that can be found in data mining tasks. In general, data mining tasks can be classified into two type:


Descriptive mining tasks portray the general properties of the data in a database.

Predictive mining tasks in order to make predictions perform inference on the current data.

In some cases, users may have no idea of which kinds of patterns in their data interests them,so may like to search for several different kinds of patterns in parallelly . Thus To accommodate different user expectations or application important to have a data mining system that can mine multiple kinds of patterns. Furthermore, data mining systems should be able to discover patterns at various different levels of abstraction.

The data mining systems should also allow users to specify hints to guide or focus in a search for the interesting patterns. Since some patterns may not clasp for all of the data in the database, a measure of certainty usually associated with each discovered pattern.

Functionalities of Data mining , and the kinds of patterns they can discover are:

Concepts/ Class
Association analysis
Classification and Prediction
Clustering Analysis
Evolution Analysis

Concepts/ Class
Data can be associated with concepts or class. For Example: In the ABC stores, classes of items for sale include Noodles and biscuits, and concepts of customers include Bigspenders and BudgetSpenders .It can be useful to describe single classes and concepts in summarized, concise, and yet precise terms. Such representation of a class or concept are called Class/Concept representation.Data characterization, by summarizing the data of the class under study (often called the target classes)For Example:A data mining system should be able to produce a description that summarizes the characteristics of customers who spend more than 1,00,000 rs a year at ABC stores. The result could be a profile of the customers, such as they are 30-40 years old, employed, and have excellent credit ratings.Data discrimination, by comparison of target class with one or a set of comparative classes (often called the contrasting classes)

A data mining system should be able to compare two groups of ABC stores customers, such as those who shops for computer products regularly versus those who rarely shop for such products. The resulting description provides a general comparative profile of the customers, such as 80% of the total customers who frequently purchase computer products are between 20 and 30 years old and have a university education, whereas 60% of the customers who occasionally buy such products are either seniors citizens or youths, and have no university degree.

Association analysis
Association analysis is the discovery of association rules that shows attribute-value conditions that occur frequently together in a given set of data. Association analysis is broadly used for market basket or transaction data analysis.
Suppose, as a marketing managers of ABC store, you would like to determine which items are regularly purchased together within the same transactions.
An example of such a rule, mined from the ABC store transactional database, is

Buys(X,”shampoo”) ⇒ buys(X,”soap”) [support =1% , confidence=50%]
meaning that if X buys “shampoo” there is a 50% chance that “soap” will be bought as well, and 1% buys both.

Classification and prediction
Classification is the technique of finding a function that describes and determine data classes or concepts for using the model to predict the class of objects whose class label is unknown. The derived model may be described in various forms, such as classification (IF-THEN) rules, decision trees, mathematical formulae, or neural networks.


Suppose, as sales manager of ABC Stores, you would like to classify a large collection of items in the store, On the basis of three kinds of responses to a sales campaign: good, mild and no response. You would like to drive a model for each of these three classes based on the descriptive features of the items, such as price, brand, manufactured_place, type and categories.

The resulting classification should maximally distinguish each class from others, presenting an organized data set pictures.

Cluster Analysis
The objects are grouped based on the principle of maximizing and minimizing the intraclass similarity and the interclass similarity respectively. The clusters are formed such that objects within a chunk have high similarity in comparison to one another ,but are very different to objects in each other clusters.

For Example, Cluster analysis can be performed on ABC stores customer data in order to identify homogenous subpopulations of customers. These clusters may represent individual target groups for many reasons.

Evolution Analysis
Data evolution analysis describes and models regularities or trends of those objects whose role change over time. Although this may include characterization, association and correlation analysis, prediction, discrimination, classification or clustering of time related data. Different features of such as analysis include sequence or periodicity pattern matching, time-series data analysis and similarity-based data analysis.


Let us Suppose that you have the major stock market (time-series) data from the Nepal Stock Exchange of the last several years available and you would like to invest in shares of different tech companies.A data mining on stock exchange data may identify regularities in stock evolution for overall stocks and for the stocks of particular companies. Such regularities may help forecast the future trends in stock market prices, contributing on decision making capabilities regarding stock investments.


Data Mining Concepts and Techniques, Morgan Kaufmann J. Han, M. Kamber
Data Warehousing in the Real Worlds , Sam Anahory and Dennis Murray, Pearson Edition Asia
Data Mining Techniques – Arun K. Pajari, University Press


#Things To Remember