Data is a representation of facts, approaches , or instructions in a formal manner that is suitable for communication, clarification, or processing by human beings or computers.
Basic concepts of data warehousing and data mining
Now a days data are growing in explosive rate from terabytes to petabytes. Data is always growing and gathering. Due to this reason Data doubles every 9 months. Data may be of High-dimensionality and High complexity as New and sophisticated applications are being used. There is a big gap from stored data to knowledge. Manual data analysis is not new but a bottleneck because it is not possible to analysis petabytes of data manually. Fast developing Computer Science and Engineering always generates new demands.
“A data warehouse is a copy of transaction data specifically structured for querying and reporting” – Ralph Kimball
It is a Structured Repository of Historic Data. It is Developed in an Evolutionary Process by blending data from non-integrated legacy systems. A Data Warehouse is a repository of information collected from multiple sources that is stored under a unified schema, and that usually resides at a single site. Data Warehouses are constructed through a serial process of Data Cleaning, Data Integration, Data Transformation, Data Loading, and Periodic data refreshing.
According to W. H. Inmon, a data warehouse is a subject-oriented, integrated, time-variant, nonvolatile set of data in support of management decisions.
Data warehouse is considered subject-oriented as it Focus on Subject Areas rather than Applications. It is organized around major subjects i.e. customer,sales, products . Data warehouse Provides a simple but a concise view around particular subject issues by excluding data that are not useful in the process of decision support.
Data warehouse is also Integrated. They are Constructed by integrating multiple, diverse data sources. Integration tasks handles naming conventions, physical attributes of data.
Data warehouse is Time Variant . It is only accurate and valid at some point in time or over some time interval. The time boundary for the data warehouse is somewhat longer than that of operational systems as Operational database provides current value data but Data warehouse data provide information from a historical point of view (e.g., past 15-20 years).
Data Warehouse is relatively Static in nature. It is not updated in real-time but data in the data warehouse is loaded and replenished from operational systems, it is not revised by end users , thus it is considered Non-Volatile.
Data Warehousing is the process of building a data warehouse for an organization. Data Warehousing can also be considered a process of converting data into information and making it accessible to users in a applicable manner to make a difference.
Objective Of A Data WarehouseTo Collect Data-Scrub, Integrate & Make It Accessible
Benefit of a Data Warehouse:Allows the extraction of data from various source systems on different platforms
Online Transaction Processing vs. Data WarehouseOLTP systems are tuned for known transactions and workloads while in a data warehouse workload is not known before.
Data Mining as an analytic process constructed to explore data in search for consistent patterns and/or systematic relationship among them. The ultimate goal of data mining is prediction.The predictive data mining is the most frequent type of data mining and one that has most direct business application.
For a class of techniques that find patterns in data, Data mining is a hot buzzword.It is a user-centric, interactive process which benefits analysis technologies and computing power. It is a group of techniques that discovers relationships not previously been discovered. Data mining does not relies on an existing database.Data mining does not Brute-force crunching of bulk data nor is a “Blind” application of algorithms. Data mining does not find relationships where none exist,It does not Present data in different ways. It is easier to understand technology that does not require an advanced degree in computer science.It simply don’t turn your data into gold. Even after finding the patterns,Data Mining process is not finished. Queries to the database are not Data Mining.
For example: A online shopping company may want to use predictive data mining, to acquire a trained model or set of models (e.g., neural networks, meta learned)that can quickly identify what product the customers may buy and recommend the customer product accordingly .
Data Mining Functionalities
They are used to specify the kind of patterns that can be found in data mining tasks. In general, data mining tasks can be classified into two type:Descriptive
Descriptive mining tasks portray the general properties of the data in a database.
Predictive mining tasks in order to make predictions perform inference on the current data.
In some cases, users may have no idea of which kinds of patterns in their data interests them,so may like to search for several different kinds of patterns in parallelly . Thus To accommodate different user expectations or application important to have a data mining system that can mine multiple kinds of patterns. Furthermore, data mining systems should be able to discover patterns at various different levels of abstraction.
The data mining systems should also allow users to specify hints to guide or focus in a search for the interesting patterns. Since some patterns may not clasp for all of the data in the database, a measure of certainty usually associated with each discovered pattern.
Functionalities of Data mining , and the kinds of patterns they can discover are:
Classification and Prediction
Data can be associated with concepts or class. For Example: In the ABC stores, classes of items for sale include Noodles and biscuits, and concepts of customers include Bigspenders and BudgetSpenders .It can be useful to describe single classes and concepts in summarized, concise, and yet precise terms. Such representation of a class or concept are called Class/Concept representation.Data characterization, by summarizing the data of the class under study (often called the target classes)For Example:A data mining system should be able to produce a description that summarizes the characteristics of customers who spend more than 1,00,000 rs a year at ABC stores. The result could be a profile of the customers, such as they are 30-40 years old, employed, and have excellent credit ratings.Data discrimination, by comparison of target class with one or a set of comparative classes (often called the contrasting classes)
A data mining system should be able to compare two groups of ABC stores customers, such as those who shops for computer products regularly versus those who rarely shop for such products. The resulting description provides a general comparative profile of the customers, such as 80% of the total customers who frequently purchase computer products are between 20 and 30 years old and have a university education, whereas 60% of the customers who occasionally buy such products are either seniors citizens or youths, and have no university degree.Association analysis
Buys(X,”shampoo”) ⇒ buys(X,”soap”) [support =1% , confidence=50%]
meaning that if X buys “shampoo” there is a 50% chance that “soap” will be bought as well, and 1% buys both.
Suppose, as sales manager of ABC Stores, you would like to classify a large collection of items in the store, On the basis of three kinds of responses to a sales campaign: good, mild and no response. You would like to drive a model for each of these three classes based on the descriptive features of the items, such as price, brand, manufactured_place, type and categories.
The resulting classification should maximally distinguish each class from others, presenting an organized data set pictures.Cluster Analysis
For Example, Cluster analysis can be performed on ABC stores customer data in order to identify homogenous subpopulations of customers. These clusters may represent individual target groups for many reasons.Evolution Analysis
Let us Suppose that you have the major stock market (time-series) data from the Nepal Stock Exchange of the last several years available and you would like to invest in shares of different tech companies.A data mining on stock exchange data may identify regularities in stock evolution for overall stocks and for the stocks of particular companies. Such regularities may help forecast the future trends in stock market prices, contributing on decision making capabilities regarding stock investments.
Data Mining Concepts and Techniques, Morgan Kaufmann J. Han, M. Kamber
Data Warehousing in the Real Worlds , Sam Anahory and Dennis Murray, Pearson Edition Asia
Data Mining Techniques – Arun K. Pajari, University Press