The tasks involved in building a Data Mining database are as follows :
- Collecting the data (involves identifying the sources of data)
- Selecting the Data to be processed
- Preprocessing the data (data quality assessment)
- Transforming the data
- Analyzing the results
- Implementing the knowledge gained
Collecting the Data
The first step is to identify the data that needs to be mined. This step is important , as most of the data that is required may never have been collected. Data may be collected from public libraries or government documents. The properties of the collected data maybe stored in a data source report that may include any of the following :
- Source of the data
- Owner of the data
- Person/organization responsibility for maintaining data
- If purchased, the cost of the data
- Size in bytes
- Storage medium used ( such as CD, tape, network)
- Security requirements
A data description report containing the contents of each file or database table can be maintained to document this information.
Selecting the Data
Once the data is collected, the data that needs to be mined is selected. This is dependent on the business objectives. Along with this, the metadata also needs to be acquired. This contains the descriptions of the data types, initial values, range of values, list of values, the unit of measure and the primary key/foreign key relationships. The main purpose of this task is to eliminate irrelevant data.
Preprocessing the Data
Some of the data is acquired may be inaccurate, inconsistent and poorly documented. This task involves filtering and organization the data ,so that the data to be mined is relevant and accurate .This task generally requires the use of sampling and visualization techniques.
Data can be categorical or quantitative depending on its characteristics. For example, the visualization for categorical data can be in the form histograms, pie charts, or pivot tables while for quantitative data; the visualization could be in terms of maximum, minimum, mean or average. Using these techniques, data that is incorrect or irrelevant can be removed.
Transforming the Data
Once the data is ready, it is transformed and an analytical model is produced. This is known as an Informational Data Model, which enables an integrated and time dependent restructuring of the data. The content and scope of this data model determines the validity and practical use of the Data Mining process. For example, if the customer trends id a product are to be done, the analyst must decide whether to conduct the analysis at the regional level or at the individual level.
Analyzing the Data
With the help of a business analyst, a data analyst can analyze the data by using visualization aids and tools .The purpose of this task is to find some patterns or trends that were not known earlier. The purpose of this is to find some patterns or trends that were not known earlier. This approach is quite different from statistical analyses as they may or may not return a straight answer to a given hypothesis. Drawing conclusion on the data in the form of “if-then” rules is another approach.
Implementing the Knowledge Gained
This task involves action steps to be taken to implement the knowledge gained by Data Mining. The assimilated knowledge is put to use by applying it to business cycles. Here, the business analyst identifies the stages for implanting the knowledge gained into real-life processes to implement the benefits gained from Data Mining. For example, a new set of associations might be discovered, which may trigger off a new advertising campaign.
The final benefit is a shift of focus on the importance of data as an asset to the organization