|Abstract: ||Data preparation is an essential part of data mining, which consists of preparing,
surveying and modelling data. It prepares the data as well as the miner so that when the
prepared data is used, better and faster models are produced. Much of this important step
in data mining can be automated, which led to the development of a data preparation tool
(the DP tool) for data mining.
Data preparation involves looking at the data variables individually as well as looking
at the set of data variables as a whole. Certain variable features are problems in data
mining. They include “sparse” variables, “compact” variables, monotonic variables, and
outliers. For some modelling methods, these problems may affect the speed of modelling
and/or the value of model. Fortunately, techniques are available to solve them before the
data is mined, and some are used when performing simple data transformations on a data
set using the DP tool.
When preparing a data set, two areas need attention. They are getting enough data and
exposing their information content. Getting enough data is known as capturing data set
variability. Estimated confidence measures of each variable are compared to the
computed ones to ensure a particular data collection set has enough data to build useful
models. In the process, a variable status report is prepared. The data collection set may
contain very complex relationships, which are often known beforehand by the business
expert. Giving the mining tool such knowledge to begin with would have sped up its process. One such case is the aggregation of transaction details to the customer level,
which is performed when building a data set.
The DP Tool is based on a visual mining project carried out by a cellular phone
company. The project aimed to identify customers churn rate and to know what actions to
reduce the rate. Descriptive models will not only provide the trend of customers churn
but also the profiles of churned customers. The project data sets serve as test data for the
data preparation tool.
Before any data can be prepared, they have to be extracted by downloading from their
sources into an exploratory database. The DP Tool provides a module to extract online
data from different database servers both local and remote. Another module provides
scrollable edit for different data “types” such as first-load data, which are reloaded after
corrections. Table records can be edited, added or deleted. When the collection data are
cleaned and verified, a data set is created. Then the data set undergoes some kinds of data
transformation, which are categorised into discrete items, continuous items and computed
items. A housekeeping module known as database maintenance is also provided.
A client/server implementation of two-tier “plus many” architecture is used to develop
the data preparation tool. The client and server reside on the same host, a laptop. The
main server is linked to other server instances for data access. SQL Server 2000 provides
high reliability, high security, and a powerful SQL programming language, which is used
to implement all the data preparation tasks. Another development tool used is Jbuilder
(Borland), which provides a visual programming environment to build the user-friendly
interface, consisting of frames and dialogs. The Java user-interface classes reside in the
client while the data preparation stored procedures reside in the server database.|