<fmt:message key='jsp.layout.header-default.alt'/>  
 

DSpace@UM >
Faculty of Computer Science and Information Technology >
Masters Dissertations: Computer Science >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1812/97

Title: Data Preparation Tool for Exploration in Data Mining
Authors: Lee, Hock Heng
Keywords: Data mining
Data preparation tool (DP tool)
Issue Date: Jan-2007
Abstract: Data preparation is an essential part of data mining, which consists of preparing, surveying and modelling data. It prepares the data as well as the miner so that when the prepared data is used, better and faster models are produced. Much of this important step in data mining can be automated, which led to the development of a data preparation tool (the DP tool) for data mining. Data preparation involves looking at the data variables individually as well as looking at the set of data variables as a whole. Certain variable features are problems in data mining. They include “sparse” variables, “compact” variables, monotonic variables, and outliers. For some modelling methods, these problems may affect the speed of modelling and/or the value of model. Fortunately, techniques are available to solve them before the data is mined, and some are used when performing simple data transformations on a data set using the DP tool. When preparing a data set, two areas need attention. They are getting enough data and exposing their information content. Getting enough data is known as capturing data set variability. Estimated confidence measures of each variable are compared to the computed ones to ensure a particular data collection set has enough data to build useful models. In the process, a variable status report is prepared. The data collection set may contain very complex relationships, which are often known beforehand by the business expert. Giving the mining tool such knowledge to begin with would have sped up its process. One such case is the aggregation of transaction details to the customer level, which is performed when building a data set. The DP Tool is based on a visual mining project carried out by a cellular phone company. The project aimed to identify customers churn rate and to know what actions to reduce the rate. Descriptive models will not only provide the trend of customers churn but also the profiles of churned customers. The project data sets serve as test data for the data preparation tool. Before any data can be prepared, they have to be extracted by downloading from their sources into an exploratory database. The DP Tool provides a module to extract online data from different database servers both local and remote. Another module provides scrollable edit for different data “types” such as first-load data, which are reloaded after corrections. Table records can be edited, added or deleted. When the collection data are cleaned and verified, a data set is created. Then the data set undergoes some kinds of data transformation, which are categorised into discrete items, continuous items and computed items. A housekeeping module known as database maintenance is also provided. A client/server implementation of two-tier “plus many” architecture is used to develop the data preparation tool. The client and server reside on the same host, a laptop. The main server is linked to other server instances for data access. SQL Server 2000 provides high reliability, high security, and a powerful SQL programming language, which is used to implement all the data preparation tasks. Another development tool used is Jbuilder (Borland), which provides a visual programming environment to build the user-friendly interface, consisting of frames and dialogs. The Java user-interface classes reside in the client while the data preparation stored procedures reside in the server database.
Description: Master of Software Engineering
URI: http://dspace.fsktm.um.edu.my/handle/1812/97
Appears in Collections:Masters Dissertations: Computer Science

Files in This Item:

File Description SizeFormat
Chapter 3 - Aspects of DP.pdfChapter 3102 kBAdobe PDFView/Open
Chapter 2 Literature review.pdfChapter 2145.47 kBAdobe PDFView/Open
Chapter 4 - Development Methodology.pdfChapter 473.24 kBAdobe PDFView/Open
Chapter 5 System Development.pdfChapter 5111.58 kBAdobe PDFView/Open
Chapter 6 - The Versatile DP tool.pdfChapter 674.38 kBAdobe PDFView/Open
Chapter 7 Conclusion.pdfChapter 751.08 kBAdobe PDFView/Open
Table of Contents.pdfTable of Contents24.25 kBAdobe PDFView/Open
Appendix A Deploying DP tool - Procedures.pdfAppendix A22.13 kBAdobe PDFView/Open
Appendix B - use cases and diagrams.pdfAppendix B45.29 kBAdobe PDFView/Open
Appendix C Object-analysis Artefacts.pdfAppendix C101 kBAdobe PDFView/Open
Appendix D2 - Object Design Artefacts.pdfAppendix D242.44 kBAdobe PDFView/Open
Appendix D1 Object-design Artefacts.pdfAppendix D135.53 kBAdobe PDFView/Open
Appendix E - User Interface classes.pdfAppendix E38.39 kBAdobe PDFView/Open
Appendix F - stored procedures.pdfAppendix F42.82 kBAdobe PDFView/Open
Appendix G - Tool Usage.pdfAppendix G60.89 kBAdobe PDFView/Open
Appendix H - Organisation of Source Files.pdfAppendix H29.48 kBAdobe PDFView/Open
Appendix I - Comparison of Software Tools.pdfAppendix I31.88 kBAdobe PDFView/Open
Appendix J - DP Tool Evaluation.pdfAppendix J23.88 kBAdobe PDFView/Open
References.pdfReferences21.15 kBAdobe PDFView/Open
Cover Page.pdfCover13.33 kBAdobe PDFView/Open
acknowledgement.pdfAcknowledgement8.53 kBAdobe PDFView/Open
An Abstract.pdfAbstract12.82 kBAdobe PDFView/Open
Chapter 1 introduction.pdfChapter 142.35 kBAdobe PDFView/Open


This item is protected by original copyright



Their Tags: compact variables; outliers;

Your Tags:

 

  © Copyright 2008 DSpace Faculty of Computer Science and Information Technology, University of Malaya . All Rights Reserved.
DSpace@UM is powered by MIT - Hawlett-Packard. More information and software credits. Feedback