Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Methods for Data Preprocessing

Question: - Describe various methods for data Preprocessing.
                   Discuss briefly the various data pre-processing technique.

Answer: -  Data preprocessing is an important issue for both data warehousing and data mining, as real-world data tend to be incomplete, noisy, and inconsistent. There are different forms of data preprocessing, as follows-

1.) Data Cleaning : - Data cleaning routines work to 'clean' the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies. If users believe the data are dirty, they are unlikely to trust the results of any data mining that has been applied to it. Furthermore, dirty data can cause confusion for the mining procedure, resulting in unreliable output. Although most mining routines have some procedures for dealing with incomplete or noisy data, they are not always robust. Instead, they may concentrate on avoiding overfitting the data to the function being modeled.

2.) Data Integration : - Data integration is a useful preprocessing step in which we include data from multiple sources in our analysis. This would involve integrating multiple databases, data cubes, or files. Yet some attributes representing  a given concept may have different names in different databases, causing inconsistencies and redundancies. For Example - The same first name could be registered as "Bill" in one database, but "William" in another, and 'B' in third. Having a large amount of redundant data may slow down or confuse the knowledge discovery process.

3.) Data Transformation : - Data transformations routines convert the data into appropriate forms for mining. For Example, attribute data may be normalized so as to fall between a small range, such as 0.0 to 1.0.

4.) Data Reduction : - Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same analytical results. There are a number of strategies for data reduction. These include data aggregation e.g., building a data cube, attribute subset selection e.g., removing irrelevant attributes through correlation analysis, dimensionality reduction e.g., using encoding schemes, and numerosity reduction e.g., replacing the data by alternative. Data can alse be reduced by generalization with the use of concept hierarchies, where low-level concepts are replaced with higher-level concepts.

This post first appeared on BCA MCA NOTES, please read the originial post: here

Share the post

Methods for Data Preprocessing


Subscribe to Bca Mca Notes

Get updates delivered right to your inbox!

Thank you for your subscription