英文摘要 |
The integrity and consistency of data substantially influence the results of big data analytics. Data cleansing is often performed prior to the start of analyses to maintain these qualities in input data and ensure the results are not distorted by data anomalies. A key goal of data cleansing is to preserve data integrity. Missing values in the collected data are the main factor undermining such integrity and often result from human negligence or machine malfunction during data collection. Methods for addressing this problem include ignoring data that contain missing values or substituting the missing values with measures of central tendency, such as means or medians. These methods may result in incorrect predictions of missing values because of an inability to detect relationships among the input data. As a result, outcomes of subsequent analyses may also be incorrect. In this study, we used machine learning techniques to manage data containing missing values for a single attribute. We used a data set without missing values as the training data and clustered it using the k-means algorithm. Prediction models were built for each cluster using the resulting data. The k-nearest neighbor algorithm was used to determine the clusters of data, and models of the clusters were used to compute the missing values. We compared the results of the root-mean-square error of our models with that of other models commonly used in simulations, and the results revealed that our models were more accurate. |