中文摘要 |
Cross-project defect prediction (CPDP) is a field of study where researchers need to build a universal model by using within-project data to predict defects on other projects. However, variations in the distribution of source and target projects have an influence on the performance of classifiers. To enable effective cross-project defect prediction, we propose a comprehensive model containing data preprocessing and classifier transferring to construct a better classification space and strengthen the performance of classifiers. In preprocessing step, one baseline is calculated for every dataset from its non-defective samples based on the distance to all other non-defective samples and the data is transformed by using rank function. Genetic algorithm and ensemble learning are selected as the way in transferring step to extract effective representation from source projects and boost the capability of weak classifiers. We use Naive Bayes, Support Vector Machine and Classification and Regression Trees as classifiers and apply this model on five open resource projects (one Apache and four Eclipse projects) and NASA MDP dataset. Selecting accuracy as fitness in genetic algorithm improves the performance of classification. The model we proposed yields similar results and obtains higher precision comparing the within-project models. Meanwhile, it obtains better performance than the state-of- the-art methods on cross-project defect prediction. These results show that our model provides an opportunity to training classifiers by using more samples from different projects. |