The Effect of Adaptive Synthetic and Information Gain on C4.5 and Naive Bayes in Imbalance Class Dataset

Mulia Sulistiyono, Lucky Adhikrisna Wirasakti, Yoga Pristyanto

Abstract


Class imbalance is a severe problem in classification due to the deep slope on the class axis. The dataset is dominated by the majority class, which has the potential for misclassification. Another problem in classification and clustering is that high-dimensional datasets are found that have the potential to affect the performance of classification algorithms in terms of computation and accuracy. In this study, the class imbalance was handled using the ADASYN k - NN resampling technique and the selection feature using Information Gain. Based on the evaluation results, the sampling contribution matrix can improve the classification model by improving the geometric mean value. The selection feature helps interpret data with more simple features but can reduce the accuracy of the results. The results showed that the implementation of ADASYN k-NN and Information Gain could increase the accuracy score and geometric mean score of Decision Tree C4.5 and Naive Bayes. For further work, this proposed method will be tested on multiclass imbalanced datasets.

Keywords


ADASYN; Information Gain; Imbalanced Class; Feature Selection; High Dimensional Dataset

Full Text:

PDF

References


K. Yang et al., “Hybrid Classifier Ensemble for Imbalanced Data,†IEEE Trans. Neural Networks Learn. Syst., vol. PP, pp. 1–14, 2019, doi: 10.1109/tnnls.2019.2920246.

S. Datta and A. Arputharaj, “An Analysis of Several Machine Learning Algorithms for Imbalanced Classes,†5th Int. Conf. Soft Comput. Mach. Intell. ISCMI 2018, pp. 22–27, 2018, doi: 10.1109/ISCMI.2018.8703244.

Y. Pristyanto, I. Pratama, and A. F. Nugraha, “Data level approach for imbalanced class handling on educational data mining multiclass classification,†in 2018 International Conference on Information and Communications Technology, ICOIACT 2018, 2018, pp. 310–314, doi: 10.1109/ICOIACT.2018.8350792.

G. Hu, T. Xi, F. Mohammed, and H. Miao, “Classification of wine quality with imbalanced data,†Proc. IEEE Int. Conf. Ind. Technol., pp. 1712–1717, 2016, doi: 10.1109/ICIT.2016.7475021.

S. T. Jishan, R. I. Rashu, N. Haque, and R. M. Rahman, “Improving accuracy of students’ final grade prediction model using optimal equal width binning and synthetic minority over-sampling technique,†Decis. Anal., vol. 2, no. 1, pp. 1–25, 2015, doi: 10.1186/s40165-014-0010-2.

M. Imran, M. Afroze, S. K. Sanampudi, A. Abdul, and M. Qyser, “Data Mining of Imbalanced Dataset in Educational Data Using Weka Tool,†Int. J. Eng. Sci. Comput., vol. 6, no. 6, pp. 7666–7669, 2016, doi: 10.4010/2016.1809.

R. I. Rashu, N. Haq, and R. M. Rahman, “Data mining approaches to predict final grade by overcoming class imbalance problem,†2014 17th Int. Conf. Comput. Inf. Technol. ICCIT 2014, pp. 14–19, 2014, doi: 10.1109/ICCITechn.2014.7073095.

D. Thammasiri, D. Delen, P. Meesad, and N. Kasap, “A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition,†Expert Syst. Appl., vol. 41, no. 2, pp. 321–330, 2014, doi: 10.1016/j.eswa.2013.07.046.

Y. Pristyanto, S. Adi, and A. Sunyoto, “The effect of feature selection on classification algorithms in credit approval,†2019 Int. Conf. Inf. Commun. Technol. ICOIACT 2019, pp. 451–456, 2019, doi: 10.1109/ICOIACT46704.2019.8938523.

R. S. Ramya and S. Kumaresan, “Analysis of feature selection techniques in credit risk assessment,†in ICACCS 2015 - Proceedings of the 2nd International Conference on Advanced Computing and Communication Systems, 2015, pp. 1–6, doi: 10.1109/ICACCS.2015.7324139.

W. Punlumjeak and N. Rachburee, “A Comparative Study of Feature Selection Techniques for Classify Student Performance,†in 2015 7th International Conference on Information Technology and Electrical Engineering (ICITEE), 2015, pp. 425–429, doi: 10.1109/ICMLA.2010.27.

I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.

Y. Khaokaew and T. Anusas-Amornkul, “A performance comparison of feature selection techniques with SVM for network anomaly detection,†in 2016 4th International Symposium on Computational and Business Intelligence, ISCBI 2016, 2016, pp. 85–89, doi: 10.1109/ISCBI.2016.7743263.

K. R. Pushpalatha and A. G. Karegowda, “CFS Based Feature Subset Selection for Enhancing Classification of Similar Looking Food Grains-A Filter Approach,†in 2017 2nd International Conference On Emerging Computation and Information Technologies, ICECIT 2017, 2018, pp. 1–6, doi: 10.1109/ICECIT.2017.8453403.

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.

S. He, H., Bai, Y., Garcia, E., & Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In IEEE International Joint Conference on Neural Networks, 2008,†in IJCNN 2008.(IEEE World Congress on Computational Intelligence) (pp. 1322– 1328), 2008, no. 3, pp. 1322– 1328.

M. Han, J., & Kamber, Data Mining: Concepts and Techniques Second, Second Edi., vol. 12. San Fransisco: Morgan Kauffman, 2006.




DOI: https://doi.org/10.30630/ijasce.4.1.70

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Flag Counter

 

Organized / Collaboration

- Soft Computing and Data Mining Centre, UTHM, Malaysia and Department of Information Technology

- Society of Visual Informatics, Indonesia