Feature Extraction and Classification On Single Nucleotide Polymorphism

Nur Fatihah Kamarudin (1), Zuraini Ali Shah (2), Mohd Farhan Md Fudzee (3), Shahreen Kasim (4)
(1) Universiti Teknologi Malaysia, Johor
(2) Universiti Tun Hussein Onn Malaysia, Batu Pahat, 86400, Johor
(3) Universiti Tun Hussein Onn Malaysia, Batu Pahat, 86400, Johor
(4) Universiti Tun Hussein Onn Malaysia, Batu Pahat, 86400, Johor
Fulltext View | Download
How to cite (IJASEIT) :
Kamarudin, N. F., Ali Shah, Z., Md Fudzee, M. F., & Kasim, S. (2019). Feature Extraction and Classification On Single Nucleotide Polymorphism. International Journal of Advanced Science Computing and Engineering, 1(2), 85–90. https://doi.org/10.62527/ijasce.1.2.6
Malay in Peninsular Malaysia can be divided into eight sub-ethnics which are Malay Bugis, Malay, Malay Champa, Malay Jawa, Malay Kelantan, Malay Kedah, Malay Minang and Malay Pattani. Ancestry informative marker (AIM) can be used to represent the eight subethnic of Malay population in Peninsular Malaysia. In this research, single nucleotide polymorphism (SNP) datasets of eight sub-ethnics are analyses in order to obtain the AIM for Malays population in Peninsular Malaysia. However, the dataset may have outlier, missing data and redundancy that may impact the accuracy of the result. Pre-processing data is an important step that will remove the entire problem. Iterative pruning principal component analysis (ipPCA) is one of the techniques that usually use in analysis on genome datasets to extract the information. It can be applied on the high structured data and can improve the resolution of the data. It also used for structure a sub-population. Random Forest and Hidden Naïve Bayes is used to classify the SNP that can be used as AIM. Information Gain Ratio will rank the chosen AIM based on the value of each attribute