Comparative Study of Machine Learning Models on Multiple Breast Cancer Datasets

Md. Arman Hussain Sujon, Hossen Mustafa


Carcinoma is one of the scariest and most frequently occurring cancers nowadays among females. It affects nearly around 10% of females all over the world at some point in their lives. Although the cure for this cancer is currently obtainable, the treatment is not effective enough if the disease is not identified at the early stages. Generally, some contemporary medical tests: roentgenogram, breast ultrasound, biopsy, etc., are used for identifying breast cancer. As an alternative, researchers are exploring machine learning techniques for classifying tumours at different stages, e.g., benign and malignant. Classification and data processing strategies can be effective mechanisms for the prediction of cancer. In this paper, we analyze six classification models: Decision Tree, K Nearest Neighbours, Random Forest, Logistic Regression, Extra Trees, and Support Vector Machine on three different datasets. We applied simple principle component analysis (PCA) to reduce dimensions of the datasets. Experimental results show that Random Forest obtained the best accuracy, recall, and F1 score among the six classification techniques for all three datasets. We also find that data attributes and values are important for accurate classification.


Classification; breast cancer prediction; data Science

Full Text:




  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Flag Counter


Organized / Collaboration

- Soft Computing and Data Mining Centre, UTHM, Malaysia and Department of Information Technology

- Society of Visual Informatics, Indonesia