While doing binary classification, if the data set is imbalanced, the accuracy of the model can’t be predicted correctly using only the R2 score. For example, if the data belonging to one of the two classes is very less in quantity as compared to the other class, the traditional accuracy will take a very small percentage of the smaller class. If only 5% of the examples are belonging to the smaller class, and the model classifies all outputs belonging to the other class, the accuracy would still be around 95%. But this will be wrong. To deal with this, we can do the following :
* Use other methods for calculating the model performance like precision/recall, F1 score, etc.
* Resample the data with techniques like undersampling(reducing the sample size of the larger class), oversampling(increasing the sample size of smaller class using repetition, SMOTE, and other such techniques.
* Using K-fold cross-validation
* Using ensemble learning such that each decision tree considers the entire sample of the smaller class and only a subset of the larger class.