-
Notifications
You must be signed in to change notification settings - Fork 22
Description
Hi,
I'm volunteering for the JASP team, and while looking at the Machine Learning modules, I noted two things that I think are worth discussing:
-
Currently, machine learning classification analyses do not include (as far as I can see) any functionality for balancing the dataset based on the ratio of target labels/classes. I think this could be a useful feature to add, as the performance of the model on the underrepresented class could be worse than the over-represented class (f.e. if a relatively simple model is used, and if the features are noisy). If the validation data is also biased (which I think it certainly is if it is sampled from a biased dataset randomly), then the average evaluation metrics will not show this bias, and could be misleading if one does not look at the class performance metrics. Adding a feature to balance the distribution of labels in both training and validation data could be educational: One could see how the class prediction accuracy changes while the average accuracy maybe does not change that much.
-
In the most extreme case of this bias (let's say out of label A and B, label B got 0 correct predictions), the table produced by Model Performance will give the same accuracy for both label A and B. The accuracy for label B in this case should be 0, as precision = 0 / (0 + FN).
Please see the attached image