A very serious guide for decoding model accuracy, precision, recall, and F1

robots running in a colosseum
how to know if you're winning, what to change if you're not

Okay I lied, this is only a mildly serious guide. With that disclosure out of the way and our conscience being clear, let's dive into the age-old question for the umpteenth time: accuracy, precision, recall, or F1? Which one should you choose to track the performance of your AI model? The answer is, frustratingly and always, that it depends. But don't worry, we're here to guide you through the jungle of metrics and help you make the right choice.

MetricWhat it isWhat it measures
AccuracyNumber of correct predictions divided by total number of predictionsHow often the model is correct
PrecisionNumber of true positive predictions divided by the sum of true positive and false positive predictionsProportion of positive predictions that are actually correct
RecallNumber of true positive predictions divided by the sum of true positive and false negative predictionsProportion of actual positive cases that were correctly predicted
F1Harmonic mean of precision and recallComposite metric, it is a balance of precision and recall

Now, why is picking your north star metric not a simple and obvious choice? Why must the diplomatic "it depends" raise its head again here? As with most things, the pesky issue is the abstract best that is market dynamics. Let's see:

Sometimes, accuracy is a great metric to use. For example, if you're building a model to predict whether it's going to rain tomorrow, accuracy is probably the way to go. If the model predicts rain and it doesn't rain, or if it predicts no rain and it does rain, the consequences are not severe. However, if you're building a model to predict whether a patient has a certain disease, the consequences of false positive or false negative predictions can be much more severe. In this case, you want to use precision, recall, or F1 to ensure that your model is not only correct, but also not making false positive or false negative predictions.

Ooh, ooh, look, another intricacy is rearing its head. It's amazing to see it in the wild. Nature, man. This one called "imbalanced datasets". If your dataset is imbalanced, accuracy can be a misleading metric. For example, if you have a dataset of 99% negative cases and 1% positive cases, a model that always predicts negative will have an accuracy of 99%. But clearly, this is not a useful model. In these cases, precision, recall, or F1 may be a better choice. That said, the smart choice in case you find yourself wrangling with a dataset clearly imbalanced to that degree, you're better served throwing it in a trash can and hitting us up at Coldpress AI to find something better.

So, what's the verdict? Well, it's like I said at the beginning: it depends. But, in general, if you're building a model for a low-stakes application, accuracy is probably fine. If you're building a model for a high-stakes application, you may want to use precision, recall, or F1. And if your dataset is imbalanced, be cautious about relying on anything.

One more interesting thing is the issue of stratification in F1, down to the fact that models typically solve multiple use cases which can have differeing levels of business importance, so something that has a 0.95 f-score can still wildly vary between use cases in terms of end-user satisfaction. This stratification is a topic of intense research that we're both doing and keeping tabs on. Sign up here to stay in the loop!