Model Evaluation Metrics For Binary Classification
Model evaluation is a crucial aspect of any machine learning process. In this article, we used a practical and explanatory approach to explore several evaluation metrics for binary classification, including the F1-score, precision, recall, and accuracy
Introduction
Classification is a type of supervised machine learning that has many real-world applications, such as detecting spam emails, predicting fraud, and predicting medical conditions.
Many industries rely on machine learning to solve classification problems in real time, which raises questions about the accuracy and reliability of the models being deployed.
Model evaluation is the process of answering these questions and ensuring that the predictions being made are accurate and won’t have negative consequences for the business. In this article, we will focus on model evaluation for binary classification problems.
By the end of this tutorial, you will be able to:
- Understand at least 5 evaluation metrics for classification problems
- Apply the evaluation metrics to a dataset
- Know why certain classification metrics are preferred over the other
Prerequisite
You’ll need the following:
- A fair knowledge of the Python Programming Language
- Basic knowledge of the Scikit-learn library for machine learning.
- A Jupyter notebook installed on your machine or Google Colab Notebook.
What is an Evaluation Metric?
Have you ever thought about why students take exams or tests after receiving lectures or training in a particular subject? This is because the teacher wants to see how well the student can apply the knowledge to real-life problems. That’s what evaluation metrics are all about in the context of machine learning.
Evaluation metrics are ways of measuring the performance of your machine learning model by comparing it to other values or test datasets (datasets that were not used to train the model).
Model evaluation is an important part of the machine learning process. In this tutorial, we will delve into the evaluation metrics commonly used in binary classification, understand the reasoning behind them, and apply them to a Kaggle dataset.
Let’s consider various components of evaluation metrics for binary classification models.
True Positive (TP): This refers to a situation where the model correctly predicts the positive class. For example, correctly predicting that a patient has a certain medical condition.
True Negative (TN): This refers to a situation where the model correctly predicts the negative class. For example, correctly predicting that a patient does not have a certain medical condition.
False Negative (FN): This refers to a situation where the model incorrectly predicts the negative class. For example, predicting that a patient does not have a certain medical condition when they actually do.
False Positive (FP): This refers to a situation where the model incorrectly predicts the positive class. For example, predicting that a patient has a certain medical condition when they actually do not.
True Positive Rate (TPR): TPR, also known as recall or sensitivity, is the ratio of true or correct positive (TP) predictions to the total number of positive predictions (TP and FN).
True Negative Rate (TNR): TNR, also known as specificity, is the ratio of true negative (TN) predictions to the total number of negative predictions (TN and FP).
False Positive Rate (FPR): The FPR, also known as the Type I Error Rate, is the ratio of false positive (FP) predictions to the total number of negative predictions (FP and TN).
False Negative Rate(FNR): FNR, also known as the Type II Error Rate, is the ratio of false negative (FN) predictions to the total number of positive predictions (FN and TP).”
Before going to evaluation metrics, let’s write build our model before using evaluation metrics.
Data Pre-processing and Model Building
To summarize, the code in this tutorial is used to predict the breast cancer status of a patient, either benign or malignant. The Scikit-Learn library is utilized to perform basic data preprocessing on the target variables, which are represented as M (malignant) and B (benign) using Label Encoding.
The dataset is then split into train and test sets to prevent overfitting and facilitate model evaluation. The metrics used in this process will be discussed later.
Evaluation Metrics
- Accuracy Score: The accuracy score is the ratio of correct predictions to the total number of predictions made by the model. However, the accuracy score and related metrics are not frequently used in real-world applications due to their limitations.
For example, when dealing with an imbalanced dataset such as predicting cancer as either malignant or benign, if the number of benign cases is 100 and the number of malignant cases is 700, the model is likely to predict more malignant cases than benign ones.
Now, let’s demonstrate the practical usage of this dataset obtained from Kaggle.
The image above shows that the model has an accuracy of 0.97 or 97% using the accuracy score evaluation metrics.
2. Confusion Matrix: The Confusion Matrix (CM) is a visual representation of the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values of the predicted and actual classes.
CM is a useful tool for evaluating the performance of a binary or multi-class classification model, as it provides a clear breakdown of the model’s predictions.
For example, in the confusion matrix above, 110 instances were predicted as false and were actually false (TN), 1 instance was predicted as true but was actually false (FP), 4 instances were predicted as false but were actually true (FN), and 59 instances were predicted as true and were actually true (TP).
3. Precision: Precision is the measure of the number of actual positive cases out of all the predicted positive cases. It is calculated by dividing the number of true positive cases (TP) by the sum of the true positive and false positive cases (TP + FP). For example, if the model predicts that 10 people are pregnant, and only 8 of them are actually pregnant, the precision would be 80%.
The precision of this model is 98.3%
4. Recall: Recall measures the proportion of actual positive cases that were correctly identified by the model. It is the ratio of true positives to all positive results, including those that were correctly predicted and those that were not. For example, if a model correctly identifies 9 out of 10 positive cases, it would have a recall of 90%.
The recall is 93.7%
5. f1-Score: The F1 score is the harmonic mean of precision and recall. It is a commonly used metric for evaluating binary classification models, particularly when the classes are imbalanced. The highest possible value for the F1 score is 1, which indicates perfect precision and recall.
The F1 score is more informative than the accuracy score because it takes into account both false negatives and false positives. In real-world applications, it is often important to consider both types of errors.
For example, in the case of predicting cancer diagnosis, a false negative (predicting a patient does not have cancer when they actually do) can have serious consequences, while a false positive (predicting a patient has cancer when they do not) can also cause unnecessary anxiety and further testing. The F1 score allows us to consider both types of errors in the evaluation of our model.
To cut the story short, our f1_score is 95.9% accurate.
6. AUC-ROC Curve: Receiver Operator Characteristic (ROC) is a probabilistic curve used to evaluate the performance of a binary classification model by plotting the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds
The closer the ROC is to the TPR axis, the more accurate the model is.
The area under the curve (AUC) is a statistical measure or summary of the curve, which ranges from 0 to 1. A value of 1 indicates perfect prediction or classification between the positive and negative class, while 0 indicates that the positive class was predicted as negative and the negative class was predicted as positive.
A value of 0.5 indicates that the model is struggling to classify the correct classes accurately. For the best results, the AUC score should be between 0.8 and 1.
The ROC-AUC-SCORE is approximately 96%
Conclusion
In this tutorial, you learned how to use various evaluation metrics with real data. You now have a good understanding of the different evaluation metrics, including their purpose and how to use them. The full code for this article can be found on my GitHub repository.
Thank you for reading, and I hope to see you in my next article. Bye!