Accuracy, Recall, Precision, & F1-Score with Python (2024)

Max Grossman

12 min read

Sep 25, 2023

Accuracy, Recall, Precision, & F1-Score with Python (2)

(All code can be found within the bottom, “Python Code,” section.)

In this document, we delve into the concepts of accuracy, precision, recall, and F1-Score, as they are frequently employed together and share a similar mathematical foundation.

But, there are two critical aspects we don’t address.

Acceptable Metric Values: This always leads to questions like: “ is an accuracy of 80% good!?!” You tell me! Are you happy that your self-driving car drives within the lines 80% of the time? It’s subjective, and it depends on how you feel about driving on sidewalks . . ..
Metric Refinement: Fine-tuning these metrics can be challenging for these reasons.

First, the trade-offs between these metrics create a delicate balance where improving one often comes at the expense of another.
Second, class imbalance can complicate the tuning process, as optimizing for one class may negatively impact the other, particularly in imbalanced datasets.
Third, identifying the right features or patterns that contribute positively to these metrics requires domain expertise and extensive data analysis.
Finally, the impact of hyperparameter adjustments or threshold changes can be non-linear and may require iterative experimentation.

Accuracy
Type 1 & 2 Errors
Precision
Recall
F1-Score
Python Code

(Num Correct Predictions) / ( Num Total Predictions)

Accuracy is used to evaluate the performance of classification models, which involve predicting the correct label for input data. To use accuracy as a metric for a classification model, the dataset should be balanced, meaning there’s roughly an equal number of data points for each class. If the data is not balanced, we pivot towards precision, recall, and F1-scores. We’ll kick off with a straightforward classification example.

Imagine you have a classification model you want to use for the data in the graph below, and there’s a new, previously unseen data point represented by the gold star. The question becomes, “Should this new data point be classified as belonging to Class A or Class B?”

Accuracy, Recall, Precision, & F1-Score with Python (3)

In this scenario, our model predicts the new data point as Class B. After making this assignment, we can evaluate the performance of our supervised model by comparing the classification to the true label found in the test set. If the label is truly B we have an accuracy of 1.0, if not, our accuracy is 0.

Now, let’s create a classification model using arthritis data generated randomly using Numpy’s random number generator. Yellow points represent patients with arthritis, and blue points represent those without arthritis.

Accuracy, Recall, Precision, & F1-Score with Python (4)

We’ll first conduct an 80–20 train-test split with our data. 80% of the data will be used to train our classification model and the remaining 20% of the data will serve as our independent test set and represents the gold star we saw in the simple example. I show the test points for each class so one see where the analogous gold stars lie (red and yellow dots here).

Accuracy, Recall, Precision, & F1-Score with Python (5)

Our classification model uses probability. When applied to test observations, the model estimates the probability that a given data point belongs to either of the two classes. In the figure below, I display the probabilities that each test point belongs to Class 1, although the model generates the probabilities for both classes.

To make a definitive classification, we utilize a threshold of 0.5, which means if the calculated probability exceeds this threshold, the data point is assigned to Class 1.

Accuracy, Recall, Precision, & F1-Score with Python (6)

Our model achieved an accuracy score of 95%, which means we got 1/20 wrong.

Accuracy, Recall, Precision, & F1-Score with Python (7)

However, it’s important to know whether this incorrect prediction corresponds to a Type 1 or Type 2 error.

We’re discussing classification, but Type 1 & 2 errors are important to hypothesis testing, so we’ll conclude with their terminology of alpha and beta.

Type 1 & 2 errors can be confusing because there are various ways to represent them, depending on the context and the field of application. Here are some common representations and analogies:

Legal Analogy:
Type I Error: Convicting an innocent person (False Positive).
Type II Error: Acquitting a guilty person (False Negative).
Decision Making:
Type I Error: Acting when you shouldn’t have (False Positive).
Type II Error: Not acting when you should have (False Negative).
Fire Alarm Analogy:
Type I Error: The fire alarm goes off, but there’s no fire (False Alarm).
Type II Error: There’s a fire, but the fire alarm doesn’t go off (Missed Detection).

In classification tasks, a Type I error is incorrectly predicting the positive class for an instance that truly belongs to the negative class. Similarly, a Type II error would involve incorrectly predicting the negative class for an instance that truly belongs to the positive class.

Returning to our arthritis example, let’s assume the red star is a new patient WHO DOES NOT HAVE ARTHRITIS, and our model diagnoses them as having arthritis. This is a false positive, a Type 1 error.

Let’s again assume our patient DOES HAVE ARTHRITIS and is the green star, and our model diagnoses them as not having arthritis, this would be a false negative, or Type II error.

Accuracy, Recall, Precision, & F1-Score with Python (8)

The terms α (alpha) and β (beta) originate from statistical hypothesis testing. In this context:

α (alpha) represents the significance level of a test, which is the probability of committing a Type I error. A Type I error occurs when we incorrectly reject a true null hypothesis. In other words, it's the risk of a false alarm.
β (beta) represents the probability of committing a Type II error. A Type II error takes place when we fail to reject a false null hypothesis. Essentially, it's the risk of failing to detect an effect when it is present.

This information is useful for remembering Type 1 & 2 errors because we develop the following pneumonic:

Type I (α or alpha): False Positive — Think of an alarm (starts with ‘a’ like alpha) that goes off even when there’s no actual threat. It’s a “false alarm.”

Type II (β or beta): False Negative — Think of a blind (starts with ‘b’ like beta) security guard who misses an actual threat.

Precision = TP / (TP + FP)
Recall = TP / (TP + FN)

We’re going to first use a simple scenario to explain precision and recall and my definitions for each focus on the questions they answer because their proper definitions are very abstract (and honestly, not much help).

Let’s say I’m sitting in front of a conveyer belt that’s shuttling animals past and I label them as such:

A cow moos by: “Tiger.”
A pig snorts by: “Tiger.”
A human complains by: “Tiger.”
A tiger roars by: “Tiger.”

Now, precision answers the question, “Of all the instances the model predicted as positive, how many were actually positive?”

What’s our precision given that we just predicted every animal to be a tiger?

Precision = TP / (TP + FP) = 1 / (1 + 3) = 1/4 = 0.25

Recall answers the question, “Of all the instances that truly belong to a certain class, how many did the model correctly identify?”

In our tiger example, we only correctly identified 1 case as being a tiger.

Recall = TP / (TP + FN) = 1 / (1 + 3) = 1/4 = 0.25

With this understanding, let’s now remember our arthritis classification example:

Accuracy, Recall, Precision, & F1-Score with Python (9)

Accuracy, Recall, Precision, & F1-Score with Python (10)

The questions for precision and recall would be:

Precision - How many of the data points classified at Class 1 (Arthritis) were actually Class 1?

Recall - How many of those data points that belong to Class 1 (Arthritis) were correctly identified?

Here are the results of our classification model:

Accuracy, Recall, Precision, & F1-Score with Python (11)

A precision of 1.0 means all the patients predicted as having arthritis (Class 1) actually have it. There are no false positives

A recall of 0.875 means the model correctly identifies 87.5% of the patients who truly have arthritis (Class 1). This also implies that 12.5% of the patients WITH ARTHRITIS are being labeled as not having arthritis by the model. This is a Type 2 errors.

So while precision and recall provide insight into model performance, they do not reveal the reasons behind errors. To understand the nature of these errors, it’s often necessary to conduct further analysis, such as examining the false negative instances or reviewing the misclassified data points:

Accuracy, Recall, Precision, & F1-Score with Python (12)

The F1 score is the harmonic mean of precision and recall, which makes it sensitive to small values. This means if either precision or recall is significantly lower than the other, it will have a more pronounced impact on the F1 score.

This sensitivity to smaller values is what makes the F1-score useful for imbalanced datasets!

When you have an imbalanced dataset, the values that precision and recall take on vary more! The performance metrics take on more radical values than if the dataset were 50/50, and the smaller value of either drags down the F1-score due to increased influence.

Let’s consider an example where we assume no one has arthritis.

Imagine we have a dataset where only 5% of the instances represent individuals with arthritis, and the remaining 95% are without arthritis. If you were to predict the negative class (no arthritis) for all instances, you could achieve an accuracy of 95%. However, this would not be particularly useful because you would fail to identify any individuals with arthritis.

Accuracy, Recall, Precision, & F1-Score with Python (13)

This is where the F1-score becomes valuable. In this example, the recall for the positive class (those with arthritis) would be 0, indicating that the model failed to detect any true cases of arthritis. Consequently, the F1-score would also be 0, emphasizing the model’s inability to identify individuals with arthritis, despite the high accuracy on the majority negative class.

But that’s not what we did, we used our classification model which returned the following results.

Accuracy, Recall, Precision, & F1-Score with Python (14)

You should be able to copy and paste these scripts into your IDE and run them, no dataset download required.

Code for Everything Except F1-Score Example:

# %% CREATE TWO NEAT CLUSTERS OF RANDOMLY GENERATED DATA
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd# Set a random seed for reproducibility
np.random.seed(20)
# Define the number of data points
num_data_points = 300
# Set the desired percentage of data points with arthritis
percentage_with_arthritis = 0.15
# Calculate the number of data points for each class
num_data_points_with_arthritis = int(num_data_points * percentage_with_arthritis)
num_data_points_no_arthritis = num_data_points - num_data_points_with_arthritis
# Generate clustered data for 'No Arthritis' class
no_arthritis_center = [30, 4] # Mean age and pain level for 'No Arthritis'
no_arthritis_std = [12, 3] # Standard deviation for age and pain level
no_arthritis_data = np.random.normal(
 loc=no_arthritis_center,
 scale=no_arthritis_std,
 size=(num_data_points_no_arthritis, 2),
)
no_arthritis_labels = np.zeros(num_data_points_no_arthritis)
# Generate clustered data for 'Arthritis' class
arthritis_center = [60, 7] # Mean age and pain level for 'Arthritis'
arthritis_std = [12, 3] # Standard deviation for age and pain level
arthritis_data = np.random.normal(
 loc=arthritis_center, scale=arthritis_std, size=(num_data_points_with_arthritis, 2)
)
arthritis_labels = np.ones(num_data_points_with_arthritis)
# Combine data and labels
age = np.concatenate((no_arthritis_data[:, 0], arthritis_data[:, 0]))
pain_level = np.concatenate((no_arthritis_data[:, 1], arthritis_data[:, 1]))
arthritis = np.concatenate((no_arthritis_labels, arthritis_labels))
# Ensure age and pain level are non-negative
age = np.maximum(age, 0)
pain_level = np.maximum(pain_level, 0)
# Create a DataFrame to store the dataset
data = pd.DataFrame({"Age": age, "Pain_Level": pain_level, "Arthritis": arthritis})
###########################################################################################################
# VISUALIZE THE CLUSTERS BEFORE SPLITTING
# Create a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(
 data[data["Arthritis"] == 0]["Age"],
 data[data["Arthritis"] == 0]["Pain_Level"],
 label="No Arthritis",
 color="blue",
 alpha=0.6,
 s=80,
)
plt.scatter(
 data[data["Arthritis"] == 1]["Age"],
 data[data["Arthritis"] == 1]["Pain_Level"],
 label="Arthritis",
 color="orange",
 alpha=0.6,
 s=80,
)
# Set plot labels and title
plt.xlabel("Age")
plt.ylabel("Pain Level")
plt.title(" Arthritis Classification Dataset with Clustered Data")
# Add a legend
plt.legend()
# Show the plot
plt.grid(True)
plt.show()
###########################################################################################################
# %% SPLIT THE DATA 80-20 TRAIN-TEST
from sklearn.model_selection import train_test_split
# Split the data into training (80%) and testing (20%) sets
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
# Display the sizes of the training and testing sets
print("Training set size:", len(train_data))
print("Testing set size:", len(test_data))
###########################################################################################################
# %% CREATE A LOGISTIC REGRESSION CLASSIFICATION MODEL
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Separate the features (Age and Pain_Level) and the target variable (Arthritis) for training and testing sets
X_train = train_data[["Age", "Pain_Level"]]
y_train = train_data["Arthritis"]
X_test = test_data[["Age", "Pain_Level"]]
y_test = test_data["Arthritis"]
# Initialize the logistic regression model
model = LogisticRegression(random_state=42)
# Train the model on the training data
model.fit(X_train, y_train)
# Make predictions on the testing data
y_pred = model.predict(X_test)
# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
###########################################################################################################
# %% PRINT PRECISION, SENSITIVITY, F1-SCORE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Calculate precision
precision = precision_score(y_test, y_pred)
print("Precision:", precision)
# Calculate recall (sensitivity)
recall = recall_score(y_test, y_pred)
print("Recall (Sensitivity):", recall)
# Calculate F1-score
f1 = f1_score(y_test, y_pred)
print("F1-Score:", f1)

Code for the F1-Score Example:

# %% CREATE TWO NEAT CLUSTERS OF RANDOMLY GENERATED DATA
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd# Set a random seed for reproducibility
np.random.seed(20)
# Define the number of data points
num_data_points = 300
# Set the desired percentage of data points with arthritis
percentage_with_arthritis = 0.05
# Calculate the number of data points for each class
num_data_points_with_arthritis = int(num_data_points * percentage_with_arthritis)
num_data_points_no_arthritis = num_data_points - num_data_points_with_arthritis
# Generate clustered data for 'No Arthritis' class
no_arthritis_center = [30, 4] # Mean age and pain level for 'No Arthritis'
no_arthritis_std = [12, 3] # Standard deviation for age and pain level
no_arthritis_data = np.random.normal(
 loc=no_arthritis_center,
 scale=no_arthritis_std,
 size=(num_data_points_no_arthritis, 2),
)
no_arthritis_labels = np.zeros(num_data_points_no_arthritis)
# Generate clustered data for 'Arthritis' class
arthritis_center = [60, 7] # Mean age and pain level for 'Arthritis'
arthritis_std = [12, 3] # Standard deviation for age and pain level
arthritis_data = np.random.normal(
 loc=arthritis_center, scale=arthritis_std, size=(num_data_points_with_arthritis, 2)
)
arthritis_labels = np.ones(num_data_points_with_arthritis)
# Combine data and labels
age = np.concatenate((no_arthritis_data[:, 0], arthritis_data[:, 0]))
pain_level = np.concatenate((no_arthritis_data[:, 1], arthritis_data[:, 1]))
arthritis = np.concatenate((no_arthritis_labels, arthritis_labels))
# Ensure age and pain level are non-negative
age = np.maximum(age, 0)
pain_level = np.maximum(pain_level, 0)
# Create a DataFrame to store the dataset
data = pd.DataFrame({"Age": age, "Pain_Level": pain_level, "Arthritis": arthritis})
###########################################################################################################
# VISUALIZE THE CLUSTERS BEFORE SPLITTING
# Create a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(
 data[data["Arthritis"] == 0]["Age"],
 data[data["Arthritis"] == 0]["Pain_Level"],
 label="No Arthritis",
 color="blue",
 alpha=0.6,
 s=80,
)
plt.scatter(
 data[data["Arthritis"] == 1]["Age"],
 data[data["Arthritis"] == 1]["Pain_Level"],
 label="Arthritis",
 color="orange",
 alpha=0.6,
 s=80,
)
# Set plot labels and title
plt.xlabel("Age")
plt.ylabel("Pain Level")
plt.title(" Arthritis Classification Dataset with Clustered Data")
# Add a legend
plt.legend()
# Show the plot
plt.grid(True)
plt.show()
###########################################################################################################
# %% SPLIT THE DATA 80-20 TRAIN-TEST
from sklearn.model_selection import train_test_split
# Split the data into training (80%) and testing (20%) sets
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)
# Display the sizes of the training and testing sets
print("Training set size:", len(train_data))
print("Testing set size:", len(test_data))
###########################################################################################################
# %% CREATE A LOGISTIC REGRESSION CLASSIFICATION MODEL
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Separate the features (Age and Pain_Level) and the target variable (Arthritis) for training and testing sets
X_train = train_data[["Age", "Pain_Level"]]
y_train = train_data["Arthritis"]
X_test = test_data[["Age", "Pain_Level"]]
y_test = test_data["Arthritis"]
# Initialize the logistic regression model
model = LogisticRegression(random_state=42)
# Train the model on the training data
model.fit(X_train, y_train)
# Make predictions on the testing data
y_pred = model.predict(X_test)
# Evaluate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
###########################################################################################################
# %% PRINT PRECISION, SENSITIVITY, F1-SCORE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
# Calculate precision
precision = precision_score(y_test, y_pred)
print("Precision:", precision)
# Calculate recall (sensitivity)
recall = recall_score(y_test, y_pred)
print("Recall (Sensitivity):", recall)
# Calculate F1-score
f1 = f1_score(y_test, y_pred)
print("F1-Score:", f1)

Accuracy, Recall, Precision, & F1-Score with Python (2024)

Code for Everything Except F1-Score Example:

Code for the F1-Score Example:

References