Confusion matrix, ROC curve, and AUC score

Ayse Seyyide Arslan
4 min readApr 7, 2021

Today, we are going to go show how the Confusion Matrix, ROC curve, and AUC score can be calculated on excel by using a dataset listing features for a number of generators.

The dataset contains:

  • The measurements of the revolutions per minute (RPM) that power station generators are running at,
  • The amount of vibration in the generators (VIBRATION),
  • An indicator to show whether the generators proved to be working (good:0) or (faulty:1) the day after these measurements were taken.

Using this dataset, we would like to train a model to distinguish between properly operating power station generators and faulty generators using the RPM and VIBRATION measurements.

This dataset is taken from Table 7.7 in the ‘‘Fundamentals of Machine Learning For Predictive Data Analytics’’ book.

Before we start, let's normalize the features RPM and VIBRATION to be between -1 and 1 by using the min-max normalization method in feature scaling.

Min-Max Normalization Method

Now, the dataset is ready, so we can apply the linear equation on each row, after finding the output of the equation, it is going to be used in the Logistic function to find the probabilities/predictions.

dsd
Linear Equation
  • The w values are the model’s learned weights, and b is the bias.
  • The β values are the feature values for a particular row.
  • b = -0.4077, β1=4.1697, β2= 6.046
Logistic (Sigmoid) Function

The result of the sigmoid function which is a probability can be used as-is or a threshold can be specified to convert the probability to a binary value. For example, let's first set the threshold to 0.5. In this case, if our probability is above the threshold then the binary value will be 1, otherwise 0.

The status column represents the ‘actual’ and the threshold(0.5) column represents the ‘Prediction’ which will help us to specify the outcome for each row.

Case 1: Status: 1, Threshold(0.5): 1 >> Outcome: TP

Case 2: Status: 1, Threshold(0.5): 0 >> Outcome: FN

Case 3: Status: 0, Threshold(0.5): 0 >> Outcome: TN

Case 4: Status: 0, Threshold(0.5): 1 >> Outcome: FP

After finding the outcome for each row, we have to count the number of TPs, FNs, FPs, and TNs in the ‘Outcome’ column to form our Confusion Matrix.

Confusion Matrix

The Confusion Matrix shown above is for threshold 0.5 case. Now, let's look at the calculations of Misclassification Rate, Classification Accuracy, TPR, FPR, TNR, and FNR.

To be able to plot the ROC curve (receiver operating characteristic curve), we need to show the performance of a classification model at all classification thresholds. ROC curve includes two parameters to plot: True Positive Rate and False Positive Rate. To easily see the change in the outcome values for each threshold, we can sort the ‘Predictions’ column.

Outcomes for the threshold values between 0, 0.1…0.9, 1

In the excel file, to count a specific string in a column, we can use COUNTIF().

For example, to find the number of FPs in the Outcome_1 column,

=COUNTIF(D2:D69,”*FP*”) will return 32

From Outcome_1 to Outcome_11 columns, we can find all the measures (mainly TPR and FPR), required to plot a ROC curve.

In the end, to be able to plot the ROC Curve let's put all the TPR and FPR results together,

By using the table shown above we can finally plot our ROC Curve,

ROC CURVE

As a result, we can easily find the AUC (area under the roc curve) score of 0.989619377 by subtracting the area which is not under the curve (a trapezoid in our plot) from the total area which is 1.

References

  1. https://github.com/linux08/machine-learning-books/blob/master/Fundamentals%20of%20Machine%20Learning%20for%20Predictive%20Data%20Analytics%202015.pdf

--

--

Ayse Seyyide Arslan
0 Followers

Graduate Student in Computer Science