Dtjytyk

Question

I use two machine learning algorithms for binary classification and I get this result :

Algo 1 :

 AUC- Train : 0.75      AUC- Test: 0.65          big Train / overfitting

Algo 2 :

 AUC- Train : 0.72      AUC- Test: 0.65          small train / small overfitting

Which one is better?

I would like to point out that if you are not optimizing probability rank, do not use AUC. There has been a ton of research that recommends avoiding AUC as a method for choosing models/params. Even when focused on rank/ordering, I have discovered inconsistencies in special cases that create confusion. To start with, check this paper out: "AUC: a misleading measure of the performance of predictive distribution models" — Mar 23 at 22:45

score 1 · Accepted Answer · 2019-03-15 12:42:52Z

1

Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).

Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.

edited Mar 15 at 12:42

answered Mar 15 at 12:36

Simon Larsson

53812

$begingroup$
Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
$endgroup$
– Nirmine
Mar 15 at 13:17

$begingroup$
Yes, that is correct.
$endgroup$
– Simon Larsson
Mar 15 at 13:23

$begingroup$
Thanks for your help
$endgroup$
– Nirmine
Mar 15 at 13:25

$begingroup$
No problem! Don't forget to mark my answer as correct if you got what you asked for.
$endgroup$
– Simon Larsson
Mar 15 at 13:27

add a comment |

score 1 · Accepted Answer · 2019-03-15 15:52:43Z

1

Algo 2

Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.

For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.

By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.

edited Mar 15 at 15:52

answered Mar 15 at 13:31

Esmailian

1,976216

$begingroup$
How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
$endgroup$
– Simon Larsson
Mar 15 at 13:48

$begingroup$
Genuinely curious btw, incase you know something I have missed. :)
$endgroup$
– Simon Larsson
Mar 15 at 13:50

$begingroup$
@SimonLarsson cool! I made some updates.
$endgroup$
– Esmailian
Mar 15 at 14:04

$begingroup$
Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
$endgroup$
– Simon Larsson
Mar 15 at 14:15

2

$begingroup$
@SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
$endgroup$
– Ben Reiniger
Mar 15 at 14:39

|
show 4 more comments

score 1 · Accepted Answer · 2019-03-15 16:24:53Z

Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.

Disclaimer:

If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.

For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:

>>> from pycm import *



>>> cm = ConfusionMatrix(matrix={"0": {"0": 1, "1":0, "2": 0}, "1": {"0": 0, "1": 1, "2": 2}, "2": {"0": 0, "1": 1, "2": 0}})  



>>> print(cm.recommended_list)

["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]

and then see the value of the metrics focusing on the recommended metrics by the following code:

>>> print(cm)

    Predict          0        1        2        

    Actual

    0                1        0        0        

    1                0        1        2        

    2                0        1        0        









Overall Statistics : 



95% CI                                                           (-0.02941,0.82941)

Bennett_S                                                        0.1

Chi-Squared                                                      6.66667

Chi-Squared DF                                                   4

Conditional Entropy                                              0.55098

Cramer_V                                                         0.8165

Cross Entropy                                                    1.52193

Gwet_AC1                                                         0.13043

Joint Entropy                                                    1.92193

KL Divergence                                                    0.15098

Kappa                                                            0.0625

Kappa 95% CI                                                     (-0.60846,0.73346)

Kappa No Prevalence                                              -0.2

Kappa Standard Error                                             0.34233

Kappa Unbiased                                                   0.03226

Lambda A                                                         0.5

Lambda B                                                         0.66667

Mutual Information                                               0.97095

Overall_ACC                                                      0.4

Overall_RACC                                                     0.36

Overall_RACCU                                                    0.38

PPV_Macro                                                        0.5

PPV_Micro                                                        0.4

Phi-Squared                                                      1.33333

Reference Entropy                                                1.37095

Response Entropy                                                 1.52193

Scott_PI                                                         0.03226

Standard Error                                                   0.21909

Strength_Of_Agreement(Altman)                                    Poor

Strength_Of_Agreement(Cicchetti)                                 Poor

Strength_Of_Agreement(Fleiss)                                    Poor

Strength_Of_Agreement(Landis and Koch)                           Slight

TPR_Macro                                                        0.44444

TPR_Micro                                                        0.4



Class Statistics :



Classes                                                          0                       1                       2                       

ACC(Accuracy)                                                    1.0                     0.4                     0.4                     

BM(Informedness or bookmaker informedness)                       1.0                     -0.16667                -0.5                    

DOR(Diagnostic odds ratio)                                       None                    0.5                     0.0                     

ERR(Error rate)                                                  0.0                     0.6                     0.6                     

F0.5(F0.5 score)                                                 1.0                     0.45455                 0.0                     

F1(F1 score - harmonic mean of precision and sensitivity)        1.0                     0.4                     0.0                     

F2(F2 score)                                                     1.0                     0.35714                 0.0                     

FDR(False discovery rate)                                        0.0                     0.5                     1.0                     

FN(False negative/miss/type 2 error)                             0                       2                       1                       

FNR(Miss rate or false negative rate)                            0.0                     0.66667                 1.0                     

FOR(False omission rate)                                         0.0                     0.66667                 0.33333                 

FP(False positive/type 1 error/false alarm)                      0                       1                       2                       

FPR(Fall-out or false positive rate)                             0.0                     0.5                     0.5                     

G(G-measure geometric mean of precision and sensitivity)         1.0                     0.40825                 0.0                     

LR+(Positive likelihood ratio)                                   None                    0.66667                 0.0                     

LR-(Negative likelihood ratio)                                   0.0                     1.33333                 2.0                     

MCC(Matthews correlation coefficient)                            1.0                     -0.16667                -0.40825                

MK(Markedness)                                                   1.0                     -0.16667                -0.33333                

N(Condition negative)                                            4                       2                       4                       

NPV(Negative predictive value)                                   1.0                     0.33333                 0.66667                 

P(Condition positive)                                            1                       3                       1                       

POP(Population)                                                  5                       5                       5                       

PPV(Precision or positive predictive value)                      1.0                     0.5                     0.0                     

PRE(Prevalence)                                                  0.2                     0.6                     0.2                     

RACC(Random accuracy)                                            0.04                    0.24                    0.08                    

RACCU(Random accuracy unbiased)                                  0.04                    0.25                    0.09                    

TN(True negative/correct rejection)                              4                       1                       2                       

TNR(Specificity or true negative rate)                           1.0                     0.5                     0.5                     

TON(Test outcome negative)                                       4                       3                       3                       

TOP(Test outcome positive)                                       1                       2                       2                       

TP(True positive/hit)                                            1                       1                       0                       

TPR(Sensitivity, recall, hit rate, or true positive rate)        1.0                     0.33333                 0.0

You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior) — Mar 15 at 16:23
@alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure. — Mar 15 at 17:53
@Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it. — Mar 15 at 18:53

score 1 · Accepted Answer · 2019-03-15 12:42:52Z

1

Based on the AUC score they are the same. It does not really matter if the model is overfitting or not. What matters is how well it performs on new data (test score).

Overfitting is just an indication that there might be room for improvement by making your model more general. But until the test score has increased the model has not improved even if it is overfitting less.

edited Mar 15 at 12:42

answered Mar 15 at 12:36

Simon Larsson

53812

$begingroup$
Thanks simon, so if I understand I should always take the biggest test score as the best model without getting any importance to training score?
$endgroup$
– Nirmine
Mar 15 at 13:17

$begingroup$
Yes, that is correct.
$endgroup$
– Simon Larsson
Mar 15 at 13:23

$begingroup$
Thanks for your help
$endgroup$
– Nirmine
Mar 15 at 13:25

$begingroup$
No problem! Don't forget to mark my answer as correct if you got what you asked for.
$endgroup$
– Simon Larsson
Mar 15 at 13:27

add a comment |

score 1 · Accepted Answer · 2019-03-15 15:52:43Z

1

Algo 2

Between equal test scores choose the one with less difference between training and test scores (Algo 2), since the one with better training score (Algo 1) is more over-fitted. We tolerate a more over-fitted model only if it has a subjectively better test score.

For a better justification, think of how we train a neural network. When validation score stops improving, we stop the training process even though training score will keep improving. If we let the training continue, the model will start making extra assumptions based on the training set that are not scrutinized by the critic (validation set) which makes the model more prone to building false assumptions about the data.

By the same token, a model (Algo 1) that has the same performance based on the critic (test set) but performs better on training set is prone to have made untested assumptions about the data.

edited Mar 15 at 15:52

answered Mar 15 at 13:31

Esmailian

1,976216

$begingroup$
How can you make these assumptions? Test score tells you the generalization ability of the algorithm regardless of the bias/variance. I feel like you can say nothing about which one will perform better on another test set.
$endgroup$
– Simon Larsson
Mar 15 at 13:48

$begingroup$
Genuinely curious btw, incase you know something I have missed. :)
$endgroup$
– Simon Larsson
Mar 15 at 13:50

$begingroup$
@SimonLarsson cool! I made some updates.
$endgroup$
– Esmailian
Mar 15 at 14:04

$begingroup$
Thank you for replying! But what I would like to know is how you can assume that one will generalize better than the other on other data when the test score is the same? Just because you know that one model has learned some junk from the training set it does not say that the other model will have learned something useful in its' place.
$endgroup$
– Simon Larsson
Mar 15 at 14:15

2

$begingroup$
@SimonLarsson I think fundamentally it's an Occam's Razor thing, with an assumption that the more-overfit model is "more complicated." In specific situations it's easier; e.g., if the data is time-dependent and the test set is out-of-time, then the train/test score discrepancy might indicate degradation over time, so that future performance may degrade faster in the more-overfit model.
$endgroup$
– Ben Reiniger
Mar 15 at 14:39

|
show 4 more comments

score 1 · Accepted Answer · 2019-03-15 16:24:53Z

Just based on this metric you can not find which one is better because AUC could not differentiate these two result. You should use some other metrics such as Kappa or some benchmarks.

Disclaimer:

If you are using Python I suggest PyCM module which get your confusion matrix as input and calculate about 100 overall and class-based metrics.

For using this module at first prepare your confusion matrix and see it's recommended parameters by the following code:

>>> from pycm import *



>>> cm = ConfusionMatrix(matrix={"0": {"0": 1, "1":0, "2": 0}, "1": {"0": 0, "1": 1, "2": 2}, "2": {"0": 0, "1": 1, "2": 0}})  



>>> print(cm.recommended_list)

["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]

and then see the value of the metrics focusing on the recommended metrics by the following code:

>>> print(cm)

    Predict          0        1        2        

    Actual

    0                1        0        0        

    1                0        1        2        

    2                0        1        0        









Overall Statistics : 



95% CI                                                           (-0.02941,0.82941)

Bennett_S                                                        0.1

Chi-Squared                                                      6.66667

Chi-Squared DF                                                   4

Conditional Entropy                                              0.55098

Cramer_V                                                         0.8165

Cross Entropy                                                    1.52193

Gwet_AC1                                                         0.13043

Joint Entropy                                                    1.92193

KL Divergence                                                    0.15098

Kappa                                                            0.0625

Kappa 95% CI                                                     (-0.60846,0.73346)

Kappa No Prevalence                                              -0.2

Kappa Standard Error                                             0.34233

Kappa Unbiased                                                   0.03226

Lambda A                                                         0.5

Lambda B                                                         0.66667

Mutual Information                                               0.97095

Overall_ACC                                                      0.4

Overall_RACC                                                     0.36

Overall_RACCU                                                    0.38

PPV_Macro                                                        0.5

PPV_Micro                                                        0.4

Phi-Squared                                                      1.33333

Reference Entropy                                                1.37095

Response Entropy                                                 1.52193

Scott_PI                                                         0.03226

Standard Error                                                   0.21909

Strength_Of_Agreement(Altman)                                    Poor

Strength_Of_Agreement(Cicchetti)                                 Poor

Strength_Of_Agreement(Fleiss)                                    Poor

Strength_Of_Agreement(Landis and Koch)                           Slight

TPR_Macro                                                        0.44444

TPR_Micro                                                        0.4



Class Statistics :



Classes                                                          0                       1                       2                       

ACC(Accuracy)                                                    1.0                     0.4                     0.4                     

BM(Informedness or bookmaker informedness)                       1.0                     -0.16667                -0.5                    

DOR(Diagnostic odds ratio)                                       None                    0.5                     0.0                     

ERR(Error rate)                                                  0.0                     0.6                     0.6                     

F0.5(F0.5 score)                                                 1.0                     0.45455                 0.0                     

F1(F1 score - harmonic mean of precision and sensitivity)        1.0                     0.4                     0.0                     

F2(F2 score)                                                     1.0                     0.35714                 0.0                     

FDR(False discovery rate)                                        0.0                     0.5                     1.0                     

FN(False negative/miss/type 2 error)                             0                       2                       1                       

FNR(Miss rate or false negative rate)                            0.0                     0.66667                 1.0                     

FOR(False omission rate)                                         0.0                     0.66667                 0.33333                 

FP(False positive/type 1 error/false alarm)                      0                       1                       2                       

FPR(Fall-out or false positive rate)                             0.0                     0.5                     0.5                     

G(G-measure geometric mean of precision and sensitivity)         1.0                     0.40825                 0.0                     

LR+(Positive likelihood ratio)                                   None                    0.66667                 0.0                     

LR-(Negative likelihood ratio)                                   0.0                     1.33333                 2.0                     

MCC(Matthews correlation coefficient)                            1.0                     -0.16667                -0.40825                

MK(Markedness)                                                   1.0                     -0.16667                -0.33333                

N(Condition negative)                                            4                       2                       4                       

NPV(Negative predictive value)                                   1.0                     0.33333                 0.66667                 

P(Condition positive)                                            1                       3                       1                       

POP(Population)                                                  5                       5                       5                       

PPV(Precision or positive predictive value)                      1.0                     0.5                     0.0                     

PRE(Prevalence)                                                  0.2                     0.6                     0.2                     

RACC(Random accuracy)                                            0.04                    0.24                    0.08                    

RACCU(Random accuracy unbiased)                                  0.04                    0.25                    0.09                    

TN(True negative/correct rejection)                              4                       1                       2                       

TNR(Specificity or true negative rate)                           1.0                     0.5                     0.5                     

TON(Test outcome negative)                                       4                       3                       3                       

TOP(Test outcome positive)                                       1                       2                       2                       

TP(True positive/hit)                                            1                       1                       0                       

TPR(Sensitivity, recall, hit rate, or true positive rate)        1.0                     0.33333                 0.0

You should mention that you are an author of the package. (datascience.stackexchange.com/help/behavior) — Mar 15 at 16:23
@alirezazolanvari In my opinion, change of measure does not solve the underlying problem. First, choice of measure dependents on task too, we cannot peak and choose independently. More importantly, this problem can happen for any other measure (e.g. Kappa) too, the solution is not to simply change the measure. — Mar 15 at 17:53
@Esmailian obviously the evaluation metric is directly related to the task but the researches for finding proper metrics for evaluating a learning algorithm have been focused on clearing the difference between the performance of algorithms in the cases in which the simple metrics such as AUC can not say which one is better. Totally for answering this question many other things should be considered. This answer not a golden key for this problem but can be helpful to solve it. — Mar 15 at 18:53

搜尋此網誌

Dtjytyk

How to select between models when AUC scores are similar?2019 Community Moderator ElectionGeneric strategy...

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Popular posts from this blog

六本木駅

Joseph Lister

How to select between models when AUC scores are similar?2019 Community Moderator ElectionGeneric strategy...

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

六本木駅

Joseph Lister

3 Answers
3

3 Answers
3

3 Answers
3