Philosophical question on logistic regression: why isn't the optimal threshold value trained?Why is accuracy not the best measure for assessing classification models?Why isn't Logistic Regression called Logistic Classification?Classification probability thresholdIs accuracy an improper scoring rule in a binary classification setting?Criteria for choosing the most appropriate logistic regression modelROC and false positive rate with over samplingLogistic Regression classifier works well during cross validation but fails on production data. Any suggestions why?How to find the optimal cp value in rpart doing cross validation manually?Optimal cut-off calculation in logistic regressionDo I do threshold selection for my logit model on the testing or training subset?ROC curves from cross-validation are identical/overlaid and AUC is the same for each foldWhy is ROC curve used in assessing how 'good' a logistic regression model is?Turning Roc curve threshold by cross validationDetermine the cutoff threshold for binary classification models using cross validation

Co-worker works way more than he should

How much of a wave function must reside inside event horizon for it to be consumed by the black hole?

Is there any pythonic way to find average of specific tuple elements in array?

Extracting Dirichlet series coefficients

Can a stored procedure reference the database in which it is stored?

Difficulty accessing OpenType ligatures with LuaLaTex and fontspec

I preordered a game on my Xbox while on the home screen of my friend's account. Which of us owns the game?

Is there really no use for MD5 anymore?

Magical attacks and overcoming damage resistance

Is there a word for the censored part of a video?

A ​Note ​on ​N!

Mistake in years of experience in resume?

Retract an already submitted recommendation letter (written for an undergrad student)

Apply a different color ramp to subset of categorized symbols in QGIS?

Prove that the countable union of countable sets is also countable

A strange hotel

How do I check if a string is entirely made of the same substring?

Why do games have consumables?

Combinatorics problem, right solution?

Restricting the options of a lookup field, based on the value of another lookup field?

Older movie/show about humans on derelict alien warship which refuels by passing through a star

Which big number is bigger?

Negative Resistance

Multiple fireplaces in an apartment building?



Philosophical question on logistic regression: why isn't the optimal threshold value trained?


Why is accuracy not the best measure for assessing classification models?Why isn't Logistic Regression called Logistic Classification?Classification probability thresholdIs accuracy an improper scoring rule in a binary classification setting?Criteria for choosing the most appropriate logistic regression modelROC and false positive rate with over samplingLogistic Regression classifier works well during cross validation but fails on production data. Any suggestions why?How to find the optimal cp value in rpart doing cross validation manually?Optimal cut-off calculation in logistic regressionDo I do threshold selection for my logit model on the testing or training subset?ROC curves from cross-validation are identical/overlaid and AUC is the same for each foldWhy is ROC curve used in assessing how 'good' a logistic regression model is?Turning Roc curve threshold by cross validationDetermine the cutoff threshold for binary classification models using cross validation






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








6












$begingroup$


Usually in logistic regression, we fit a model and get some predictions on the training set. We then cross-validate on those training predictions (something like here) and decide the optimal threshold value based on something like the ROC curve.



Why don't we incorporate cross-validation of the threshold INTO the actual model, and train the whole thing end-to-end?










share|cite|improve this question











$endgroup$











  • $begingroup$
    Possible duplicate of Classification probability threshold
    $endgroup$
    – kjetil b halvorsen
    2 hours ago










  • $begingroup$
    Already ruled as not a duplicate this morning, but I why the mix-up is happening.
    $endgroup$
    – StatsSorceress
    2 hours ago

















6












$begingroup$


Usually in logistic regression, we fit a model and get some predictions on the training set. We then cross-validate on those training predictions (something like here) and decide the optimal threshold value based on something like the ROC curve.



Why don't we incorporate cross-validation of the threshold INTO the actual model, and train the whole thing end-to-end?










share|cite|improve this question











$endgroup$











  • $begingroup$
    Possible duplicate of Classification probability threshold
    $endgroup$
    – kjetil b halvorsen
    2 hours ago










  • $begingroup$
    Already ruled as not a duplicate this morning, but I why the mix-up is happening.
    $endgroup$
    – StatsSorceress
    2 hours ago













6












6








6





$begingroup$


Usually in logistic regression, we fit a model and get some predictions on the training set. We then cross-validate on those training predictions (something like here) and decide the optimal threshold value based on something like the ROC curve.



Why don't we incorporate cross-validation of the threshold INTO the actual model, and train the whole thing end-to-end?










share|cite|improve this question











$endgroup$




Usually in logistic regression, we fit a model and get some predictions on the training set. We then cross-validate on those training predictions (something like here) and decide the optimal threshold value based on something like the ROC curve.



Why don't we incorporate cross-validation of the threshold INTO the actual model, and train the whole thing end-to-end?







logistic cross-validation optimization roc threshold






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited 7 hours ago







StatsSorceress

















asked 8 hours ago









StatsSorceressStatsSorceress

17918




17918











  • $begingroup$
    Possible duplicate of Classification probability threshold
    $endgroup$
    – kjetil b halvorsen
    2 hours ago










  • $begingroup$
    Already ruled as not a duplicate this morning, but I why the mix-up is happening.
    $endgroup$
    – StatsSorceress
    2 hours ago
















  • $begingroup$
    Possible duplicate of Classification probability threshold
    $endgroup$
    – kjetil b halvorsen
    2 hours ago










  • $begingroup$
    Already ruled as not a duplicate this morning, but I why the mix-up is happening.
    $endgroup$
    – StatsSorceress
    2 hours ago















$begingroup$
Possible duplicate of Classification probability threshold
$endgroup$
– kjetil b halvorsen
2 hours ago




$begingroup$
Possible duplicate of Classification probability threshold
$endgroup$
– kjetil b halvorsen
2 hours ago












$begingroup$
Already ruled as not a duplicate this morning, but I why the mix-up is happening.
$endgroup$
– StatsSorceress
2 hours ago




$begingroup$
Already ruled as not a duplicate this morning, but I why the mix-up is happening.
$endgroup$
– StatsSorceress
2 hours ago










3 Answers
3






active

oldest

votes


















6












$begingroup$

It isn't because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.






share|cite|improve this answer









$endgroup$












  • $begingroup$
    Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
    $endgroup$
    – StatsSorceress
    8 hours ago






  • 2




    $begingroup$
    You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
    $endgroup$
    – gung
    8 hours ago











  • $begingroup$
    Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
    $endgroup$
    – StatsSorceress
    7 hours ago






  • 2




    $begingroup$
    As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
    $endgroup$
    – gung
    6 hours ago


















7












$begingroup$

It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.






share|cite|improve this answer









$endgroup$




















    2












    $begingroup$

    Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



    A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



    However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



    Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



    For more information, see ROC Curves for Continuous Data
    by Wojtek J. Krzanowski and David J. Hand.






    share|cite|improve this answer











    $endgroup$












    • $begingroup$
      This doesn't really answer my question, but it's a very nice description of ROC curves.
      $endgroup$
      – StatsSorceress
      8 hours ago










    • $begingroup$
      In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
      $endgroup$
      – Sycorax
      8 hours ago










    • $begingroup$
      I was asking why we don't train the threshold instead of choosing it after training the model.
      $endgroup$
      – StatsSorceress
      8 hours ago










    • $begingroup$
      How would you train a threshold?
      $endgroup$
      – Sycorax
      8 hours ago






    • 1




      $begingroup$
      I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
      $endgroup$
      – Sycorax
      8 hours ago











    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "65"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f405041%2fphilosophical-question-on-logistic-regression-why-isnt-the-optimal-threshold-v%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    6












    $begingroup$

    It isn't because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.






    share|cite|improve this answer









    $endgroup$












    • $begingroup$
      Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
      $endgroup$
      – StatsSorceress
      8 hours ago






    • 2




      $begingroup$
      You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
      $endgroup$
      – gung
      8 hours ago











    • $begingroup$
      Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
      $endgroup$
      – StatsSorceress
      7 hours ago






    • 2




      $begingroup$
      As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
      $endgroup$
      – gung
      6 hours ago















    6












    $begingroup$

    It isn't because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.






    share|cite|improve this answer









    $endgroup$












    • $begingroup$
      Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
      $endgroup$
      – StatsSorceress
      8 hours ago






    • 2




      $begingroup$
      You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
      $endgroup$
      – gung
      8 hours ago











    • $begingroup$
      Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
      $endgroup$
      – StatsSorceress
      7 hours ago






    • 2




      $begingroup$
      As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
      $endgroup$
      – gung
      6 hours ago













    6












    6








    6





    $begingroup$

    It isn't because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.






    share|cite|improve this answer









    $endgroup$



    It isn't because logistic regression isn't a classifier (cf., Why isn't Logistic Regression called Logistic Classification?). It is a model to estimate the parameter, $p$, that governs the behavior of the Bernoulli distribution. That is, you are assuming that the response distribution, conditional on the covariates, is Bernoulli, and so you want to estimate how the parameter that controls that variable changes as a function of the covariates. It is a direct probability model only. Of course, it can be used as a classifier subsequently, and sometimes is in certain contexts, but it is still a probability model.







    share|cite|improve this answer












    share|cite|improve this answer



    share|cite|improve this answer










    answered 8 hours ago









    gunggung

    110k34268539




    110k34268539











    • $begingroup$
      Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
      $endgroup$
      – StatsSorceress
      8 hours ago






    • 2




      $begingroup$
      You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
      $endgroup$
      – gung
      8 hours ago











    • $begingroup$
      Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
      $endgroup$
      – StatsSorceress
      7 hours ago






    • 2




      $begingroup$
      As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
      $endgroup$
      – gung
      6 hours ago
















    • $begingroup$
      Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
      $endgroup$
      – StatsSorceress
      8 hours ago






    • 2




      $begingroup$
      You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
      $endgroup$
      – gung
      8 hours ago











    • $begingroup$
      Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
      $endgroup$
      – StatsSorceress
      7 hours ago






    • 2




      $begingroup$
      As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
      $endgroup$
      – gung
      6 hours ago















    $begingroup$
    Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
    $endgroup$
    – StatsSorceress
    8 hours ago




    $begingroup$
    Okay, I understand that part of the theory (thank you for that eloquent explanation!) but why can't we incorporate the classification aspect into the model? That is, why can't we find p, then find the threshold, and train the whole thing end-to-end to minimize some loss?
    $endgroup$
    – StatsSorceress
    8 hours ago




    2




    2




    $begingroup$
    You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
    $endgroup$
    – gung
    8 hours ago





    $begingroup$
    You certainly could (@Sycorax's answer speaks to that possibility). But because that isn't what LR itself is, but rather some ad hoc augmentation, you would need to code up the full optimization scheme yourself. Note BTW, that Frank Harrell has pointed out that process will lead to what might be considered an inferior model by many standards.
    $endgroup$
    – gung
    8 hours ago













    $begingroup$
    Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
    $endgroup$
    – StatsSorceress
    7 hours ago




    $begingroup$
    Hmm. I read the accepted answer in the related question here, and I agree with it in theory, but sometimes in machine learning classification applications we don't care about the relative error types, we just care about "correct classification". In that case, could you train end-to-end as I describe?
    $endgroup$
    – StatsSorceress
    7 hours ago




    2




    2




    $begingroup$
    As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
    $endgroup$
    – gung
    6 hours ago




    $begingroup$
    As I said, you very much can set up your own custom optimization that will train the model & select the threshold simultaneously. You just have to do it yourself & the final model is likely to be poorer by most standards.
    $endgroup$
    – gung
    6 hours ago













    7












    $begingroup$

    It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



    If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



    Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



    See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.






    share|cite|improve this answer









    $endgroup$

















      7












      $begingroup$

      It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



      If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



      Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



      See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.






      share|cite|improve this answer









      $endgroup$















        7












        7








        7





        $begingroup$

        It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



        If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



        Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



        See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.






        share|cite|improve this answer









        $endgroup$



        It's because the optimal threshold is not only a function of the true positive rate (TPR), the false positive rate (FPR), accuracy or whatever else. The other crucial ingredient is the cost and the payoff of correct and wrong decisions.



        If your target is a common cold, your response to a positive test is to prescribe two aspirin, and the cost of a true untreated positive is an unnecessary two days' worth of headaches, then your optimal decision (not classification!) threshold is quite different than if your target is some life-threatening disease, and your decision is (a) some comparatively simple procedure like an appendectomy, or (b) a major intervention like months of chemotherapy! And note that although your target variable may be binary (sick/healthy), your decisions may have more values (send home with two aspirin/run more tests/admit to hospital and watch/operate immediately).



        Bottom line: if you know your cost structure and all the different decisions, you can certainly train a decision support system (DSS) directly, which includes a probabilistic classification or prediction. I would, however, strongly argue that discretizing predictions or classifications via thresholds is not the right way to go about this.



        See also my answer to the earlier "Classification probability threshold" thread. Or this answer of mine. Or that one.







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered 8 hours ago









        Stephan KolassaStephan Kolassa

        48.5k8102182




        48.5k8102182





















            2












            $begingroup$

            Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



            A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



            However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



            Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



            For more information, see ROC Curves for Continuous Data
            by Wojtek J. Krzanowski and David J. Hand.






            share|cite|improve this answer











            $endgroup$












            • $begingroup$
              This doesn't really answer my question, but it's a very nice description of ROC curves.
              $endgroup$
              – StatsSorceress
              8 hours ago










            • $begingroup$
              In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
              $endgroup$
              – Sycorax
              8 hours ago










            • $begingroup$
              I was asking why we don't train the threshold instead of choosing it after training the model.
              $endgroup$
              – StatsSorceress
              8 hours ago










            • $begingroup$
              How would you train a threshold?
              $endgroup$
              – Sycorax
              8 hours ago






            • 1




              $begingroup$
              I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
              $endgroup$
              – Sycorax
              8 hours ago















            2












            $begingroup$

            Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



            A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



            However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



            Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



            For more information, see ROC Curves for Continuous Data
            by Wojtek J. Krzanowski and David J. Hand.






            share|cite|improve this answer











            $endgroup$












            • $begingroup$
              This doesn't really answer my question, but it's a very nice description of ROC curves.
              $endgroup$
              – StatsSorceress
              8 hours ago










            • $begingroup$
              In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
              $endgroup$
              – Sycorax
              8 hours ago










            • $begingroup$
              I was asking why we don't train the threshold instead of choosing it after training the model.
              $endgroup$
              – StatsSorceress
              8 hours ago










            • $begingroup$
              How would you train a threshold?
              $endgroup$
              – Sycorax
              8 hours ago






            • 1




              $begingroup$
              I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
              $endgroup$
              – Sycorax
              8 hours ago













            2












            2








            2





            $begingroup$

            Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



            A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



            However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



            Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



            For more information, see ROC Curves for Continuous Data
            by Wojtek J. Krzanowski and David J. Hand.






            share|cite|improve this answer











            $endgroup$



            Regardless of the underlying model, we can work out the sampling distributions of TPR and FPR at a threshold. This implies that we can characterize the variability in TPR and FPR at some threshold, and we can back into a desired error rate trade-off.



            A ROC curve is a little bit deceptive because the only thing that you control is the threshold, however the plot displays TPR and FPR, which are functions of the threshold. Moreover, the TPR and FPR are both statistics, so they are subject to the vagaries of random sampling. This implies that if you were to repeat the procedure (say by cross-validation), you could come up with a different FPR and TPR at some specific threshold value.



            However, if we can estimate the variability in the TPR and FPR, then repeating the ROC procedure is not necessary. We just pick a threshold such that the endpoints of a confidence interval (with some width) are acceptable. That is, pick the model so that the FPR is plausibly below some researcher-specified maximum, and/or the TPR is plausibly above some researcher-specified minimum. If your model can't attain your targets, you'll have to build a better model.



            Of course, what TPR and FPR values are tolerable in your usage will be context-dependent.



            For more information, see ROC Curves for Continuous Data
            by Wojtek J. Krzanowski and David J. Hand.







            share|cite|improve this answer














            share|cite|improve this answer



            share|cite|improve this answer








            edited 8 hours ago

























            answered 8 hours ago









            SycoraxSycorax

            43.1k12112208




            43.1k12112208











            • $begingroup$
              This doesn't really answer my question, but it's a very nice description of ROC curves.
              $endgroup$
              – StatsSorceress
              8 hours ago










            • $begingroup$
              In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
              $endgroup$
              – Sycorax
              8 hours ago










            • $begingroup$
              I was asking why we don't train the threshold instead of choosing it after training the model.
              $endgroup$
              – StatsSorceress
              8 hours ago










            • $begingroup$
              How would you train a threshold?
              $endgroup$
              – Sycorax
              8 hours ago






            • 1




              $begingroup$
              I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
              $endgroup$
              – Sycorax
              8 hours ago
















            • $begingroup$
              This doesn't really answer my question, but it's a very nice description of ROC curves.
              $endgroup$
              – StatsSorceress
              8 hours ago










            • $begingroup$
              In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
              $endgroup$
              – Sycorax
              8 hours ago










            • $begingroup$
              I was asking why we don't train the threshold instead of choosing it after training the model.
              $endgroup$
              – StatsSorceress
              8 hours ago










            • $begingroup$
              How would you train a threshold?
              $endgroup$
              – Sycorax
              8 hours ago






            • 1




              $begingroup$
              I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
              $endgroup$
              – Sycorax
              8 hours ago















            $begingroup$
            This doesn't really answer my question, but it's a very nice description of ROC curves.
            $endgroup$
            – StatsSorceress
            8 hours ago




            $begingroup$
            This doesn't really answer my question, but it's a very nice description of ROC curves.
            $endgroup$
            – StatsSorceress
            8 hours ago












            $begingroup$
            In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
            $endgroup$
            – Sycorax
            8 hours ago




            $begingroup$
            In what way does this not answer your question? What is your question, if not asking about how to choose a threshold for classification?
            $endgroup$
            – Sycorax
            8 hours ago












            $begingroup$
            I was asking why we don't train the threshold instead of choosing it after training the model.
            $endgroup$
            – StatsSorceress
            8 hours ago




            $begingroup$
            I was asking why we don't train the threshold instead of choosing it after training the model.
            $endgroup$
            – StatsSorceress
            8 hours ago












            $begingroup$
            How would you train a threshold?
            $endgroup$
            – Sycorax
            8 hours ago




            $begingroup$
            How would you train a threshold?
            $endgroup$
            – Sycorax
            8 hours ago




            1




            1




            $begingroup$
            I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
            $endgroup$
            – Sycorax
            8 hours ago




            $begingroup$
            I'm not aware of any statistical procedure that works that way. Why is this square wheel a good idea? What problem does it solve?
            $endgroup$
            – Sycorax
            8 hours ago

















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f405041%2fphilosophical-question-on-logistic-regression-why-isnt-the-optimal-threshold-v%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Nidaros erkebispedøme

            Birsay

            Was Woodrow Wilson really a Liberal?Was World War I a war of liberals against authoritarians?Founding Fathers...