Advantages of AUC vs standard accuracy












50












$begingroup$


I was starting to look into area under curve(AUC) and am a little confused about its usefulness. When first explained to me, AUC seemed to be a great measure of performance but in my research I've found that some claim its advantage is mostly marginal in that it is best for catching 'lucky' models with high standard accuracy measurements and low AUC.



So should I avoid relying on AUC for validating models or would a combination be best? Thanks for all your help.










share|improve this question









$endgroup$








  • 4




    $begingroup$
    Consider a highly unbalanced problem. That is where ROC AUC is very popular, because the curve balances the class sizes. It's easy to achieve 99% accuracy on a data set where 99% of objects is in the same class.
    $endgroup$
    – Anony-Mousse
    Jul 27 '14 at 10:26






  • 3




    $begingroup$
    "The implicit goal of AUC is to deal with situations where you have a very skewed sample distribution, and don't want to overfit to a single class." I thought that these situations were where AUC performed poorly and precision-recall graphs/area under them were used.
    $endgroup$
    – JenSCDC
    Nov 26 '14 at 20:11










  • $begingroup$
    @JenSCDC, From my experience in these situations AUC performs well and as indico describes below it is from ROC curve that you get that area from. P-R graph is also useful (note that the Recall is the same as TPR, one of the axes in ROC) but Precision is not quite the same as FPR so the PR plot is related to ROC but not the same. Sources: stats.stackexchange.com/questions/132777/… and stats.stackexchange.com/questions/7207/…
    $endgroup$
    – alexey
    Sep 1 '17 at 0:11
















50












$begingroup$


I was starting to look into area under curve(AUC) and am a little confused about its usefulness. When first explained to me, AUC seemed to be a great measure of performance but in my research I've found that some claim its advantage is mostly marginal in that it is best for catching 'lucky' models with high standard accuracy measurements and low AUC.



So should I avoid relying on AUC for validating models or would a combination be best? Thanks for all your help.










share|improve this question









$endgroup$








  • 4




    $begingroup$
    Consider a highly unbalanced problem. That is where ROC AUC is very popular, because the curve balances the class sizes. It's easy to achieve 99% accuracy on a data set where 99% of objects is in the same class.
    $endgroup$
    – Anony-Mousse
    Jul 27 '14 at 10:26






  • 3




    $begingroup$
    "The implicit goal of AUC is to deal with situations where you have a very skewed sample distribution, and don't want to overfit to a single class." I thought that these situations were where AUC performed poorly and precision-recall graphs/area under them were used.
    $endgroup$
    – JenSCDC
    Nov 26 '14 at 20:11










  • $begingroup$
    @JenSCDC, From my experience in these situations AUC performs well and as indico describes below it is from ROC curve that you get that area from. P-R graph is also useful (note that the Recall is the same as TPR, one of the axes in ROC) but Precision is not quite the same as FPR so the PR plot is related to ROC but not the same. Sources: stats.stackexchange.com/questions/132777/… and stats.stackexchange.com/questions/7207/…
    $endgroup$
    – alexey
    Sep 1 '17 at 0:11














50












50








50


31



$begingroup$


I was starting to look into area under curve(AUC) and am a little confused about its usefulness. When first explained to me, AUC seemed to be a great measure of performance but in my research I've found that some claim its advantage is mostly marginal in that it is best for catching 'lucky' models with high standard accuracy measurements and low AUC.



So should I avoid relying on AUC for validating models or would a combination be best? Thanks for all your help.










share|improve this question









$endgroup$




I was starting to look into area under curve(AUC) and am a little confused about its usefulness. When first explained to me, AUC seemed to be a great measure of performance but in my research I've found that some claim its advantage is mostly marginal in that it is best for catching 'lucky' models with high standard accuracy measurements and low AUC.



So should I avoid relying on AUC for validating models or would a combination be best? Thanks for all your help.







machine-learning accuracy






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jul 22 '14 at 3:43









aidankmclaidankmcl

353146




353146








  • 4




    $begingroup$
    Consider a highly unbalanced problem. That is where ROC AUC is very popular, because the curve balances the class sizes. It's easy to achieve 99% accuracy on a data set where 99% of objects is in the same class.
    $endgroup$
    – Anony-Mousse
    Jul 27 '14 at 10:26






  • 3




    $begingroup$
    "The implicit goal of AUC is to deal with situations where you have a very skewed sample distribution, and don't want to overfit to a single class." I thought that these situations were where AUC performed poorly and precision-recall graphs/area under them were used.
    $endgroup$
    – JenSCDC
    Nov 26 '14 at 20:11










  • $begingroup$
    @JenSCDC, From my experience in these situations AUC performs well and as indico describes below it is from ROC curve that you get that area from. P-R graph is also useful (note that the Recall is the same as TPR, one of the axes in ROC) but Precision is not quite the same as FPR so the PR plot is related to ROC but not the same. Sources: stats.stackexchange.com/questions/132777/… and stats.stackexchange.com/questions/7207/…
    $endgroup$
    – alexey
    Sep 1 '17 at 0:11














  • 4




    $begingroup$
    Consider a highly unbalanced problem. That is where ROC AUC is very popular, because the curve balances the class sizes. It's easy to achieve 99% accuracy on a data set where 99% of objects is in the same class.
    $endgroup$
    – Anony-Mousse
    Jul 27 '14 at 10:26






  • 3




    $begingroup$
    "The implicit goal of AUC is to deal with situations where you have a very skewed sample distribution, and don't want to overfit to a single class." I thought that these situations were where AUC performed poorly and precision-recall graphs/area under them were used.
    $endgroup$
    – JenSCDC
    Nov 26 '14 at 20:11










  • $begingroup$
    @JenSCDC, From my experience in these situations AUC performs well and as indico describes below it is from ROC curve that you get that area from. P-R graph is also useful (note that the Recall is the same as TPR, one of the axes in ROC) but Precision is not quite the same as FPR so the PR plot is related to ROC but not the same. Sources: stats.stackexchange.com/questions/132777/… and stats.stackexchange.com/questions/7207/…
    $endgroup$
    – alexey
    Sep 1 '17 at 0:11








4




4




$begingroup$
Consider a highly unbalanced problem. That is where ROC AUC is very popular, because the curve balances the class sizes. It's easy to achieve 99% accuracy on a data set where 99% of objects is in the same class.
$endgroup$
– Anony-Mousse
Jul 27 '14 at 10:26




$begingroup$
Consider a highly unbalanced problem. That is where ROC AUC is very popular, because the curve balances the class sizes. It's easy to achieve 99% accuracy on a data set where 99% of objects is in the same class.
$endgroup$
– Anony-Mousse
Jul 27 '14 at 10:26




3




3




$begingroup$
"The implicit goal of AUC is to deal with situations where you have a very skewed sample distribution, and don't want to overfit to a single class." I thought that these situations were where AUC performed poorly and precision-recall graphs/area under them were used.
$endgroup$
– JenSCDC
Nov 26 '14 at 20:11




$begingroup$
"The implicit goal of AUC is to deal with situations where you have a very skewed sample distribution, and don't want to overfit to a single class." I thought that these situations were where AUC performed poorly and precision-recall graphs/area under them were used.
$endgroup$
– JenSCDC
Nov 26 '14 at 20:11












$begingroup$
@JenSCDC, From my experience in these situations AUC performs well and as indico describes below it is from ROC curve that you get that area from. P-R graph is also useful (note that the Recall is the same as TPR, one of the axes in ROC) but Precision is not quite the same as FPR so the PR plot is related to ROC but not the same. Sources: stats.stackexchange.com/questions/132777/… and stats.stackexchange.com/questions/7207/…
$endgroup$
– alexey
Sep 1 '17 at 0:11




$begingroup$
@JenSCDC, From my experience in these situations AUC performs well and as indico describes below it is from ROC curve that you get that area from. P-R graph is also useful (note that the Recall is the same as TPR, one of the axes in ROC) but Precision is not quite the same as FPR so the PR plot is related to ROC but not the same. Sources: stats.stackexchange.com/questions/132777/… and stats.stackexchange.com/questions/7207/…
$endgroup$
– alexey
Sep 1 '17 at 0:11










3 Answers
3






active

oldest

votes


















47












$begingroup$

Really great question, and one that I find that most people don't really understand on an intuitive level. AUC is in fact often preferred over accuracy for binary classification for a number of different reasons. First though, let's talk about exactly what AUC is. Honestly, for being one of the most widely used efficacy metrics, it's surprisingly obtuse to figure out exactly how AUC works.



AUC stands for Area Under the Curve, which curve you ask? Well, that would be the ROC curve. ROC stands for Receiver Operating Characteristic, which is actually slightly non-intuitive. The implicit goal of AUC is to deal with situations where you have a very skewed sample distribution, and don't want to overfit to a single class.



A great example is in spam detection. Generally, spam datasets are STRONGLY biased towards ham, or not-spam. If your data set is 90% ham, you can get a pretty damn good accuracy by just saying that every single email is ham, which is obviously something that indicates a non-ideal classifier. Let's start with a couple of metrics that are a little more useful for us, specifically the true positive rate (TPR) and the false positive rate (FPR):



ROC axes



Now in this graph, TPR is specifically the ratio of true positive to all positives, and FPR is the ratio of false positives to all negatives. (Keep in mind, this is only for binary classification.) On a graph like this, it should be pretty straightforward to figure out that a prediction of all 0's or all 1's will result in the points of (0,0) and (1,1) respectively. If you draw a line through these lines you get something like this:



Kind of like a triangle



Which looks basically like a diagonal line (it is), and by some easy geometry, you can see that the AUC of such a model would be 0.5 (height and base are both 1). Similarly, if you predict a random assortment of 0's and 1's, let's say 90% 1's, you could get the point (0.9, 0.9), which again falls along that diagonal line.



Now comes the interesting part. What if we weren't only predicting 0's and 1's? What if instead, we wanted to say that, theoretically we were going to set a cutoff, above which every result was a 1, and below which every result were a 0. This would mean that at the extremes you get the original situation where you have all 0's and all 1's (at a cutoff of 0 and 1 respectively), but also a series of intermediate states that fall within the 1x1 graph that contains your ROC. In practice you get something like this:
Courtesy of Wikipedia



So basically, what you're actually getting when you do an AUC over accuracy is something that will strongly discourage people going for models that are representative, but not discriminative, as this will only actually select for models that achieve false positive and true positive rates that are significantly above random chance, which is not guaranteed for accuracy.






share|improve this answer











$endgroup$













  • $begingroup$
    Could you add how AUC compares to an F1-score?
    $endgroup$
    – Dan
    Jul 22 '14 at 7:00






  • 5




    $begingroup$
    @Dan- The biggest difference is that you don't have to set a decision threshold with AUC (it's essentially measuring the probability spam is ranked above non-spam). F1-score requires a decision threshold. Of course, you could always set the decision threshold as an operating parameter and plot F1-scores.
    $endgroup$
    – DSea
    Jul 22 '14 at 19:14



















13












$begingroup$

AUC and accuracy are fairly different things. AUC applies to binary classifiers that have some notion of a decision threshold internally. For example logistic regression returns positive/negative depending on whether the logistic function is greater/smaller than a threshold, usually 0.5 by default. When you choose your threshold, you have a classifier. You have to choose one.



For a given choice of threshold, you can compute accuracy, which is the proportion of true positives and negatives in the whole data set.



AUC measures how true positive rate (recall) and false positive rate trade off, so in that sense it is already measuring something else. More importantly, AUC is not a function of threshold. It is an evaluation of the classifier as threshold varies over all possible values. It is in a sense a broader metric, testing the quality of the internal value that the classifier generates and then compares to a threshold. It is not testing the quality of a particular choice of threshold.



AUC has a different interpretation, and that is that it's also the probability that a randomly chosen positive example is ranked above a randomly chosen negative example, according to the classifier's internal value for the examples.



AUC is computable even if you have an algorithm that only produces a ranking on examples. AUC is not computable if you truly only have a black-box classifier, and not one with an internal threshold. These would usually dictate which of the two is even available to a problem at hand.



AUC is, I think, a more comprehensive measure, although applicable in fewer situations. It's not strictly better than accuracy; it's different. It depends in part on whether you care more about true positives, false negatives, etc.



F-measure is more like accuracy in the sense that it's a function of a classifier and its threshold setting. But it measures precision vs recall (true positive rate), which is not the same as either above.






share|improve this answer









$endgroup$





















    4












    $begingroup$

    I'd like to refer to how you should choose a performance measure.
    Before that I'll refer to the specific question of accuracy and AUC.



    As answered before, on imbalanced dataset using the majority run as a classifier will lead to high accuracy what will make it a misleading measure.
    AUC aggregate over confidence threshold, for good and bad.
    For good, you get a weight result for all confidence level.
    The bad is that you are usually care only about the confidence level you will actually use and the rest are irrelevant.



    However, I want to remark about choosing a proper performance measure for a model.
    You should compare a model by its goal. The goal of a model is not a question os machine learning or statistic, in is question of the business domain and its needs.



    If you are digging for gold (a scenario in which you have huge benefit from a true positive, not too high cost of a false positive) then recall is a good measure.



    If you are trying to decide whether to perform a complex medical procedure on people (high cost of false positive, hopefully a low cost of false negative), precision is the measure you should use.



    There are plenty of measures you can use.
    You can also combine them in various ways.



    However, there is no universal "best" measure.
    There is the best model for your needs, the one that maximizing it will maximize your benefit.






    share|improve this answer









    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "557"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f806%2fadvantages-of-auc-vs-standard-accuracy%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      47












      $begingroup$

      Really great question, and one that I find that most people don't really understand on an intuitive level. AUC is in fact often preferred over accuracy for binary classification for a number of different reasons. First though, let's talk about exactly what AUC is. Honestly, for being one of the most widely used efficacy metrics, it's surprisingly obtuse to figure out exactly how AUC works.



      AUC stands for Area Under the Curve, which curve you ask? Well, that would be the ROC curve. ROC stands for Receiver Operating Characteristic, which is actually slightly non-intuitive. The implicit goal of AUC is to deal with situations where you have a very skewed sample distribution, and don't want to overfit to a single class.



      A great example is in spam detection. Generally, spam datasets are STRONGLY biased towards ham, or not-spam. If your data set is 90% ham, you can get a pretty damn good accuracy by just saying that every single email is ham, which is obviously something that indicates a non-ideal classifier. Let's start with a couple of metrics that are a little more useful for us, specifically the true positive rate (TPR) and the false positive rate (FPR):



      ROC axes



      Now in this graph, TPR is specifically the ratio of true positive to all positives, and FPR is the ratio of false positives to all negatives. (Keep in mind, this is only for binary classification.) On a graph like this, it should be pretty straightforward to figure out that a prediction of all 0's or all 1's will result in the points of (0,0) and (1,1) respectively. If you draw a line through these lines you get something like this:



      Kind of like a triangle



      Which looks basically like a diagonal line (it is), and by some easy geometry, you can see that the AUC of such a model would be 0.5 (height and base are both 1). Similarly, if you predict a random assortment of 0's and 1's, let's say 90% 1's, you could get the point (0.9, 0.9), which again falls along that diagonal line.



      Now comes the interesting part. What if we weren't only predicting 0's and 1's? What if instead, we wanted to say that, theoretically we were going to set a cutoff, above which every result was a 1, and below which every result were a 0. This would mean that at the extremes you get the original situation where you have all 0's and all 1's (at a cutoff of 0 and 1 respectively), but also a series of intermediate states that fall within the 1x1 graph that contains your ROC. In practice you get something like this:
      Courtesy of Wikipedia



      So basically, what you're actually getting when you do an AUC over accuracy is something that will strongly discourage people going for models that are representative, but not discriminative, as this will only actually select for models that achieve false positive and true positive rates that are significantly above random chance, which is not guaranteed for accuracy.






      share|improve this answer











      $endgroup$













      • $begingroup$
        Could you add how AUC compares to an F1-score?
        $endgroup$
        – Dan
        Jul 22 '14 at 7:00






      • 5




        $begingroup$
        @Dan- The biggest difference is that you don't have to set a decision threshold with AUC (it's essentially measuring the probability spam is ranked above non-spam). F1-score requires a decision threshold. Of course, you could always set the decision threshold as an operating parameter and plot F1-scores.
        $endgroup$
        – DSea
        Jul 22 '14 at 19:14
















      47












      $begingroup$

      Really great question, and one that I find that most people don't really understand on an intuitive level. AUC is in fact often preferred over accuracy for binary classification for a number of different reasons. First though, let's talk about exactly what AUC is. Honestly, for being one of the most widely used efficacy metrics, it's surprisingly obtuse to figure out exactly how AUC works.



      AUC stands for Area Under the Curve, which curve you ask? Well, that would be the ROC curve. ROC stands for Receiver Operating Characteristic, which is actually slightly non-intuitive. The implicit goal of AUC is to deal with situations where you have a very skewed sample distribution, and don't want to overfit to a single class.



      A great example is in spam detection. Generally, spam datasets are STRONGLY biased towards ham, or not-spam. If your data set is 90% ham, you can get a pretty damn good accuracy by just saying that every single email is ham, which is obviously something that indicates a non-ideal classifier. Let's start with a couple of metrics that are a little more useful for us, specifically the true positive rate (TPR) and the false positive rate (FPR):



      ROC axes



      Now in this graph, TPR is specifically the ratio of true positive to all positives, and FPR is the ratio of false positives to all negatives. (Keep in mind, this is only for binary classification.) On a graph like this, it should be pretty straightforward to figure out that a prediction of all 0's or all 1's will result in the points of (0,0) and (1,1) respectively. If you draw a line through these lines you get something like this:



      Kind of like a triangle



      Which looks basically like a diagonal line (it is), and by some easy geometry, you can see that the AUC of such a model would be 0.5 (height and base are both 1). Similarly, if you predict a random assortment of 0's and 1's, let's say 90% 1's, you could get the point (0.9, 0.9), which again falls along that diagonal line.



      Now comes the interesting part. What if we weren't only predicting 0's and 1's? What if instead, we wanted to say that, theoretically we were going to set a cutoff, above which every result was a 1, and below which every result were a 0. This would mean that at the extremes you get the original situation where you have all 0's and all 1's (at a cutoff of 0 and 1 respectively), but also a series of intermediate states that fall within the 1x1 graph that contains your ROC. In practice you get something like this:
      Courtesy of Wikipedia



      So basically, what you're actually getting when you do an AUC over accuracy is something that will strongly discourage people going for models that are representative, but not discriminative, as this will only actually select for models that achieve false positive and true positive rates that are significantly above random chance, which is not guaranteed for accuracy.






      share|improve this answer











      $endgroup$













      • $begingroup$
        Could you add how AUC compares to an F1-score?
        $endgroup$
        – Dan
        Jul 22 '14 at 7:00






      • 5




        $begingroup$
        @Dan- The biggest difference is that you don't have to set a decision threshold with AUC (it's essentially measuring the probability spam is ranked above non-spam). F1-score requires a decision threshold. Of course, you could always set the decision threshold as an operating parameter and plot F1-scores.
        $endgroup$
        – DSea
        Jul 22 '14 at 19:14














      47












      47








      47





      $begingroup$

      Really great question, and one that I find that most people don't really understand on an intuitive level. AUC is in fact often preferred over accuracy for binary classification for a number of different reasons. First though, let's talk about exactly what AUC is. Honestly, for being one of the most widely used efficacy metrics, it's surprisingly obtuse to figure out exactly how AUC works.



      AUC stands for Area Under the Curve, which curve you ask? Well, that would be the ROC curve. ROC stands for Receiver Operating Characteristic, which is actually slightly non-intuitive. The implicit goal of AUC is to deal with situations where you have a very skewed sample distribution, and don't want to overfit to a single class.



      A great example is in spam detection. Generally, spam datasets are STRONGLY biased towards ham, or not-spam. If your data set is 90% ham, you can get a pretty damn good accuracy by just saying that every single email is ham, which is obviously something that indicates a non-ideal classifier. Let's start with a couple of metrics that are a little more useful for us, specifically the true positive rate (TPR) and the false positive rate (FPR):



      ROC axes



      Now in this graph, TPR is specifically the ratio of true positive to all positives, and FPR is the ratio of false positives to all negatives. (Keep in mind, this is only for binary classification.) On a graph like this, it should be pretty straightforward to figure out that a prediction of all 0's or all 1's will result in the points of (0,0) and (1,1) respectively. If you draw a line through these lines you get something like this:



      Kind of like a triangle



      Which looks basically like a diagonal line (it is), and by some easy geometry, you can see that the AUC of such a model would be 0.5 (height and base are both 1). Similarly, if you predict a random assortment of 0's and 1's, let's say 90% 1's, you could get the point (0.9, 0.9), which again falls along that diagonal line.



      Now comes the interesting part. What if we weren't only predicting 0's and 1's? What if instead, we wanted to say that, theoretically we were going to set a cutoff, above which every result was a 1, and below which every result were a 0. This would mean that at the extremes you get the original situation where you have all 0's and all 1's (at a cutoff of 0 and 1 respectively), but also a series of intermediate states that fall within the 1x1 graph that contains your ROC. In practice you get something like this:
      Courtesy of Wikipedia



      So basically, what you're actually getting when you do an AUC over accuracy is something that will strongly discourage people going for models that are representative, but not discriminative, as this will only actually select for models that achieve false positive and true positive rates that are significantly above random chance, which is not guaranteed for accuracy.






      share|improve this answer











      $endgroup$



      Really great question, and one that I find that most people don't really understand on an intuitive level. AUC is in fact often preferred over accuracy for binary classification for a number of different reasons. First though, let's talk about exactly what AUC is. Honestly, for being one of the most widely used efficacy metrics, it's surprisingly obtuse to figure out exactly how AUC works.



      AUC stands for Area Under the Curve, which curve you ask? Well, that would be the ROC curve. ROC stands for Receiver Operating Characteristic, which is actually slightly non-intuitive. The implicit goal of AUC is to deal with situations where you have a very skewed sample distribution, and don't want to overfit to a single class.



      A great example is in spam detection. Generally, spam datasets are STRONGLY biased towards ham, or not-spam. If your data set is 90% ham, you can get a pretty damn good accuracy by just saying that every single email is ham, which is obviously something that indicates a non-ideal classifier. Let's start with a couple of metrics that are a little more useful for us, specifically the true positive rate (TPR) and the false positive rate (FPR):



      ROC axes



      Now in this graph, TPR is specifically the ratio of true positive to all positives, and FPR is the ratio of false positives to all negatives. (Keep in mind, this is only for binary classification.) On a graph like this, it should be pretty straightforward to figure out that a prediction of all 0's or all 1's will result in the points of (0,0) and (1,1) respectively. If you draw a line through these lines you get something like this:



      Kind of like a triangle



      Which looks basically like a diagonal line (it is), and by some easy geometry, you can see that the AUC of such a model would be 0.5 (height and base are both 1). Similarly, if you predict a random assortment of 0's and 1's, let's say 90% 1's, you could get the point (0.9, 0.9), which again falls along that diagonal line.



      Now comes the interesting part. What if we weren't only predicting 0's and 1's? What if instead, we wanted to say that, theoretically we were going to set a cutoff, above which every result was a 1, and below which every result were a 0. This would mean that at the extremes you get the original situation where you have all 0's and all 1's (at a cutoff of 0 and 1 respectively), but also a series of intermediate states that fall within the 1x1 graph that contains your ROC. In practice you get something like this:
      Courtesy of Wikipedia



      So basically, what you're actually getting when you do an AUC over accuracy is something that will strongly discourage people going for models that are representative, but not discriminative, as this will only actually select for models that achieve false positive and true positive rates that are significantly above random chance, which is not guaranteed for accuracy.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited 5 hours ago









      PascalIv

      31




      31










      answered Jul 22 '14 at 4:10









      indicoindico

      3,3991420




      3,3991420












      • $begingroup$
        Could you add how AUC compares to an F1-score?
        $endgroup$
        – Dan
        Jul 22 '14 at 7:00






      • 5




        $begingroup$
        @Dan- The biggest difference is that you don't have to set a decision threshold with AUC (it's essentially measuring the probability spam is ranked above non-spam). F1-score requires a decision threshold. Of course, you could always set the decision threshold as an operating parameter and plot F1-scores.
        $endgroup$
        – DSea
        Jul 22 '14 at 19:14


















      • $begingroup$
        Could you add how AUC compares to an F1-score?
        $endgroup$
        – Dan
        Jul 22 '14 at 7:00






      • 5




        $begingroup$
        @Dan- The biggest difference is that you don't have to set a decision threshold with AUC (it's essentially measuring the probability spam is ranked above non-spam). F1-score requires a decision threshold. Of course, you could always set the decision threshold as an operating parameter and plot F1-scores.
        $endgroup$
        – DSea
        Jul 22 '14 at 19:14
















      $begingroup$
      Could you add how AUC compares to an F1-score?
      $endgroup$
      – Dan
      Jul 22 '14 at 7:00




      $begingroup$
      Could you add how AUC compares to an F1-score?
      $endgroup$
      – Dan
      Jul 22 '14 at 7:00




      5




      5




      $begingroup$
      @Dan- The biggest difference is that you don't have to set a decision threshold with AUC (it's essentially measuring the probability spam is ranked above non-spam). F1-score requires a decision threshold. Of course, you could always set the decision threshold as an operating parameter and plot F1-scores.
      $endgroup$
      – DSea
      Jul 22 '14 at 19:14




      $begingroup$
      @Dan- The biggest difference is that you don't have to set a decision threshold with AUC (it's essentially measuring the probability spam is ranked above non-spam). F1-score requires a decision threshold. Of course, you could always set the decision threshold as an operating parameter and plot F1-scores.
      $endgroup$
      – DSea
      Jul 22 '14 at 19:14











      13












      $begingroup$

      AUC and accuracy are fairly different things. AUC applies to binary classifiers that have some notion of a decision threshold internally. For example logistic regression returns positive/negative depending on whether the logistic function is greater/smaller than a threshold, usually 0.5 by default. When you choose your threshold, you have a classifier. You have to choose one.



      For a given choice of threshold, you can compute accuracy, which is the proportion of true positives and negatives in the whole data set.



      AUC measures how true positive rate (recall) and false positive rate trade off, so in that sense it is already measuring something else. More importantly, AUC is not a function of threshold. It is an evaluation of the classifier as threshold varies over all possible values. It is in a sense a broader metric, testing the quality of the internal value that the classifier generates and then compares to a threshold. It is not testing the quality of a particular choice of threshold.



      AUC has a different interpretation, and that is that it's also the probability that a randomly chosen positive example is ranked above a randomly chosen negative example, according to the classifier's internal value for the examples.



      AUC is computable even if you have an algorithm that only produces a ranking on examples. AUC is not computable if you truly only have a black-box classifier, and not one with an internal threshold. These would usually dictate which of the two is even available to a problem at hand.



      AUC is, I think, a more comprehensive measure, although applicable in fewer situations. It's not strictly better than accuracy; it's different. It depends in part on whether you care more about true positives, false negatives, etc.



      F-measure is more like accuracy in the sense that it's a function of a classifier and its threshold setting. But it measures precision vs recall (true positive rate), which is not the same as either above.






      share|improve this answer









      $endgroup$


















        13












        $begingroup$

        AUC and accuracy are fairly different things. AUC applies to binary classifiers that have some notion of a decision threshold internally. For example logistic regression returns positive/negative depending on whether the logistic function is greater/smaller than a threshold, usually 0.5 by default. When you choose your threshold, you have a classifier. You have to choose one.



        For a given choice of threshold, you can compute accuracy, which is the proportion of true positives and negatives in the whole data set.



        AUC measures how true positive rate (recall) and false positive rate trade off, so in that sense it is already measuring something else. More importantly, AUC is not a function of threshold. It is an evaluation of the classifier as threshold varies over all possible values. It is in a sense a broader metric, testing the quality of the internal value that the classifier generates and then compares to a threshold. It is not testing the quality of a particular choice of threshold.



        AUC has a different interpretation, and that is that it's also the probability that a randomly chosen positive example is ranked above a randomly chosen negative example, according to the classifier's internal value for the examples.



        AUC is computable even if you have an algorithm that only produces a ranking on examples. AUC is not computable if you truly only have a black-box classifier, and not one with an internal threshold. These would usually dictate which of the two is even available to a problem at hand.



        AUC is, I think, a more comprehensive measure, although applicable in fewer situations. It's not strictly better than accuracy; it's different. It depends in part on whether you care more about true positives, false negatives, etc.



        F-measure is more like accuracy in the sense that it's a function of a classifier and its threshold setting. But it measures precision vs recall (true positive rate), which is not the same as either above.






        share|improve this answer









        $endgroup$
















          13












          13








          13





          $begingroup$

          AUC and accuracy are fairly different things. AUC applies to binary classifiers that have some notion of a decision threshold internally. For example logistic regression returns positive/negative depending on whether the logistic function is greater/smaller than a threshold, usually 0.5 by default. When you choose your threshold, you have a classifier. You have to choose one.



          For a given choice of threshold, you can compute accuracy, which is the proportion of true positives and negatives in the whole data set.



          AUC measures how true positive rate (recall) and false positive rate trade off, so in that sense it is already measuring something else. More importantly, AUC is not a function of threshold. It is an evaluation of the classifier as threshold varies over all possible values. It is in a sense a broader metric, testing the quality of the internal value that the classifier generates and then compares to a threshold. It is not testing the quality of a particular choice of threshold.



          AUC has a different interpretation, and that is that it's also the probability that a randomly chosen positive example is ranked above a randomly chosen negative example, according to the classifier's internal value for the examples.



          AUC is computable even if you have an algorithm that only produces a ranking on examples. AUC is not computable if you truly only have a black-box classifier, and not one with an internal threshold. These would usually dictate which of the two is even available to a problem at hand.



          AUC is, I think, a more comprehensive measure, although applicable in fewer situations. It's not strictly better than accuracy; it's different. It depends in part on whether you care more about true positives, false negatives, etc.



          F-measure is more like accuracy in the sense that it's a function of a classifier and its threshold setting. But it measures precision vs recall (true positive rate), which is not the same as either above.






          share|improve this answer









          $endgroup$



          AUC and accuracy are fairly different things. AUC applies to binary classifiers that have some notion of a decision threshold internally. For example logistic regression returns positive/negative depending on whether the logistic function is greater/smaller than a threshold, usually 0.5 by default. When you choose your threshold, you have a classifier. You have to choose one.



          For a given choice of threshold, you can compute accuracy, which is the proportion of true positives and negatives in the whole data set.



          AUC measures how true positive rate (recall) and false positive rate trade off, so in that sense it is already measuring something else. More importantly, AUC is not a function of threshold. It is an evaluation of the classifier as threshold varies over all possible values. It is in a sense a broader metric, testing the quality of the internal value that the classifier generates and then compares to a threshold. It is not testing the quality of a particular choice of threshold.



          AUC has a different interpretation, and that is that it's also the probability that a randomly chosen positive example is ranked above a randomly chosen negative example, according to the classifier's internal value for the examples.



          AUC is computable even if you have an algorithm that only produces a ranking on examples. AUC is not computable if you truly only have a black-box classifier, and not one with an internal threshold. These would usually dictate which of the two is even available to a problem at hand.



          AUC is, I think, a more comprehensive measure, although applicable in fewer situations. It's not strictly better than accuracy; it's different. It depends in part on whether you care more about true positives, false negatives, etc.



          F-measure is more like accuracy in the sense that it's a function of a classifier and its threshold setting. But it measures precision vs recall (true positive rate), which is not the same as either above.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 28 '14 at 12:48









          Sean OwenSean Owen

          4,15141934




          4,15141934























              4












              $begingroup$

              I'd like to refer to how you should choose a performance measure.
              Before that I'll refer to the specific question of accuracy and AUC.



              As answered before, on imbalanced dataset using the majority run as a classifier will lead to high accuracy what will make it a misleading measure.
              AUC aggregate over confidence threshold, for good and bad.
              For good, you get a weight result for all confidence level.
              The bad is that you are usually care only about the confidence level you will actually use and the rest are irrelevant.



              However, I want to remark about choosing a proper performance measure for a model.
              You should compare a model by its goal. The goal of a model is not a question os machine learning or statistic, in is question of the business domain and its needs.



              If you are digging for gold (a scenario in which you have huge benefit from a true positive, not too high cost of a false positive) then recall is a good measure.



              If you are trying to decide whether to perform a complex medical procedure on people (high cost of false positive, hopefully a low cost of false negative), precision is the measure you should use.



              There are plenty of measures you can use.
              You can also combine them in various ways.



              However, there is no universal "best" measure.
              There is the best model for your needs, the one that maximizing it will maximize your benefit.






              share|improve this answer









              $endgroup$


















                4












                $begingroup$

                I'd like to refer to how you should choose a performance measure.
                Before that I'll refer to the specific question of accuracy and AUC.



                As answered before, on imbalanced dataset using the majority run as a classifier will lead to high accuracy what will make it a misleading measure.
                AUC aggregate over confidence threshold, for good and bad.
                For good, you get a weight result for all confidence level.
                The bad is that you are usually care only about the confidence level you will actually use and the rest are irrelevant.



                However, I want to remark about choosing a proper performance measure for a model.
                You should compare a model by its goal. The goal of a model is not a question os machine learning or statistic, in is question of the business domain and its needs.



                If you are digging for gold (a scenario in which you have huge benefit from a true positive, not too high cost of a false positive) then recall is a good measure.



                If you are trying to decide whether to perform a complex medical procedure on people (high cost of false positive, hopefully a low cost of false negative), precision is the measure you should use.



                There are plenty of measures you can use.
                You can also combine them in various ways.



                However, there is no universal "best" measure.
                There is the best model for your needs, the one that maximizing it will maximize your benefit.






                share|improve this answer









                $endgroup$
















                  4












                  4








                  4





                  $begingroup$

                  I'd like to refer to how you should choose a performance measure.
                  Before that I'll refer to the specific question of accuracy and AUC.



                  As answered before, on imbalanced dataset using the majority run as a classifier will lead to high accuracy what will make it a misleading measure.
                  AUC aggregate over confidence threshold, for good and bad.
                  For good, you get a weight result for all confidence level.
                  The bad is that you are usually care only about the confidence level you will actually use and the rest are irrelevant.



                  However, I want to remark about choosing a proper performance measure for a model.
                  You should compare a model by its goal. The goal of a model is not a question os machine learning or statistic, in is question of the business domain and its needs.



                  If you are digging for gold (a scenario in which you have huge benefit from a true positive, not too high cost of a false positive) then recall is a good measure.



                  If you are trying to decide whether to perform a complex medical procedure on people (high cost of false positive, hopefully a low cost of false negative), precision is the measure you should use.



                  There are plenty of measures you can use.
                  You can also combine them in various ways.



                  However, there is no universal "best" measure.
                  There is the best model for your needs, the one that maximizing it will maximize your benefit.






                  share|improve this answer









                  $endgroup$



                  I'd like to refer to how you should choose a performance measure.
                  Before that I'll refer to the specific question of accuracy and AUC.



                  As answered before, on imbalanced dataset using the majority run as a classifier will lead to high accuracy what will make it a misleading measure.
                  AUC aggregate over confidence threshold, for good and bad.
                  For good, you get a weight result for all confidence level.
                  The bad is that you are usually care only about the confidence level you will actually use and the rest are irrelevant.



                  However, I want to remark about choosing a proper performance measure for a model.
                  You should compare a model by its goal. The goal of a model is not a question os machine learning or statistic, in is question of the business domain and its needs.



                  If you are digging for gold (a scenario in which you have huge benefit from a true positive, not too high cost of a false positive) then recall is a good measure.



                  If you are trying to decide whether to perform a complex medical procedure on people (high cost of false positive, hopefully a low cost of false negative), precision is the measure you should use.



                  There are plenty of measures you can use.
                  You can also combine them in various ways.



                  However, there is no universal "best" measure.
                  There is the best model for your needs, the one that maximizing it will maximize your benefit.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Nov 11 '15 at 7:32









                  DaLDaL

                  2,164410




                  2,164410






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f806%2fadvantages-of-auc-vs-standard-accuracy%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How to label and detect the document text images

                      Tabula Rosettana

                      Aureus (color)