The cross-entropy error function in neural networks












95












$begingroup$


In the MNIST For ML Beginners they define cross-entropy as



$$H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$$



$y_i$ is the predicted probability value for class $i$ and $y_i'$ is the true probability for that class.



Question 1



Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0? This would mean that we have a really bad classifier, of course. But think of an error in our dataset, e.g. an "obvious" 1 labeled as 3. Would it simply crash? Does the model we chose (softmax activation at the end) basically never give the probability 0 for the correct class?



Question 2



I've learned that cross-entropy is defined as



$$H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$$



What is correct? Do you have any textbook references for either version? How do those functions differ in their properties (as error functions for neural networks)?










share|improve this question











$endgroup$












  • $begingroup$
    See also: stats.stackexchange.com/questions/80967/…
    $endgroup$
    – Piotr Migdal
    Jan 22 '16 at 19:04










  • $begingroup$
    See also: Kullback-Leibler Divergence Explained blog post.
    $endgroup$
    – Piotr Migdal
    May 11 '17 at 22:15
















95












$begingroup$


In the MNIST For ML Beginners they define cross-entropy as



$$H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$$



$y_i$ is the predicted probability value for class $i$ and $y_i'$ is the true probability for that class.



Question 1



Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0? This would mean that we have a really bad classifier, of course. But think of an error in our dataset, e.g. an "obvious" 1 labeled as 3. Would it simply crash? Does the model we chose (softmax activation at the end) basically never give the probability 0 for the correct class?



Question 2



I've learned that cross-entropy is defined as



$$H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$$



What is correct? Do you have any textbook references for either version? How do those functions differ in their properties (as error functions for neural networks)?










share|improve this question











$endgroup$












  • $begingroup$
    See also: stats.stackexchange.com/questions/80967/…
    $endgroup$
    – Piotr Migdal
    Jan 22 '16 at 19:04










  • $begingroup$
    See also: Kullback-Leibler Divergence Explained blog post.
    $endgroup$
    – Piotr Migdal
    May 11 '17 at 22:15














95












95








95


84



$begingroup$


In the MNIST For ML Beginners they define cross-entropy as



$$H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$$



$y_i$ is the predicted probability value for class $i$ and $y_i'$ is the true probability for that class.



Question 1



Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0? This would mean that we have a really bad classifier, of course. But think of an error in our dataset, e.g. an "obvious" 1 labeled as 3. Would it simply crash? Does the model we chose (softmax activation at the end) basically never give the probability 0 for the correct class?



Question 2



I've learned that cross-entropy is defined as



$$H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$$



What is correct? Do you have any textbook references for either version? How do those functions differ in their properties (as error functions for neural networks)?










share|improve this question











$endgroup$




In the MNIST For ML Beginners they define cross-entropy as



$$H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$$



$y_i$ is the predicted probability value for class $i$ and $y_i'$ is the true probability for that class.



Question 1



Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0? This would mean that we have a really bad classifier, of course. But think of an error in our dataset, e.g. an "obvious" 1 labeled as 3. Would it simply crash? Does the model we chose (softmax activation at the end) basically never give the probability 0 for the correct class?



Question 2



I've learned that cross-entropy is defined as



$$H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$$



What is correct? Do you have any textbook references for either version? How do those functions differ in their properties (as error functions for neural networks)?







machine-learning tensorflow






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Apr 19 '18 at 19:17









Alex

34




34










asked Dec 10 '15 at 6:22









Martin ThomaMartin Thoma

6,2931353130




6,2931353130












  • $begingroup$
    See also: stats.stackexchange.com/questions/80967/…
    $endgroup$
    – Piotr Migdal
    Jan 22 '16 at 19:04










  • $begingroup$
    See also: Kullback-Leibler Divergence Explained blog post.
    $endgroup$
    – Piotr Migdal
    May 11 '17 at 22:15


















  • $begingroup$
    See also: stats.stackexchange.com/questions/80967/…
    $endgroup$
    – Piotr Migdal
    Jan 22 '16 at 19:04










  • $begingroup$
    See also: Kullback-Leibler Divergence Explained blog post.
    $endgroup$
    – Piotr Migdal
    May 11 '17 at 22:15
















$begingroup$
See also: stats.stackexchange.com/questions/80967/…
$endgroup$
– Piotr Migdal
Jan 22 '16 at 19:04




$begingroup$
See also: stats.stackexchange.com/questions/80967/…
$endgroup$
– Piotr Migdal
Jan 22 '16 at 19:04












$begingroup$
See also: Kullback-Leibler Divergence Explained blog post.
$endgroup$
– Piotr Migdal
May 11 '17 at 22:15




$begingroup$
See also: Kullback-Leibler Divergence Explained blog post.
$endgroup$
– Piotr Migdal
May 11 '17 at 22:15










5 Answers
5






active

oldest

votes


















85












$begingroup$

One way to interpret cross-entropy is to see it as a (minus) log-likelihood for the data $y_i'$, under a model $y_i$.



Namely, suppose that you have some fixed model (a.k.a. "hypothesis"), which predicts for $n$ classes ${1,2,dots, n}$ their hypothetical occurrence probabilities $y_1, y_2,dots, y_n$. Suppose that you now observe (in reality) $k_1$ instances of class $1$, $k_2$ instances of class $2$, $k_n$ instances of class $n$, etc. According to your model the likelihood of this happening is:
$$
P[data|model] := y_1^{k_1}y_2^{k_2}dots y_n^{k_n}.
$$

Taking the logarithm and changing the sign:
$$
-log P[data|model] = -k_1log y_1 -k_2log y_2 - dots -k_nlog y_n = -sum_i k_i log y_i
$$

If you now divide the right-hand sum by the number of observations $N = k_1+k_2+dots+k_n$, and denote the empirical probabilities as $y_i'=k_i/N$, you'll get the cross-entropy:
$$
-frac{1}{N} log P[data|model] = -frac{1}{N}sum_i k_i log y_i = -sum_i y_i'log y_i =: H(y', y)
$$



Furthermore, the log-likelihood of a dataset given a model can be interpreted as a measure of "encoding length" - the number of bits you expect to spend to encode this information if your encoding scheme would be based on your hypothesis.



This follows from the observation that an independent event with probability $y_i$ requires at least $-log_2 y_i$ bits to encode it (assuming efficient coding), and consequently the expression
$$-sum_i y_i'log_2 y_i,$$
is literally the expected length of the encoding, where the encoding lengths for the events are computed using the "hypothesized" distribution, while the expectation is taken over the actual one.



Finally, instead of saying "measure of expected encoding length" I really like to use the informal term "measure of surprise". If you need a lot of bits to encode an expected event from a distribution, the distribution is "really surprising" for you.



With those intuitions in mind, the answers to your questions can be seen as follows:





  • Question 1. Yes. It is a problem whenever the corresponding $y_i'$ is nonzero at the same time. It corresponds to the situation where your model believes that some class has zero probability of occurrence, and yet the class pops up in reality. As a result, the "surprise" of your model is infinitely great: your model did not account for that event and now needs infinitely many bits to encode it. That is why you get infinity as your cross-entropy.



    To avoid this problem you need to make sure that your model does not make rash assumptions about something being impossible while it can happen. In reality, people tend to use sigmoid or "softmax" functions as their hypothesis models, which are conservative enough to leave at least some chance for every option.



    If you use some other hypothesis model, it is up to you to regularize (aka "smooth") it so that it would not hypothesize zeros where it should not.




  • Question 2. In this formula, one usually assumes $y_i'$ to be either $0$ or $1$, while $y_i$ is the model's probability hypothesis for the corresponding input. If you look closely, you will see that it is simply a $-log P[data|model]$ for binary data, an equivalent of the second equation in this answer.



    Hence, strictly speaking, although it is still a log-likelihood, this is not syntactically equivalent to cross-entropy. What some people mean when referring to such an expression as cross-entropy is that it is, in fact, a sum over binary cross-entropies for individual points in the dataset:
    $$
    sum_i H(y_i', y_i),
    $$

    where $y_i'$ and $y_i$ have to be interpreted as the corresponding binary distributions $(y_i', 1-y_i')$ and $(y_i, 1-y_i)$.








share|improve this answer











$endgroup$









  • 1




    $begingroup$
    Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
    $endgroup$
    – Lenar Hoyt
    Jun 22 '16 at 7:47






  • 1




    $begingroup$
    In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
    $endgroup$
    – Lenar Hoyt
    Jun 22 '16 at 9:32










  • $begingroup$
    @LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
    $endgroup$
    – THN
    Jul 13 '17 at 11:02










  • $begingroup$
    'independent event requires...to encode it' - could you explain this bit please?
    $endgroup$
    – Alex
    Aug 20 '17 at 13:25










  • $begingroup$
    @Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
    $endgroup$
    – KT.
    Aug 21 '17 at 9:55





















20












$begingroup$

The first logloss formula you are using is for multiclass log loss, where the $i$ subscript enumerates the different classes in an example. The formula assumes that a single $y_i'$ in each example is 1, and the rest are all 0.



That means the formula only captures error on the target class. It discards any notion of errors that you might consider "false positive" and does not care how predicted probabilities are distributed other than predicted probability of the true class.



Another assumption is that $sum_i y_i = 1$ for the predictions of each example. A softmax layer does this automatically - if you use something different you will need to scale the outputs to meet that constraint.



Question 1




Isn't it a problem that the $y_i$ (in $log(y_i)$) could be 0?




Yes that can be a problem, but it is usually not a practical one. A randomly-initialised softmax layer is extremely unlikely to output an exact 0 in any class. But it is possible, so worth allowing for it. First, don't evaluate $log(y_i)$ for any $y_i'=0$, because the negative classes always contribute 0 to the error. Second, in practical code you can limit the value to something like log( max( y_predict, 1e-15 ) ) for numerical stability - in many cases it is not required, but this is sensible defensive programming.



Question 2




I've learned that cross-entropy is defined as $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$




This formulation is often used for a network with one output predicting two classes (usually positive class membership for 1 and negative for 0 output). In that case $i$ may only have one value - you can lose the sum over $i$.



If you modify such a network to have two opposing outputs and use softmax plus the first logloss definition, then you can see that in fact it is the same error measurement but folding the error metric for two classes into a single output.



If there is more than one class to predict membership of, and the classes are not exclusive i.e. an example could be any or all of the classes at the same time, then you will need to use this second formulation. For digit recognition that is not the case (a written digit should only have one "true" class)






share|improve this answer











$endgroup$













  • $begingroup$
    Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
    $endgroup$
    – Neil Slater
    Dec 10 '15 at 16:24










  • $begingroup$
    I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
    $endgroup$
    – Martin Thoma
    Dec 17 '15 at 8:47










  • $begingroup$
    @NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
    $endgroup$
    – Media
    Feb 28 '18 at 13:15






  • 1




    $begingroup$
    @Media: Not really. You want to be looking at things such as hierarchical classification though . . .
    $endgroup$
    – Neil Slater
    Feb 28 '18 at 15:38






  • 1




    $begingroup$
    @Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
    $endgroup$
    – Neil Slater
    Feb 1 at 15:46





















10












$begingroup$

Given $y_{true}$, you want to optimize your machine learning method to get the $y_{predict}$ as close as possible to $y_{true}$.



First question:



Above answer has explained the background of your first formula, the cross entropy defined in information theory.



From a opinion other than information theory:



you can examine yourself that first formula does not have penalty on false-positiveness(truth is false but your model predict that it is right), while the second one has penalty on false-positiveness. Therefore, the choice of first formula or second, will affect your metrics(aka what statistic quantity you would like to use to evaluate your model).



In layman word:



If you want to accept almost all good people to be your friend but willing to accept some bad people become your friend, then use first formula for criterion.



If you want to punish yourself accepting some bad people to be your friend,but at the same time your good-people accepting rate might be lower than the first condition, then use second formula.



While, I guess most of us are critical and would like to choose the second one(so as many ML package assume what is cross entropy).



Second question:



Cross entropy per sample per class: $$-y_{true}log{(y_{predict})}$$



Cross entropy for whole datasets whole classes: $$sum_i^n sum_k^K -y_{true}^{(k)}log{(y_{predict}^{(k)})}$$



Thus, when there are only two classes (K = 2), you will have the second formula.






share|improve this answer











$endgroup$





















    5












    $begingroup$

    Those issues are handled by the tutorial's use of softmax.



    For 1) you're correct that softmax guarantees a non-zero output because it exponentiates it's input. For activations that do not give this guarantee (like relu), it's simple to add a very small positive term to every output to avoid that problem.



    As for 2), they aren't the same obviously, but I the softmax formulation they gave takes care of the the issue. If you didn't use softmax, this would cause you to learn huge bias terms that guess 1 for every class for any input. But since they normalize the softmax across all classes, the only way to maximize the output of the correct class is for it to be large relative to the incorrect classes.






    share|improve this answer









    $endgroup$













    • $begingroup$
      "you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
      $endgroup$
      – Martin Thoma
      Dec 10 '15 at 14:30










    • $begingroup$
      Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
      $endgroup$
      – jamesmf
      Dec 10 '15 at 14:50



















    0












    $begingroup$


    Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0?




    Yes it is, since $log(0)$ is undefined, but this problem is avoided using $log(y_i + epsilon)$ in practice.




    What is correct?

    (a) $H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$ or

    (b) $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log(1-y_i)})$?




    (a) is correct for estimating class probabilities, (b) is correct for predicting binary classes. Both are cross-entropy, (a) sums over classes and doesn't care about miss-classifications, but (b) sums over training points.



    Example:



    Suppose each training data $x_i$ has label $c_i in {0, 1}$, and model predicts $c_i' in [0, 1]$. Let $p(c)$ be the empirical probability of class $c$, and $p'(c)$ be model's estimation.



    True label $c_i$ and model prediction $c_i'$ for 5 data points are:
    $(c_i, c_i')={(1, 0.8), (1, 0.2), (0, 0.1), (0, 0.4), (0, 0.8)}$,



    Empirical and estimated class probabilities are:
    $p(1) = 2/5 = 0.4$, $p'(1) = 2/5 = 0.4$,



    (a) is calculated as: $-p(1)logp'(1) - p(0)logp'(0) = -0.4log(0.4) - 0.6log(0.6) = 0.292$.



    Two data points $(1, 0.2)$ and $(0, 0.8)$ are miss-classified but $p(c)$ is estimated correctly!



    (b) is calculated as: $-1/5([log(0.8) + log(0.2)] + [log(1-0.1)+log(1-0.4) + log(1-0.8)]) = 0.352$



    Now, suppose all 5 points where classified correctly as:
    $(c_i, c_i')={(1, 0.8), (1, color{blue}{0.8}), (0, 0.1), (0, 0.4), (0, color{blue}{0.2})}$,



    (a) still remains the same, since $p'(1)$ is still $2/5$. However, (b) decreases to:
    $-1/5([log(0.8) + log(color{blue}{0.8})] + [log(1-0.1)+log(1-0.4) + log(1-color{blue}{0.2})]) = 0.112$



    Derivation:



    To write down their formula, I changed your notations for a better delivery.



    Let's write (a) as: $H_{p} (p') := - sum_{c} p(c)log p'(c)$



    This sum is over all possible classes such as $C={red, blue, green}$ or $C={0, 1}$. To calculate (a), model should output $c_i' in C$ for every $(x_i, c_i)$, then the ratios $p(c)=sum_{i:c_i=c}1/N$ and $p'(c)=sum_{i:c_i'=c}1/N$ should be plugged into (a).



    If there is two classes 1 and 0, another cross-entropy (b) can be used. For training point $(x_i, c_i)$, when $c_i = 1$, we want the model's output $c_i'=p'(c=1|x_i)$ to be close to 1, and when $c_i = 0$, close to 0. Therefore, loss of $(x_i, 1)$ can be defined as $-log(c_i')$, which gives $c_i' rightarrow 1 Rightarrow -log(c_i') rightarrow 0$. Similarly, loss of $(x_i, 0)$ can be defined as $-log(1 - c_i')$, which gives $c_i' rightarrow 0 Rightarrow -log(1 - c_i') rightarrow 0$. Both losses can be combined as:



    $L(c_i, c_i') = -c_ilog(c_i') - (1 - c_i)log(1 - c_i')$,



    When $c_i = 1$, $0log(1 - c_i')=0$ is disabled, and when $c_i = 0$, $0log(c_i')=0$ is disabled.



    Finally, (b) can be written as:



    $begin{align*}
    H_{c}(c') &= - 1/Nsum_{(x_i,c_i)} c_ilog(c_i') + (1 - c_i)log(1 - c_i')\
    &= - 1/Nsum_{(x_i,1)} log(c_i') - 1/Nsum_{(x_i,0)} log(1 - c_i')
    end{align*}$



    To better see the difference, cross-entropy (a) for two classes ${0, 1}$ would be:



    $begin{align*}
    H_{p} (p') &= - p(1)log p'(1) - p(0)log p'(0)\
    &= - 1/Nsum_{(x_i,1)}log(sum_{k:c_k''=1}1/N) - 1/Nsum_{(x_i,0)}log(1 - sum_{k:c_k''=1}1/N)
    end{align*}$



    Using $p(c) = sum_{(x_i,c)}1/N$, and $p'(c) = sum_{i:c_i''=c}1/N$ where $c_i'' = left lfloor c_i' + 0.5 right rfloor in {0, 1}$.



    There is a summation inside $log(.)$ independent of point $i$, meaning (a) doesn't care about $i$ being miss-classified.






    share|improve this answer










    New contributor




    P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$













      Your Answer





      StackExchange.ifUsing("editor", function () {
      return StackExchange.using("mathjaxEditing", function () {
      StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
      StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
      });
      });
      }, "mathjax-editing");

      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "557"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f9302%2fthe-cross-entropy-error-function-in-neural-networks%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      5 Answers
      5






      active

      oldest

      votes








      5 Answers
      5






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      85












      $begingroup$

      One way to interpret cross-entropy is to see it as a (minus) log-likelihood for the data $y_i'$, under a model $y_i$.



      Namely, suppose that you have some fixed model (a.k.a. "hypothesis"), which predicts for $n$ classes ${1,2,dots, n}$ their hypothetical occurrence probabilities $y_1, y_2,dots, y_n$. Suppose that you now observe (in reality) $k_1$ instances of class $1$, $k_2$ instances of class $2$, $k_n$ instances of class $n$, etc. According to your model the likelihood of this happening is:
      $$
      P[data|model] := y_1^{k_1}y_2^{k_2}dots y_n^{k_n}.
      $$

      Taking the logarithm and changing the sign:
      $$
      -log P[data|model] = -k_1log y_1 -k_2log y_2 - dots -k_nlog y_n = -sum_i k_i log y_i
      $$

      If you now divide the right-hand sum by the number of observations $N = k_1+k_2+dots+k_n$, and denote the empirical probabilities as $y_i'=k_i/N$, you'll get the cross-entropy:
      $$
      -frac{1}{N} log P[data|model] = -frac{1}{N}sum_i k_i log y_i = -sum_i y_i'log y_i =: H(y', y)
      $$



      Furthermore, the log-likelihood of a dataset given a model can be interpreted as a measure of "encoding length" - the number of bits you expect to spend to encode this information if your encoding scheme would be based on your hypothesis.



      This follows from the observation that an independent event with probability $y_i$ requires at least $-log_2 y_i$ bits to encode it (assuming efficient coding), and consequently the expression
      $$-sum_i y_i'log_2 y_i,$$
      is literally the expected length of the encoding, where the encoding lengths for the events are computed using the "hypothesized" distribution, while the expectation is taken over the actual one.



      Finally, instead of saying "measure of expected encoding length" I really like to use the informal term "measure of surprise". If you need a lot of bits to encode an expected event from a distribution, the distribution is "really surprising" for you.



      With those intuitions in mind, the answers to your questions can be seen as follows:





      • Question 1. Yes. It is a problem whenever the corresponding $y_i'$ is nonzero at the same time. It corresponds to the situation where your model believes that some class has zero probability of occurrence, and yet the class pops up in reality. As a result, the "surprise" of your model is infinitely great: your model did not account for that event and now needs infinitely many bits to encode it. That is why you get infinity as your cross-entropy.



        To avoid this problem you need to make sure that your model does not make rash assumptions about something being impossible while it can happen. In reality, people tend to use sigmoid or "softmax" functions as their hypothesis models, which are conservative enough to leave at least some chance for every option.



        If you use some other hypothesis model, it is up to you to regularize (aka "smooth") it so that it would not hypothesize zeros where it should not.




      • Question 2. In this formula, one usually assumes $y_i'$ to be either $0$ or $1$, while $y_i$ is the model's probability hypothesis for the corresponding input. If you look closely, you will see that it is simply a $-log P[data|model]$ for binary data, an equivalent of the second equation in this answer.



        Hence, strictly speaking, although it is still a log-likelihood, this is not syntactically equivalent to cross-entropy. What some people mean when referring to such an expression as cross-entropy is that it is, in fact, a sum over binary cross-entropies for individual points in the dataset:
        $$
        sum_i H(y_i', y_i),
        $$

        where $y_i'$ and $y_i$ have to be interpreted as the corresponding binary distributions $(y_i', 1-y_i')$ and $(y_i, 1-y_i)$.








      share|improve this answer











      $endgroup$









      • 1




        $begingroup$
        Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
        $endgroup$
        – Lenar Hoyt
        Jun 22 '16 at 7:47






      • 1




        $begingroup$
        In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
        $endgroup$
        – Lenar Hoyt
        Jun 22 '16 at 9:32










      • $begingroup$
        @LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
        $endgroup$
        – THN
        Jul 13 '17 at 11:02










      • $begingroup$
        'independent event requires...to encode it' - could you explain this bit please?
        $endgroup$
        – Alex
        Aug 20 '17 at 13:25










      • $begingroup$
        @Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
        $endgroup$
        – KT.
        Aug 21 '17 at 9:55


















      85












      $begingroup$

      One way to interpret cross-entropy is to see it as a (minus) log-likelihood for the data $y_i'$, under a model $y_i$.



      Namely, suppose that you have some fixed model (a.k.a. "hypothesis"), which predicts for $n$ classes ${1,2,dots, n}$ their hypothetical occurrence probabilities $y_1, y_2,dots, y_n$. Suppose that you now observe (in reality) $k_1$ instances of class $1$, $k_2$ instances of class $2$, $k_n$ instances of class $n$, etc. According to your model the likelihood of this happening is:
      $$
      P[data|model] := y_1^{k_1}y_2^{k_2}dots y_n^{k_n}.
      $$

      Taking the logarithm and changing the sign:
      $$
      -log P[data|model] = -k_1log y_1 -k_2log y_2 - dots -k_nlog y_n = -sum_i k_i log y_i
      $$

      If you now divide the right-hand sum by the number of observations $N = k_1+k_2+dots+k_n$, and denote the empirical probabilities as $y_i'=k_i/N$, you'll get the cross-entropy:
      $$
      -frac{1}{N} log P[data|model] = -frac{1}{N}sum_i k_i log y_i = -sum_i y_i'log y_i =: H(y', y)
      $$



      Furthermore, the log-likelihood of a dataset given a model can be interpreted as a measure of "encoding length" - the number of bits you expect to spend to encode this information if your encoding scheme would be based on your hypothesis.



      This follows from the observation that an independent event with probability $y_i$ requires at least $-log_2 y_i$ bits to encode it (assuming efficient coding), and consequently the expression
      $$-sum_i y_i'log_2 y_i,$$
      is literally the expected length of the encoding, where the encoding lengths for the events are computed using the "hypothesized" distribution, while the expectation is taken over the actual one.



      Finally, instead of saying "measure of expected encoding length" I really like to use the informal term "measure of surprise". If you need a lot of bits to encode an expected event from a distribution, the distribution is "really surprising" for you.



      With those intuitions in mind, the answers to your questions can be seen as follows:





      • Question 1. Yes. It is a problem whenever the corresponding $y_i'$ is nonzero at the same time. It corresponds to the situation where your model believes that some class has zero probability of occurrence, and yet the class pops up in reality. As a result, the "surprise" of your model is infinitely great: your model did not account for that event and now needs infinitely many bits to encode it. That is why you get infinity as your cross-entropy.



        To avoid this problem you need to make sure that your model does not make rash assumptions about something being impossible while it can happen. In reality, people tend to use sigmoid or "softmax" functions as their hypothesis models, which are conservative enough to leave at least some chance for every option.



        If you use some other hypothesis model, it is up to you to regularize (aka "smooth") it so that it would not hypothesize zeros where it should not.




      • Question 2. In this formula, one usually assumes $y_i'$ to be either $0$ or $1$, while $y_i$ is the model's probability hypothesis for the corresponding input. If you look closely, you will see that it is simply a $-log P[data|model]$ for binary data, an equivalent of the second equation in this answer.



        Hence, strictly speaking, although it is still a log-likelihood, this is not syntactically equivalent to cross-entropy. What some people mean when referring to such an expression as cross-entropy is that it is, in fact, a sum over binary cross-entropies for individual points in the dataset:
        $$
        sum_i H(y_i', y_i),
        $$

        where $y_i'$ and $y_i$ have to be interpreted as the corresponding binary distributions $(y_i', 1-y_i')$ and $(y_i, 1-y_i)$.








      share|improve this answer











      $endgroup$









      • 1




        $begingroup$
        Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
        $endgroup$
        – Lenar Hoyt
        Jun 22 '16 at 7:47






      • 1




        $begingroup$
        In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
        $endgroup$
        – Lenar Hoyt
        Jun 22 '16 at 9:32










      • $begingroup$
        @LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
        $endgroup$
        – THN
        Jul 13 '17 at 11:02










      • $begingroup$
        'independent event requires...to encode it' - could you explain this bit please?
        $endgroup$
        – Alex
        Aug 20 '17 at 13:25










      • $begingroup$
        @Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
        $endgroup$
        – KT.
        Aug 21 '17 at 9:55
















      85












      85








      85





      $begingroup$

      One way to interpret cross-entropy is to see it as a (minus) log-likelihood for the data $y_i'$, under a model $y_i$.



      Namely, suppose that you have some fixed model (a.k.a. "hypothesis"), which predicts for $n$ classes ${1,2,dots, n}$ their hypothetical occurrence probabilities $y_1, y_2,dots, y_n$. Suppose that you now observe (in reality) $k_1$ instances of class $1$, $k_2$ instances of class $2$, $k_n$ instances of class $n$, etc. According to your model the likelihood of this happening is:
      $$
      P[data|model] := y_1^{k_1}y_2^{k_2}dots y_n^{k_n}.
      $$

      Taking the logarithm and changing the sign:
      $$
      -log P[data|model] = -k_1log y_1 -k_2log y_2 - dots -k_nlog y_n = -sum_i k_i log y_i
      $$

      If you now divide the right-hand sum by the number of observations $N = k_1+k_2+dots+k_n$, and denote the empirical probabilities as $y_i'=k_i/N$, you'll get the cross-entropy:
      $$
      -frac{1}{N} log P[data|model] = -frac{1}{N}sum_i k_i log y_i = -sum_i y_i'log y_i =: H(y', y)
      $$



      Furthermore, the log-likelihood of a dataset given a model can be interpreted as a measure of "encoding length" - the number of bits you expect to spend to encode this information if your encoding scheme would be based on your hypothesis.



      This follows from the observation that an independent event with probability $y_i$ requires at least $-log_2 y_i$ bits to encode it (assuming efficient coding), and consequently the expression
      $$-sum_i y_i'log_2 y_i,$$
      is literally the expected length of the encoding, where the encoding lengths for the events are computed using the "hypothesized" distribution, while the expectation is taken over the actual one.



      Finally, instead of saying "measure of expected encoding length" I really like to use the informal term "measure of surprise". If you need a lot of bits to encode an expected event from a distribution, the distribution is "really surprising" for you.



      With those intuitions in mind, the answers to your questions can be seen as follows:





      • Question 1. Yes. It is a problem whenever the corresponding $y_i'$ is nonzero at the same time. It corresponds to the situation where your model believes that some class has zero probability of occurrence, and yet the class pops up in reality. As a result, the "surprise" of your model is infinitely great: your model did not account for that event and now needs infinitely many bits to encode it. That is why you get infinity as your cross-entropy.



        To avoid this problem you need to make sure that your model does not make rash assumptions about something being impossible while it can happen. In reality, people tend to use sigmoid or "softmax" functions as their hypothesis models, which are conservative enough to leave at least some chance for every option.



        If you use some other hypothesis model, it is up to you to regularize (aka "smooth") it so that it would not hypothesize zeros where it should not.




      • Question 2. In this formula, one usually assumes $y_i'$ to be either $0$ or $1$, while $y_i$ is the model's probability hypothesis for the corresponding input. If you look closely, you will see that it is simply a $-log P[data|model]$ for binary data, an equivalent of the second equation in this answer.



        Hence, strictly speaking, although it is still a log-likelihood, this is not syntactically equivalent to cross-entropy. What some people mean when referring to such an expression as cross-entropy is that it is, in fact, a sum over binary cross-entropies for individual points in the dataset:
        $$
        sum_i H(y_i', y_i),
        $$

        where $y_i'$ and $y_i$ have to be interpreted as the corresponding binary distributions $(y_i', 1-y_i')$ and $(y_i, 1-y_i)$.








      share|improve this answer











      $endgroup$



      One way to interpret cross-entropy is to see it as a (minus) log-likelihood for the data $y_i'$, under a model $y_i$.



      Namely, suppose that you have some fixed model (a.k.a. "hypothesis"), which predicts for $n$ classes ${1,2,dots, n}$ their hypothetical occurrence probabilities $y_1, y_2,dots, y_n$. Suppose that you now observe (in reality) $k_1$ instances of class $1$, $k_2$ instances of class $2$, $k_n$ instances of class $n$, etc. According to your model the likelihood of this happening is:
      $$
      P[data|model] := y_1^{k_1}y_2^{k_2}dots y_n^{k_n}.
      $$

      Taking the logarithm and changing the sign:
      $$
      -log P[data|model] = -k_1log y_1 -k_2log y_2 - dots -k_nlog y_n = -sum_i k_i log y_i
      $$

      If you now divide the right-hand sum by the number of observations $N = k_1+k_2+dots+k_n$, and denote the empirical probabilities as $y_i'=k_i/N$, you'll get the cross-entropy:
      $$
      -frac{1}{N} log P[data|model] = -frac{1}{N}sum_i k_i log y_i = -sum_i y_i'log y_i =: H(y', y)
      $$



      Furthermore, the log-likelihood of a dataset given a model can be interpreted as a measure of "encoding length" - the number of bits you expect to spend to encode this information if your encoding scheme would be based on your hypothesis.



      This follows from the observation that an independent event with probability $y_i$ requires at least $-log_2 y_i$ bits to encode it (assuming efficient coding), and consequently the expression
      $$-sum_i y_i'log_2 y_i,$$
      is literally the expected length of the encoding, where the encoding lengths for the events are computed using the "hypothesized" distribution, while the expectation is taken over the actual one.



      Finally, instead of saying "measure of expected encoding length" I really like to use the informal term "measure of surprise". If you need a lot of bits to encode an expected event from a distribution, the distribution is "really surprising" for you.



      With those intuitions in mind, the answers to your questions can be seen as follows:





      • Question 1. Yes. It is a problem whenever the corresponding $y_i'$ is nonzero at the same time. It corresponds to the situation where your model believes that some class has zero probability of occurrence, and yet the class pops up in reality. As a result, the "surprise" of your model is infinitely great: your model did not account for that event and now needs infinitely many bits to encode it. That is why you get infinity as your cross-entropy.



        To avoid this problem you need to make sure that your model does not make rash assumptions about something being impossible while it can happen. In reality, people tend to use sigmoid or "softmax" functions as their hypothesis models, which are conservative enough to leave at least some chance for every option.



        If you use some other hypothesis model, it is up to you to regularize (aka "smooth") it so that it would not hypothesize zeros where it should not.




      • Question 2. In this formula, one usually assumes $y_i'$ to be either $0$ or $1$, while $y_i$ is the model's probability hypothesis for the corresponding input. If you look closely, you will see that it is simply a $-log P[data|model]$ for binary data, an equivalent of the second equation in this answer.



        Hence, strictly speaking, although it is still a log-likelihood, this is not syntactically equivalent to cross-entropy. What some people mean when referring to such an expression as cross-entropy is that it is, in fact, a sum over binary cross-entropies for individual points in the dataset:
        $$
        sum_i H(y_i', y_i),
        $$

        where $y_i'$ and $y_i$ have to be interpreted as the corresponding binary distributions $(y_i', 1-y_i')$ and $(y_i, 1-y_i)$.









      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 25 '18 at 11:08

























      answered Dec 16 '15 at 13:29









      KT.KT.

      1,43157




      1,43157








      • 1




        $begingroup$
        Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
        $endgroup$
        – Lenar Hoyt
        Jun 22 '16 at 7:47






      • 1




        $begingroup$
        In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
        $endgroup$
        – Lenar Hoyt
        Jun 22 '16 at 9:32










      • $begingroup$
        @LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
        $endgroup$
        – THN
        Jul 13 '17 at 11:02










      • $begingroup$
        'independent event requires...to encode it' - could you explain this bit please?
        $endgroup$
        – Alex
        Aug 20 '17 at 13:25










      • $begingroup$
        @Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
        $endgroup$
        – KT.
        Aug 21 '17 at 9:55
















      • 1




        $begingroup$
        Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
        $endgroup$
        – Lenar Hoyt
        Jun 22 '16 at 7:47






      • 1




        $begingroup$
        In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
        $endgroup$
        – Lenar Hoyt
        Jun 22 '16 at 9:32










      • $begingroup$
        @LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
        $endgroup$
        – THN
        Jul 13 '17 at 11:02










      • $begingroup$
        'independent event requires...to encode it' - could you explain this bit please?
        $endgroup$
        – Alex
        Aug 20 '17 at 13:25










      • $begingroup$
        @Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
        $endgroup$
        – KT.
        Aug 21 '17 at 9:55










      1




      1




      $begingroup$
      Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
      $endgroup$
      – Lenar Hoyt
      Jun 22 '16 at 7:47




      $begingroup$
      Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
      $endgroup$
      – Lenar Hoyt
      Jun 22 '16 at 7:47




      1




      1




      $begingroup$
      In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
      $endgroup$
      – Lenar Hoyt
      Jun 22 '16 at 9:32




      $begingroup$
      In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
      $endgroup$
      – Lenar Hoyt
      Jun 22 '16 at 9:32












      $begingroup$
      @LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
      $endgroup$
      – THN
      Jul 13 '17 at 11:02




      $begingroup$
      @LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
      $endgroup$
      – THN
      Jul 13 '17 at 11:02












      $begingroup$
      'independent event requires...to encode it' - could you explain this bit please?
      $endgroup$
      – Alex
      Aug 20 '17 at 13:25




      $begingroup$
      'independent event requires...to encode it' - could you explain this bit please?
      $endgroup$
      – Alex
      Aug 20 '17 at 13:25












      $begingroup$
      @Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
      $endgroup$
      – KT.
      Aug 21 '17 at 9:55






      $begingroup$
      @Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
      $endgroup$
      – KT.
      Aug 21 '17 at 9:55













      20












      $begingroup$

      The first logloss formula you are using is for multiclass log loss, where the $i$ subscript enumerates the different classes in an example. The formula assumes that a single $y_i'$ in each example is 1, and the rest are all 0.



      That means the formula only captures error on the target class. It discards any notion of errors that you might consider "false positive" and does not care how predicted probabilities are distributed other than predicted probability of the true class.



      Another assumption is that $sum_i y_i = 1$ for the predictions of each example. A softmax layer does this automatically - if you use something different you will need to scale the outputs to meet that constraint.



      Question 1




      Isn't it a problem that the $y_i$ (in $log(y_i)$) could be 0?




      Yes that can be a problem, but it is usually not a practical one. A randomly-initialised softmax layer is extremely unlikely to output an exact 0 in any class. But it is possible, so worth allowing for it. First, don't evaluate $log(y_i)$ for any $y_i'=0$, because the negative classes always contribute 0 to the error. Second, in practical code you can limit the value to something like log( max( y_predict, 1e-15 ) ) for numerical stability - in many cases it is not required, but this is sensible defensive programming.



      Question 2




      I've learned that cross-entropy is defined as $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$




      This formulation is often used for a network with one output predicting two classes (usually positive class membership for 1 and negative for 0 output). In that case $i$ may only have one value - you can lose the sum over $i$.



      If you modify such a network to have two opposing outputs and use softmax plus the first logloss definition, then you can see that in fact it is the same error measurement but folding the error metric for two classes into a single output.



      If there is more than one class to predict membership of, and the classes are not exclusive i.e. an example could be any or all of the classes at the same time, then you will need to use this second formulation. For digit recognition that is not the case (a written digit should only have one "true" class)






      share|improve this answer











      $endgroup$













      • $begingroup$
        Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
        $endgroup$
        – Neil Slater
        Dec 10 '15 at 16:24










      • $begingroup$
        I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
        $endgroup$
        – Martin Thoma
        Dec 17 '15 at 8:47










      • $begingroup$
        @NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
        $endgroup$
        – Media
        Feb 28 '18 at 13:15






      • 1




        $begingroup$
        @Media: Not really. You want to be looking at things such as hierarchical classification though . . .
        $endgroup$
        – Neil Slater
        Feb 28 '18 at 15:38






      • 1




        $begingroup$
        @Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
        $endgroup$
        – Neil Slater
        Feb 1 at 15:46


















      20












      $begingroup$

      The first logloss formula you are using is for multiclass log loss, where the $i$ subscript enumerates the different classes in an example. The formula assumes that a single $y_i'$ in each example is 1, and the rest are all 0.



      That means the formula only captures error on the target class. It discards any notion of errors that you might consider "false positive" and does not care how predicted probabilities are distributed other than predicted probability of the true class.



      Another assumption is that $sum_i y_i = 1$ for the predictions of each example. A softmax layer does this automatically - if you use something different you will need to scale the outputs to meet that constraint.



      Question 1




      Isn't it a problem that the $y_i$ (in $log(y_i)$) could be 0?




      Yes that can be a problem, but it is usually not a practical one. A randomly-initialised softmax layer is extremely unlikely to output an exact 0 in any class. But it is possible, so worth allowing for it. First, don't evaluate $log(y_i)$ for any $y_i'=0$, because the negative classes always contribute 0 to the error. Second, in practical code you can limit the value to something like log( max( y_predict, 1e-15 ) ) for numerical stability - in many cases it is not required, but this is sensible defensive programming.



      Question 2




      I've learned that cross-entropy is defined as $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$




      This formulation is often used for a network with one output predicting two classes (usually positive class membership for 1 and negative for 0 output). In that case $i$ may only have one value - you can lose the sum over $i$.



      If you modify such a network to have two opposing outputs and use softmax plus the first logloss definition, then you can see that in fact it is the same error measurement but folding the error metric for two classes into a single output.



      If there is more than one class to predict membership of, and the classes are not exclusive i.e. an example could be any or all of the classes at the same time, then you will need to use this second formulation. For digit recognition that is not the case (a written digit should only have one "true" class)






      share|improve this answer











      $endgroup$













      • $begingroup$
        Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
        $endgroup$
        – Neil Slater
        Dec 10 '15 at 16:24










      • $begingroup$
        I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
        $endgroup$
        – Martin Thoma
        Dec 17 '15 at 8:47










      • $begingroup$
        @NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
        $endgroup$
        – Media
        Feb 28 '18 at 13:15






      • 1




        $begingroup$
        @Media: Not really. You want to be looking at things such as hierarchical classification though . . .
        $endgroup$
        – Neil Slater
        Feb 28 '18 at 15:38






      • 1




        $begingroup$
        @Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
        $endgroup$
        – Neil Slater
        Feb 1 at 15:46
















      20












      20








      20





      $begingroup$

      The first logloss formula you are using is for multiclass log loss, where the $i$ subscript enumerates the different classes in an example. The formula assumes that a single $y_i'$ in each example is 1, and the rest are all 0.



      That means the formula only captures error on the target class. It discards any notion of errors that you might consider "false positive" and does not care how predicted probabilities are distributed other than predicted probability of the true class.



      Another assumption is that $sum_i y_i = 1$ for the predictions of each example. A softmax layer does this automatically - if you use something different you will need to scale the outputs to meet that constraint.



      Question 1




      Isn't it a problem that the $y_i$ (in $log(y_i)$) could be 0?




      Yes that can be a problem, but it is usually not a practical one. A randomly-initialised softmax layer is extremely unlikely to output an exact 0 in any class. But it is possible, so worth allowing for it. First, don't evaluate $log(y_i)$ for any $y_i'=0$, because the negative classes always contribute 0 to the error. Second, in practical code you can limit the value to something like log( max( y_predict, 1e-15 ) ) for numerical stability - in many cases it is not required, but this is sensible defensive programming.



      Question 2




      I've learned that cross-entropy is defined as $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$




      This formulation is often used for a network with one output predicting two classes (usually positive class membership for 1 and negative for 0 output). In that case $i$ may only have one value - you can lose the sum over $i$.



      If you modify such a network to have two opposing outputs and use softmax plus the first logloss definition, then you can see that in fact it is the same error measurement but folding the error metric for two classes into a single output.



      If there is more than one class to predict membership of, and the classes are not exclusive i.e. an example could be any or all of the classes at the same time, then you will need to use this second formulation. For digit recognition that is not the case (a written digit should only have one "true" class)






      share|improve this answer











      $endgroup$



      The first logloss formula you are using is for multiclass log loss, where the $i$ subscript enumerates the different classes in an example. The formula assumes that a single $y_i'$ in each example is 1, and the rest are all 0.



      That means the formula only captures error on the target class. It discards any notion of errors that you might consider "false positive" and does not care how predicted probabilities are distributed other than predicted probability of the true class.



      Another assumption is that $sum_i y_i = 1$ for the predictions of each example. A softmax layer does this automatically - if you use something different you will need to scale the outputs to meet that constraint.



      Question 1




      Isn't it a problem that the $y_i$ (in $log(y_i)$) could be 0?




      Yes that can be a problem, but it is usually not a practical one. A randomly-initialised softmax layer is extremely unlikely to output an exact 0 in any class. But it is possible, so worth allowing for it. First, don't evaluate $log(y_i)$ for any $y_i'=0$, because the negative classes always contribute 0 to the error. Second, in practical code you can limit the value to something like log( max( y_predict, 1e-15 ) ) for numerical stability - in many cases it is not required, but this is sensible defensive programming.



      Question 2




      I've learned that cross-entropy is defined as $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$




      This formulation is often used for a network with one output predicting two classes (usually positive class membership for 1 and negative for 0 output). In that case $i$ may only have one value - you can lose the sum over $i$.



      If you modify such a network to have two opposing outputs and use softmax plus the first logloss definition, then you can see that in fact it is the same error measurement but folding the error metric for two classes into a single output.



      If there is more than one class to predict membership of, and the classes are not exclusive i.e. an example could be any or all of the classes at the same time, then you will need to use this second formulation. For digit recognition that is not the case (a written digit should only have one "true" class)







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Dec 17 '15 at 9:40

























      answered Dec 10 '15 at 16:10









      Neil SlaterNeil Slater

      17k22961




      17k22961












      • $begingroup$
        Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
        $endgroup$
        – Neil Slater
        Dec 10 '15 at 16:24










      • $begingroup$
        I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
        $endgroup$
        – Martin Thoma
        Dec 17 '15 at 8:47










      • $begingroup$
        @NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
        $endgroup$
        – Media
        Feb 28 '18 at 13:15






      • 1




        $begingroup$
        @Media: Not really. You want to be looking at things such as hierarchical classification though . . .
        $endgroup$
        – Neil Slater
        Feb 28 '18 at 15:38






      • 1




        $begingroup$
        @Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
        $endgroup$
        – Neil Slater
        Feb 1 at 15:46




















      • $begingroup$
        Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
        $endgroup$
        – Neil Slater
        Dec 10 '15 at 16:24










      • $begingroup$
        I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
        $endgroup$
        – Martin Thoma
        Dec 17 '15 at 8:47










      • $begingroup$
        @NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
        $endgroup$
        – Media
        Feb 28 '18 at 13:15






      • 1




        $begingroup$
        @Media: Not really. You want to be looking at things such as hierarchical classification though . . .
        $endgroup$
        – Neil Slater
        Feb 28 '18 at 15:38






      • 1




        $begingroup$
        @Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
        $endgroup$
        – Neil Slater
        Feb 1 at 15:46


















      $begingroup$
      Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
      $endgroup$
      – Neil Slater
      Dec 10 '15 at 16:24




      $begingroup$
      Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
      $endgroup$
      – Neil Slater
      Dec 10 '15 at 16:24












      $begingroup$
      I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
      $endgroup$
      – Martin Thoma
      Dec 17 '15 at 8:47




      $begingroup$
      I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
      $endgroup$
      – Martin Thoma
      Dec 17 '15 at 8:47












      $begingroup$
      @NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
      $endgroup$
      – Media
      Feb 28 '18 at 13:15




      $begingroup$
      @NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
      $endgroup$
      – Media
      Feb 28 '18 at 13:15




      1




      1




      $begingroup$
      @Media: Not really. You want to be looking at things such as hierarchical classification though . . .
      $endgroup$
      – Neil Slater
      Feb 28 '18 at 15:38




      $begingroup$
      @Media: Not really. You want to be looking at things such as hierarchical classification though . . .
      $endgroup$
      – Neil Slater
      Feb 28 '18 at 15:38




      1




      1




      $begingroup$
      @Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
      $endgroup$
      – Neil Slater
      Feb 1 at 15:46






      $begingroup$
      @Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
      $endgroup$
      – Neil Slater
      Feb 1 at 15:46













      10












      $begingroup$

      Given $y_{true}$, you want to optimize your machine learning method to get the $y_{predict}$ as close as possible to $y_{true}$.



      First question:



      Above answer has explained the background of your first formula, the cross entropy defined in information theory.



      From a opinion other than information theory:



      you can examine yourself that first formula does not have penalty on false-positiveness(truth is false but your model predict that it is right), while the second one has penalty on false-positiveness. Therefore, the choice of first formula or second, will affect your metrics(aka what statistic quantity you would like to use to evaluate your model).



      In layman word:



      If you want to accept almost all good people to be your friend but willing to accept some bad people become your friend, then use first formula for criterion.



      If you want to punish yourself accepting some bad people to be your friend,but at the same time your good-people accepting rate might be lower than the first condition, then use second formula.



      While, I guess most of us are critical and would like to choose the second one(so as many ML package assume what is cross entropy).



      Second question:



      Cross entropy per sample per class: $$-y_{true}log{(y_{predict})}$$



      Cross entropy for whole datasets whole classes: $$sum_i^n sum_k^K -y_{true}^{(k)}log{(y_{predict}^{(k)})}$$



      Thus, when there are only two classes (K = 2), you will have the second formula.






      share|improve this answer











      $endgroup$


















        10












        $begingroup$

        Given $y_{true}$, you want to optimize your machine learning method to get the $y_{predict}$ as close as possible to $y_{true}$.



        First question:



        Above answer has explained the background of your first formula, the cross entropy defined in information theory.



        From a opinion other than information theory:



        you can examine yourself that first formula does not have penalty on false-positiveness(truth is false but your model predict that it is right), while the second one has penalty on false-positiveness. Therefore, the choice of first formula or second, will affect your metrics(aka what statistic quantity you would like to use to evaluate your model).



        In layman word:



        If you want to accept almost all good people to be your friend but willing to accept some bad people become your friend, then use first formula for criterion.



        If you want to punish yourself accepting some bad people to be your friend,but at the same time your good-people accepting rate might be lower than the first condition, then use second formula.



        While, I guess most of us are critical and would like to choose the second one(so as many ML package assume what is cross entropy).



        Second question:



        Cross entropy per sample per class: $$-y_{true}log{(y_{predict})}$$



        Cross entropy for whole datasets whole classes: $$sum_i^n sum_k^K -y_{true}^{(k)}log{(y_{predict}^{(k)})}$$



        Thus, when there are only two classes (K = 2), you will have the second formula.






        share|improve this answer











        $endgroup$
















          10












          10








          10





          $begingroup$

          Given $y_{true}$, you want to optimize your machine learning method to get the $y_{predict}$ as close as possible to $y_{true}$.



          First question:



          Above answer has explained the background of your first formula, the cross entropy defined in information theory.



          From a opinion other than information theory:



          you can examine yourself that first formula does not have penalty on false-positiveness(truth is false but your model predict that it is right), while the second one has penalty on false-positiveness. Therefore, the choice of first formula or second, will affect your metrics(aka what statistic quantity you would like to use to evaluate your model).



          In layman word:



          If you want to accept almost all good people to be your friend but willing to accept some bad people become your friend, then use first formula for criterion.



          If you want to punish yourself accepting some bad people to be your friend,but at the same time your good-people accepting rate might be lower than the first condition, then use second formula.



          While, I guess most of us are critical and would like to choose the second one(so as many ML package assume what is cross entropy).



          Second question:



          Cross entropy per sample per class: $$-y_{true}log{(y_{predict})}$$



          Cross entropy for whole datasets whole classes: $$sum_i^n sum_k^K -y_{true}^{(k)}log{(y_{predict}^{(k)})}$$



          Thus, when there are only two classes (K = 2), you will have the second formula.






          share|improve this answer











          $endgroup$



          Given $y_{true}$, you want to optimize your machine learning method to get the $y_{predict}$ as close as possible to $y_{true}$.



          First question:



          Above answer has explained the background of your first formula, the cross entropy defined in information theory.



          From a opinion other than information theory:



          you can examine yourself that first formula does not have penalty on false-positiveness(truth is false but your model predict that it is right), while the second one has penalty on false-positiveness. Therefore, the choice of first formula or second, will affect your metrics(aka what statistic quantity you would like to use to evaluate your model).



          In layman word:



          If you want to accept almost all good people to be your friend but willing to accept some bad people become your friend, then use first formula for criterion.



          If you want to punish yourself accepting some bad people to be your friend,but at the same time your good-people accepting rate might be lower than the first condition, then use second formula.



          While, I guess most of us are critical and would like to choose the second one(so as many ML package assume what is cross entropy).



          Second question:



          Cross entropy per sample per class: $$-y_{true}log{(y_{predict})}$$



          Cross entropy for whole datasets whole classes: $$sum_i^n sum_k^K -y_{true}^{(k)}log{(y_{predict}^{(k)})}$$



          Thus, when there are only two classes (K = 2), you will have the second formula.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Dec 1 '16 at 2:53

























          answered Dec 1 '16 at 2:36









          ArtificiallyIntelligenceArtificiallyIntelligence

          20829




          20829























              5












              $begingroup$

              Those issues are handled by the tutorial's use of softmax.



              For 1) you're correct that softmax guarantees a non-zero output because it exponentiates it's input. For activations that do not give this guarantee (like relu), it's simple to add a very small positive term to every output to avoid that problem.



              As for 2), they aren't the same obviously, but I the softmax formulation they gave takes care of the the issue. If you didn't use softmax, this would cause you to learn huge bias terms that guess 1 for every class for any input. But since they normalize the softmax across all classes, the only way to maximize the output of the correct class is for it to be large relative to the incorrect classes.






              share|improve this answer









              $endgroup$













              • $begingroup$
                "you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
                $endgroup$
                – Martin Thoma
                Dec 10 '15 at 14:30










              • $begingroup$
                Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
                $endgroup$
                – jamesmf
                Dec 10 '15 at 14:50
















              5












              $begingroup$

              Those issues are handled by the tutorial's use of softmax.



              For 1) you're correct that softmax guarantees a non-zero output because it exponentiates it's input. For activations that do not give this guarantee (like relu), it's simple to add a very small positive term to every output to avoid that problem.



              As for 2), they aren't the same obviously, but I the softmax formulation they gave takes care of the the issue. If you didn't use softmax, this would cause you to learn huge bias terms that guess 1 for every class for any input. But since they normalize the softmax across all classes, the only way to maximize the output of the correct class is for it to be large relative to the incorrect classes.






              share|improve this answer









              $endgroup$













              • $begingroup$
                "you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
                $endgroup$
                – Martin Thoma
                Dec 10 '15 at 14:30










              • $begingroup$
                Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
                $endgroup$
                – jamesmf
                Dec 10 '15 at 14:50














              5












              5








              5





              $begingroup$

              Those issues are handled by the tutorial's use of softmax.



              For 1) you're correct that softmax guarantees a non-zero output because it exponentiates it's input. For activations that do not give this guarantee (like relu), it's simple to add a very small positive term to every output to avoid that problem.



              As for 2), they aren't the same obviously, but I the softmax formulation they gave takes care of the the issue. If you didn't use softmax, this would cause you to learn huge bias terms that guess 1 for every class for any input. But since they normalize the softmax across all classes, the only way to maximize the output of the correct class is for it to be large relative to the incorrect classes.






              share|improve this answer









              $endgroup$



              Those issues are handled by the tutorial's use of softmax.



              For 1) you're correct that softmax guarantees a non-zero output because it exponentiates it's input. For activations that do not give this guarantee (like relu), it's simple to add a very small positive term to every output to avoid that problem.



              As for 2), they aren't the same obviously, but I the softmax formulation they gave takes care of the the issue. If you didn't use softmax, this would cause you to learn huge bias terms that guess 1 for every class for any input. But since they normalize the softmax across all classes, the only way to maximize the output of the correct class is for it to be large relative to the incorrect classes.







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Dec 10 '15 at 14:08









              jamesmfjamesmf

              2,387819




              2,387819












              • $begingroup$
                "you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
                $endgroup$
                – Martin Thoma
                Dec 10 '15 at 14:30










              • $begingroup$
                Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
                $endgroup$
                – jamesmf
                Dec 10 '15 at 14:50


















              • $begingroup$
                "you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
                $endgroup$
                – Martin Thoma
                Dec 10 '15 at 14:30










              • $begingroup$
                Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
                $endgroup$
                – jamesmf
                Dec 10 '15 at 14:50
















              $begingroup$
              "you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
              $endgroup$
              – Martin Thoma
              Dec 10 '15 at 14:30




              $begingroup$
              "you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
              $endgroup$
              – Martin Thoma
              Dec 10 '15 at 14:30












              $begingroup$
              Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
              $endgroup$
              – jamesmf
              Dec 10 '15 at 14:50




              $begingroup$
              Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
              $endgroup$
              – jamesmf
              Dec 10 '15 at 14:50











              0












              $begingroup$


              Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0?




              Yes it is, since $log(0)$ is undefined, but this problem is avoided using $log(y_i + epsilon)$ in practice.




              What is correct?

              (a) $H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$ or

              (b) $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log(1-y_i)})$?




              (a) is correct for estimating class probabilities, (b) is correct for predicting binary classes. Both are cross-entropy, (a) sums over classes and doesn't care about miss-classifications, but (b) sums over training points.



              Example:



              Suppose each training data $x_i$ has label $c_i in {0, 1}$, and model predicts $c_i' in [0, 1]$. Let $p(c)$ be the empirical probability of class $c$, and $p'(c)$ be model's estimation.



              True label $c_i$ and model prediction $c_i'$ for 5 data points are:
              $(c_i, c_i')={(1, 0.8), (1, 0.2), (0, 0.1), (0, 0.4), (0, 0.8)}$,



              Empirical and estimated class probabilities are:
              $p(1) = 2/5 = 0.4$, $p'(1) = 2/5 = 0.4$,



              (a) is calculated as: $-p(1)logp'(1) - p(0)logp'(0) = -0.4log(0.4) - 0.6log(0.6) = 0.292$.



              Two data points $(1, 0.2)$ and $(0, 0.8)$ are miss-classified but $p(c)$ is estimated correctly!



              (b) is calculated as: $-1/5([log(0.8) + log(0.2)] + [log(1-0.1)+log(1-0.4) + log(1-0.8)]) = 0.352$



              Now, suppose all 5 points where classified correctly as:
              $(c_i, c_i')={(1, 0.8), (1, color{blue}{0.8}), (0, 0.1), (0, 0.4), (0, color{blue}{0.2})}$,



              (a) still remains the same, since $p'(1)$ is still $2/5$. However, (b) decreases to:
              $-1/5([log(0.8) + log(color{blue}{0.8})] + [log(1-0.1)+log(1-0.4) + log(1-color{blue}{0.2})]) = 0.112$



              Derivation:



              To write down their formula, I changed your notations for a better delivery.



              Let's write (a) as: $H_{p} (p') := - sum_{c} p(c)log p'(c)$



              This sum is over all possible classes such as $C={red, blue, green}$ or $C={0, 1}$. To calculate (a), model should output $c_i' in C$ for every $(x_i, c_i)$, then the ratios $p(c)=sum_{i:c_i=c}1/N$ and $p'(c)=sum_{i:c_i'=c}1/N$ should be plugged into (a).



              If there is two classes 1 and 0, another cross-entropy (b) can be used. For training point $(x_i, c_i)$, when $c_i = 1$, we want the model's output $c_i'=p'(c=1|x_i)$ to be close to 1, and when $c_i = 0$, close to 0. Therefore, loss of $(x_i, 1)$ can be defined as $-log(c_i')$, which gives $c_i' rightarrow 1 Rightarrow -log(c_i') rightarrow 0$. Similarly, loss of $(x_i, 0)$ can be defined as $-log(1 - c_i')$, which gives $c_i' rightarrow 0 Rightarrow -log(1 - c_i') rightarrow 0$. Both losses can be combined as:



              $L(c_i, c_i') = -c_ilog(c_i') - (1 - c_i)log(1 - c_i')$,



              When $c_i = 1$, $0log(1 - c_i')=0$ is disabled, and when $c_i = 0$, $0log(c_i')=0$ is disabled.



              Finally, (b) can be written as:



              $begin{align*}
              H_{c}(c') &= - 1/Nsum_{(x_i,c_i)} c_ilog(c_i') + (1 - c_i)log(1 - c_i')\
              &= - 1/Nsum_{(x_i,1)} log(c_i') - 1/Nsum_{(x_i,0)} log(1 - c_i')
              end{align*}$



              To better see the difference, cross-entropy (a) for two classes ${0, 1}$ would be:



              $begin{align*}
              H_{p} (p') &= - p(1)log p'(1) - p(0)log p'(0)\
              &= - 1/Nsum_{(x_i,1)}log(sum_{k:c_k''=1}1/N) - 1/Nsum_{(x_i,0)}log(1 - sum_{k:c_k''=1}1/N)
              end{align*}$



              Using $p(c) = sum_{(x_i,c)}1/N$, and $p'(c) = sum_{i:c_i''=c}1/N$ where $c_i'' = left lfloor c_i' + 0.5 right rfloor in {0, 1}$.



              There is a summation inside $log(.)$ independent of point $i$, meaning (a) doesn't care about $i$ being miss-classified.






              share|improve this answer










              New contributor




              P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$


















                0












                $begingroup$


                Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0?




                Yes it is, since $log(0)$ is undefined, but this problem is avoided using $log(y_i + epsilon)$ in practice.




                What is correct?

                (a) $H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$ or

                (b) $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log(1-y_i)})$?




                (a) is correct for estimating class probabilities, (b) is correct for predicting binary classes. Both are cross-entropy, (a) sums over classes and doesn't care about miss-classifications, but (b) sums over training points.



                Example:



                Suppose each training data $x_i$ has label $c_i in {0, 1}$, and model predicts $c_i' in [0, 1]$. Let $p(c)$ be the empirical probability of class $c$, and $p'(c)$ be model's estimation.



                True label $c_i$ and model prediction $c_i'$ for 5 data points are:
                $(c_i, c_i')={(1, 0.8), (1, 0.2), (0, 0.1), (0, 0.4), (0, 0.8)}$,



                Empirical and estimated class probabilities are:
                $p(1) = 2/5 = 0.4$, $p'(1) = 2/5 = 0.4$,



                (a) is calculated as: $-p(1)logp'(1) - p(0)logp'(0) = -0.4log(0.4) - 0.6log(0.6) = 0.292$.



                Two data points $(1, 0.2)$ and $(0, 0.8)$ are miss-classified but $p(c)$ is estimated correctly!



                (b) is calculated as: $-1/5([log(0.8) + log(0.2)] + [log(1-0.1)+log(1-0.4) + log(1-0.8)]) = 0.352$



                Now, suppose all 5 points where classified correctly as:
                $(c_i, c_i')={(1, 0.8), (1, color{blue}{0.8}), (0, 0.1), (0, 0.4), (0, color{blue}{0.2})}$,



                (a) still remains the same, since $p'(1)$ is still $2/5$. However, (b) decreases to:
                $-1/5([log(0.8) + log(color{blue}{0.8})] + [log(1-0.1)+log(1-0.4) + log(1-color{blue}{0.2})]) = 0.112$



                Derivation:



                To write down their formula, I changed your notations for a better delivery.



                Let's write (a) as: $H_{p} (p') := - sum_{c} p(c)log p'(c)$



                This sum is over all possible classes such as $C={red, blue, green}$ or $C={0, 1}$. To calculate (a), model should output $c_i' in C$ for every $(x_i, c_i)$, then the ratios $p(c)=sum_{i:c_i=c}1/N$ and $p'(c)=sum_{i:c_i'=c}1/N$ should be plugged into (a).



                If there is two classes 1 and 0, another cross-entropy (b) can be used. For training point $(x_i, c_i)$, when $c_i = 1$, we want the model's output $c_i'=p'(c=1|x_i)$ to be close to 1, and when $c_i = 0$, close to 0. Therefore, loss of $(x_i, 1)$ can be defined as $-log(c_i')$, which gives $c_i' rightarrow 1 Rightarrow -log(c_i') rightarrow 0$. Similarly, loss of $(x_i, 0)$ can be defined as $-log(1 - c_i')$, which gives $c_i' rightarrow 0 Rightarrow -log(1 - c_i') rightarrow 0$. Both losses can be combined as:



                $L(c_i, c_i') = -c_ilog(c_i') - (1 - c_i)log(1 - c_i')$,



                When $c_i = 1$, $0log(1 - c_i')=0$ is disabled, and when $c_i = 0$, $0log(c_i')=0$ is disabled.



                Finally, (b) can be written as:



                $begin{align*}
                H_{c}(c') &= - 1/Nsum_{(x_i,c_i)} c_ilog(c_i') + (1 - c_i)log(1 - c_i')\
                &= - 1/Nsum_{(x_i,1)} log(c_i') - 1/Nsum_{(x_i,0)} log(1 - c_i')
                end{align*}$



                To better see the difference, cross-entropy (a) for two classes ${0, 1}$ would be:



                $begin{align*}
                H_{p} (p') &= - p(1)log p'(1) - p(0)log p'(0)\
                &= - 1/Nsum_{(x_i,1)}log(sum_{k:c_k''=1}1/N) - 1/Nsum_{(x_i,0)}log(1 - sum_{k:c_k''=1}1/N)
                end{align*}$



                Using $p(c) = sum_{(x_i,c)}1/N$, and $p'(c) = sum_{i:c_i''=c}1/N$ where $c_i'' = left lfloor c_i' + 0.5 right rfloor in {0, 1}$.



                There is a summation inside $log(.)$ independent of point $i$, meaning (a) doesn't care about $i$ being miss-classified.






                share|improve this answer










                New contributor




                P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$
















                  0












                  0








                  0





                  $begingroup$


                  Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0?




                  Yes it is, since $log(0)$ is undefined, but this problem is avoided using $log(y_i + epsilon)$ in practice.




                  What is correct?

                  (a) $H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$ or

                  (b) $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log(1-y_i)})$?




                  (a) is correct for estimating class probabilities, (b) is correct for predicting binary classes. Both are cross-entropy, (a) sums over classes and doesn't care about miss-classifications, but (b) sums over training points.



                  Example:



                  Suppose each training data $x_i$ has label $c_i in {0, 1}$, and model predicts $c_i' in [0, 1]$. Let $p(c)$ be the empirical probability of class $c$, and $p'(c)$ be model's estimation.



                  True label $c_i$ and model prediction $c_i'$ for 5 data points are:
                  $(c_i, c_i')={(1, 0.8), (1, 0.2), (0, 0.1), (0, 0.4), (0, 0.8)}$,



                  Empirical and estimated class probabilities are:
                  $p(1) = 2/5 = 0.4$, $p'(1) = 2/5 = 0.4$,



                  (a) is calculated as: $-p(1)logp'(1) - p(0)logp'(0) = -0.4log(0.4) - 0.6log(0.6) = 0.292$.



                  Two data points $(1, 0.2)$ and $(0, 0.8)$ are miss-classified but $p(c)$ is estimated correctly!



                  (b) is calculated as: $-1/5([log(0.8) + log(0.2)] + [log(1-0.1)+log(1-0.4) + log(1-0.8)]) = 0.352$



                  Now, suppose all 5 points where classified correctly as:
                  $(c_i, c_i')={(1, 0.8), (1, color{blue}{0.8}), (0, 0.1), (0, 0.4), (0, color{blue}{0.2})}$,



                  (a) still remains the same, since $p'(1)$ is still $2/5$. However, (b) decreases to:
                  $-1/5([log(0.8) + log(color{blue}{0.8})] + [log(1-0.1)+log(1-0.4) + log(1-color{blue}{0.2})]) = 0.112$



                  Derivation:



                  To write down their formula, I changed your notations for a better delivery.



                  Let's write (a) as: $H_{p} (p') := - sum_{c} p(c)log p'(c)$



                  This sum is over all possible classes such as $C={red, blue, green}$ or $C={0, 1}$. To calculate (a), model should output $c_i' in C$ for every $(x_i, c_i)$, then the ratios $p(c)=sum_{i:c_i=c}1/N$ and $p'(c)=sum_{i:c_i'=c}1/N$ should be plugged into (a).



                  If there is two classes 1 and 0, another cross-entropy (b) can be used. For training point $(x_i, c_i)$, when $c_i = 1$, we want the model's output $c_i'=p'(c=1|x_i)$ to be close to 1, and when $c_i = 0$, close to 0. Therefore, loss of $(x_i, 1)$ can be defined as $-log(c_i')$, which gives $c_i' rightarrow 1 Rightarrow -log(c_i') rightarrow 0$. Similarly, loss of $(x_i, 0)$ can be defined as $-log(1 - c_i')$, which gives $c_i' rightarrow 0 Rightarrow -log(1 - c_i') rightarrow 0$. Both losses can be combined as:



                  $L(c_i, c_i') = -c_ilog(c_i') - (1 - c_i)log(1 - c_i')$,



                  When $c_i = 1$, $0log(1 - c_i')=0$ is disabled, and when $c_i = 0$, $0log(c_i')=0$ is disabled.



                  Finally, (b) can be written as:



                  $begin{align*}
                  H_{c}(c') &= - 1/Nsum_{(x_i,c_i)} c_ilog(c_i') + (1 - c_i)log(1 - c_i')\
                  &= - 1/Nsum_{(x_i,1)} log(c_i') - 1/Nsum_{(x_i,0)} log(1 - c_i')
                  end{align*}$



                  To better see the difference, cross-entropy (a) for two classes ${0, 1}$ would be:



                  $begin{align*}
                  H_{p} (p') &= - p(1)log p'(1) - p(0)log p'(0)\
                  &= - 1/Nsum_{(x_i,1)}log(sum_{k:c_k''=1}1/N) - 1/Nsum_{(x_i,0)}log(1 - sum_{k:c_k''=1}1/N)
                  end{align*}$



                  Using $p(c) = sum_{(x_i,c)}1/N$, and $p'(c) = sum_{i:c_i''=c}1/N$ where $c_i'' = left lfloor c_i' + 0.5 right rfloor in {0, 1}$.



                  There is a summation inside $log(.)$ independent of point $i$, meaning (a) doesn't care about $i$ being miss-classified.






                  share|improve this answer










                  New contributor




                  P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  $endgroup$




                  Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0?




                  Yes it is, since $log(0)$ is undefined, but this problem is avoided using $log(y_i + epsilon)$ in practice.




                  What is correct?

                  (a) $H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$ or

                  (b) $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log(1-y_i)})$?




                  (a) is correct for estimating class probabilities, (b) is correct for predicting binary classes. Both are cross-entropy, (a) sums over classes and doesn't care about miss-classifications, but (b) sums over training points.



                  Example:



                  Suppose each training data $x_i$ has label $c_i in {0, 1}$, and model predicts $c_i' in [0, 1]$. Let $p(c)$ be the empirical probability of class $c$, and $p'(c)$ be model's estimation.



                  True label $c_i$ and model prediction $c_i'$ for 5 data points are:
                  $(c_i, c_i')={(1, 0.8), (1, 0.2), (0, 0.1), (0, 0.4), (0, 0.8)}$,



                  Empirical and estimated class probabilities are:
                  $p(1) = 2/5 = 0.4$, $p'(1) = 2/5 = 0.4$,



                  (a) is calculated as: $-p(1)logp'(1) - p(0)logp'(0) = -0.4log(0.4) - 0.6log(0.6) = 0.292$.



                  Two data points $(1, 0.2)$ and $(0, 0.8)$ are miss-classified but $p(c)$ is estimated correctly!



                  (b) is calculated as: $-1/5([log(0.8) + log(0.2)] + [log(1-0.1)+log(1-0.4) + log(1-0.8)]) = 0.352$



                  Now, suppose all 5 points where classified correctly as:
                  $(c_i, c_i')={(1, 0.8), (1, color{blue}{0.8}), (0, 0.1), (0, 0.4), (0, color{blue}{0.2})}$,



                  (a) still remains the same, since $p'(1)$ is still $2/5$. However, (b) decreases to:
                  $-1/5([log(0.8) + log(color{blue}{0.8})] + [log(1-0.1)+log(1-0.4) + log(1-color{blue}{0.2})]) = 0.112$



                  Derivation:



                  To write down their formula, I changed your notations for a better delivery.



                  Let's write (a) as: $H_{p} (p') := - sum_{c} p(c)log p'(c)$



                  This sum is over all possible classes such as $C={red, blue, green}$ or $C={0, 1}$. To calculate (a), model should output $c_i' in C$ for every $(x_i, c_i)$, then the ratios $p(c)=sum_{i:c_i=c}1/N$ and $p'(c)=sum_{i:c_i'=c}1/N$ should be plugged into (a).



                  If there is two classes 1 and 0, another cross-entropy (b) can be used. For training point $(x_i, c_i)$, when $c_i = 1$, we want the model's output $c_i'=p'(c=1|x_i)$ to be close to 1, and when $c_i = 0$, close to 0. Therefore, loss of $(x_i, 1)$ can be defined as $-log(c_i')$, which gives $c_i' rightarrow 1 Rightarrow -log(c_i') rightarrow 0$. Similarly, loss of $(x_i, 0)$ can be defined as $-log(1 - c_i')$, which gives $c_i' rightarrow 0 Rightarrow -log(1 - c_i') rightarrow 0$. Both losses can be combined as:



                  $L(c_i, c_i') = -c_ilog(c_i') - (1 - c_i)log(1 - c_i')$,



                  When $c_i = 1$, $0log(1 - c_i')=0$ is disabled, and when $c_i = 0$, $0log(c_i')=0$ is disabled.



                  Finally, (b) can be written as:



                  $begin{align*}
                  H_{c}(c') &= - 1/Nsum_{(x_i,c_i)} c_ilog(c_i') + (1 - c_i)log(1 - c_i')\
                  &= - 1/Nsum_{(x_i,1)} log(c_i') - 1/Nsum_{(x_i,0)} log(1 - c_i')
                  end{align*}$



                  To better see the difference, cross-entropy (a) for two classes ${0, 1}$ would be:



                  $begin{align*}
                  H_{p} (p') &= - p(1)log p'(1) - p(0)log p'(0)\
                  &= - 1/Nsum_{(x_i,1)}log(sum_{k:c_k''=1}1/N) - 1/Nsum_{(x_i,0)}log(1 - sum_{k:c_k''=1}1/N)
                  end{align*}$



                  Using $p(c) = sum_{(x_i,c)}1/N$, and $p'(c) = sum_{i:c_i''=c}1/N$ where $c_i'' = left lfloor c_i' + 0.5 right rfloor in {0, 1}$.



                  There is a summation inside $log(.)$ independent of point $i$, meaning (a) doesn't care about $i$ being miss-classified.







                  share|improve this answer










                  New contributor




                  P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  share|improve this answer



                  share|improve this answer








                  edited 17 hours ago





















                  New contributor




                  P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.









                  answered yesterday









                  P. EsmailianP. Esmailian

                  612




                  612




                  New contributor




                  P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.





                  New contributor





                  P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






                  P. Esmailian is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Data Science Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f9302%2fthe-cross-entropy-error-function-in-neural-networks%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      How to label and detect the document text images

                      Tabula Rosettana

                      Aureus (color)