The cross-entropy error function in neural networks
$begingroup$
In the MNIST For ML Beginners they define cross-entropy as
$$H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$$
$y_i$ is the predicted probability value for class $i$ and $y_i'$ is the true probability for that class.
Question 1
Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0? This would mean that we have a really bad classifier, of course. But think of an error in our dataset, e.g. an "obvious" 1
labeled as 3
. Would it simply crash? Does the model we chose (softmax activation at the end) basically never give the probability 0 for the correct class?
Question 2
I've learned that cross-entropy is defined as
$$H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$$
What is correct? Do you have any textbook references for either version? How do those functions differ in their properties (as error functions for neural networks)?
machine-learning tensorflow
$endgroup$
add a comment |
$begingroup$
In the MNIST For ML Beginners they define cross-entropy as
$$H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$$
$y_i$ is the predicted probability value for class $i$ and $y_i'$ is the true probability for that class.
Question 1
Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0? This would mean that we have a really bad classifier, of course. But think of an error in our dataset, e.g. an "obvious" 1
labeled as 3
. Would it simply crash? Does the model we chose (softmax activation at the end) basically never give the probability 0 for the correct class?
Question 2
I've learned that cross-entropy is defined as
$$H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$$
What is correct? Do you have any textbook references for either version? How do those functions differ in their properties (as error functions for neural networks)?
machine-learning tensorflow
$endgroup$
$begingroup$
See also: stats.stackexchange.com/questions/80967/…
$endgroup$
– Piotr Migdal
Jan 22 '16 at 19:04
$begingroup$
See also: Kullback-Leibler Divergence Explained blog post.
$endgroup$
– Piotr Migdal
May 11 '17 at 22:15
add a comment |
$begingroup$
In the MNIST For ML Beginners they define cross-entropy as
$$H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$$
$y_i$ is the predicted probability value for class $i$ and $y_i'$ is the true probability for that class.
Question 1
Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0? This would mean that we have a really bad classifier, of course. But think of an error in our dataset, e.g. an "obvious" 1
labeled as 3
. Would it simply crash? Does the model we chose (softmax activation at the end) basically never give the probability 0 for the correct class?
Question 2
I've learned that cross-entropy is defined as
$$H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$$
What is correct? Do you have any textbook references for either version? How do those functions differ in their properties (as error functions for neural networks)?
machine-learning tensorflow
$endgroup$
In the MNIST For ML Beginners they define cross-entropy as
$$H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$$
$y_i$ is the predicted probability value for class $i$ and $y_i'$ is the true probability for that class.
Question 1
Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0? This would mean that we have a really bad classifier, of course. But think of an error in our dataset, e.g. an "obvious" 1
labeled as 3
. Would it simply crash? Does the model we chose (softmax activation at the end) basically never give the probability 0 for the correct class?
Question 2
I've learned that cross-entropy is defined as
$$H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$$
What is correct? Do you have any textbook references for either version? How do those functions differ in their properties (as error functions for neural networks)?
machine-learning tensorflow
machine-learning tensorflow
edited Apr 19 '18 at 19:17
Alex
34
34
asked Dec 10 '15 at 6:22
Martin ThomaMartin Thoma
6,2931353130
6,2931353130
$begingroup$
See also: stats.stackexchange.com/questions/80967/…
$endgroup$
– Piotr Migdal
Jan 22 '16 at 19:04
$begingroup$
See also: Kullback-Leibler Divergence Explained blog post.
$endgroup$
– Piotr Migdal
May 11 '17 at 22:15
add a comment |
$begingroup$
See also: stats.stackexchange.com/questions/80967/…
$endgroup$
– Piotr Migdal
Jan 22 '16 at 19:04
$begingroup$
See also: Kullback-Leibler Divergence Explained blog post.
$endgroup$
– Piotr Migdal
May 11 '17 at 22:15
$begingroup$
See also: stats.stackexchange.com/questions/80967/…
$endgroup$
– Piotr Migdal
Jan 22 '16 at 19:04
$begingroup$
See also: stats.stackexchange.com/questions/80967/…
$endgroup$
– Piotr Migdal
Jan 22 '16 at 19:04
$begingroup$
See also: Kullback-Leibler Divergence Explained blog post.
$endgroup$
– Piotr Migdal
May 11 '17 at 22:15
$begingroup$
See also: Kullback-Leibler Divergence Explained blog post.
$endgroup$
– Piotr Migdal
May 11 '17 at 22:15
add a comment |
5 Answers
5
active
oldest
votes
$begingroup$
One way to interpret cross-entropy is to see it as a (minus) log-likelihood for the data $y_i'$, under a model $y_i$.
Namely, suppose that you have some fixed model (a.k.a. "hypothesis"), which predicts for $n$ classes ${1,2,dots, n}$ their hypothetical occurrence probabilities $y_1, y_2,dots, y_n$. Suppose that you now observe (in reality) $k_1$ instances of class $1$, $k_2$ instances of class $2$, $k_n$ instances of class $n$, etc. According to your model the likelihood of this happening is:
$$
P[data|model] := y_1^{k_1}y_2^{k_2}dots y_n^{k_n}.
$$
Taking the logarithm and changing the sign:
$$
-log P[data|model] = -k_1log y_1 -k_2log y_2 - dots -k_nlog y_n = -sum_i k_i log y_i
$$
If you now divide the right-hand sum by the number of observations $N = k_1+k_2+dots+k_n$, and denote the empirical probabilities as $y_i'=k_i/N$, you'll get the cross-entropy:
$$
-frac{1}{N} log P[data|model] = -frac{1}{N}sum_i k_i log y_i = -sum_i y_i'log y_i =: H(y', y)
$$
Furthermore, the log-likelihood of a dataset given a model can be interpreted as a measure of "encoding length" - the number of bits you expect to spend to encode this information if your encoding scheme would be based on your hypothesis.
This follows from the observation that an independent event with probability $y_i$ requires at least $-log_2 y_i$ bits to encode it (assuming efficient coding), and consequently the expression
$$-sum_i y_i'log_2 y_i,$$
is literally the expected length of the encoding, where the encoding lengths for the events are computed using the "hypothesized" distribution, while the expectation is taken over the actual one.
Finally, instead of saying "measure of expected encoding length" I really like to use the informal term "measure of surprise". If you need a lot of bits to encode an expected event from a distribution, the distribution is "really surprising" for you.
With those intuitions in mind, the answers to your questions can be seen as follows:
Question 1. Yes. It is a problem whenever the corresponding $y_i'$ is nonzero at the same time. It corresponds to the situation where your model believes that some class has zero probability of occurrence, and yet the class pops up in reality. As a result, the "surprise" of your model is infinitely great: your model did not account for that event and now needs infinitely many bits to encode it. That is why you get infinity as your cross-entropy.
To avoid this problem you need to make sure that your model does not make rash assumptions about something being impossible while it can happen. In reality, people tend to use sigmoid or "softmax" functions as their hypothesis models, which are conservative enough to leave at least some chance for every option.
If you use some other hypothesis model, it is up to you to regularize (aka "smooth") it so that it would not hypothesize zeros where it should not.
Question 2. In this formula, one usually assumes $y_i'$ to be either $0$ or $1$, while $y_i$ is the model's probability hypothesis for the corresponding input. If you look closely, you will see that it is simply a $-log P[data|model]$ for binary data, an equivalent of the second equation in this answer.
Hence, strictly speaking, although it is still a log-likelihood, this is not syntactically equivalent to cross-entropy. What some people mean when referring to such an expression as cross-entropy is that it is, in fact, a sum over binary cross-entropies for individual points in the dataset:
$$
sum_i H(y_i', y_i),
$$
where $y_i'$ and $y_i$ have to be interpreted as the corresponding binary distributions $(y_i', 1-y_i')$ and $(y_i, 1-y_i)$.
$endgroup$
1
$begingroup$
Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 7:47
1
$begingroup$
In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 9:32
$begingroup$
@LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
$endgroup$
– THN
Jul 13 '17 at 11:02
$begingroup$
'independent event requires...to encode it' - could you explain this bit please?
$endgroup$
– Alex
Aug 20 '17 at 13:25
$begingroup$
@Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
$endgroup$
– KT.
Aug 21 '17 at 9:55
|
show 1 more comment
$begingroup$
The first logloss formula you are using is for multiclass log loss, where the $i$ subscript enumerates the different classes in an example. The formula assumes that a single $y_i'$ in each example is 1, and the rest are all 0.
That means the formula only captures error on the target class. It discards any notion of errors that you might consider "false positive" and does not care how predicted probabilities are distributed other than predicted probability of the true class.
Another assumption is that $sum_i y_i = 1$ for the predictions of each example. A softmax layer does this automatically - if you use something different you will need to scale the outputs to meet that constraint.
Question 1
Isn't it a problem that the $y_i$ (in $log(y_i)$) could be 0?
Yes that can be a problem, but it is usually not a practical one. A randomly-initialised softmax layer is extremely unlikely to output an exact 0
in any class. But it is possible, so worth allowing for it. First, don't evaluate $log(y_i)$ for any $y_i'=0$, because the negative classes always contribute 0 to the error. Second, in practical code you can limit the value to something like log( max( y_predict, 1e-15 ) )
for numerical stability - in many cases it is not required, but this is sensible defensive programming.
Question 2
I've learned that cross-entropy is defined as $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$
This formulation is often used for a network with one output predicting two classes (usually positive class membership for 1 and negative for 0 output). In that case $i$ may only have one value - you can lose the sum over $i$.
If you modify such a network to have two opposing outputs and use softmax plus the first logloss definition, then you can see that in fact it is the same error measurement but folding the error metric for two classes into a single output.
If there is more than one class to predict membership of, and the classes are not exclusive i.e. an example could be any or all of the classes at the same time, then you will need to use this second formulation. For digit recognition that is not the case (a written digit should only have one "true" class)
$endgroup$
$begingroup$
Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
$endgroup$
– Neil Slater
Dec 10 '15 at 16:24
$begingroup$
I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
$endgroup$
– Martin Thoma
Dec 17 '15 at 8:47
$begingroup$
@NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
$endgroup$
– Media
Feb 28 '18 at 13:15
1
$begingroup$
@Media: Not really. You want to be looking at things such as hierarchical classification though . . .
$endgroup$
– Neil Slater
Feb 28 '18 at 15:38
1
$begingroup$
@Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
$endgroup$
– Neil Slater
Feb 1 at 15:46
|
show 3 more comments
$begingroup$
Given $y_{true}$, you want to optimize your machine learning method to get the $y_{predict}$ as close as possible to $y_{true}$.
First question:
Above answer has explained the background of your first formula, the cross entropy defined in information theory.
From a opinion other than information theory:
you can examine yourself that first formula does not have penalty on false-positiveness(truth is false but your model predict that it is right), while the second one has penalty on false-positiveness. Therefore, the choice of first formula or second, will affect your metrics(aka what statistic quantity you would like to use to evaluate your model).
In layman word:
If you want to accept almost all good people to be your friend but willing to accept some bad people become your friend, then use first formula for criterion.
If you want to punish yourself accepting some bad people to be your friend,but at the same time your good-people accepting rate might be lower than the first condition, then use second formula.
While, I guess most of us are critical and would like to choose the second one(so as many ML package assume what is cross entropy).
Second question:
Cross entropy per sample per class: $$-y_{true}log{(y_{predict})}$$
Cross entropy for whole datasets whole classes: $$sum_i^n sum_k^K -y_{true}^{(k)}log{(y_{predict}^{(k)})}$$
Thus, when there are only two classes (K = 2), you will have the second formula.
$endgroup$
add a comment |
$begingroup$
Those issues are handled by the tutorial's use of softmax.
For 1) you're correct that softmax guarantees a non-zero output because it exponentiates it's input. For activations that do not give this guarantee (like relu), it's simple to add a very small positive term to every output to avoid that problem.
As for 2), they aren't the same obviously, but I the softmax formulation they gave takes care of the the issue. If you didn't use softmax, this would cause you to learn huge bias terms that guess 1 for every class for any input. But since they normalize the softmax across all classes, the only way to maximize the output of the correct class is for it to be large relative to the incorrect classes.
$endgroup$
$begingroup$
"you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
$endgroup$
– Martin Thoma
Dec 10 '15 at 14:30
$begingroup$
Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
$endgroup$
– jamesmf
Dec 10 '15 at 14:50
add a comment |
$begingroup$
Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0?
Yes it is, since $log(0)$ is undefined, but this problem is avoided using $log(y_i + epsilon)$ in practice.
What is correct?
(a) $H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$ or
(b) $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log(1-y_i)})$?
(a) is correct for estimating class probabilities, (b) is correct for predicting binary classes. Both are cross-entropy, (a) sums over classes and doesn't care about miss-classifications, but (b) sums over training points.
Example:
Suppose each training data $x_i$ has label $c_i in {0, 1}$, and model predicts $c_i' in [0, 1]$. Let $p(c)$ be the empirical probability of class $c$, and $p'(c)$ be model's estimation.
True label $c_i$ and model prediction $c_i'$ for 5 data points are:
$(c_i, c_i')={(1, 0.8), (1, 0.2), (0, 0.1), (0, 0.4), (0, 0.8)}$,
Empirical and estimated class probabilities are:
$p(1) = 2/5 = 0.4$, $p'(1) = 2/5 = 0.4$,
(a) is calculated as: $-p(1)logp'(1) - p(0)logp'(0) = -0.4log(0.4) - 0.6log(0.6) = 0.292$.
Two data points $(1, 0.2)$ and $(0, 0.8)$ are miss-classified but $p(c)$ is estimated correctly!
(b) is calculated as: $-1/5([log(0.8) + log(0.2)] + [log(1-0.1)+log(1-0.4) + log(1-0.8)]) = 0.352$
Now, suppose all 5 points where classified correctly as:
$(c_i, c_i')={(1, 0.8), (1, color{blue}{0.8}), (0, 0.1), (0, 0.4), (0, color{blue}{0.2})}$,
(a) still remains the same, since $p'(1)$ is still $2/5$. However, (b) decreases to:
$-1/5([log(0.8) + log(color{blue}{0.8})] + [log(1-0.1)+log(1-0.4) + log(1-color{blue}{0.2})]) = 0.112$
Derivation:
To write down their formula, I changed your notations for a better delivery.
Let's write (a) as: $H_{p} (p') := - sum_{c} p(c)log p'(c)$
This sum is over all possible classes such as $C={red, blue, green}$ or $C={0, 1}$. To calculate (a), model should output $c_i' in C$ for every $(x_i, c_i)$, then the ratios $p(c)=sum_{i:c_i=c}1/N$ and $p'(c)=sum_{i:c_i'=c}1/N$ should be plugged into (a).
If there is two classes 1 and 0, another cross-entropy (b) can be used. For training point $(x_i, c_i)$, when $c_i = 1$, we want the model's output $c_i'=p'(c=1|x_i)$ to be close to 1, and when $c_i = 0$, close to 0. Therefore, loss of $(x_i, 1)$ can be defined as $-log(c_i')$, which gives $c_i' rightarrow 1 Rightarrow -log(c_i') rightarrow 0$. Similarly, loss of $(x_i, 0)$ can be defined as $-log(1 - c_i')$, which gives $c_i' rightarrow 0 Rightarrow -log(1 - c_i') rightarrow 0$. Both losses can be combined as:
$L(c_i, c_i') = -c_ilog(c_i') - (1 - c_i)log(1 - c_i')$,
When $c_i = 1$, $0log(1 - c_i')=0$ is disabled, and when $c_i = 0$, $0log(c_i')=0$ is disabled.
Finally, (b) can be written as:
$begin{align*}
H_{c}(c') &= - 1/Nsum_{(x_i,c_i)} c_ilog(c_i') + (1 - c_i)log(1 - c_i')\
&= - 1/Nsum_{(x_i,1)} log(c_i') - 1/Nsum_{(x_i,0)} log(1 - c_i')
end{align*}$
To better see the difference, cross-entropy (a) for two classes ${0, 1}$ would be:
$begin{align*}
H_{p} (p') &= - p(1)log p'(1) - p(0)log p'(0)\
&= - 1/Nsum_{(x_i,1)}log(sum_{k:c_k''=1}1/N) - 1/Nsum_{(x_i,0)}log(1 - sum_{k:c_k''=1}1/N)
end{align*}$
Using $p(c) = sum_{(x_i,c)}1/N$, and $p'(c) = sum_{i:c_i''=c}1/N$ where $c_i'' = left lfloor c_i' + 0.5 right rfloor in {0, 1}$.
There is a summation inside $log(.)$ independent of point $i$, meaning (a) doesn't care about $i$ being miss-classified.
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f9302%2fthe-cross-entropy-error-function-in-neural-networks%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
One way to interpret cross-entropy is to see it as a (minus) log-likelihood for the data $y_i'$, under a model $y_i$.
Namely, suppose that you have some fixed model (a.k.a. "hypothesis"), which predicts for $n$ classes ${1,2,dots, n}$ their hypothetical occurrence probabilities $y_1, y_2,dots, y_n$. Suppose that you now observe (in reality) $k_1$ instances of class $1$, $k_2$ instances of class $2$, $k_n$ instances of class $n$, etc. According to your model the likelihood of this happening is:
$$
P[data|model] := y_1^{k_1}y_2^{k_2}dots y_n^{k_n}.
$$
Taking the logarithm and changing the sign:
$$
-log P[data|model] = -k_1log y_1 -k_2log y_2 - dots -k_nlog y_n = -sum_i k_i log y_i
$$
If you now divide the right-hand sum by the number of observations $N = k_1+k_2+dots+k_n$, and denote the empirical probabilities as $y_i'=k_i/N$, you'll get the cross-entropy:
$$
-frac{1}{N} log P[data|model] = -frac{1}{N}sum_i k_i log y_i = -sum_i y_i'log y_i =: H(y', y)
$$
Furthermore, the log-likelihood of a dataset given a model can be interpreted as a measure of "encoding length" - the number of bits you expect to spend to encode this information if your encoding scheme would be based on your hypothesis.
This follows from the observation that an independent event with probability $y_i$ requires at least $-log_2 y_i$ bits to encode it (assuming efficient coding), and consequently the expression
$$-sum_i y_i'log_2 y_i,$$
is literally the expected length of the encoding, where the encoding lengths for the events are computed using the "hypothesized" distribution, while the expectation is taken over the actual one.
Finally, instead of saying "measure of expected encoding length" I really like to use the informal term "measure of surprise". If you need a lot of bits to encode an expected event from a distribution, the distribution is "really surprising" for you.
With those intuitions in mind, the answers to your questions can be seen as follows:
Question 1. Yes. It is a problem whenever the corresponding $y_i'$ is nonzero at the same time. It corresponds to the situation where your model believes that some class has zero probability of occurrence, and yet the class pops up in reality. As a result, the "surprise" of your model is infinitely great: your model did not account for that event and now needs infinitely many bits to encode it. That is why you get infinity as your cross-entropy.
To avoid this problem you need to make sure that your model does not make rash assumptions about something being impossible while it can happen. In reality, people tend to use sigmoid or "softmax" functions as their hypothesis models, which are conservative enough to leave at least some chance for every option.
If you use some other hypothesis model, it is up to you to regularize (aka "smooth") it so that it would not hypothesize zeros where it should not.
Question 2. In this formula, one usually assumes $y_i'$ to be either $0$ or $1$, while $y_i$ is the model's probability hypothesis for the corresponding input. If you look closely, you will see that it is simply a $-log P[data|model]$ for binary data, an equivalent of the second equation in this answer.
Hence, strictly speaking, although it is still a log-likelihood, this is not syntactically equivalent to cross-entropy. What some people mean when referring to such an expression as cross-entropy is that it is, in fact, a sum over binary cross-entropies for individual points in the dataset:
$$
sum_i H(y_i', y_i),
$$
where $y_i'$ and $y_i$ have to be interpreted as the corresponding binary distributions $(y_i', 1-y_i')$ and $(y_i, 1-y_i)$.
$endgroup$
1
$begingroup$
Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 7:47
1
$begingroup$
In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 9:32
$begingroup$
@LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
$endgroup$
– THN
Jul 13 '17 at 11:02
$begingroup$
'independent event requires...to encode it' - could you explain this bit please?
$endgroup$
– Alex
Aug 20 '17 at 13:25
$begingroup$
@Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
$endgroup$
– KT.
Aug 21 '17 at 9:55
|
show 1 more comment
$begingroup$
One way to interpret cross-entropy is to see it as a (minus) log-likelihood for the data $y_i'$, under a model $y_i$.
Namely, suppose that you have some fixed model (a.k.a. "hypothesis"), which predicts for $n$ classes ${1,2,dots, n}$ their hypothetical occurrence probabilities $y_1, y_2,dots, y_n$. Suppose that you now observe (in reality) $k_1$ instances of class $1$, $k_2$ instances of class $2$, $k_n$ instances of class $n$, etc. According to your model the likelihood of this happening is:
$$
P[data|model] := y_1^{k_1}y_2^{k_2}dots y_n^{k_n}.
$$
Taking the logarithm and changing the sign:
$$
-log P[data|model] = -k_1log y_1 -k_2log y_2 - dots -k_nlog y_n = -sum_i k_i log y_i
$$
If you now divide the right-hand sum by the number of observations $N = k_1+k_2+dots+k_n$, and denote the empirical probabilities as $y_i'=k_i/N$, you'll get the cross-entropy:
$$
-frac{1}{N} log P[data|model] = -frac{1}{N}sum_i k_i log y_i = -sum_i y_i'log y_i =: H(y', y)
$$
Furthermore, the log-likelihood of a dataset given a model can be interpreted as a measure of "encoding length" - the number of bits you expect to spend to encode this information if your encoding scheme would be based on your hypothesis.
This follows from the observation that an independent event with probability $y_i$ requires at least $-log_2 y_i$ bits to encode it (assuming efficient coding), and consequently the expression
$$-sum_i y_i'log_2 y_i,$$
is literally the expected length of the encoding, where the encoding lengths for the events are computed using the "hypothesized" distribution, while the expectation is taken over the actual one.
Finally, instead of saying "measure of expected encoding length" I really like to use the informal term "measure of surprise". If you need a lot of bits to encode an expected event from a distribution, the distribution is "really surprising" for you.
With those intuitions in mind, the answers to your questions can be seen as follows:
Question 1. Yes. It is a problem whenever the corresponding $y_i'$ is nonzero at the same time. It corresponds to the situation where your model believes that some class has zero probability of occurrence, and yet the class pops up in reality. As a result, the "surprise" of your model is infinitely great: your model did not account for that event and now needs infinitely many bits to encode it. That is why you get infinity as your cross-entropy.
To avoid this problem you need to make sure that your model does not make rash assumptions about something being impossible while it can happen. In reality, people tend to use sigmoid or "softmax" functions as their hypothesis models, which are conservative enough to leave at least some chance for every option.
If you use some other hypothesis model, it is up to you to regularize (aka "smooth") it so that it would not hypothesize zeros where it should not.
Question 2. In this formula, one usually assumes $y_i'$ to be either $0$ or $1$, while $y_i$ is the model's probability hypothesis for the corresponding input. If you look closely, you will see that it is simply a $-log P[data|model]$ for binary data, an equivalent of the second equation in this answer.
Hence, strictly speaking, although it is still a log-likelihood, this is not syntactically equivalent to cross-entropy. What some people mean when referring to such an expression as cross-entropy is that it is, in fact, a sum over binary cross-entropies for individual points in the dataset:
$$
sum_i H(y_i', y_i),
$$
where $y_i'$ and $y_i$ have to be interpreted as the corresponding binary distributions $(y_i', 1-y_i')$ and $(y_i, 1-y_i)$.
$endgroup$
1
$begingroup$
Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 7:47
1
$begingroup$
In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 9:32
$begingroup$
@LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
$endgroup$
– THN
Jul 13 '17 at 11:02
$begingroup$
'independent event requires...to encode it' - could you explain this bit please?
$endgroup$
– Alex
Aug 20 '17 at 13:25
$begingroup$
@Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
$endgroup$
– KT.
Aug 21 '17 at 9:55
|
show 1 more comment
$begingroup$
One way to interpret cross-entropy is to see it as a (minus) log-likelihood for the data $y_i'$, under a model $y_i$.
Namely, suppose that you have some fixed model (a.k.a. "hypothesis"), which predicts for $n$ classes ${1,2,dots, n}$ their hypothetical occurrence probabilities $y_1, y_2,dots, y_n$. Suppose that you now observe (in reality) $k_1$ instances of class $1$, $k_2$ instances of class $2$, $k_n$ instances of class $n$, etc. According to your model the likelihood of this happening is:
$$
P[data|model] := y_1^{k_1}y_2^{k_2}dots y_n^{k_n}.
$$
Taking the logarithm and changing the sign:
$$
-log P[data|model] = -k_1log y_1 -k_2log y_2 - dots -k_nlog y_n = -sum_i k_i log y_i
$$
If you now divide the right-hand sum by the number of observations $N = k_1+k_2+dots+k_n$, and denote the empirical probabilities as $y_i'=k_i/N$, you'll get the cross-entropy:
$$
-frac{1}{N} log P[data|model] = -frac{1}{N}sum_i k_i log y_i = -sum_i y_i'log y_i =: H(y', y)
$$
Furthermore, the log-likelihood of a dataset given a model can be interpreted as a measure of "encoding length" - the number of bits you expect to spend to encode this information if your encoding scheme would be based on your hypothesis.
This follows from the observation that an independent event with probability $y_i$ requires at least $-log_2 y_i$ bits to encode it (assuming efficient coding), and consequently the expression
$$-sum_i y_i'log_2 y_i,$$
is literally the expected length of the encoding, where the encoding lengths for the events are computed using the "hypothesized" distribution, while the expectation is taken over the actual one.
Finally, instead of saying "measure of expected encoding length" I really like to use the informal term "measure of surprise". If you need a lot of bits to encode an expected event from a distribution, the distribution is "really surprising" for you.
With those intuitions in mind, the answers to your questions can be seen as follows:
Question 1. Yes. It is a problem whenever the corresponding $y_i'$ is nonzero at the same time. It corresponds to the situation where your model believes that some class has zero probability of occurrence, and yet the class pops up in reality. As a result, the "surprise" of your model is infinitely great: your model did not account for that event and now needs infinitely many bits to encode it. That is why you get infinity as your cross-entropy.
To avoid this problem you need to make sure that your model does not make rash assumptions about something being impossible while it can happen. In reality, people tend to use sigmoid or "softmax" functions as their hypothesis models, which are conservative enough to leave at least some chance for every option.
If you use some other hypothesis model, it is up to you to regularize (aka "smooth") it so that it would not hypothesize zeros where it should not.
Question 2. In this formula, one usually assumes $y_i'$ to be either $0$ or $1$, while $y_i$ is the model's probability hypothesis for the corresponding input. If you look closely, you will see that it is simply a $-log P[data|model]$ for binary data, an equivalent of the second equation in this answer.
Hence, strictly speaking, although it is still a log-likelihood, this is not syntactically equivalent to cross-entropy. What some people mean when referring to such an expression as cross-entropy is that it is, in fact, a sum over binary cross-entropies for individual points in the dataset:
$$
sum_i H(y_i', y_i),
$$
where $y_i'$ and $y_i$ have to be interpreted as the corresponding binary distributions $(y_i', 1-y_i')$ and $(y_i, 1-y_i)$.
$endgroup$
One way to interpret cross-entropy is to see it as a (minus) log-likelihood for the data $y_i'$, under a model $y_i$.
Namely, suppose that you have some fixed model (a.k.a. "hypothesis"), which predicts for $n$ classes ${1,2,dots, n}$ their hypothetical occurrence probabilities $y_1, y_2,dots, y_n$. Suppose that you now observe (in reality) $k_1$ instances of class $1$, $k_2$ instances of class $2$, $k_n$ instances of class $n$, etc. According to your model the likelihood of this happening is:
$$
P[data|model] := y_1^{k_1}y_2^{k_2}dots y_n^{k_n}.
$$
Taking the logarithm and changing the sign:
$$
-log P[data|model] = -k_1log y_1 -k_2log y_2 - dots -k_nlog y_n = -sum_i k_i log y_i
$$
If you now divide the right-hand sum by the number of observations $N = k_1+k_2+dots+k_n$, and denote the empirical probabilities as $y_i'=k_i/N$, you'll get the cross-entropy:
$$
-frac{1}{N} log P[data|model] = -frac{1}{N}sum_i k_i log y_i = -sum_i y_i'log y_i =: H(y', y)
$$
Furthermore, the log-likelihood of a dataset given a model can be interpreted as a measure of "encoding length" - the number of bits you expect to spend to encode this information if your encoding scheme would be based on your hypothesis.
This follows from the observation that an independent event with probability $y_i$ requires at least $-log_2 y_i$ bits to encode it (assuming efficient coding), and consequently the expression
$$-sum_i y_i'log_2 y_i,$$
is literally the expected length of the encoding, where the encoding lengths for the events are computed using the "hypothesized" distribution, while the expectation is taken over the actual one.
Finally, instead of saying "measure of expected encoding length" I really like to use the informal term "measure of surprise". If you need a lot of bits to encode an expected event from a distribution, the distribution is "really surprising" for you.
With those intuitions in mind, the answers to your questions can be seen as follows:
Question 1. Yes. It is a problem whenever the corresponding $y_i'$ is nonzero at the same time. It corresponds to the situation where your model believes that some class has zero probability of occurrence, and yet the class pops up in reality. As a result, the "surprise" of your model is infinitely great: your model did not account for that event and now needs infinitely many bits to encode it. That is why you get infinity as your cross-entropy.
To avoid this problem you need to make sure that your model does not make rash assumptions about something being impossible while it can happen. In reality, people tend to use sigmoid or "softmax" functions as their hypothesis models, which are conservative enough to leave at least some chance for every option.
If you use some other hypothesis model, it is up to you to regularize (aka "smooth") it so that it would not hypothesize zeros where it should not.
Question 2. In this formula, one usually assumes $y_i'$ to be either $0$ or $1$, while $y_i$ is the model's probability hypothesis for the corresponding input. If you look closely, you will see that it is simply a $-log P[data|model]$ for binary data, an equivalent of the second equation in this answer.
Hence, strictly speaking, although it is still a log-likelihood, this is not syntactically equivalent to cross-entropy. What some people mean when referring to such an expression as cross-entropy is that it is, in fact, a sum over binary cross-entropies for individual points in the dataset:
$$
sum_i H(y_i', y_i),
$$
where $y_i'$ and $y_i$ have to be interpreted as the corresponding binary distributions $(y_i', 1-y_i')$ and $(y_i, 1-y_i)$.
edited Nov 25 '18 at 11:08
answered Dec 16 '15 at 13:29
KT.KT.
1,43157
1,43157
1
$begingroup$
Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 7:47
1
$begingroup$
In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 9:32
$begingroup$
@LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
$endgroup$
– THN
Jul 13 '17 at 11:02
$begingroup$
'independent event requires...to encode it' - could you explain this bit please?
$endgroup$
– Alex
Aug 20 '17 at 13:25
$begingroup$
@Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
$endgroup$
– KT.
Aug 21 '17 at 9:55
|
show 1 more comment
1
$begingroup$
Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 7:47
1
$begingroup$
In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 9:32
$begingroup$
@LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
$endgroup$
– THN
Jul 13 '17 at 11:02
$begingroup$
'independent event requires...to encode it' - could you explain this bit please?
$endgroup$
– Alex
Aug 20 '17 at 13:25
$begingroup$
@Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
$endgroup$
– KT.
Aug 21 '17 at 9:55
1
1
$begingroup$
Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 7:47
$begingroup$
Can you provide a source where they define $y′i=frac{ki}{N}$? Here they define it as a one-hot distribution for the current class label. What is the difference?
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 7:47
1
1
$begingroup$
In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 9:32
$begingroup$
In the MNIST TensorFlow tutorial they define it in terms of one-hot vectors as well.
$endgroup$
– Lenar Hoyt
Jun 22 '16 at 9:32
$begingroup$
@LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
$endgroup$
– THN
Jul 13 '17 at 11:02
$begingroup$
@LenarHoyt When $N=1$, $k_i/N$ would be equivalent to one-hot. You can think of one-hot as the encoding of one item based on its empirical (real) categorical probability.
$endgroup$
– THN
Jul 13 '17 at 11:02
$begingroup$
'independent event requires...to encode it' - could you explain this bit please?
$endgroup$
– Alex
Aug 20 '17 at 13:25
$begingroup$
'independent event requires...to encode it' - could you explain this bit please?
$endgroup$
– Alex
Aug 20 '17 at 13:25
$begingroup$
@Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
$endgroup$
– KT.
Aug 21 '17 at 9:55
$begingroup$
@Alex This may need longer explanation to understand properly - read up on Shannon-Fano codes and relation of optimal coding to the Shannon entropy equation. To dumb things down, if an event has probability 1/2, your best bet is to code it using a single bit. If it has probability 1/4, you should spend 2 bits to encode it, etc. In general, if your set of events has probabilities of the form 1/2^k, you should give them lengths k - this way your code will approach the Shannon optimal length.
$endgroup$
– KT.
Aug 21 '17 at 9:55
|
show 1 more comment
$begingroup$
The first logloss formula you are using is for multiclass log loss, where the $i$ subscript enumerates the different classes in an example. The formula assumes that a single $y_i'$ in each example is 1, and the rest are all 0.
That means the formula only captures error on the target class. It discards any notion of errors that you might consider "false positive" and does not care how predicted probabilities are distributed other than predicted probability of the true class.
Another assumption is that $sum_i y_i = 1$ for the predictions of each example. A softmax layer does this automatically - if you use something different you will need to scale the outputs to meet that constraint.
Question 1
Isn't it a problem that the $y_i$ (in $log(y_i)$) could be 0?
Yes that can be a problem, but it is usually not a practical one. A randomly-initialised softmax layer is extremely unlikely to output an exact 0
in any class. But it is possible, so worth allowing for it. First, don't evaluate $log(y_i)$ for any $y_i'=0$, because the negative classes always contribute 0 to the error. Second, in practical code you can limit the value to something like log( max( y_predict, 1e-15 ) )
for numerical stability - in many cases it is not required, but this is sensible defensive programming.
Question 2
I've learned that cross-entropy is defined as $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$
This formulation is often used for a network with one output predicting two classes (usually positive class membership for 1 and negative for 0 output). In that case $i$ may only have one value - you can lose the sum over $i$.
If you modify such a network to have two opposing outputs and use softmax plus the first logloss definition, then you can see that in fact it is the same error measurement but folding the error metric for two classes into a single output.
If there is more than one class to predict membership of, and the classes are not exclusive i.e. an example could be any or all of the classes at the same time, then you will need to use this second formulation. For digit recognition that is not the case (a written digit should only have one "true" class)
$endgroup$
$begingroup$
Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
$endgroup$
– Neil Slater
Dec 10 '15 at 16:24
$begingroup$
I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
$endgroup$
– Martin Thoma
Dec 17 '15 at 8:47
$begingroup$
@NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
$endgroup$
– Media
Feb 28 '18 at 13:15
1
$begingroup$
@Media: Not really. You want to be looking at things such as hierarchical classification though . . .
$endgroup$
– Neil Slater
Feb 28 '18 at 15:38
1
$begingroup$
@Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
$endgroup$
– Neil Slater
Feb 1 at 15:46
|
show 3 more comments
$begingroup$
The first logloss formula you are using is for multiclass log loss, where the $i$ subscript enumerates the different classes in an example. The formula assumes that a single $y_i'$ in each example is 1, and the rest are all 0.
That means the formula only captures error on the target class. It discards any notion of errors that you might consider "false positive" and does not care how predicted probabilities are distributed other than predicted probability of the true class.
Another assumption is that $sum_i y_i = 1$ for the predictions of each example. A softmax layer does this automatically - if you use something different you will need to scale the outputs to meet that constraint.
Question 1
Isn't it a problem that the $y_i$ (in $log(y_i)$) could be 0?
Yes that can be a problem, but it is usually not a practical one. A randomly-initialised softmax layer is extremely unlikely to output an exact 0
in any class. But it is possible, so worth allowing for it. First, don't evaluate $log(y_i)$ for any $y_i'=0$, because the negative classes always contribute 0 to the error. Second, in practical code you can limit the value to something like log( max( y_predict, 1e-15 ) )
for numerical stability - in many cases it is not required, but this is sensible defensive programming.
Question 2
I've learned that cross-entropy is defined as $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$
This formulation is often used for a network with one output predicting two classes (usually positive class membership for 1 and negative for 0 output). In that case $i$ may only have one value - you can lose the sum over $i$.
If you modify such a network to have two opposing outputs and use softmax plus the first logloss definition, then you can see that in fact it is the same error measurement but folding the error metric for two classes into a single output.
If there is more than one class to predict membership of, and the classes are not exclusive i.e. an example could be any or all of the classes at the same time, then you will need to use this second formulation. For digit recognition that is not the case (a written digit should only have one "true" class)
$endgroup$
$begingroup$
Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
$endgroup$
– Neil Slater
Dec 10 '15 at 16:24
$begingroup$
I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
$endgroup$
– Martin Thoma
Dec 17 '15 at 8:47
$begingroup$
@NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
$endgroup$
– Media
Feb 28 '18 at 13:15
1
$begingroup$
@Media: Not really. You want to be looking at things such as hierarchical classification though . . .
$endgroup$
– Neil Slater
Feb 28 '18 at 15:38
1
$begingroup$
@Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
$endgroup$
– Neil Slater
Feb 1 at 15:46
|
show 3 more comments
$begingroup$
The first logloss formula you are using is for multiclass log loss, where the $i$ subscript enumerates the different classes in an example. The formula assumes that a single $y_i'$ in each example is 1, and the rest are all 0.
That means the formula only captures error on the target class. It discards any notion of errors that you might consider "false positive" and does not care how predicted probabilities are distributed other than predicted probability of the true class.
Another assumption is that $sum_i y_i = 1$ for the predictions of each example. A softmax layer does this automatically - if you use something different you will need to scale the outputs to meet that constraint.
Question 1
Isn't it a problem that the $y_i$ (in $log(y_i)$) could be 0?
Yes that can be a problem, but it is usually not a practical one. A randomly-initialised softmax layer is extremely unlikely to output an exact 0
in any class. But it is possible, so worth allowing for it. First, don't evaluate $log(y_i)$ for any $y_i'=0$, because the negative classes always contribute 0 to the error. Second, in practical code you can limit the value to something like log( max( y_predict, 1e-15 ) )
for numerical stability - in many cases it is not required, but this is sensible defensive programming.
Question 2
I've learned that cross-entropy is defined as $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$
This formulation is often used for a network with one output predicting two classes (usually positive class membership for 1 and negative for 0 output). In that case $i$ may only have one value - you can lose the sum over $i$.
If you modify such a network to have two opposing outputs and use softmax plus the first logloss definition, then you can see that in fact it is the same error measurement but folding the error metric for two classes into a single output.
If there is more than one class to predict membership of, and the classes are not exclusive i.e. an example could be any or all of the classes at the same time, then you will need to use this second formulation. For digit recognition that is not the case (a written digit should only have one "true" class)
$endgroup$
The first logloss formula you are using is for multiclass log loss, where the $i$ subscript enumerates the different classes in an example. The formula assumes that a single $y_i'$ in each example is 1, and the rest are all 0.
That means the formula only captures error on the target class. It discards any notion of errors that you might consider "false positive" and does not care how predicted probabilities are distributed other than predicted probability of the true class.
Another assumption is that $sum_i y_i = 1$ for the predictions of each example. A softmax layer does this automatically - if you use something different you will need to scale the outputs to meet that constraint.
Question 1
Isn't it a problem that the $y_i$ (in $log(y_i)$) could be 0?
Yes that can be a problem, but it is usually not a practical one. A randomly-initialised softmax layer is extremely unlikely to output an exact 0
in any class. But it is possible, so worth allowing for it. First, don't evaluate $log(y_i)$ for any $y_i'=0$, because the negative classes always contribute 0 to the error. Second, in practical code you can limit the value to something like log( max( y_predict, 1e-15 ) )
for numerical stability - in many cases it is not required, but this is sensible defensive programming.
Question 2
I've learned that cross-entropy is defined as $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log (1-y_i)})$
This formulation is often used for a network with one output predicting two classes (usually positive class membership for 1 and negative for 0 output). In that case $i$ may only have one value - you can lose the sum over $i$.
If you modify such a network to have two opposing outputs and use softmax plus the first logloss definition, then you can see that in fact it is the same error measurement but folding the error metric for two classes into a single output.
If there is more than one class to predict membership of, and the classes are not exclusive i.e. an example could be any or all of the classes at the same time, then you will need to use this second formulation. For digit recognition that is not the case (a written digit should only have one "true" class)
edited Dec 17 '15 at 9:40
answered Dec 10 '15 at 16:10
Neil SlaterNeil Slater
17k22961
17k22961
$begingroup$
Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
$endgroup$
– Neil Slater
Dec 10 '15 at 16:24
$begingroup$
I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
$endgroup$
– Martin Thoma
Dec 17 '15 at 8:47
$begingroup$
@NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
$endgroup$
– Media
Feb 28 '18 at 13:15
1
$begingroup$
@Media: Not really. You want to be looking at things such as hierarchical classification though . . .
$endgroup$
– Neil Slater
Feb 28 '18 at 15:38
1
$begingroup$
@Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
$endgroup$
– Neil Slater
Feb 1 at 15:46
|
show 3 more comments
$begingroup$
Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
$endgroup$
– Neil Slater
Dec 10 '15 at 16:24
$begingroup$
I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
$endgroup$
– Martin Thoma
Dec 17 '15 at 8:47
$begingroup$
@NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
$endgroup$
– Media
Feb 28 '18 at 13:15
1
$begingroup$
@Media: Not really. You want to be looking at things such as hierarchical classification though . . .
$endgroup$
– Neil Slater
Feb 28 '18 at 15:38
1
$begingroup$
@Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
$endgroup$
– Neil Slater
Feb 1 at 15:46
$begingroup$
Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
$endgroup$
– Neil Slater
Dec 10 '15 at 16:24
$begingroup$
Note there is some ambiguity in the presentation of the second formula - it could in theory assume just one class and $i$ would then enumerate the examples.
$endgroup$
– Neil Slater
Dec 10 '15 at 16:24
$begingroup$
I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
$endgroup$
– Martin Thoma
Dec 17 '15 at 8:47
$begingroup$
I'm sorry, I've asked something different than what I wanted to know. I don't see a problem in $log(y_i) = 0$, but in $y_i = 0$, because of $log(y_i)$. Could you please adjust your answer to that?
$endgroup$
– Martin Thoma
Dec 17 '15 at 8:47
$begingroup$
@NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
$endgroup$
– Media
Feb 28 '18 at 13:15
$begingroup$
@NeilSlater if the classes were not mutually exclusive, the output vector for each input may contain more than one 1, should we use the second formula?
$endgroup$
– Media
Feb 28 '18 at 13:15
1
1
$begingroup$
@Media: Not really. You want to be looking at things such as hierarchical classification though . . .
$endgroup$
– Neil Slater
Feb 28 '18 at 15:38
$begingroup$
@Media: Not really. You want to be looking at things such as hierarchical classification though . . .
$endgroup$
– Neil Slater
Feb 28 '18 at 15:38
1
1
$begingroup$
@Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
$endgroup$
– Neil Slater
Feb 1 at 15:46
$begingroup$
@Javi: In the OP's question $y'_i$ is the ground truth, thus usually 0 or 1. It is $y_i$ that is the softmax output. However $y_i$ can end up zero in practice due to floating point rounding. This does actually happen.
$endgroup$
– Neil Slater
Feb 1 at 15:46
|
show 3 more comments
$begingroup$
Given $y_{true}$, you want to optimize your machine learning method to get the $y_{predict}$ as close as possible to $y_{true}$.
First question:
Above answer has explained the background of your first formula, the cross entropy defined in information theory.
From a opinion other than information theory:
you can examine yourself that first formula does not have penalty on false-positiveness(truth is false but your model predict that it is right), while the second one has penalty on false-positiveness. Therefore, the choice of first formula or second, will affect your metrics(aka what statistic quantity you would like to use to evaluate your model).
In layman word:
If you want to accept almost all good people to be your friend but willing to accept some bad people become your friend, then use first formula for criterion.
If you want to punish yourself accepting some bad people to be your friend,but at the same time your good-people accepting rate might be lower than the first condition, then use second formula.
While, I guess most of us are critical and would like to choose the second one(so as many ML package assume what is cross entropy).
Second question:
Cross entropy per sample per class: $$-y_{true}log{(y_{predict})}$$
Cross entropy for whole datasets whole classes: $$sum_i^n sum_k^K -y_{true}^{(k)}log{(y_{predict}^{(k)})}$$
Thus, when there are only two classes (K = 2), you will have the second formula.
$endgroup$
add a comment |
$begingroup$
Given $y_{true}$, you want to optimize your machine learning method to get the $y_{predict}$ as close as possible to $y_{true}$.
First question:
Above answer has explained the background of your first formula, the cross entropy defined in information theory.
From a opinion other than information theory:
you can examine yourself that first formula does not have penalty on false-positiveness(truth is false but your model predict that it is right), while the second one has penalty on false-positiveness. Therefore, the choice of first formula or second, will affect your metrics(aka what statistic quantity you would like to use to evaluate your model).
In layman word:
If you want to accept almost all good people to be your friend but willing to accept some bad people become your friend, then use first formula for criterion.
If you want to punish yourself accepting some bad people to be your friend,but at the same time your good-people accepting rate might be lower than the first condition, then use second formula.
While, I guess most of us are critical and would like to choose the second one(so as many ML package assume what is cross entropy).
Second question:
Cross entropy per sample per class: $$-y_{true}log{(y_{predict})}$$
Cross entropy for whole datasets whole classes: $$sum_i^n sum_k^K -y_{true}^{(k)}log{(y_{predict}^{(k)})}$$
Thus, when there are only two classes (K = 2), you will have the second formula.
$endgroup$
add a comment |
$begingroup$
Given $y_{true}$, you want to optimize your machine learning method to get the $y_{predict}$ as close as possible to $y_{true}$.
First question:
Above answer has explained the background of your first formula, the cross entropy defined in information theory.
From a opinion other than information theory:
you can examine yourself that first formula does not have penalty on false-positiveness(truth is false but your model predict that it is right), while the second one has penalty on false-positiveness. Therefore, the choice of first formula or second, will affect your metrics(aka what statistic quantity you would like to use to evaluate your model).
In layman word:
If you want to accept almost all good people to be your friend but willing to accept some bad people become your friend, then use first formula for criterion.
If you want to punish yourself accepting some bad people to be your friend,but at the same time your good-people accepting rate might be lower than the first condition, then use second formula.
While, I guess most of us are critical and would like to choose the second one(so as many ML package assume what is cross entropy).
Second question:
Cross entropy per sample per class: $$-y_{true}log{(y_{predict})}$$
Cross entropy for whole datasets whole classes: $$sum_i^n sum_k^K -y_{true}^{(k)}log{(y_{predict}^{(k)})}$$
Thus, when there are only two classes (K = 2), you will have the second formula.
$endgroup$
Given $y_{true}$, you want to optimize your machine learning method to get the $y_{predict}$ as close as possible to $y_{true}$.
First question:
Above answer has explained the background of your first formula, the cross entropy defined in information theory.
From a opinion other than information theory:
you can examine yourself that first formula does not have penalty on false-positiveness(truth is false but your model predict that it is right), while the second one has penalty on false-positiveness. Therefore, the choice of first formula or second, will affect your metrics(aka what statistic quantity you would like to use to evaluate your model).
In layman word:
If you want to accept almost all good people to be your friend but willing to accept some bad people become your friend, then use first formula for criterion.
If you want to punish yourself accepting some bad people to be your friend,but at the same time your good-people accepting rate might be lower than the first condition, then use second formula.
While, I guess most of us are critical and would like to choose the second one(so as many ML package assume what is cross entropy).
Second question:
Cross entropy per sample per class: $$-y_{true}log{(y_{predict})}$$
Cross entropy for whole datasets whole classes: $$sum_i^n sum_k^K -y_{true}^{(k)}log{(y_{predict}^{(k)})}$$
Thus, when there are only two classes (K = 2), you will have the second formula.
edited Dec 1 '16 at 2:53
answered Dec 1 '16 at 2:36
ArtificiallyIntelligenceArtificiallyIntelligence
20829
20829
add a comment |
add a comment |
$begingroup$
Those issues are handled by the tutorial's use of softmax.
For 1) you're correct that softmax guarantees a non-zero output because it exponentiates it's input. For activations that do not give this guarantee (like relu), it's simple to add a very small positive term to every output to avoid that problem.
As for 2), they aren't the same obviously, but I the softmax formulation they gave takes care of the the issue. If you didn't use softmax, this would cause you to learn huge bias terms that guess 1 for every class for any input. But since they normalize the softmax across all classes, the only way to maximize the output of the correct class is for it to be large relative to the incorrect classes.
$endgroup$
$begingroup$
"you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
$endgroup$
– Martin Thoma
Dec 10 '15 at 14:30
$begingroup$
Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
$endgroup$
– jamesmf
Dec 10 '15 at 14:50
add a comment |
$begingroup$
Those issues are handled by the tutorial's use of softmax.
For 1) you're correct that softmax guarantees a non-zero output because it exponentiates it's input. For activations that do not give this guarantee (like relu), it's simple to add a very small positive term to every output to avoid that problem.
As for 2), they aren't the same obviously, but I the softmax formulation they gave takes care of the the issue. If you didn't use softmax, this would cause you to learn huge bias terms that guess 1 for every class for any input. But since they normalize the softmax across all classes, the only way to maximize the output of the correct class is for it to be large relative to the incorrect classes.
$endgroup$
$begingroup$
"you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
$endgroup$
– Martin Thoma
Dec 10 '15 at 14:30
$begingroup$
Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
$endgroup$
– jamesmf
Dec 10 '15 at 14:50
add a comment |
$begingroup$
Those issues are handled by the tutorial's use of softmax.
For 1) you're correct that softmax guarantees a non-zero output because it exponentiates it's input. For activations that do not give this guarantee (like relu), it's simple to add a very small positive term to every output to avoid that problem.
As for 2), they aren't the same obviously, but I the softmax formulation they gave takes care of the the issue. If you didn't use softmax, this would cause you to learn huge bias terms that guess 1 for every class for any input. But since they normalize the softmax across all classes, the only way to maximize the output of the correct class is for it to be large relative to the incorrect classes.
$endgroup$
Those issues are handled by the tutorial's use of softmax.
For 1) you're correct that softmax guarantees a non-zero output because it exponentiates it's input. For activations that do not give this guarantee (like relu), it's simple to add a very small positive term to every output to avoid that problem.
As for 2), they aren't the same obviously, but I the softmax formulation they gave takes care of the the issue. If you didn't use softmax, this would cause you to learn huge bias terms that guess 1 for every class for any input. But since they normalize the softmax across all classes, the only way to maximize the output of the correct class is for it to be large relative to the incorrect classes.
answered Dec 10 '15 at 14:08
jamesmfjamesmf
2,387819
2,387819
$begingroup$
"you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
$endgroup$
– Martin Thoma
Dec 10 '15 at 14:30
$begingroup$
Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
$endgroup$
– jamesmf
Dec 10 '15 at 14:50
add a comment |
$begingroup$
"you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
$endgroup$
– Martin Thoma
Dec 10 '15 at 14:30
$begingroup$
Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
$endgroup$
– jamesmf
Dec 10 '15 at 14:50
$begingroup$
"you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
$endgroup$
– Martin Thoma
Dec 10 '15 at 14:30
$begingroup$
"you're correct that softmax guarantees a non-zero output" - I know that this is theoretically the case. In reality, can it happen that (due to numeric issues) this becomes 0?
$endgroup$
– Martin Thoma
Dec 10 '15 at 14:30
$begingroup$
Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
$endgroup$
– jamesmf
Dec 10 '15 at 14:50
$begingroup$
Good question. I assume it's perfectly possible for the exponentiation function to output 0.0 if your input is too small for the precision of your float. However I'd guess most implementations do add the tiny positive term to guarantee non-zero input.
$endgroup$
– jamesmf
Dec 10 '15 at 14:50
add a comment |
$begingroup$
Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0?
Yes it is, since $log(0)$ is undefined, but this problem is avoided using $log(y_i + epsilon)$ in practice.
What is correct?
(a) $H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$ or
(b) $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log(1-y_i)})$?
(a) is correct for estimating class probabilities, (b) is correct for predicting binary classes. Both are cross-entropy, (a) sums over classes and doesn't care about miss-classifications, but (b) sums over training points.
Example:
Suppose each training data $x_i$ has label $c_i in {0, 1}$, and model predicts $c_i' in [0, 1]$. Let $p(c)$ be the empirical probability of class $c$, and $p'(c)$ be model's estimation.
True label $c_i$ and model prediction $c_i'$ for 5 data points are:
$(c_i, c_i')={(1, 0.8), (1, 0.2), (0, 0.1), (0, 0.4), (0, 0.8)}$,
Empirical and estimated class probabilities are:
$p(1) = 2/5 = 0.4$, $p'(1) = 2/5 = 0.4$,
(a) is calculated as: $-p(1)logp'(1) - p(0)logp'(0) = -0.4log(0.4) - 0.6log(0.6) = 0.292$.
Two data points $(1, 0.2)$ and $(0, 0.8)$ are miss-classified but $p(c)$ is estimated correctly!
(b) is calculated as: $-1/5([log(0.8) + log(0.2)] + [log(1-0.1)+log(1-0.4) + log(1-0.8)]) = 0.352$
Now, suppose all 5 points where classified correctly as:
$(c_i, c_i')={(1, 0.8), (1, color{blue}{0.8}), (0, 0.1), (0, 0.4), (0, color{blue}{0.2})}$,
(a) still remains the same, since $p'(1)$ is still $2/5$. However, (b) decreases to:
$-1/5([log(0.8) + log(color{blue}{0.8})] + [log(1-0.1)+log(1-0.4) + log(1-color{blue}{0.2})]) = 0.112$
Derivation:
To write down their formula, I changed your notations for a better delivery.
Let's write (a) as: $H_{p} (p') := - sum_{c} p(c)log p'(c)$
This sum is over all possible classes such as $C={red, blue, green}$ or $C={0, 1}$. To calculate (a), model should output $c_i' in C$ for every $(x_i, c_i)$, then the ratios $p(c)=sum_{i:c_i=c}1/N$ and $p'(c)=sum_{i:c_i'=c}1/N$ should be plugged into (a).
If there is two classes 1 and 0, another cross-entropy (b) can be used. For training point $(x_i, c_i)$, when $c_i = 1$, we want the model's output $c_i'=p'(c=1|x_i)$ to be close to 1, and when $c_i = 0$, close to 0. Therefore, loss of $(x_i, 1)$ can be defined as $-log(c_i')$, which gives $c_i' rightarrow 1 Rightarrow -log(c_i') rightarrow 0$. Similarly, loss of $(x_i, 0)$ can be defined as $-log(1 - c_i')$, which gives $c_i' rightarrow 0 Rightarrow -log(1 - c_i') rightarrow 0$. Both losses can be combined as:
$L(c_i, c_i') = -c_ilog(c_i') - (1 - c_i)log(1 - c_i')$,
When $c_i = 1$, $0log(1 - c_i')=0$ is disabled, and when $c_i = 0$, $0log(c_i')=0$ is disabled.
Finally, (b) can be written as:
$begin{align*}
H_{c}(c') &= - 1/Nsum_{(x_i,c_i)} c_ilog(c_i') + (1 - c_i)log(1 - c_i')\
&= - 1/Nsum_{(x_i,1)} log(c_i') - 1/Nsum_{(x_i,0)} log(1 - c_i')
end{align*}$
To better see the difference, cross-entropy (a) for two classes ${0, 1}$ would be:
$begin{align*}
H_{p} (p') &= - p(1)log p'(1) - p(0)log p'(0)\
&= - 1/Nsum_{(x_i,1)}log(sum_{k:c_k''=1}1/N) - 1/Nsum_{(x_i,0)}log(1 - sum_{k:c_k''=1}1/N)
end{align*}$
Using $p(c) = sum_{(x_i,c)}1/N$, and $p'(c) = sum_{i:c_i''=c}1/N$ where $c_i'' = left lfloor c_i' + 0.5 right rfloor in {0, 1}$.
There is a summation inside $log(.)$ independent of point $i$, meaning (a) doesn't care about $i$ being miss-classified.
New contributor
$endgroup$
add a comment |
$begingroup$
Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0?
Yes it is, since $log(0)$ is undefined, but this problem is avoided using $log(y_i + epsilon)$ in practice.
What is correct?
(a) $H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$ or
(b) $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log(1-y_i)})$?
(a) is correct for estimating class probabilities, (b) is correct for predicting binary classes. Both are cross-entropy, (a) sums over classes and doesn't care about miss-classifications, but (b) sums over training points.
Example:
Suppose each training data $x_i$ has label $c_i in {0, 1}$, and model predicts $c_i' in [0, 1]$. Let $p(c)$ be the empirical probability of class $c$, and $p'(c)$ be model's estimation.
True label $c_i$ and model prediction $c_i'$ for 5 data points are:
$(c_i, c_i')={(1, 0.8), (1, 0.2), (0, 0.1), (0, 0.4), (0, 0.8)}$,
Empirical and estimated class probabilities are:
$p(1) = 2/5 = 0.4$, $p'(1) = 2/5 = 0.4$,
(a) is calculated as: $-p(1)logp'(1) - p(0)logp'(0) = -0.4log(0.4) - 0.6log(0.6) = 0.292$.
Two data points $(1, 0.2)$ and $(0, 0.8)$ are miss-classified but $p(c)$ is estimated correctly!
(b) is calculated as: $-1/5([log(0.8) + log(0.2)] + [log(1-0.1)+log(1-0.4) + log(1-0.8)]) = 0.352$
Now, suppose all 5 points where classified correctly as:
$(c_i, c_i')={(1, 0.8), (1, color{blue}{0.8}), (0, 0.1), (0, 0.4), (0, color{blue}{0.2})}$,
(a) still remains the same, since $p'(1)$ is still $2/5$. However, (b) decreases to:
$-1/5([log(0.8) + log(color{blue}{0.8})] + [log(1-0.1)+log(1-0.4) + log(1-color{blue}{0.2})]) = 0.112$
Derivation:
To write down their formula, I changed your notations for a better delivery.
Let's write (a) as: $H_{p} (p') := - sum_{c} p(c)log p'(c)$
This sum is over all possible classes such as $C={red, blue, green}$ or $C={0, 1}$. To calculate (a), model should output $c_i' in C$ for every $(x_i, c_i)$, then the ratios $p(c)=sum_{i:c_i=c}1/N$ and $p'(c)=sum_{i:c_i'=c}1/N$ should be plugged into (a).
If there is two classes 1 and 0, another cross-entropy (b) can be used. For training point $(x_i, c_i)$, when $c_i = 1$, we want the model's output $c_i'=p'(c=1|x_i)$ to be close to 1, and when $c_i = 0$, close to 0. Therefore, loss of $(x_i, 1)$ can be defined as $-log(c_i')$, which gives $c_i' rightarrow 1 Rightarrow -log(c_i') rightarrow 0$. Similarly, loss of $(x_i, 0)$ can be defined as $-log(1 - c_i')$, which gives $c_i' rightarrow 0 Rightarrow -log(1 - c_i') rightarrow 0$. Both losses can be combined as:
$L(c_i, c_i') = -c_ilog(c_i') - (1 - c_i)log(1 - c_i')$,
When $c_i = 1$, $0log(1 - c_i')=0$ is disabled, and when $c_i = 0$, $0log(c_i')=0$ is disabled.
Finally, (b) can be written as:
$begin{align*}
H_{c}(c') &= - 1/Nsum_{(x_i,c_i)} c_ilog(c_i') + (1 - c_i)log(1 - c_i')\
&= - 1/Nsum_{(x_i,1)} log(c_i') - 1/Nsum_{(x_i,0)} log(1 - c_i')
end{align*}$
To better see the difference, cross-entropy (a) for two classes ${0, 1}$ would be:
$begin{align*}
H_{p} (p') &= - p(1)log p'(1) - p(0)log p'(0)\
&= - 1/Nsum_{(x_i,1)}log(sum_{k:c_k''=1}1/N) - 1/Nsum_{(x_i,0)}log(1 - sum_{k:c_k''=1}1/N)
end{align*}$
Using $p(c) = sum_{(x_i,c)}1/N$, and $p'(c) = sum_{i:c_i''=c}1/N$ where $c_i'' = left lfloor c_i' + 0.5 right rfloor in {0, 1}$.
There is a summation inside $log(.)$ independent of point $i$, meaning (a) doesn't care about $i$ being miss-classified.
New contributor
$endgroup$
add a comment |
$begingroup$
Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0?
Yes it is, since $log(0)$ is undefined, but this problem is avoided using $log(y_i + epsilon)$ in practice.
What is correct?
(a) $H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$ or
(b) $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log(1-y_i)})$?
(a) is correct for estimating class probabilities, (b) is correct for predicting binary classes. Both are cross-entropy, (a) sums over classes and doesn't care about miss-classifications, but (b) sums over training points.
Example:
Suppose each training data $x_i$ has label $c_i in {0, 1}$, and model predicts $c_i' in [0, 1]$. Let $p(c)$ be the empirical probability of class $c$, and $p'(c)$ be model's estimation.
True label $c_i$ and model prediction $c_i'$ for 5 data points are:
$(c_i, c_i')={(1, 0.8), (1, 0.2), (0, 0.1), (0, 0.4), (0, 0.8)}$,
Empirical and estimated class probabilities are:
$p(1) = 2/5 = 0.4$, $p'(1) = 2/5 = 0.4$,
(a) is calculated as: $-p(1)logp'(1) - p(0)logp'(0) = -0.4log(0.4) - 0.6log(0.6) = 0.292$.
Two data points $(1, 0.2)$ and $(0, 0.8)$ are miss-classified but $p(c)$ is estimated correctly!
(b) is calculated as: $-1/5([log(0.8) + log(0.2)] + [log(1-0.1)+log(1-0.4) + log(1-0.8)]) = 0.352$
Now, suppose all 5 points where classified correctly as:
$(c_i, c_i')={(1, 0.8), (1, color{blue}{0.8}), (0, 0.1), (0, 0.4), (0, color{blue}{0.2})}$,
(a) still remains the same, since $p'(1)$ is still $2/5$. However, (b) decreases to:
$-1/5([log(0.8) + log(color{blue}{0.8})] + [log(1-0.1)+log(1-0.4) + log(1-color{blue}{0.2})]) = 0.112$
Derivation:
To write down their formula, I changed your notations for a better delivery.
Let's write (a) as: $H_{p} (p') := - sum_{c} p(c)log p'(c)$
This sum is over all possible classes such as $C={red, blue, green}$ or $C={0, 1}$. To calculate (a), model should output $c_i' in C$ for every $(x_i, c_i)$, then the ratios $p(c)=sum_{i:c_i=c}1/N$ and $p'(c)=sum_{i:c_i'=c}1/N$ should be plugged into (a).
If there is two classes 1 and 0, another cross-entropy (b) can be used. For training point $(x_i, c_i)$, when $c_i = 1$, we want the model's output $c_i'=p'(c=1|x_i)$ to be close to 1, and when $c_i = 0$, close to 0. Therefore, loss of $(x_i, 1)$ can be defined as $-log(c_i')$, which gives $c_i' rightarrow 1 Rightarrow -log(c_i') rightarrow 0$. Similarly, loss of $(x_i, 0)$ can be defined as $-log(1 - c_i')$, which gives $c_i' rightarrow 0 Rightarrow -log(1 - c_i') rightarrow 0$. Both losses can be combined as:
$L(c_i, c_i') = -c_ilog(c_i') - (1 - c_i)log(1 - c_i')$,
When $c_i = 1$, $0log(1 - c_i')=0$ is disabled, and when $c_i = 0$, $0log(c_i')=0$ is disabled.
Finally, (b) can be written as:
$begin{align*}
H_{c}(c') &= - 1/Nsum_{(x_i,c_i)} c_ilog(c_i') + (1 - c_i)log(1 - c_i')\
&= - 1/Nsum_{(x_i,1)} log(c_i') - 1/Nsum_{(x_i,0)} log(1 - c_i')
end{align*}$
To better see the difference, cross-entropy (a) for two classes ${0, 1}$ would be:
$begin{align*}
H_{p} (p') &= - p(1)log p'(1) - p(0)log p'(0)\
&= - 1/Nsum_{(x_i,1)}log(sum_{k:c_k''=1}1/N) - 1/Nsum_{(x_i,0)}log(1 - sum_{k:c_k''=1}1/N)
end{align*}$
Using $p(c) = sum_{(x_i,c)}1/N$, and $p'(c) = sum_{i:c_i''=c}1/N$ where $c_i'' = left lfloor c_i' + 0.5 right rfloor in {0, 1}$.
There is a summation inside $log(.)$ independent of point $i$, meaning (a) doesn't care about $i$ being miss-classified.
New contributor
$endgroup$
Isn't it a problem that $y_i$ (in $log(y_i)$) could be 0?
Yes it is, since $log(0)$ is undefined, but this problem is avoided using $log(y_i + epsilon)$ in practice.
What is correct?
(a) $H_{y'} (y) := - sum_{i} y_{i}' log (y_i)$ or
(b) $H_{y'}(y) := - sum_{i} ({y_i' log(y_i) + (1-y_i') log(1-y_i)})$?
(a) is correct for estimating class probabilities, (b) is correct for predicting binary classes. Both are cross-entropy, (a) sums over classes and doesn't care about miss-classifications, but (b) sums over training points.
Example:
Suppose each training data $x_i$ has label $c_i in {0, 1}$, and model predicts $c_i' in [0, 1]$. Let $p(c)$ be the empirical probability of class $c$, and $p'(c)$ be model's estimation.
True label $c_i$ and model prediction $c_i'$ for 5 data points are:
$(c_i, c_i')={(1, 0.8), (1, 0.2), (0, 0.1), (0, 0.4), (0, 0.8)}$,
Empirical and estimated class probabilities are:
$p(1) = 2/5 = 0.4$, $p'(1) = 2/5 = 0.4$,
(a) is calculated as: $-p(1)logp'(1) - p(0)logp'(0) = -0.4log(0.4) - 0.6log(0.6) = 0.292$.
Two data points $(1, 0.2)$ and $(0, 0.8)$ are miss-classified but $p(c)$ is estimated correctly!
(b) is calculated as: $-1/5([log(0.8) + log(0.2)] + [log(1-0.1)+log(1-0.4) + log(1-0.8)]) = 0.352$
Now, suppose all 5 points where classified correctly as:
$(c_i, c_i')={(1, 0.8), (1, color{blue}{0.8}), (0, 0.1), (0, 0.4), (0, color{blue}{0.2})}$,
(a) still remains the same, since $p'(1)$ is still $2/5$. However, (b) decreases to:
$-1/5([log(0.8) + log(color{blue}{0.8})] + [log(1-0.1)+log(1-0.4) + log(1-color{blue}{0.2})]) = 0.112$
Derivation:
To write down their formula, I changed your notations for a better delivery.
Let's write (a) as: $H_{p} (p') := - sum_{c} p(c)log p'(c)$
This sum is over all possible classes such as $C={red, blue, green}$ or $C={0, 1}$. To calculate (a), model should output $c_i' in C$ for every $(x_i, c_i)$, then the ratios $p(c)=sum_{i:c_i=c}1/N$ and $p'(c)=sum_{i:c_i'=c}1/N$ should be plugged into (a).
If there is two classes 1 and 0, another cross-entropy (b) can be used. For training point $(x_i, c_i)$, when $c_i = 1$, we want the model's output $c_i'=p'(c=1|x_i)$ to be close to 1, and when $c_i = 0$, close to 0. Therefore, loss of $(x_i, 1)$ can be defined as $-log(c_i')$, which gives $c_i' rightarrow 1 Rightarrow -log(c_i') rightarrow 0$. Similarly, loss of $(x_i, 0)$ can be defined as $-log(1 - c_i')$, which gives $c_i' rightarrow 0 Rightarrow -log(1 - c_i') rightarrow 0$. Both losses can be combined as:
$L(c_i, c_i') = -c_ilog(c_i') - (1 - c_i)log(1 - c_i')$,
When $c_i = 1$, $0log(1 - c_i')=0$ is disabled, and when $c_i = 0$, $0log(c_i')=0$ is disabled.
Finally, (b) can be written as:
$begin{align*}
H_{c}(c') &= - 1/Nsum_{(x_i,c_i)} c_ilog(c_i') + (1 - c_i)log(1 - c_i')\
&= - 1/Nsum_{(x_i,1)} log(c_i') - 1/Nsum_{(x_i,0)} log(1 - c_i')
end{align*}$
To better see the difference, cross-entropy (a) for two classes ${0, 1}$ would be:
$begin{align*}
H_{p} (p') &= - p(1)log p'(1) - p(0)log p'(0)\
&= - 1/Nsum_{(x_i,1)}log(sum_{k:c_k''=1}1/N) - 1/Nsum_{(x_i,0)}log(1 - sum_{k:c_k''=1}1/N)
end{align*}$
Using $p(c) = sum_{(x_i,c)}1/N$, and $p'(c) = sum_{i:c_i''=c}1/N$ where $c_i'' = left lfloor c_i' + 0.5 right rfloor in {0, 1}$.
There is a summation inside $log(.)$ independent of point $i$, meaning (a) doesn't care about $i$ being miss-classified.
New contributor
edited 17 hours ago
New contributor
answered yesterday
P. EsmailianP. Esmailian
612
612
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f9302%2fthe-cross-entropy-error-function-in-neural-networks%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
See also: stats.stackexchange.com/questions/80967/…
$endgroup$
– Piotr Migdal
Jan 22 '16 at 19:04
$begingroup$
See also: Kullback-Leibler Divergence Explained blog post.
$endgroup$
– Piotr Migdal
May 11 '17 at 22:15