Intuitive explanation of Noise Contrastive Estimation (NCE) loss?

I read about NCE (a form of candidate sampling) from these two sources:

Tensorflow writeup

Original Paper

Can someone help me with the following:

A simple explanation of how NCE works (I found the above difficult to parse and get an understanding of, so something intuitive that leads to the math presented there would be great)

After point 1 above, a naturally intuitive description of how this is different from Negative Sampling. I can see that there's a slight change in the formula but could not understand the math. I do have an intuitive understanding of negative sampling in the context of word2vec - we randomly choose some samples from the vocabulary V and update only those because |V| is large and this offers a speedup. Please correct if wrong.

When to use which one and how is that decided? It would be great if you could include examples(possibly easy to understand applications)

Is NCE better than Negative Sampling? Better in what manner?

Thank you.

asked Aug 5 '16 at 3:36

tejaskhot

1,16541318

$begingroup$
may be my post may helps. nanjiang.quora.com/Noise-contrastive-Estimation and later experiment with theano can be found at my github.com/jiangnanHugo/language_modeling. I hope my understanding is right.
$endgroup$
– jiangnan hugo
Oct 6 '16 at 12:03

add a comment |

I read about NCE (a form of candidate sampling) from these two sources:

Tensorflow writeup

Original Paper

Can someone help me with the following:

A simple explanation of how NCE works (I found the above difficult to parse and get an understanding of, so something intuitive that leads to the math presented there would be great)

After point 1 above, a naturally intuitive description of how this is different from Negative Sampling. I can see that there's a slight change in the formula but could not understand the math. I do have an intuitive understanding of negative sampling in the context of word2vec - we randomly choose some samples from the vocabulary V and update only those because |V| is large and this offers a speedup. Please correct if wrong.

When to use which one and how is that decided? It would be great if you could include examples(possibly easy to understand applications)

Is NCE better than Negative Sampling? Better in what manner?

Thank you.

asked Aug 5 '16 at 3:36

tejaskhot

1,16541318

$begingroup$
may be my post may helps. nanjiang.quora.com/Noise-contrastive-Estimation and later experiment with theano can be found at my github.com/jiangnanHugo/language_modeling. I hope my understanding is right.
$endgroup$
– jiangnan hugo
Oct 6 '16 at 12:03

add a comment |

I read about NCE (a form of candidate sampling) from these two sources:

Tensorflow writeup

Original Paper

Can someone help me with the following:

A simple explanation of how NCE works (I found the above difficult to parse and get an understanding of, so something intuitive that leads to the math presented there would be great)

After point 1 above, a naturally intuitive description of how this is different from Negative Sampling. I can see that there's a slight change in the formula but could not understand the math. I do have an intuitive understanding of negative sampling in the context of word2vec - we randomly choose some samples from the vocabulary V and update only those because |V| is large and this offers a speedup. Please correct if wrong.

When to use which one and how is that decided? It would be great if you could include examples(possibly easy to understand applications)

Is NCE better than Negative Sampling? Better in what manner?

Thank you.

asked Aug 5 '16 at 3:36

tejaskhot

1,16541318

I read about NCE (a form of candidate sampling) from these two sources:

Tensorflow writeup

Original Paper

Can someone help me with the following:

A simple explanation of how NCE works (I found the above difficult to parse and get an understanding of, so something intuitive that leads to the math presented there would be great)

After point 1 above, a naturally intuitive description of how this is different from Negative Sampling. I can see that there's a slight change in the formula but could not understand the math. I do have an intuitive understanding of negative sampling in the context of word2vec - we randomly choose some samples from the vocabulary V and update only those because |V| is large and this offers a speedup. Please correct if wrong.

When to use which one and how is that decided? It would be great if you could include examples(possibly easy to understand applications)

Is NCE better than Negative Sampling? Better in what manner?

Thank you.

deep-learning tensorflow word-embeddings sampling loss-function

asked Aug 5 '16 at 3:36

tejaskhot

1,16541318

asked Aug 5 '16 at 3:36

tejaskhot

1,16541318

asked Aug 5 '16 at 3:36

tejaskhot

1,16541318

asked Aug 5 '16 at 3:36

tejaskhot

1,16541318

asked Aug 5 '16 at 3:36

tejaskhot

1,16541318

$begingroup$
may be my post may helps. nanjiang.quora.com/Noise-contrastive-Estimation and later experiment with theano can be found at my github.com/jiangnanHugo/language_modeling. I hope my understanding is right.
$endgroup$
– jiangnan hugo
Oct 6 '16 at 12:03

add a comment |

$begingroup$
may be my post may helps. nanjiang.quora.com/Noise-contrastive-Estimation and later experiment with theano can be found at my github.com/jiangnanHugo/language_modeling. I hope my understanding is right.
$endgroup$
– jiangnan hugo
Oct 6 '16 at 12:03

may be my post may helps. nanjiang.quora.com/Noise-contrastive-Estimation and later experiment with theano can be found at my github.com/jiangnanHugo/language_modeling. I hope my understanding is right.

– jiangnan hugo
Oct 6 '16 at 12:03

add a comment |

2 Answers
2

active

oldest

votes

Taken from this post:https://stats.stackexchange.com/a/245452/154812

The issue

There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).

Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.

A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.

The (a) solution

In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.

The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.

For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.

This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.

edited Apr 13 '17 at 12:44

Community♦

answered Mar 27 '17 at 12:57

user154812

32434

add a comment |

Basically, this is selecting a sample from the true distribution which consists of the true class and some other noisy class labels. Then taking the softmax over it.

This is based on sampling words from true distribution and noise distribution.

Here the basic Idea is to train logistic regression classifier which can separate the samples obtained from true distribution and sample obtained from noise distribution. Remember When we are talking about the samples obtained from the true distribution we are talking about only one sample which is the true class obtained from the model distribution.

Here I have explained about NCE loss and how it differ from the NCE loss .

Noise Contrastive Estimation : Solution for expensive Softmax .

edited yesterday

Rohola Zandie

answered Jul 19 '17 at 4:01

Shamane Siriwardhana

390219

1

$begingroup$
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
$endgroup$
– tuomastik
Jul 19 '17 at 6:35

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f13216%2fintuitive-explanation-of-noise-contrastive-estimation-nce-loss%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Taken from this post:https://stats.stackexchange.com/a/245452/154812

The issue

The (a) solution

This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.

edited Apr 13 '17 at 12:44

Community♦

answered Mar 27 '17 at 12:57

user154812

32434

add a comment |

Taken from this post:https://stats.stackexchange.com/a/245452/154812

The issue

The (a) solution

This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.

edited Apr 13 '17 at 12:44

Community♦

answered Mar 27 '17 at 12:57

user154812

32434

add a comment |

Taken from this post:https://stats.stackexchange.com/a/245452/154812

The issue

The (a) solution

This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.

edited Apr 13 '17 at 12:44

Community♦

answered Mar 27 '17 at 12:57

user154812

32434

Taken from this post:https://stats.stackexchange.com/a/245452/154812

The issue

The (a) solution

This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.

References

[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data

[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models

[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning

[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.

[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.

edited Apr 13 '17 at 12:44

Community♦

answered Mar 27 '17 at 12:57

user154812

32434

edited Apr 13 '17 at 12:44

Community♦

edited Apr 13 '17 at 12:44

Community♦

edited Apr 13 '17 at 12:44

Community♦

answered Mar 27 '17 at 12:57

user154812

32434

answered Mar 27 '17 at 12:57

user154812

32434

answered Mar 27 '17 at 12:57

user154812

32434

add a comment |

Basically, this is selecting a sample from the true distribution which consists of the true class and some other noisy class labels. Then taking the softmax over it.

This is based on sampling words from true distribution and noise distribution.

Here I have explained about NCE loss and how it differ from the NCE loss .

Noise Contrastive Estimation : Solution for expensive Softmax .

edited yesterday

Rohola Zandie

answered Jul 19 '17 at 4:01

Shamane Siriwardhana

390219

1

$begingroup$
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
$endgroup$
– tuomastik
Jul 19 '17 at 6:35

add a comment |

Basically, this is selecting a sample from the true distribution which consists of the true class and some other noisy class labels. Then taking the softmax over it.

This is based on sampling words from true distribution and noise distribution.

Here I have explained about NCE loss and how it differ from the NCE loss .

Noise Contrastive Estimation : Solution for expensive Softmax .

edited yesterday

Rohola Zandie

answered Jul 19 '17 at 4:01

Shamane Siriwardhana

390219

1

$begingroup$
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
$endgroup$
– tuomastik
Jul 19 '17 at 6:35

add a comment |

Basically, this is selecting a sample from the true distribution which consists of the true class and some other noisy class labels. Then taking the softmax over it.

This is based on sampling words from true distribution and noise distribution.

Here I have explained about NCE loss and how it differ from the NCE loss .

Noise Contrastive Estimation : Solution for expensive Softmax .

edited yesterday

Rohola Zandie

answered Jul 19 '17 at 4:01

Shamane Siriwardhana

390219

Basically, this is selecting a sample from the true distribution which consists of the true class and some other noisy class labels. Then taking the softmax over it.

This is based on sampling words from true distribution and noise distribution.

Here I have explained about NCE loss and how it differ from the NCE loss .

Noise Contrastive Estimation : Solution for expensive Softmax .

edited yesterday

Rohola Zandie

answered Jul 19 '17 at 4:01

Shamane Siriwardhana

390219

edited yesterday

Rohola Zandie

edited yesterday

Rohola Zandie

edited yesterday

Rohola Zandie

answered Jul 19 '17 at 4:01

Shamane Siriwardhana

390219

answered Jul 19 '17 at 4:01

Shamane Siriwardhana

390219

answered Jul 19 '17 at 4:01

Shamane Siriwardhana

390219

1

$begingroup$
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
$endgroup$
– tuomastik
Jul 19 '17 at 6:35

add a comment |

1

$begingroup$
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
$endgroup$
– tuomastik
Jul 19 '17 at 6:35

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.

– tuomastik
Jul 19 '17 at 6:35

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk