Intuitive explanation of Noise Contrastive Estimation (NCE) loss?
$begingroup$
I read about NCE (a form of candidate sampling) from these two sources:
Tensorflow writeup
Original Paper
Can someone help me with the following:
- A simple explanation of how NCE works (I found the above difficult to parse and get an understanding of, so something intuitive that leads to the math presented there would be great)
- After point 1 above, a naturally intuitive description of how this is different from Negative Sampling. I can see that there's a slight change in the formula but could not understand the math. I do have an intuitive understanding of negative sampling in the context of
word2vec
- we randomly choose some samples from the vocabularyV
and update only those because|V|
is large and this offers a speedup. Please correct if wrong. - When to use which one and how is that decided? It would be great if you could include examples(possibly easy to understand applications)
- Is NCE better than Negative Sampling? Better in what manner?
Thank you.
deep-learning tensorflow word-embeddings sampling loss-function
$endgroup$
add a comment |
$begingroup$
I read about NCE (a form of candidate sampling) from these two sources:
Tensorflow writeup
Original Paper
Can someone help me with the following:
- A simple explanation of how NCE works (I found the above difficult to parse and get an understanding of, so something intuitive that leads to the math presented there would be great)
- After point 1 above, a naturally intuitive description of how this is different from Negative Sampling. I can see that there's a slight change in the formula but could not understand the math. I do have an intuitive understanding of negative sampling in the context of
word2vec
- we randomly choose some samples from the vocabularyV
and update only those because|V|
is large and this offers a speedup. Please correct if wrong. - When to use which one and how is that decided? It would be great if you could include examples(possibly easy to understand applications)
- Is NCE better than Negative Sampling? Better in what manner?
Thank you.
deep-learning tensorflow word-embeddings sampling loss-function
$endgroup$
$begingroup$
may be my post may helps. nanjiang.quora.com/Noise-contrastive-Estimation and later experiment with theano can be found at my github.com/jiangnanHugo/language_modeling. I hope my understanding is right.
$endgroup$
– jiangnan hugo
Oct 6 '16 at 12:03
add a comment |
$begingroup$
I read about NCE (a form of candidate sampling) from these two sources:
Tensorflow writeup
Original Paper
Can someone help me with the following:
- A simple explanation of how NCE works (I found the above difficult to parse and get an understanding of, so something intuitive that leads to the math presented there would be great)
- After point 1 above, a naturally intuitive description of how this is different from Negative Sampling. I can see that there's a slight change in the formula but could not understand the math. I do have an intuitive understanding of negative sampling in the context of
word2vec
- we randomly choose some samples from the vocabularyV
and update only those because|V|
is large and this offers a speedup. Please correct if wrong. - When to use which one and how is that decided? It would be great if you could include examples(possibly easy to understand applications)
- Is NCE better than Negative Sampling? Better in what manner?
Thank you.
deep-learning tensorflow word-embeddings sampling loss-function
$endgroup$
I read about NCE (a form of candidate sampling) from these two sources:
Tensorflow writeup
Original Paper
Can someone help me with the following:
- A simple explanation of how NCE works (I found the above difficult to parse and get an understanding of, so something intuitive that leads to the math presented there would be great)
- After point 1 above, a naturally intuitive description of how this is different from Negative Sampling. I can see that there's a slight change in the formula but could not understand the math. I do have an intuitive understanding of negative sampling in the context of
word2vec
- we randomly choose some samples from the vocabularyV
and update only those because|V|
is large and this offers a speedup. Please correct if wrong. - When to use which one and how is that decided? It would be great if you could include examples(possibly easy to understand applications)
- Is NCE better than Negative Sampling? Better in what manner?
Thank you.
deep-learning tensorflow word-embeddings sampling loss-function
deep-learning tensorflow word-embeddings sampling loss-function
asked Aug 5 '16 at 3:36
tejaskhottejaskhot
1,16541318
1,16541318
$begingroup$
may be my post may helps. nanjiang.quora.com/Noise-contrastive-Estimation and later experiment with theano can be found at my github.com/jiangnanHugo/language_modeling. I hope my understanding is right.
$endgroup$
– jiangnan hugo
Oct 6 '16 at 12:03
add a comment |
$begingroup$
may be my post may helps. nanjiang.quora.com/Noise-contrastive-Estimation and later experiment with theano can be found at my github.com/jiangnanHugo/language_modeling. I hope my understanding is right.
$endgroup$
– jiangnan hugo
Oct 6 '16 at 12:03
$begingroup$
may be my post may helps. nanjiang.quora.com/Noise-contrastive-Estimation and later experiment with theano can be found at my github.com/jiangnanHugo/language_modeling. I hope my understanding is right.
$endgroup$
– jiangnan hugo
Oct 6 '16 at 12:03
$begingroup$
may be my post may helps. nanjiang.quora.com/Noise-contrastive-Estimation and later experiment with theano can be found at my github.com/jiangnanHugo/language_modeling. I hope my understanding is right.
$endgroup$
– jiangnan hugo
Oct 6 '16 at 12:03
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Taken from this post:https://stats.stackexchange.com/a/245452/154812
The issue
There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).
Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.
A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.
The (a) solution
In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.
The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.
For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.
This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.
Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.
References
[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data
[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning
[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.
[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.
$endgroup$
add a comment |
$begingroup$
Basically, this is selecting a sample from the true distribution which consists of the true class and some other noisy class labels. Then taking the softmax over it.
This is based on sampling words from true distribution and noise distribution.
Here the basic Idea is to train logistic regression classifier which can separate the samples obtained from true distribution and sample obtained from noise distribution. Remember When we are talking about the samples obtained from the true distribution we are talking about only one sample which is the true class obtained from the model distribution.
Here I have explained about NCE loss and how it differ from the NCE loss .
Noise Contrastive Estimation : Solution for expensive Softmax .
$endgroup$
1
$begingroup$
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
$endgroup$
– tuomastik
Jul 19 '17 at 6:35
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f13216%2fintuitive-explanation-of-noise-contrastive-estimation-nce-loss%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Taken from this post:https://stats.stackexchange.com/a/245452/154812
The issue
There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).
Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.
A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.
The (a) solution
In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.
The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.
For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.
This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.
Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.
References
[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data
[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning
[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.
[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.
$endgroup$
add a comment |
$begingroup$
Taken from this post:https://stats.stackexchange.com/a/245452/154812
The issue
There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).
Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.
A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.
The (a) solution
In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.
The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.
For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.
This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.
Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.
References
[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data
[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning
[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.
[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.
$endgroup$
add a comment |
$begingroup$
Taken from this post:https://stats.stackexchange.com/a/245452/154812
The issue
There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).
Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.
A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.
The (a) solution
In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.
The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.
For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.
This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.
Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.
References
[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data
[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning
[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.
[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.
$endgroup$
Taken from this post:https://stats.stackexchange.com/a/245452/154812
The issue
There are some issues with learning the word vectors using an "standard" neural network. In this way, the word vectors are learned while the network learns to predict the next word given a window of words (the input of the network).
Predicting the next word is like predicting the class. That is, such a network is just a "standard" multinomial (multi-class) classifier. And this network must have as many output neurons as classes there are. When classes are actual words, the number of neurons is, well, huge.
A "standard" neural network is usually trained with a cross-entropy cost function which requires the values of the output neurons to represent probabilities - which means that the output "scores" computed by the network for each class have to be normalized, converted into actual probabilities for each class. This normalization step is achieved by means of the softmax function. Softmax is very costly when applied to a huge output layer.
The (a) solution
In order to deal with this issue, that is, the expensive computation of the softmax, Word2Vec uses a technique called noise-contrastive estimation. This technique was introduced by [A] (reformulated by [B]) then used in [C], [D], [E] to learn word embeddings from unlabelled natural language text.
The basic idea is to convert a multinomial classification problem (as it is the problem of predicting the next word) to a binary classification problem. That is, instead of using softmax to estimate a true probability distribution of the output word, a binary logistic regression (binary classification) is used instead.
For each training sample, the enhanced (optimized) classifier is fed a true pair (a center word and another word that appears in its context) and a number of kk randomly corrupted pairs (consisting of the center word and a randomly chosen word from the vocabulary). By learning to distinguish the true pairs from corrupted ones, the classifier will ultimately learn the word vectors.
This is important: instead of predicting the next word (the "standard" training technique), the optimized classifier simply predicts whether a pair of words is good or bad.
Word2Vec slightly customizes the process and calls it negative sampling. In Word2Vec, the words for the negative samples (used for the corrupted pairs) are drawn from a specially designed distribution, which favours less frequent words to be drawn more often.
References
[A] (2005) - Contrastive estimation: Training log-linear models on unlabeled data
[B] (2010) - Noise-contrastive estimation: A new estimation principle for unnormalized statistical models
[C] (2008) - A unified architecture for natural language processing: Deep neural networks with multitask learning
[D] (2012) - A fast and simple algorithm for training neural probabilistic language models.
[E] (2013) - Learning word embeddings efficiently with noise-contrastive estimation.
edited Apr 13 '17 at 12:44
Community♦
1
1
answered Mar 27 '17 at 12:57
user154812user154812
32434
32434
add a comment |
add a comment |
$begingroup$
Basically, this is selecting a sample from the true distribution which consists of the true class and some other noisy class labels. Then taking the softmax over it.
This is based on sampling words from true distribution and noise distribution.
Here the basic Idea is to train logistic regression classifier which can separate the samples obtained from true distribution and sample obtained from noise distribution. Remember When we are talking about the samples obtained from the true distribution we are talking about only one sample which is the true class obtained from the model distribution.
Here I have explained about NCE loss and how it differ from the NCE loss .
Noise Contrastive Estimation : Solution for expensive Softmax .
$endgroup$
1
$begingroup$
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
$endgroup$
– tuomastik
Jul 19 '17 at 6:35
add a comment |
$begingroup$
Basically, this is selecting a sample from the true distribution which consists of the true class and some other noisy class labels. Then taking the softmax over it.
This is based on sampling words from true distribution and noise distribution.
Here the basic Idea is to train logistic regression classifier which can separate the samples obtained from true distribution and sample obtained from noise distribution. Remember When we are talking about the samples obtained from the true distribution we are talking about only one sample which is the true class obtained from the model distribution.
Here I have explained about NCE loss and how it differ from the NCE loss .
Noise Contrastive Estimation : Solution for expensive Softmax .
$endgroup$
1
$begingroup$
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
$endgroup$
– tuomastik
Jul 19 '17 at 6:35
add a comment |
$begingroup$
Basically, this is selecting a sample from the true distribution which consists of the true class and some other noisy class labels. Then taking the softmax over it.
This is based on sampling words from true distribution and noise distribution.
Here the basic Idea is to train logistic regression classifier which can separate the samples obtained from true distribution and sample obtained from noise distribution. Remember When we are talking about the samples obtained from the true distribution we are talking about only one sample which is the true class obtained from the model distribution.
Here I have explained about NCE loss and how it differ from the NCE loss .
Noise Contrastive Estimation : Solution for expensive Softmax .
$endgroup$
Basically, this is selecting a sample from the true distribution which consists of the true class and some other noisy class labels. Then taking the softmax over it.
This is based on sampling words from true distribution and noise distribution.
Here the basic Idea is to train logistic regression classifier which can separate the samples obtained from true distribution and sample obtained from noise distribution. Remember When we are talking about the samples obtained from the true distribution we are talking about only one sample which is the true class obtained from the model distribution.
Here I have explained about NCE loss and how it differ from the NCE loss .
Noise Contrastive Estimation : Solution for expensive Softmax .
edited yesterday
Rohola Zandie
33
33
answered Jul 19 '17 at 4:01
Shamane SiriwardhanaShamane Siriwardhana
390219
390219
1
$begingroup$
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
$endgroup$
– tuomastik
Jul 19 '17 at 6:35
add a comment |
1
$begingroup$
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
$endgroup$
– tuomastik
Jul 19 '17 at 6:35
1
1
$begingroup$
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
$endgroup$
– tuomastik
Jul 19 '17 at 6:35
$begingroup$
While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes.
$endgroup$
– tuomastik
Jul 19 '17 at 6:35
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f13216%2fintuitive-explanation-of-noise-contrastive-estimation-nce-loss%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
may be my post may helps. nanjiang.quora.com/Noise-contrastive-Estimation and later experiment with theano can be found at my github.com/jiangnanHugo/language_modeling. I hope my understanding is right.
$endgroup$
– jiangnan hugo
Oct 6 '16 at 12:03