Understanding Contrastive Divergence
$begingroup$
I’m trying to understand, and eventually build a Restricted Boltzmann Machine. I understand that the update rule - that is the algorithm used to change the weights - is something called “contrastive divergence”. I looked this up on Wikipedia and found these steps:
- Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.
- Compute the outer product of v and h and call this the positive gradient.
- From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)
- Compute the outer product of v' and h' and call this the negative gradient.
- ...
I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.
machine-learning python neural-network unsupervised-learning rbm
$endgroup$
add a comment |
$begingroup$
I’m trying to understand, and eventually build a Restricted Boltzmann Machine. I understand that the update rule - that is the algorithm used to change the weights - is something called “contrastive divergence”. I looked this up on Wikipedia and found these steps:
- Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.
- Compute the outer product of v and h and call this the positive gradient.
- From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)
- Compute the outer product of v' and h' and call this the negative gradient.
- ...
I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.
machine-learning python neural-network unsupervised-learning rbm
$endgroup$
add a comment |
$begingroup$
I’m trying to understand, and eventually build a Restricted Boltzmann Machine. I understand that the update rule - that is the algorithm used to change the weights - is something called “contrastive divergence”. I looked this up on Wikipedia and found these steps:
- Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.
- Compute the outer product of v and h and call this the positive gradient.
- From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)
- Compute the outer product of v' and h' and call this the negative gradient.
- ...
I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.
machine-learning python neural-network unsupervised-learning rbm
$endgroup$
I’m trying to understand, and eventually build a Restricted Boltzmann Machine. I understand that the update rule - that is the algorithm used to change the weights - is something called “contrastive divergence”. I looked this up on Wikipedia and found these steps:
- Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.
- Compute the outer product of v and h and call this the positive gradient.
- From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)
- Compute the outer product of v' and h' and call this the negative gradient.
- ...
I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.
machine-learning python neural-network unsupervised-learning rbm
machine-learning python neural-network unsupervised-learning rbm
asked Apr 11 '18 at 18:50
Finn WilliamsFinn Williams
557
557
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.
To explain this, I will first have to introduce the term state space. Recall that a Boltzmann machine is built out of binary units, i.e. every unit can be in one of two states - say 0 and 1. The overall state of the network is then specified by the state for every unit, i.e. the states of the network can be described as points in the space ${0,1}^N$, where N is the number of units in the network. This point is called the state space.
Now, on that state space, we can define a probability distribution. The details are not so important, but what you essentially do is that you define energy for every state and turn that into a probability distribution using a Boltzmann distribution. Thus there will be states that are likely and other states that are less likely.
A Gibbs sampler is now a procedure to produce a sample, i.e. a sequence $X_n$ of states such that, roughly speaking, the distribution of these states across the state space reflects the probability distribution. Thus you want most of the $X_n$ to be in regions of the state space with high probability (and low energy), and few of them to be in regions with low probability (and high energy).
To do this, a naive Gibbs sampling approach would proceed as follows. You start with some state $X_0$. To find the state $X_1$, you would pick some unit and calculate the conditional probability for that unit to be in state 1 ("on") conditional on the current value of all other units. Call this number p. You would then set the unit to 1 with probability p and pick the next unit to repeat this to get from $X_1$ to $X_2$ and so forth.
In the special case of a restricted Boltzmann machine, this can be greatly simplified. Instead of going through, say, first all hidden units and then all visible units and update them like this one by one, you can, in fact, update all hidden units in one step and all visible units in one step, because any two hidden units and any two visible units are independent. Thus, for a full cycle through all units of the state space, you would:
- calculate the probability for all hidden units to be 1, given the value of the visible units
- set the hidden units to 1 with this probability
- calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and
- set the visible units to 1 with this probability
This constitutes one full Gibbs sampling step and is your step 1 + the first part of 3 (the second part is then needed for the further calculation and not part of the sampling). The reason why we do this in the CD algorithm is that we actually want to approximate an expectation value and use a sampler for this.
This is a complex topic and hard to summarize in a few sentences. If you want to learn more about the mathematics behind this (Markov chains) and on the application to RBMs (contrastive divergence and persistent contrastive divergence), you might find this and this document helpful - these are some notes that I put together while learning about this.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30186%2funderstanding-contrastive-divergence%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.
To explain this, I will first have to introduce the term state space. Recall that a Boltzmann machine is built out of binary units, i.e. every unit can be in one of two states - say 0 and 1. The overall state of the network is then specified by the state for every unit, i.e. the states of the network can be described as points in the space ${0,1}^N$, where N is the number of units in the network. This point is called the state space.
Now, on that state space, we can define a probability distribution. The details are not so important, but what you essentially do is that you define energy for every state and turn that into a probability distribution using a Boltzmann distribution. Thus there will be states that are likely and other states that are less likely.
A Gibbs sampler is now a procedure to produce a sample, i.e. a sequence $X_n$ of states such that, roughly speaking, the distribution of these states across the state space reflects the probability distribution. Thus you want most of the $X_n$ to be in regions of the state space with high probability (and low energy), and few of them to be in regions with low probability (and high energy).
To do this, a naive Gibbs sampling approach would proceed as follows. You start with some state $X_0$. To find the state $X_1$, you would pick some unit and calculate the conditional probability for that unit to be in state 1 ("on") conditional on the current value of all other units. Call this number p. You would then set the unit to 1 with probability p and pick the next unit to repeat this to get from $X_1$ to $X_2$ and so forth.
In the special case of a restricted Boltzmann machine, this can be greatly simplified. Instead of going through, say, first all hidden units and then all visible units and update them like this one by one, you can, in fact, update all hidden units in one step and all visible units in one step, because any two hidden units and any two visible units are independent. Thus, for a full cycle through all units of the state space, you would:
- calculate the probability for all hidden units to be 1, given the value of the visible units
- set the hidden units to 1 with this probability
- calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and
- set the visible units to 1 with this probability
This constitutes one full Gibbs sampling step and is your step 1 + the first part of 3 (the second part is then needed for the further calculation and not part of the sampling). The reason why we do this in the CD algorithm is that we actually want to approximate an expectation value and use a sampler for this.
This is a complex topic and hard to summarize in a few sentences. If you want to learn more about the mathematics behind this (Markov chains) and on the application to RBMs (contrastive divergence and persistent contrastive divergence), you might find this and this document helpful - these are some notes that I put together while learning about this.
$endgroup$
add a comment |
$begingroup$
Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.
To explain this, I will first have to introduce the term state space. Recall that a Boltzmann machine is built out of binary units, i.e. every unit can be in one of two states - say 0 and 1. The overall state of the network is then specified by the state for every unit, i.e. the states of the network can be described as points in the space ${0,1}^N$, where N is the number of units in the network. This point is called the state space.
Now, on that state space, we can define a probability distribution. The details are not so important, but what you essentially do is that you define energy for every state and turn that into a probability distribution using a Boltzmann distribution. Thus there will be states that are likely and other states that are less likely.
A Gibbs sampler is now a procedure to produce a sample, i.e. a sequence $X_n$ of states such that, roughly speaking, the distribution of these states across the state space reflects the probability distribution. Thus you want most of the $X_n$ to be in regions of the state space with high probability (and low energy), and few of them to be in regions with low probability (and high energy).
To do this, a naive Gibbs sampling approach would proceed as follows. You start with some state $X_0$. To find the state $X_1$, you would pick some unit and calculate the conditional probability for that unit to be in state 1 ("on") conditional on the current value of all other units. Call this number p. You would then set the unit to 1 with probability p and pick the next unit to repeat this to get from $X_1$ to $X_2$ and so forth.
In the special case of a restricted Boltzmann machine, this can be greatly simplified. Instead of going through, say, first all hidden units and then all visible units and update them like this one by one, you can, in fact, update all hidden units in one step and all visible units in one step, because any two hidden units and any two visible units are independent. Thus, for a full cycle through all units of the state space, you would:
- calculate the probability for all hidden units to be 1, given the value of the visible units
- set the hidden units to 1 with this probability
- calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and
- set the visible units to 1 with this probability
This constitutes one full Gibbs sampling step and is your step 1 + the first part of 3 (the second part is then needed for the further calculation and not part of the sampling). The reason why we do this in the CD algorithm is that we actually want to approximate an expectation value and use a sampler for this.
This is a complex topic and hard to summarize in a few sentences. If you want to learn more about the mathematics behind this (Markov chains) and on the application to RBMs (contrastive divergence and persistent contrastive divergence), you might find this and this document helpful - these are some notes that I put together while learning about this.
$endgroup$
add a comment |
$begingroup$
Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.
To explain this, I will first have to introduce the term state space. Recall that a Boltzmann machine is built out of binary units, i.e. every unit can be in one of two states - say 0 and 1. The overall state of the network is then specified by the state for every unit, i.e. the states of the network can be described as points in the space ${0,1}^N$, where N is the number of units in the network. This point is called the state space.
Now, on that state space, we can define a probability distribution. The details are not so important, but what you essentially do is that you define energy for every state and turn that into a probability distribution using a Boltzmann distribution. Thus there will be states that are likely and other states that are less likely.
A Gibbs sampler is now a procedure to produce a sample, i.e. a sequence $X_n$ of states such that, roughly speaking, the distribution of these states across the state space reflects the probability distribution. Thus you want most of the $X_n$ to be in regions of the state space with high probability (and low energy), and few of them to be in regions with low probability (and high energy).
To do this, a naive Gibbs sampling approach would proceed as follows. You start with some state $X_0$. To find the state $X_1$, you would pick some unit and calculate the conditional probability for that unit to be in state 1 ("on") conditional on the current value of all other units. Call this number p. You would then set the unit to 1 with probability p and pick the next unit to repeat this to get from $X_1$ to $X_2$ and so forth.
In the special case of a restricted Boltzmann machine, this can be greatly simplified. Instead of going through, say, first all hidden units and then all visible units and update them like this one by one, you can, in fact, update all hidden units in one step and all visible units in one step, because any two hidden units and any two visible units are independent. Thus, for a full cycle through all units of the state space, you would:
- calculate the probability for all hidden units to be 1, given the value of the visible units
- set the hidden units to 1 with this probability
- calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and
- set the visible units to 1 with this probability
This constitutes one full Gibbs sampling step and is your step 1 + the first part of 3 (the second part is then needed for the further calculation and not part of the sampling). The reason why we do this in the CD algorithm is that we actually want to approximate an expectation value and use a sampler for this.
This is a complex topic and hard to summarize in a few sentences. If you want to learn more about the mathematics behind this (Markov chains) and on the application to RBMs (contrastive divergence and persistent contrastive divergence), you might find this and this document helpful - these are some notes that I put together while learning about this.
$endgroup$
Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.
To explain this, I will first have to introduce the term state space. Recall that a Boltzmann machine is built out of binary units, i.e. every unit can be in one of two states - say 0 and 1. The overall state of the network is then specified by the state for every unit, i.e. the states of the network can be described as points in the space ${0,1}^N$, where N is the number of units in the network. This point is called the state space.
Now, on that state space, we can define a probability distribution. The details are not so important, but what you essentially do is that you define energy for every state and turn that into a probability distribution using a Boltzmann distribution. Thus there will be states that are likely and other states that are less likely.
A Gibbs sampler is now a procedure to produce a sample, i.e. a sequence $X_n$ of states such that, roughly speaking, the distribution of these states across the state space reflects the probability distribution. Thus you want most of the $X_n$ to be in regions of the state space with high probability (and low energy), and few of them to be in regions with low probability (and high energy).
To do this, a naive Gibbs sampling approach would proceed as follows. You start with some state $X_0$. To find the state $X_1$, you would pick some unit and calculate the conditional probability for that unit to be in state 1 ("on") conditional on the current value of all other units. Call this number p. You would then set the unit to 1 with probability p and pick the next unit to repeat this to get from $X_1$ to $X_2$ and so forth.
In the special case of a restricted Boltzmann machine, this can be greatly simplified. Instead of going through, say, first all hidden units and then all visible units and update them like this one by one, you can, in fact, update all hidden units in one step and all visible units in one step, because any two hidden units and any two visible units are independent. Thus, for a full cycle through all units of the state space, you would:
- calculate the probability for all hidden units to be 1, given the value of the visible units
- set the hidden units to 1 with this probability
- calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and
- set the visible units to 1 with this probability
This constitutes one full Gibbs sampling step and is your step 1 + the first part of 3 (the second part is then needed for the further calculation and not part of the sampling). The reason why we do this in the CD algorithm is that we actually want to approximate an expectation value and use a sampler for this.
This is a complex topic and hard to summarize in a few sentences. If you want to learn more about the mathematics behind this (Markov chains) and on the application to RBMs (contrastive divergence and persistent contrastive divergence), you might find this and this document helpful - these are some notes that I put together while learning about this.
edited 23 mins ago
Community♦
1
1
answered Apr 11 '18 at 20:50
christianb93christianb93
463
463
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30186%2funderstanding-contrastive-divergence%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown