Understanding Contrastive Divergence

I’m trying to understand, and eventually build a Restricted Boltzmann Machine. I understand that the update rule - that is the algorithm used to change the weights - is something called “contrastive divergence”. I looked this up on Wikipedia and found these steps:

Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.

Compute the outer product of v and h and call this the positive gradient.

From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)

Compute the outer product of v' and h' and call this the negative gradient.

I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.

asked Apr 11 '18 at 18:50

Finn Williams

557

add a comment |

Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.

Compute the outer product of v and h and call this the positive gradient.

From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)

Compute the outer product of v' and h' and call this the negative gradient.

I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.

asked Apr 11 '18 at 18:50

Finn Williams

557

add a comment |

Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.

Compute the outer product of v and h and call this the positive gradient.

From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)

Compute the outer product of v' and h' and call this the negative gradient.

I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.

asked Apr 11 '18 at 18:50

Finn Williams

557

Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.

Compute the outer product of v and h and call this the positive gradient.

From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)

Compute the outer product of v' and h' and call this the negative gradient.

I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.

machine-learning python neural-network unsupervised-learning rbm

asked Apr 11 '18 at 18:50

Finn Williams

557

asked Apr 11 '18 at 18:50

Finn Williams

557

asked Apr 11 '18 at 18:50

Finn Williams

557

asked Apr 11 '18 at 18:50

Finn Williams

557

asked Apr 11 '18 at 18:50

Finn Williams

557

add a comment |

1 Answer
1

active

oldest

votes

Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.

To explain this, I will first have to introduce the term state space. Recall that a Boltzmann machine is built out of binary units, i.e. every unit can be in one of two states - say 0 and 1. The overall state of the network is then specified by the state for every unit, i.e. the states of the network can be described as points in the space ${0,1}^N$, where N is the number of units in the network. This point is called the state space.

Now, on that state space, we can define a probability distribution. The details are not so important, but what you essentially do is that you define energy for every state and turn that into a probability distribution using a Boltzmann distribution. Thus there will be states that are likely and other states that are less likely.

A Gibbs sampler is now a procedure to produce a sample, i.e. a sequence $X_n$ of states such that, roughly speaking, the distribution of these states across the state space reflects the probability distribution. Thus you want most of the $X_n$ to be in regions of the state space with high probability (and low energy), and few of them to be in regions with low probability (and high energy).

To do this, a naive Gibbs sampling approach would proceed as follows. You start with some state $X_0$. To find the state $X_1$, you would pick some unit and calculate the conditional probability for that unit to be in state 1 ("on") conditional on the current value of all other units. Call this number p. You would then set the unit to 1 with probability p and pick the next unit to repeat this to get from $X_1$ to $X_2$ and so forth.

In the special case of a restricted Boltzmann machine, this can be greatly simplified. Instead of going through, say, first all hidden units and then all visible units and update them like this one by one, you can, in fact, update all hidden units in one step and all visible units in one step, because any two hidden units and any two visible units are independent. Thus, for a full cycle through all units of the state space, you would:

calculate the probability for all hidden units to be 1, given the value of the visible units

set the hidden units to 1 with this probability

calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and

set the visible units to 1 with this probability

This constitutes one full Gibbs sampling step and is your step 1 + the first part of 3 (the second part is then needed for the further calculation and not part of the sampling). The reason why we do this in the CD algorithm is that we actually want to approximate an expectation value and use a sampler for this.

This is a complex topic and hard to summarize in a few sentences. If you want to learn more about the mathematics behind this (Markov chains) and on the application to RBMs (contrastive divergence and persistent contrastive divergence), you might find this and this document helpful - these are some notes that I put together while learning about this.

edited 23 mins ago

Community♦

answered Apr 11 '18 at 20:50

christianb93

463

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30186%2funderstanding-contrastive-divergence%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.

calculate the probability for all hidden units to be 1, given the value of the visible units

set the hidden units to 1 with this probability

calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and

set the visible units to 1 with this probability

edited 23 mins ago

Community♦

answered Apr 11 '18 at 20:50

christianb93

463

add a comment |

Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.

calculate the probability for all hidden units to be 1, given the value of the visible units

set the hidden units to 1 with this probability

calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and

set the visible units to 1 with this probability

edited 23 mins ago

Community♦

answered Apr 11 '18 at 20:50

christianb93

463

add a comment |

Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.

calculate the probability for all hidden units to be 1, given the value of the visible units

set the hidden units to 1 with this probability

calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and

set the visible units to 1 with this probability

edited 23 mins ago

Community♦

answered Apr 11 '18 at 20:50

christianb93

463

Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.

calculate the probability for all hidden units to be 1, given the value of the visible units

set the hidden units to 1 with this probability

calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and

set the visible units to 1 with this probability

edited 23 mins ago

Community♦

answered Apr 11 '18 at 20:50

christianb93

463

edited 23 mins ago

Community♦

edited 23 mins ago

Community♦

edited 23 mins ago

Community♦

answered Apr 11 '18 at 20:50

christianb93

463

answered Apr 11 '18 at 20:50

christianb93

463

answered Apr 11 '18 at 20:50

christianb93

463

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk