Understanding Contrastive Divergence












0












$begingroup$


I’m trying to understand, and eventually build a Restricted Boltzmann Machine. I understand that the update rule - that is the algorithm used to change the weights - is something called “contrastive divergence”. I looked this up on Wikipedia and found these steps:




  1. Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.

  2. Compute the outer product of v and h and call this the positive gradient.

  3. From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)

  4. Compute the outer product of v' and h' and call this the negative gradient.

  5. ...


I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.










share|improve this question









$endgroup$

















    0












    $begingroup$


    I’m trying to understand, and eventually build a Restricted Boltzmann Machine. I understand that the update rule - that is the algorithm used to change the weights - is something called “contrastive divergence”. I looked this up on Wikipedia and found these steps:




    1. Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.

    2. Compute the outer product of v and h and call this the positive gradient.

    3. From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)

    4. Compute the outer product of v' and h' and call this the negative gradient.

    5. ...


    I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.










    share|improve this question









    $endgroup$















      0












      0








      0


      1



      $begingroup$


      I’m trying to understand, and eventually build a Restricted Boltzmann Machine. I understand that the update rule - that is the algorithm used to change the weights - is something called “contrastive divergence”. I looked this up on Wikipedia and found these steps:




      1. Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.

      2. Compute the outer product of v and h and call this the positive gradient.

      3. From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)

      4. Compute the outer product of v' and h' and call this the negative gradient.

      5. ...


      I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.










      share|improve this question









      $endgroup$




      I’m trying to understand, and eventually build a Restricted Boltzmann Machine. I understand that the update rule - that is the algorithm used to change the weights - is something called “contrastive divergence”. I looked this up on Wikipedia and found these steps:




      1. Take a training sample v, compute the probabilities of the hidden units and sample a hidden activation vector h from this probability distribution.

      2. Compute the outer product of v and h and call this the positive gradient.

      3. From h, sample a reconstruction v' of the visible units, then resample the hidden activations h' from this. (Gibbs sampling step)

      4. Compute the outer product of v' and h' and call this the negative gradient.

      5. ...


      I don’t understand step 3 and I’m struggling to grasp the concept of Gibbs sampling. Would someone explain this simply to me? I have covered neural networks if that helps you.







      machine-learning python neural-network unsupervised-learning rbm






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Apr 11 '18 at 18:50









      Finn WilliamsFinn Williams

      557




      557






















          1 Answer
          1






          active

          oldest

          votes


















          1












          $begingroup$

          Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.



          To explain this, I will first have to introduce the term state space. Recall that a Boltzmann machine is built out of binary units, i.e. every unit can be in one of two states - say 0 and 1. The overall state of the network is then specified by the state for every unit, i.e. the states of the network can be described as points in the space ${0,1}^N$, where N is the number of units in the network. This point is called the state space.



          Now, on that state space, we can define a probability distribution. The details are not so important, but what you essentially do is that you define energy for every state and turn that into a probability distribution using a Boltzmann distribution. Thus there will be states that are likely and other states that are less likely.



          A Gibbs sampler is now a procedure to produce a sample, i.e. a sequence $X_n$ of states such that, roughly speaking, the distribution of these states across the state space reflects the probability distribution. Thus you want most of the $X_n$ to be in regions of the state space with high probability (and low energy), and few of them to be in regions with low probability (and high energy).



          To do this, a naive Gibbs sampling approach would proceed as follows. You start with some state $X_0$. To find the state $X_1$, you would pick some unit and calculate the conditional probability for that unit to be in state 1 ("on") conditional on the current value of all other units. Call this number p. You would then set the unit to 1 with probability p and pick the next unit to repeat this to get from $X_1$ to $X_2$ and so forth.



          In the special case of a restricted Boltzmann machine, this can be greatly simplified. Instead of going through, say, first all hidden units and then all visible units and update them like this one by one, you can, in fact, update all hidden units in one step and all visible units in one step, because any two hidden units and any two visible units are independent. Thus, for a full cycle through all units of the state space, you would:




          • calculate the probability for all hidden units to be 1, given the value of the visible units

          • set the hidden units to 1 with this probability

          • calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and

          • set the visible units to 1 with this probability


          This constitutes one full Gibbs sampling step and is your step 1 + the first part of 3 (the second part is then needed for the further calculation and not part of the sampling). The reason why we do this in the CD algorithm is that we actually want to approximate an expectation value and use a sampler for this.



          This is a complex topic and hard to summarize in a few sentences. If you want to learn more about the mathematics behind this (Markov chains) and on the application to RBMs (contrastive divergence and persistent contrastive divergence), you might find this and this document helpful - these are some notes that I put together while learning about this.






          share|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30186%2funderstanding-contrastive-divergence%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.



            To explain this, I will first have to introduce the term state space. Recall that a Boltzmann machine is built out of binary units, i.e. every unit can be in one of two states - say 0 and 1. The overall state of the network is then specified by the state for every unit, i.e. the states of the network can be described as points in the space ${0,1}^N$, where N is the number of units in the network. This point is called the state space.



            Now, on that state space, we can define a probability distribution. The details are not so important, but what you essentially do is that you define energy for every state and turn that into a probability distribution using a Boltzmann distribution. Thus there will be states that are likely and other states that are less likely.



            A Gibbs sampler is now a procedure to produce a sample, i.e. a sequence $X_n$ of states such that, roughly speaking, the distribution of these states across the state space reflects the probability distribution. Thus you want most of the $X_n$ to be in regions of the state space with high probability (and low energy), and few of them to be in regions with low probability (and high energy).



            To do this, a naive Gibbs sampling approach would proceed as follows. You start with some state $X_0$. To find the state $X_1$, you would pick some unit and calculate the conditional probability for that unit to be in state 1 ("on") conditional on the current value of all other units. Call this number p. You would then set the unit to 1 with probability p and pick the next unit to repeat this to get from $X_1$ to $X_2$ and so forth.



            In the special case of a restricted Boltzmann machine, this can be greatly simplified. Instead of going through, say, first all hidden units and then all visible units and update them like this one by one, you can, in fact, update all hidden units in one step and all visible units in one step, because any two hidden units and any two visible units are independent. Thus, for a full cycle through all units of the state space, you would:




            • calculate the probability for all hidden units to be 1, given the value of the visible units

            • set the hidden units to 1 with this probability

            • calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and

            • set the visible units to 1 with this probability


            This constitutes one full Gibbs sampling step and is your step 1 + the first part of 3 (the second part is then needed for the further calculation and not part of the sampling). The reason why we do this in the CD algorithm is that we actually want to approximate an expectation value and use a sampler for this.



            This is a complex topic and hard to summarize in a few sentences. If you want to learn more about the mathematics behind this (Markov chains) and on the application to RBMs (contrastive divergence and persistent contrastive divergence), you might find this and this document helpful - these are some notes that I put together while learning about this.






            share|improve this answer











            $endgroup$


















              1












              $begingroup$

              Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.



              To explain this, I will first have to introduce the term state space. Recall that a Boltzmann machine is built out of binary units, i.e. every unit can be in one of two states - say 0 and 1. The overall state of the network is then specified by the state for every unit, i.e. the states of the network can be described as points in the space ${0,1}^N$, where N is the number of units in the network. This point is called the state space.



              Now, on that state space, we can define a probability distribution. The details are not so important, but what you essentially do is that you define energy for every state and turn that into a probability distribution using a Boltzmann distribution. Thus there will be states that are likely and other states that are less likely.



              A Gibbs sampler is now a procedure to produce a sample, i.e. a sequence $X_n$ of states such that, roughly speaking, the distribution of these states across the state space reflects the probability distribution. Thus you want most of the $X_n$ to be in regions of the state space with high probability (and low energy), and few of them to be in regions with low probability (and high energy).



              To do this, a naive Gibbs sampling approach would proceed as follows. You start with some state $X_0$. To find the state $X_1$, you would pick some unit and calculate the conditional probability for that unit to be in state 1 ("on") conditional on the current value of all other units. Call this number p. You would then set the unit to 1 with probability p and pick the next unit to repeat this to get from $X_1$ to $X_2$ and so forth.



              In the special case of a restricted Boltzmann machine, this can be greatly simplified. Instead of going through, say, first all hidden units and then all visible units and update them like this one by one, you can, in fact, update all hidden units in one step and all visible units in one step, because any two hidden units and any two visible units are independent. Thus, for a full cycle through all units of the state space, you would:




              • calculate the probability for all hidden units to be 1, given the value of the visible units

              • set the hidden units to 1 with this probability

              • calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and

              • set the visible units to 1 with this probability


              This constitutes one full Gibbs sampling step and is your step 1 + the first part of 3 (the second part is then needed for the further calculation and not part of the sampling). The reason why we do this in the CD algorithm is that we actually want to approximate an expectation value and use a sampler for this.



              This is a complex topic and hard to summarize in a few sentences. If you want to learn more about the mathematics behind this (Markov chains) and on the application to RBMs (contrastive divergence and persistent contrastive divergence), you might find this and this document helpful - these are some notes that I put together while learning about this.






              share|improve this answer











              $endgroup$
















                1












                1








                1





                $begingroup$

                Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.



                To explain this, I will first have to introduce the term state space. Recall that a Boltzmann machine is built out of binary units, i.e. every unit can be in one of two states - say 0 and 1. The overall state of the network is then specified by the state for every unit, i.e. the states of the network can be described as points in the space ${0,1}^N$, where N is the number of units in the network. This point is called the state space.



                Now, on that state space, we can define a probability distribution. The details are not so important, but what you essentially do is that you define energy for every state and turn that into a probability distribution using a Boltzmann distribution. Thus there will be states that are likely and other states that are less likely.



                A Gibbs sampler is now a procedure to produce a sample, i.e. a sequence $X_n$ of states such that, roughly speaking, the distribution of these states across the state space reflects the probability distribution. Thus you want most of the $X_n$ to be in regions of the state space with high probability (and low energy), and few of them to be in regions with low probability (and high energy).



                To do this, a naive Gibbs sampling approach would proceed as follows. You start with some state $X_0$. To find the state $X_1$, you would pick some unit and calculate the conditional probability for that unit to be in state 1 ("on") conditional on the current value of all other units. Call this number p. You would then set the unit to 1 with probability p and pick the next unit to repeat this to get from $X_1$ to $X_2$ and so forth.



                In the special case of a restricted Boltzmann machine, this can be greatly simplified. Instead of going through, say, first all hidden units and then all visible units and update them like this one by one, you can, in fact, update all hidden units in one step and all visible units in one step, because any two hidden units and any two visible units are independent. Thus, for a full cycle through all units of the state space, you would:




                • calculate the probability for all hidden units to be 1, given the value of the visible units

                • set the hidden units to 1 with this probability

                • calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and

                • set the visible units to 1 with this probability


                This constitutes one full Gibbs sampling step and is your step 1 + the first part of 3 (the second part is then needed for the further calculation and not part of the sampling). The reason why we do this in the CD algorithm is that we actually want to approximate an expectation value and use a sampler for this.



                This is a complex topic and hard to summarize in a few sentences. If you want to learn more about the mathematics behind this (Markov chains) and on the application to RBMs (contrastive divergence and persistent contrastive divergence), you might find this and this document helpful - these are some notes that I put together while learning about this.






                share|improve this answer











                $endgroup$



                Gibbs sampling is an example for the more general Markov chain Monte Carlo methods to sample from distribution in a high-dimensional space.



                To explain this, I will first have to introduce the term state space. Recall that a Boltzmann machine is built out of binary units, i.e. every unit can be in one of two states - say 0 and 1. The overall state of the network is then specified by the state for every unit, i.e. the states of the network can be described as points in the space ${0,1}^N$, where N is the number of units in the network. This point is called the state space.



                Now, on that state space, we can define a probability distribution. The details are not so important, but what you essentially do is that you define energy for every state and turn that into a probability distribution using a Boltzmann distribution. Thus there will be states that are likely and other states that are less likely.



                A Gibbs sampler is now a procedure to produce a sample, i.e. a sequence $X_n$ of states such that, roughly speaking, the distribution of these states across the state space reflects the probability distribution. Thus you want most of the $X_n$ to be in regions of the state space with high probability (and low energy), and few of them to be in regions with low probability (and high energy).



                To do this, a naive Gibbs sampling approach would proceed as follows. You start with some state $X_0$. To find the state $X_1$, you would pick some unit and calculate the conditional probability for that unit to be in state 1 ("on") conditional on the current value of all other units. Call this number p. You would then set the unit to 1 with probability p and pick the next unit to repeat this to get from $X_1$ to $X_2$ and so forth.



                In the special case of a restricted Boltzmann machine, this can be greatly simplified. Instead of going through, say, first all hidden units and then all visible units and update them like this one by one, you can, in fact, update all hidden units in one step and all visible units in one step, because any two hidden units and any two visible units are independent. Thus, for a full cycle through all units of the state space, you would:




                • calculate the probability for all hidden units to be 1, given the value of the visible units

                • set the hidden units to 1 with this probability

                • calculate the probability for the visible units to be 1, again conditional on the value of the hidden units, and

                • set the visible units to 1 with this probability


                This constitutes one full Gibbs sampling step and is your step 1 + the first part of 3 (the second part is then needed for the further calculation and not part of the sampling). The reason why we do this in the CD algorithm is that we actually want to approximate an expectation value and use a sampler for this.



                This is a complex topic and hard to summarize in a few sentences. If you want to learn more about the mathematics behind this (Markov chains) and on the application to RBMs (contrastive divergence and persistent contrastive divergence), you might find this and this document helpful - these are some notes that I put together while learning about this.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited 23 mins ago









                Community

                1




                1










                answered Apr 11 '18 at 20:50









                christianb93christianb93

                463




                463






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f30186%2funderstanding-contrastive-divergence%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Vallis Paradisi

                    Tabula Rosettana