Kerns LSTM kernel












5












$begingroup$


I am trying to understand how the weight matrix in an LSTM cell is used. An LSTM unit has several weight matrix: Wf, Wi, Wc, Wo like below:



enter image description here



enter image description here



enter image description here



( from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ )



At the same time, I am playing with the Keras LSTM and studying its source code:
https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1871



In the source code, there is only one kernel mentioned. I am wondering is it referring to Wc only? Then where are the other weight matrix Wf, Wi, Wo initialized and used? Thanks!










share|improve this question









$endgroup$




bumped to the homepage by Community 15 hours ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.




















    5












    $begingroup$


    I am trying to understand how the weight matrix in an LSTM cell is used. An LSTM unit has several weight matrix: Wf, Wi, Wc, Wo like below:



    enter image description here



    enter image description here



    enter image description here



    ( from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ )



    At the same time, I am playing with the Keras LSTM and studying its source code:
    https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1871



    In the source code, there is only one kernel mentioned. I am wondering is it referring to Wc only? Then where are the other weight matrix Wf, Wi, Wo initialized and used? Thanks!










    share|improve this question









    $endgroup$




    bumped to the homepage by Community 15 hours ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















      5












      5








      5





      $begingroup$


      I am trying to understand how the weight matrix in an LSTM cell is used. An LSTM unit has several weight matrix: Wf, Wi, Wc, Wo like below:



      enter image description here



      enter image description here



      enter image description here



      ( from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ )



      At the same time, I am playing with the Keras LSTM and studying its source code:
      https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1871



      In the source code, there is only one kernel mentioned. I am wondering is it referring to Wc only? Then where are the other weight matrix Wf, Wi, Wo initialized and used? Thanks!










      share|improve this question









      $endgroup$




      I am trying to understand how the weight matrix in an LSTM cell is used. An LSTM unit has several weight matrix: Wf, Wi, Wc, Wo like below:



      enter image description here



      enter image description here



      enter image description here



      ( from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ )



      At the same time, I am playing with the Keras LSTM and studying its source code:
      https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1871



      In the source code, there is only one kernel mentioned. I am wondering is it referring to Wc only? Then where are the other weight matrix Wf, Wi, Wo initialized and used? Thanks!







      keras rnn lstm kernel






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Dec 31 '17 at 23:10









      EdamameEdamame

      5532616




      5532616





      bumped to the homepage by Community 15 hours ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 15 hours ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$


          understand how the weight matrix in an LSTM cell is used




          In LSTM you have a cell vector that keeps track of necessary information for the task at hand.



          To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.



          The key idea is we manipulate the cell vector by linear interactions through gates, where we introduce some non-linearity between the current input and the previous hidden features to obtain new features that range between zero to one, indicating how much we manipulate (add/remove/update) across each element of the new feature vector.





          1. how much information we need to keep track of?



            w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.



            modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).




          2. how much information we need to extract from the current input.



            we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.



            proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].



            modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.



            current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.




          3. how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.



            the same mechanism as above:
            current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).








          share|improve this answer









          $endgroup$





















            0












            $begingroup$

            They use this variable to save all the weight matrices by concatenating them.



            In the call function of the LSTMCell you can see, how they are unpacked:



            self.kernel_i = self.kernel[:, :self.units]
            self.kernel_f = self.kernel[:, self.units: self.units * 2]
            self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
            self.kernel_o = self.kernel[:, self.units * 3:]





            share|improve this answer











            $endgroup$














              Your Answer








              StackExchange.ready(function() {
              var channelOptions = {
              tags: "".split(" "),
              id: "557"
              };
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function() {
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled) {
              StackExchange.using("snippets", function() {
              createEditor();
              });
              }
              else {
              createEditor();
              }
              });

              function createEditor() {
              StackExchange.prepareEditor({
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader: {
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              },
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              });


              }
              });














              draft saved

              draft discarded


















              StackExchange.ready(
              function () {
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26175%2fkerns-lstm-kernel%23new-answer', 'question_page');
              }
              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              1












              $begingroup$


              understand how the weight matrix in an LSTM cell is used




              In LSTM you have a cell vector that keeps track of necessary information for the task at hand.



              To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.



              The key idea is we manipulate the cell vector by linear interactions through gates, where we introduce some non-linearity between the current input and the previous hidden features to obtain new features that range between zero to one, indicating how much we manipulate (add/remove/update) across each element of the new feature vector.





              1. how much information we need to keep track of?



                w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.



                modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).




              2. how much information we need to extract from the current input.



                we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.



                proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].



                modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.



                current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.




              3. how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.



                the same mechanism as above:
                current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).








              share|improve this answer









              $endgroup$


















                1












                $begingroup$


                understand how the weight matrix in an LSTM cell is used




                In LSTM you have a cell vector that keeps track of necessary information for the task at hand.



                To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.



                The key idea is we manipulate the cell vector by linear interactions through gates, where we introduce some non-linearity between the current input and the previous hidden features to obtain new features that range between zero to one, indicating how much we manipulate (add/remove/update) across each element of the new feature vector.





                1. how much information we need to keep track of?



                  w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.



                  modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).




                2. how much information we need to extract from the current input.



                  we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.



                  proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].



                  modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.



                  current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.




                3. how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.



                  the same mechanism as above:
                  current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).








                share|improve this answer









                $endgroup$
















                  1












                  1








                  1





                  $begingroup$


                  understand how the weight matrix in an LSTM cell is used




                  In LSTM you have a cell vector that keeps track of necessary information for the task at hand.



                  To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.



                  The key idea is we manipulate the cell vector by linear interactions through gates, where we introduce some non-linearity between the current input and the previous hidden features to obtain new features that range between zero to one, indicating how much we manipulate (add/remove/update) across each element of the new feature vector.





                  1. how much information we need to keep track of?



                    w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.



                    modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).




                  2. how much information we need to extract from the current input.



                    we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.



                    proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].



                    modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.



                    current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.




                  3. how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.



                    the same mechanism as above:
                    current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).








                  share|improve this answer









                  $endgroup$




                  understand how the weight matrix in an LSTM cell is used




                  In LSTM you have a cell vector that keeps track of necessary information for the task at hand.



                  To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.



                  The key idea is we manipulate the cell vector by linear interactions through gates, where we introduce some non-linearity between the current input and the previous hidden features to obtain new features that range between zero to one, indicating how much we manipulate (add/remove/update) across each element of the new feature vector.





                  1. how much information we need to keep track of?



                    w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.



                    modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).




                  2. how much information we need to extract from the current input.



                    we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.



                    proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].



                    modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.



                    current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.




                  3. how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.



                    the same mechanism as above:
                    current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).









                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered May 19 '18 at 18:07









                  Fadi BakouraFadi Bakoura

                  663212




                  663212























                      0












                      $begingroup$

                      They use this variable to save all the weight matrices by concatenating them.



                      In the call function of the LSTMCell you can see, how they are unpacked:



                      self.kernel_i = self.kernel[:, :self.units]
                      self.kernel_f = self.kernel[:, self.units: self.units * 2]
                      self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
                      self.kernel_o = self.kernel[:, self.units * 3:]





                      share|improve this answer











                      $endgroup$


















                        0












                        $begingroup$

                        They use this variable to save all the weight matrices by concatenating them.



                        In the call function of the LSTMCell you can see, how they are unpacked:



                        self.kernel_i = self.kernel[:, :self.units]
                        self.kernel_f = self.kernel[:, self.units: self.units * 2]
                        self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
                        self.kernel_o = self.kernel[:, self.units * 3:]





                        share|improve this answer











                        $endgroup$
















                          0












                          0








                          0





                          $begingroup$

                          They use this variable to save all the weight matrices by concatenating them.



                          In the call function of the LSTMCell you can see, how they are unpacked:



                          self.kernel_i = self.kernel[:, :self.units]
                          self.kernel_f = self.kernel[:, self.units: self.units * 2]
                          self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
                          self.kernel_o = self.kernel[:, self.units * 3:]





                          share|improve this answer











                          $endgroup$



                          They use this variable to save all the weight matrices by concatenating them.



                          In the call function of the LSTMCell you can see, how they are unpacked:



                          self.kernel_i = self.kernel[:, :self.units]
                          self.kernel_f = self.kernel[:, self.units: self.units * 2]
                          self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
                          self.kernel_o = self.kernel[:, self.units * 3:]






                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Jan 19 '18 at 13:30









                          Stephen Rauch

                          1,52551330




                          1,52551330










                          answered Jan 19 '18 at 12:16









                          sietschiesietschie

                          1012




                          1012






























                              draft saved

                              draft discarded




















































                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid



                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.


                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function () {
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26175%2fkerns-lstm-kernel%23new-answer', 'question_page');
                              }
                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              How to label and detect the document text images

                              Vallis Paradisi

                              Tabula Rosettana