When to use GRU over LSTM?












71












$begingroup$


The key difference between a GRU and an LSTM is that a GRU has two gates (reset and update gates) whereas an LSTM has three gates (namely input, output and forget gates).



Why do we make use of GRU when we clearly have more control on the network through the LSTM model (as we have three gates)? In which scenario GRU is preferred over LSTM?










share|improve this question











$endgroup$








  • 1




    $begingroup$
    A GRU is slightly less complex but is approximately as good as an LSTM performance-wise. An implementation in TensorFlow is found here: data-blogger.com/2017/08/27/gru-implementation-tensorflow.
    $endgroup$
    – www.data-blogger.com
    Aug 27 '17 at 12:28
















71












$begingroup$


The key difference between a GRU and an LSTM is that a GRU has two gates (reset and update gates) whereas an LSTM has three gates (namely input, output and forget gates).



Why do we make use of GRU when we clearly have more control on the network through the LSTM model (as we have three gates)? In which scenario GRU is preferred over LSTM?










share|improve this question











$endgroup$








  • 1




    $begingroup$
    A GRU is slightly less complex but is approximately as good as an LSTM performance-wise. An implementation in TensorFlow is found here: data-blogger.com/2017/08/27/gru-implementation-tensorflow.
    $endgroup$
    – www.data-blogger.com
    Aug 27 '17 at 12:28














71












71








71


36



$begingroup$


The key difference between a GRU and an LSTM is that a GRU has two gates (reset and update gates) whereas an LSTM has three gates (namely input, output and forget gates).



Why do we make use of GRU when we clearly have more control on the network through the LSTM model (as we have three gates)? In which scenario GRU is preferred over LSTM?










share|improve this question











$endgroup$




The key difference between a GRU and an LSTM is that a GRU has two gates (reset and update gates) whereas an LSTM has three gates (namely input, output and forget gates).



Why do we make use of GRU when we clearly have more control on the network through the LSTM model (as we have three gates)? In which scenario GRU is preferred over LSTM?







neural-network deep-learning






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 10 '17 at 23:20









nbro

263216




263216










asked Oct 17 '16 at 11:47









Sayali SonawaneSayali Sonawane

456156




456156








  • 1




    $begingroup$
    A GRU is slightly less complex but is approximately as good as an LSTM performance-wise. An implementation in TensorFlow is found here: data-blogger.com/2017/08/27/gru-implementation-tensorflow.
    $endgroup$
    – www.data-blogger.com
    Aug 27 '17 at 12:28














  • 1




    $begingroup$
    A GRU is slightly less complex but is approximately as good as an LSTM performance-wise. An implementation in TensorFlow is found here: data-blogger.com/2017/08/27/gru-implementation-tensorflow.
    $endgroup$
    – www.data-blogger.com
    Aug 27 '17 at 12:28








1




1




$begingroup$
A GRU is slightly less complex but is approximately as good as an LSTM performance-wise. An implementation in TensorFlow is found here: data-blogger.com/2017/08/27/gru-implementation-tensorflow.
$endgroup$
– www.data-blogger.com
Aug 27 '17 at 12:28




$begingroup$
A GRU is slightly less complex but is approximately as good as an LSTM performance-wise. An implementation in TensorFlow is found here: data-blogger.com/2017/08/27/gru-implementation-tensorflow.
$endgroup$
– www.data-blogger.com
Aug 27 '17 at 12:28










7 Answers
7






active

oldest

votes


















50












$begingroup$

GRU is related to LSTM as both are utilizing different way if gating information to prevent vanishing gradient problem. Here are some pin-points about GRU vs LSTM-




  • The GRU unit controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control.

  • GRU is relatively new, and from my perspective, the performance is on par with LSTM, but computationally more efficient (less complex structure as pointed out). So we are seeing it being used more and more.


For detail description you can explore this Research Paper - Arxiv.org. The paper explains all this brilliantly.



Plus, you can also explore these blogs for a better idea-





  • WildML

  • Colah - Github


Hope it helps!






share|improve this answer









$endgroup$









  • 1




    $begingroup$
    In addition to your answer there is a nice paper evaluating the performance between GRU and LSTM and their various permutations "An empirical exploration of recurrent network architectures" by Google
    $endgroup$
    – minerals
    Jun 10 '17 at 18:11





















25












$begingroup$

*To complement already great answers above.




  • From my experience, GRUs train faster and perform better than LSTMs on less training data if you are doing language modeling (not sure about other tasks).


  • GRUs are simpler and thus easier to modify, for example adding new gates in case of additional input to the network. It's just less code in general.


  • LSTMs should in theory remember longer sequences than GRUs and outperform them in tasks requiring modeling long-distance relations.



*Some additional papers that analyze GRUs and LSTMs.




  • "Neural GPUs Learn Algorithms" (Łukasz Kaiser, Ilya Sutskever, 2015)
    https://arxiv.org/abs/1511.08228


  • "Comparative Study of CNN and RNN for Natural Language Processing"
    (Wenpeng Yin et al. 2017) https://arxiv.org/abs/1702.01923







share|improve this answer











$endgroup$





















    7












    $begingroup$

    This answer actually lies on the dataset and the use case. It's hard to tell definitively which is better.




    • GRU exposes the complete memory unlike LSTM, so applications which
      that acts as advantage might be helpful. Also, adding onto why to use
      GRU - it is computationally easier than LSTM since it has only 2
      gates and if it's performance is on par with LSTM, then why not?

    • This paper demonstrates excellently with graphs the superiority
      of gated networks over a simple RNN but clearly mentions that it
      cannot conclude which of the either are better. So, if you are
      confused as to which to use as your model, I'd suggest you to train
      both and then get the better of them.






    share|improve this answer











    $endgroup$





















      1












      $begingroup$

      GRU is better than LSTM as it is easy to modify and doesn't need memory units, therefore, faster to train than LSTM and give as per performance.






      share|improve this answer









      $endgroup$









      • 7




        $begingroup$
        please support the performance claim with fair references
        $endgroup$
        – Kari
        May 21 '18 at 3:57



















      1












      $begingroup$

      Actually, the key difference comes out to be more than that: Long-short term (LSTM) perceptrons are made up using the momentum and gradient descent algorithms. When you reconcile LSTM perceptrons with their recursive counterpart RNNs, you come up with GRU which is really just a generalized recurrent unit or Gradient Recurrent Unit (depending on the context) that more closely integrates the momentum and gradient descent algorithms. Were I you, I'd do more research on AdamOptimizers.



      GRU is an outdated concept by the way. However, I can understand you researching it if you want moderate-advanced in-depth knowledge of TF.






      share|improve this answer











      $endgroup$









      • 1




        $begingroup$
        I'm curious. Could you explain why GRU is an outdated concept?
        $endgroup$
        – random_user
        Nov 16 '18 at 18:01



















      1












      $begingroup$

      FULL GRU Unit



      $ tilde{c}_t = tanh(W_c [G_r * c_{t-1}, x_t ] + b_c) $



      $ G_u = sigma(W_u [ c_{t-1}, x_t ] + b_u) $



      $ G_r = sigma(W_r [ c_{t-1}, x_t ] + b_r) $



      $ c_t = G_u * tilde{c}_t + (1 - G_u) * c_{t-1} $



      $ a_t = c_t $



      LSTM Unit



      $ tilde{c}_t = tanh(W_c [ a_{t-1}, x_t ] + b_c) $



      $ G_u = sigma(W_u [ a_{t-1}, x_t ] + b_u) $



      $ G_f = sigma(W_f [ a_{t-1}, x_t ] + b_f) $



      $ G_o = sigma(W_o [ a_{t-1}, x_t ] + b_o) $



      $ c_t = G_u * tilde{c}_t + G_f * c_{t-1} $



      $ a_t = G_o * c_t $



      As can be seen from the equations LSTMs have a separate update gate and forget gate. This clearly makes LSTMs more sophisticated but at the same time more complex as well. There is no simple way to decide which to use for your particular use case. You always have to do trial and error to test the performance. However, because GRU is simpler than LSTM, GRUs will take much less time to train and are more efficient.



      Credits:Andrew Ng






      share|improve this answer











      $endgroup$





















        1












        $begingroup$

        FULL GRU Unit



        $ tilde{c}_t = tanh(W_c [G_r * c_{t-1}, x_t ] + b_c) $



        $ G_u = sigma(W_u [ c_{t-1}, x_t ] + b_u) $



        $ G_r = sigma(W_r [ c_{t-1}, x_t ] + b_r) $



        $ c_t = G_u * tilde{c}_t + (1 - G_u) * c_{t-1} $



        $ a_t = c_t $



        LSTM Unit



        $ tilde{c}_t = tanh(W_c [ a_{t-1}, x_t ] + b_c) $



        $ G_u = sigma(W_u [ a_{t-1}, x_t ] + b_u) $



        $ G_f = sigma(W_f [ a_{t-1}, x_t ] + b_f) $



        $ G_o = sigma(W_o [ a_{t-1}, x_t ] + b_o) $



        $ c_t = G_u * tilde{c}_t + G_f * c_{t-1} $



        $ a_t = G_o * c_t $



        As can be seen from the equations LSTMs have a separate update gate and forget gate. This clearly makes LSTMs more sophisticated but at the same time more complex as well. There is no simple way to decide which to use for your particular use case. You always have to do trial and error to test the performance. However, because GRU is simpler than LSTM, GRUs will take much less time to train and are more efficient.



        Credits:Andrew Ng





        GRU unit is a redundant term, also, that's not a full GRU. Thanks for playing ^^






        share|improve this answer











        $endgroup$













          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f14581%2fwhen-to-use-gru-over-lstm%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          7 Answers
          7






          active

          oldest

          votes








          7 Answers
          7






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          50












          $begingroup$

          GRU is related to LSTM as both are utilizing different way if gating information to prevent vanishing gradient problem. Here are some pin-points about GRU vs LSTM-




          • The GRU unit controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control.

          • GRU is relatively new, and from my perspective, the performance is on par with LSTM, but computationally more efficient (less complex structure as pointed out). So we are seeing it being used more and more.


          For detail description you can explore this Research Paper - Arxiv.org. The paper explains all this brilliantly.



          Plus, you can also explore these blogs for a better idea-





          • WildML

          • Colah - Github


          Hope it helps!






          share|improve this answer









          $endgroup$









          • 1




            $begingroup$
            In addition to your answer there is a nice paper evaluating the performance between GRU and LSTM and their various permutations "An empirical exploration of recurrent network architectures" by Google
            $endgroup$
            – minerals
            Jun 10 '17 at 18:11


















          50












          $begingroup$

          GRU is related to LSTM as both are utilizing different way if gating information to prevent vanishing gradient problem. Here are some pin-points about GRU vs LSTM-




          • The GRU unit controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control.

          • GRU is relatively new, and from my perspective, the performance is on par with LSTM, but computationally more efficient (less complex structure as pointed out). So we are seeing it being used more and more.


          For detail description you can explore this Research Paper - Arxiv.org. The paper explains all this brilliantly.



          Plus, you can also explore these blogs for a better idea-





          • WildML

          • Colah - Github


          Hope it helps!






          share|improve this answer









          $endgroup$









          • 1




            $begingroup$
            In addition to your answer there is a nice paper evaluating the performance between GRU and LSTM and their various permutations "An empirical exploration of recurrent network architectures" by Google
            $endgroup$
            – minerals
            Jun 10 '17 at 18:11
















          50












          50








          50





          $begingroup$

          GRU is related to LSTM as both are utilizing different way if gating information to prevent vanishing gradient problem. Here are some pin-points about GRU vs LSTM-




          • The GRU unit controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control.

          • GRU is relatively new, and from my perspective, the performance is on par with LSTM, but computationally more efficient (less complex structure as pointed out). So we are seeing it being used more and more.


          For detail description you can explore this Research Paper - Arxiv.org. The paper explains all this brilliantly.



          Plus, you can also explore these blogs for a better idea-





          • WildML

          • Colah - Github


          Hope it helps!






          share|improve this answer









          $endgroup$



          GRU is related to LSTM as both are utilizing different way if gating information to prevent vanishing gradient problem. Here are some pin-points about GRU vs LSTM-




          • The GRU unit controls the flow of information like the LSTM unit, but without having to use a memory unit. It just exposes the full hidden content without any control.

          • GRU is relatively new, and from my perspective, the performance is on par with LSTM, but computationally more efficient (less complex structure as pointed out). So we are seeing it being used more and more.


          For detail description you can explore this Research Paper - Arxiv.org. The paper explains all this brilliantly.



          Plus, you can also explore these blogs for a better idea-





          • WildML

          • Colah - Github


          Hope it helps!







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Oct 17 '16 at 13:33









          Abhishek JaiswalAbhishek Jaiswal

          1,1811719




          1,1811719








          • 1




            $begingroup$
            In addition to your answer there is a nice paper evaluating the performance between GRU and LSTM and their various permutations "An empirical exploration of recurrent network architectures" by Google
            $endgroup$
            – minerals
            Jun 10 '17 at 18:11
















          • 1




            $begingroup$
            In addition to your answer there is a nice paper evaluating the performance between GRU and LSTM and their various permutations "An empirical exploration of recurrent network architectures" by Google
            $endgroup$
            – minerals
            Jun 10 '17 at 18:11










          1




          1




          $begingroup$
          In addition to your answer there is a nice paper evaluating the performance between GRU and LSTM and their various permutations "An empirical exploration of recurrent network architectures" by Google
          $endgroup$
          – minerals
          Jun 10 '17 at 18:11






          $begingroup$
          In addition to your answer there is a nice paper evaluating the performance between GRU and LSTM and their various permutations "An empirical exploration of recurrent network architectures" by Google
          $endgroup$
          – minerals
          Jun 10 '17 at 18:11













          25












          $begingroup$

          *To complement already great answers above.




          • From my experience, GRUs train faster and perform better than LSTMs on less training data if you are doing language modeling (not sure about other tasks).


          • GRUs are simpler and thus easier to modify, for example adding new gates in case of additional input to the network. It's just less code in general.


          • LSTMs should in theory remember longer sequences than GRUs and outperform them in tasks requiring modeling long-distance relations.



          *Some additional papers that analyze GRUs and LSTMs.




          • "Neural GPUs Learn Algorithms" (Łukasz Kaiser, Ilya Sutskever, 2015)
            https://arxiv.org/abs/1511.08228


          • "Comparative Study of CNN and RNN for Natural Language Processing"
            (Wenpeng Yin et al. 2017) https://arxiv.org/abs/1702.01923







          share|improve this answer











          $endgroup$


















            25












            $begingroup$

            *To complement already great answers above.




            • From my experience, GRUs train faster and perform better than LSTMs on less training data if you are doing language modeling (not sure about other tasks).


            • GRUs are simpler and thus easier to modify, for example adding new gates in case of additional input to the network. It's just less code in general.


            • LSTMs should in theory remember longer sequences than GRUs and outperform them in tasks requiring modeling long-distance relations.



            *Some additional papers that analyze GRUs and LSTMs.




            • "Neural GPUs Learn Algorithms" (Łukasz Kaiser, Ilya Sutskever, 2015)
              https://arxiv.org/abs/1511.08228


            • "Comparative Study of CNN and RNN for Natural Language Processing"
              (Wenpeng Yin et al. 2017) https://arxiv.org/abs/1702.01923







            share|improve this answer











            $endgroup$
















              25












              25








              25





              $begingroup$

              *To complement already great answers above.




              • From my experience, GRUs train faster and perform better than LSTMs on less training data if you are doing language modeling (not sure about other tasks).


              • GRUs are simpler and thus easier to modify, for example adding new gates in case of additional input to the network. It's just less code in general.


              • LSTMs should in theory remember longer sequences than GRUs and outperform them in tasks requiring modeling long-distance relations.



              *Some additional papers that analyze GRUs and LSTMs.




              • "Neural GPUs Learn Algorithms" (Łukasz Kaiser, Ilya Sutskever, 2015)
                https://arxiv.org/abs/1511.08228


              • "Comparative Study of CNN and RNN for Natural Language Processing"
                (Wenpeng Yin et al. 2017) https://arxiv.org/abs/1702.01923







              share|improve this answer











              $endgroup$



              *To complement already great answers above.




              • From my experience, GRUs train faster and perform better than LSTMs on less training data if you are doing language modeling (not sure about other tasks).


              • GRUs are simpler and thus easier to modify, for example adding new gates in case of additional input to the network. It's just less code in general.


              • LSTMs should in theory remember longer sequences than GRUs and outperform them in tasks requiring modeling long-distance relations.



              *Some additional papers that analyze GRUs and LSTMs.




              • "Neural GPUs Learn Algorithms" (Łukasz Kaiser, Ilya Sutskever, 2015)
                https://arxiv.org/abs/1511.08228


              • "Comparative Study of CNN and RNN for Natural Language Processing"
                (Wenpeng Yin et al. 2017) https://arxiv.org/abs/1702.01923








              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Sep 27 '17 at 13:06









              Neil Slater

              16.6k22861




              16.6k22861










              answered Jun 10 '17 at 18:22









              mineralsminerals

              7571719




              7571719























                  7












                  $begingroup$

                  This answer actually lies on the dataset and the use case. It's hard to tell definitively which is better.




                  • GRU exposes the complete memory unlike LSTM, so applications which
                    that acts as advantage might be helpful. Also, adding onto why to use
                    GRU - it is computationally easier than LSTM since it has only 2
                    gates and if it's performance is on par with LSTM, then why not?

                  • This paper demonstrates excellently with graphs the superiority
                    of gated networks over a simple RNN but clearly mentions that it
                    cannot conclude which of the either are better. So, if you are
                    confused as to which to use as your model, I'd suggest you to train
                    both and then get the better of them.






                  share|improve this answer











                  $endgroup$


















                    7












                    $begingroup$

                    This answer actually lies on the dataset and the use case. It's hard to tell definitively which is better.




                    • GRU exposes the complete memory unlike LSTM, so applications which
                      that acts as advantage might be helpful. Also, adding onto why to use
                      GRU - it is computationally easier than LSTM since it has only 2
                      gates and if it's performance is on par with LSTM, then why not?

                    • This paper demonstrates excellently with graphs the superiority
                      of gated networks over a simple RNN but clearly mentions that it
                      cannot conclude which of the either are better. So, if you are
                      confused as to which to use as your model, I'd suggest you to train
                      both and then get the better of them.






                    share|improve this answer











                    $endgroup$
















                      7












                      7








                      7





                      $begingroup$

                      This answer actually lies on the dataset and the use case. It's hard to tell definitively which is better.




                      • GRU exposes the complete memory unlike LSTM, so applications which
                        that acts as advantage might be helpful. Also, adding onto why to use
                        GRU - it is computationally easier than LSTM since it has only 2
                        gates and if it's performance is on par with LSTM, then why not?

                      • This paper demonstrates excellently with graphs the superiority
                        of gated networks over a simple RNN but clearly mentions that it
                        cannot conclude which of the either are better. So, if you are
                        confused as to which to use as your model, I'd suggest you to train
                        both and then get the better of them.






                      share|improve this answer











                      $endgroup$



                      This answer actually lies on the dataset and the use case. It's hard to tell definitively which is better.




                      • GRU exposes the complete memory unlike LSTM, so applications which
                        that acts as advantage might be helpful. Also, adding onto why to use
                        GRU - it is computationally easier than LSTM since it has only 2
                        gates and if it's performance is on par with LSTM, then why not?

                      • This paper demonstrates excellently with graphs the superiority
                        of gated networks over a simple RNN but clearly mentions that it
                        cannot conclude which of the either are better. So, if you are
                        confused as to which to use as your model, I'd suggest you to train
                        both and then get the better of them.







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Oct 17 '16 at 13:35

























                      answered Oct 17 '16 at 12:13









                      Hima VarshaHima Varsha

                      1,672327




                      1,672327























                          1












                          $begingroup$

                          GRU is better than LSTM as it is easy to modify and doesn't need memory units, therefore, faster to train than LSTM and give as per performance.






                          share|improve this answer









                          $endgroup$









                          • 7




                            $begingroup$
                            please support the performance claim with fair references
                            $endgroup$
                            – Kari
                            May 21 '18 at 3:57
















                          1












                          $begingroup$

                          GRU is better than LSTM as it is easy to modify and doesn't need memory units, therefore, faster to train than LSTM and give as per performance.






                          share|improve this answer









                          $endgroup$









                          • 7




                            $begingroup$
                            please support the performance claim with fair references
                            $endgroup$
                            – Kari
                            May 21 '18 at 3:57














                          1












                          1








                          1





                          $begingroup$

                          GRU is better than LSTM as it is easy to modify and doesn't need memory units, therefore, faster to train than LSTM and give as per performance.






                          share|improve this answer









                          $endgroup$



                          GRU is better than LSTM as it is easy to modify and doesn't need memory units, therefore, faster to train than LSTM and give as per performance.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Dec 5 '17 at 4:12









                          Vivek KhetanVivek Khetan

                          1356




                          1356








                          • 7




                            $begingroup$
                            please support the performance claim with fair references
                            $endgroup$
                            – Kari
                            May 21 '18 at 3:57














                          • 7




                            $begingroup$
                            please support the performance claim with fair references
                            $endgroup$
                            – Kari
                            May 21 '18 at 3:57








                          7




                          7




                          $begingroup$
                          please support the performance claim with fair references
                          $endgroup$
                          – Kari
                          May 21 '18 at 3:57




                          $begingroup$
                          please support the performance claim with fair references
                          $endgroup$
                          – Kari
                          May 21 '18 at 3:57











                          1












                          $begingroup$

                          Actually, the key difference comes out to be more than that: Long-short term (LSTM) perceptrons are made up using the momentum and gradient descent algorithms. When you reconcile LSTM perceptrons with their recursive counterpart RNNs, you come up with GRU which is really just a generalized recurrent unit or Gradient Recurrent Unit (depending on the context) that more closely integrates the momentum and gradient descent algorithms. Were I you, I'd do more research on AdamOptimizers.



                          GRU is an outdated concept by the way. However, I can understand you researching it if you want moderate-advanced in-depth knowledge of TF.






                          share|improve this answer











                          $endgroup$









                          • 1




                            $begingroup$
                            I'm curious. Could you explain why GRU is an outdated concept?
                            $endgroup$
                            – random_user
                            Nov 16 '18 at 18:01
















                          1












                          $begingroup$

                          Actually, the key difference comes out to be more than that: Long-short term (LSTM) perceptrons are made up using the momentum and gradient descent algorithms. When you reconcile LSTM perceptrons with their recursive counterpart RNNs, you come up with GRU which is really just a generalized recurrent unit or Gradient Recurrent Unit (depending on the context) that more closely integrates the momentum and gradient descent algorithms. Were I you, I'd do more research on AdamOptimizers.



                          GRU is an outdated concept by the way. However, I can understand you researching it if you want moderate-advanced in-depth knowledge of TF.






                          share|improve this answer











                          $endgroup$









                          • 1




                            $begingroup$
                            I'm curious. Could you explain why GRU is an outdated concept?
                            $endgroup$
                            – random_user
                            Nov 16 '18 at 18:01














                          1












                          1








                          1





                          $begingroup$

                          Actually, the key difference comes out to be more than that: Long-short term (LSTM) perceptrons are made up using the momentum and gradient descent algorithms. When you reconcile LSTM perceptrons with their recursive counterpart RNNs, you come up with GRU which is really just a generalized recurrent unit or Gradient Recurrent Unit (depending on the context) that more closely integrates the momentum and gradient descent algorithms. Were I you, I'd do more research on AdamOptimizers.



                          GRU is an outdated concept by the way. However, I can understand you researching it if you want moderate-advanced in-depth knowledge of TF.






                          share|improve this answer











                          $endgroup$



                          Actually, the key difference comes out to be more than that: Long-short term (LSTM) perceptrons are made up using the momentum and gradient descent algorithms. When you reconcile LSTM perceptrons with their recursive counterpart RNNs, you come up with GRU which is really just a generalized recurrent unit or Gradient Recurrent Unit (depending on the context) that more closely integrates the momentum and gradient descent algorithms. Were I you, I'd do more research on AdamOptimizers.



                          GRU is an outdated concept by the way. However, I can understand you researching it if you want moderate-advanced in-depth knowledge of TF.







                          share|improve this answer














                          share|improve this answer



                          share|improve this answer








                          edited Oct 26 '18 at 21:40

























                          answered Oct 26 '18 at 21:04









                          Andre PattersonAndre Patterson

                          213




                          213








                          • 1




                            $begingroup$
                            I'm curious. Could you explain why GRU is an outdated concept?
                            $endgroup$
                            – random_user
                            Nov 16 '18 at 18:01














                          • 1




                            $begingroup$
                            I'm curious. Could you explain why GRU is an outdated concept?
                            $endgroup$
                            – random_user
                            Nov 16 '18 at 18:01








                          1




                          1




                          $begingroup$
                          I'm curious. Could you explain why GRU is an outdated concept?
                          $endgroup$
                          – random_user
                          Nov 16 '18 at 18:01




                          $begingroup$
                          I'm curious. Could you explain why GRU is an outdated concept?
                          $endgroup$
                          – random_user
                          Nov 16 '18 at 18:01











                          1












                          $begingroup$

                          FULL GRU Unit



                          $ tilde{c}_t = tanh(W_c [G_r * c_{t-1}, x_t ] + b_c) $



                          $ G_u = sigma(W_u [ c_{t-1}, x_t ] + b_u) $



                          $ G_r = sigma(W_r [ c_{t-1}, x_t ] + b_r) $



                          $ c_t = G_u * tilde{c}_t + (1 - G_u) * c_{t-1} $



                          $ a_t = c_t $



                          LSTM Unit



                          $ tilde{c}_t = tanh(W_c [ a_{t-1}, x_t ] + b_c) $



                          $ G_u = sigma(W_u [ a_{t-1}, x_t ] + b_u) $



                          $ G_f = sigma(W_f [ a_{t-1}, x_t ] + b_f) $



                          $ G_o = sigma(W_o [ a_{t-1}, x_t ] + b_o) $



                          $ c_t = G_u * tilde{c}_t + G_f * c_{t-1} $



                          $ a_t = G_o * c_t $



                          As can be seen from the equations LSTMs have a separate update gate and forget gate. This clearly makes LSTMs more sophisticated but at the same time more complex as well. There is no simple way to decide which to use for your particular use case. You always have to do trial and error to test the performance. However, because GRU is simpler than LSTM, GRUs will take much less time to train and are more efficient.



                          Credits:Andrew Ng






                          share|improve this answer











                          $endgroup$


















                            1












                            $begingroup$

                            FULL GRU Unit



                            $ tilde{c}_t = tanh(W_c [G_r * c_{t-1}, x_t ] + b_c) $



                            $ G_u = sigma(W_u [ c_{t-1}, x_t ] + b_u) $



                            $ G_r = sigma(W_r [ c_{t-1}, x_t ] + b_r) $



                            $ c_t = G_u * tilde{c}_t + (1 - G_u) * c_{t-1} $



                            $ a_t = c_t $



                            LSTM Unit



                            $ tilde{c}_t = tanh(W_c [ a_{t-1}, x_t ] + b_c) $



                            $ G_u = sigma(W_u [ a_{t-1}, x_t ] + b_u) $



                            $ G_f = sigma(W_f [ a_{t-1}, x_t ] + b_f) $



                            $ G_o = sigma(W_o [ a_{t-1}, x_t ] + b_o) $



                            $ c_t = G_u * tilde{c}_t + G_f * c_{t-1} $



                            $ a_t = G_o * c_t $



                            As can be seen from the equations LSTMs have a separate update gate and forget gate. This clearly makes LSTMs more sophisticated but at the same time more complex as well. There is no simple way to decide which to use for your particular use case. You always have to do trial and error to test the performance. However, because GRU is simpler than LSTM, GRUs will take much less time to train and are more efficient.



                            Credits:Andrew Ng






                            share|improve this answer











                            $endgroup$
















                              1












                              1








                              1





                              $begingroup$

                              FULL GRU Unit



                              $ tilde{c}_t = tanh(W_c [G_r * c_{t-1}, x_t ] + b_c) $



                              $ G_u = sigma(W_u [ c_{t-1}, x_t ] + b_u) $



                              $ G_r = sigma(W_r [ c_{t-1}, x_t ] + b_r) $



                              $ c_t = G_u * tilde{c}_t + (1 - G_u) * c_{t-1} $



                              $ a_t = c_t $



                              LSTM Unit



                              $ tilde{c}_t = tanh(W_c [ a_{t-1}, x_t ] + b_c) $



                              $ G_u = sigma(W_u [ a_{t-1}, x_t ] + b_u) $



                              $ G_f = sigma(W_f [ a_{t-1}, x_t ] + b_f) $



                              $ G_o = sigma(W_o [ a_{t-1}, x_t ] + b_o) $



                              $ c_t = G_u * tilde{c}_t + G_f * c_{t-1} $



                              $ a_t = G_o * c_t $



                              As can be seen from the equations LSTMs have a separate update gate and forget gate. This clearly makes LSTMs more sophisticated but at the same time more complex as well. There is no simple way to decide which to use for your particular use case. You always have to do trial and error to test the performance. However, because GRU is simpler than LSTM, GRUs will take much less time to train and are more efficient.



                              Credits:Andrew Ng






                              share|improve this answer











                              $endgroup$



                              FULL GRU Unit



                              $ tilde{c}_t = tanh(W_c [G_r * c_{t-1}, x_t ] + b_c) $



                              $ G_u = sigma(W_u [ c_{t-1}, x_t ] + b_u) $



                              $ G_r = sigma(W_r [ c_{t-1}, x_t ] + b_r) $



                              $ c_t = G_u * tilde{c}_t + (1 - G_u) * c_{t-1} $



                              $ a_t = c_t $



                              LSTM Unit



                              $ tilde{c}_t = tanh(W_c [ a_{t-1}, x_t ] + b_c) $



                              $ G_u = sigma(W_u [ a_{t-1}, x_t ] + b_u) $



                              $ G_f = sigma(W_f [ a_{t-1}, x_t ] + b_f) $



                              $ G_o = sigma(W_o [ a_{t-1}, x_t ] + b_o) $



                              $ c_t = G_u * tilde{c}_t + G_f * c_{t-1} $



                              $ a_t = G_o * c_t $



                              As can be seen from the equations LSTMs have a separate update gate and forget gate. This clearly makes LSTMs more sophisticated but at the same time more complex as well. There is no simple way to decide which to use for your particular use case. You always have to do trial and error to test the performance. However, because GRU is simpler than LSTM, GRUs will take much less time to train and are more efficient.



                              Credits:Andrew Ng







                              share|improve this answer














                              share|improve this answer



                              share|improve this answer








                              edited 4 hours ago









                              mickey

                              1054




                              1054










                              answered Aug 8 '18 at 3:48









                              balboabalboa

                              211




                              211























                                  1












                                  $begingroup$

                                  FULL GRU Unit



                                  $ tilde{c}_t = tanh(W_c [G_r * c_{t-1}, x_t ] + b_c) $



                                  $ G_u = sigma(W_u [ c_{t-1}, x_t ] + b_u) $



                                  $ G_r = sigma(W_r [ c_{t-1}, x_t ] + b_r) $



                                  $ c_t = G_u * tilde{c}_t + (1 - G_u) * c_{t-1} $



                                  $ a_t = c_t $



                                  LSTM Unit



                                  $ tilde{c}_t = tanh(W_c [ a_{t-1}, x_t ] + b_c) $



                                  $ G_u = sigma(W_u [ a_{t-1}, x_t ] + b_u) $



                                  $ G_f = sigma(W_f [ a_{t-1}, x_t ] + b_f) $



                                  $ G_o = sigma(W_o [ a_{t-1}, x_t ] + b_o) $



                                  $ c_t = G_u * tilde{c}_t + G_f * c_{t-1} $



                                  $ a_t = G_o * c_t $



                                  As can be seen from the equations LSTMs have a separate update gate and forget gate. This clearly makes LSTMs more sophisticated but at the same time more complex as well. There is no simple way to decide which to use for your particular use case. You always have to do trial and error to test the performance. However, because GRU is simpler than LSTM, GRUs will take much less time to train and are more efficient.



                                  Credits:Andrew Ng





                                  GRU unit is a redundant term, also, that's not a full GRU. Thanks for playing ^^






                                  share|improve this answer











                                  $endgroup$


















                                    1












                                    $begingroup$

                                    FULL GRU Unit



                                    $ tilde{c}_t = tanh(W_c [G_r * c_{t-1}, x_t ] + b_c) $



                                    $ G_u = sigma(W_u [ c_{t-1}, x_t ] + b_u) $



                                    $ G_r = sigma(W_r [ c_{t-1}, x_t ] + b_r) $



                                    $ c_t = G_u * tilde{c}_t + (1 - G_u) * c_{t-1} $



                                    $ a_t = c_t $



                                    LSTM Unit



                                    $ tilde{c}_t = tanh(W_c [ a_{t-1}, x_t ] + b_c) $



                                    $ G_u = sigma(W_u [ a_{t-1}, x_t ] + b_u) $



                                    $ G_f = sigma(W_f [ a_{t-1}, x_t ] + b_f) $



                                    $ G_o = sigma(W_o [ a_{t-1}, x_t ] + b_o) $



                                    $ c_t = G_u * tilde{c}_t + G_f * c_{t-1} $



                                    $ a_t = G_o * c_t $



                                    As can be seen from the equations LSTMs have a separate update gate and forget gate. This clearly makes LSTMs more sophisticated but at the same time more complex as well. There is no simple way to decide which to use for your particular use case. You always have to do trial and error to test the performance. However, because GRU is simpler than LSTM, GRUs will take much less time to train and are more efficient.



                                    Credits:Andrew Ng





                                    GRU unit is a redundant term, also, that's not a full GRU. Thanks for playing ^^






                                    share|improve this answer











                                    $endgroup$
















                                      1












                                      1








                                      1





                                      $begingroup$

                                      FULL GRU Unit



                                      $ tilde{c}_t = tanh(W_c [G_r * c_{t-1}, x_t ] + b_c) $



                                      $ G_u = sigma(W_u [ c_{t-1}, x_t ] + b_u) $



                                      $ G_r = sigma(W_r [ c_{t-1}, x_t ] + b_r) $



                                      $ c_t = G_u * tilde{c}_t + (1 - G_u) * c_{t-1} $



                                      $ a_t = c_t $



                                      LSTM Unit



                                      $ tilde{c}_t = tanh(W_c [ a_{t-1}, x_t ] + b_c) $



                                      $ G_u = sigma(W_u [ a_{t-1}, x_t ] + b_u) $



                                      $ G_f = sigma(W_f [ a_{t-1}, x_t ] + b_f) $



                                      $ G_o = sigma(W_o [ a_{t-1}, x_t ] + b_o) $



                                      $ c_t = G_u * tilde{c}_t + G_f * c_{t-1} $



                                      $ a_t = G_o * c_t $



                                      As can be seen from the equations LSTMs have a separate update gate and forget gate. This clearly makes LSTMs more sophisticated but at the same time more complex as well. There is no simple way to decide which to use for your particular use case. You always have to do trial and error to test the performance. However, because GRU is simpler than LSTM, GRUs will take much less time to train and are more efficient.



                                      Credits:Andrew Ng





                                      GRU unit is a redundant term, also, that's not a full GRU. Thanks for playing ^^






                                      share|improve this answer











                                      $endgroup$



                                      FULL GRU Unit



                                      $ tilde{c}_t = tanh(W_c [G_r * c_{t-1}, x_t ] + b_c) $



                                      $ G_u = sigma(W_u [ c_{t-1}, x_t ] + b_u) $



                                      $ G_r = sigma(W_r [ c_{t-1}, x_t ] + b_r) $



                                      $ c_t = G_u * tilde{c}_t + (1 - G_u) * c_{t-1} $



                                      $ a_t = c_t $



                                      LSTM Unit



                                      $ tilde{c}_t = tanh(W_c [ a_{t-1}, x_t ] + b_c) $



                                      $ G_u = sigma(W_u [ a_{t-1}, x_t ] + b_u) $



                                      $ G_f = sigma(W_f [ a_{t-1}, x_t ] + b_f) $



                                      $ G_o = sigma(W_o [ a_{t-1}, x_t ] + b_o) $



                                      $ c_t = G_u * tilde{c}_t + G_f * c_{t-1} $



                                      $ a_t = G_o * c_t $



                                      As can be seen from the equations LSTMs have a separate update gate and forget gate. This clearly makes LSTMs more sophisticated but at the same time more complex as well. There is no simple way to decide which to use for your particular use case. You always have to do trial and error to test the performance. However, because GRU is simpler than LSTM, GRUs will take much less time to train and are more efficient.



                                      Credits:Andrew Ng





                                      GRU unit is a redundant term, also, that's not a full GRU. Thanks for playing ^^







                                      share|improve this answer














                                      share|improve this answer



                                      share|improve this answer








                                      edited 4 hours ago









                                      mickey

                                      1054




                                      1054










                                      answered Oct 26 '18 at 21:38









                                      Andre PattersonAndre Patterson

                                      213




                                      213






























                                          draft saved

                                          draft discarded




















































                                          Thanks for contributing an answer to Data Science Stack Exchange!


                                          • Please be sure to answer the question. Provide details and share your research!

                                          But avoid



                                          • Asking for help, clarification, or responding to other answers.

                                          • Making statements based on opinion; back them up with references or personal experience.


                                          Use MathJax to format equations. MathJax reference.


                                          To learn more, see our tips on writing great answers.




                                          draft saved


                                          draft discarded














                                          StackExchange.ready(
                                          function () {
                                          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f14581%2fwhen-to-use-gru-over-lstm%23new-answer', 'question_page');
                                          }
                                          );

                                          Post as a guest















                                          Required, but never shown





















































                                          Required, but never shown














                                          Required, but never shown












                                          Required, but never shown







                                          Required, but never shown

































                                          Required, but never shown














                                          Required, but never shown












                                          Required, but never shown







                                          Required, but never shown







                                          Popular posts from this blog

                                          How to label and detect the document text images

                                          Vallis Paradisi

                                          Tabula Rosettana