What is the motivation for row-wise convolution and folding in Kalchbrenner et al. (2014)?












1












$begingroup$


I was reading the paper by Kalchbrenner et al. titled A Convolutional Neural Network for Modelling Sentences and am struggling to understand their definition of convolutional layer.



First, let's take a step back and describe what I'd expect the 1d convolution to look like, just as defined in Yoon Kim (2014).




sentence. A sentence of length n (padded where necessary) is represented as



x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)



where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words
xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R^hk, which is applied to a window of h words to produce a new feature. For
example, a feature ci is generated from a window
of words xi:i+h−1 by



ci = f(w · xi:i+h−1 + b) (2).



Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map



c = [c1, c2, . . . , cn−h+1], (3)



with c ∈ R^(n−h+1).




Meaning a single feature detector transforms every window from the input sequence to a single number, resulting in n-h+1 activations.



Whereas in Kalchbrenner's paper, the convolution is described as follows:




If we temporarily ignore the pooling layer, we
may state how one computes each d-dimensional
column a in the matrix a resulting after the convolutional and non-linear layers. Define M to be the
matrix of diagonals:
M = [diag(m:,1), . . . , diag(m:,m)] (5)
where m are the weights of the d filters of the wide
convolution. Then after the first pair of a convolutional and a non-linear layer, each column a in the
matrix a is obtained as follows, for some index j:
enter image description here
Here a is a column of first order features. Second order features are similarly obtained by applying Eq. 6 to a sequence of first order features aj , ..., aj+m0−1 with another weight matrix M0. Barring pooling, Eq. 6 represents a core aspect of the feature extraction function and has a rather general form that we return to below. Together with pooling, the feature function induces position invariance and makes the range of higher-order features variable.




As described in this question, the matrix M has dimensionalty of d x dm and the vector of concatenated w's has dimensionality dm. Thus the multiplication produces a vector of dimensionality d (for a single convolution of a single window!).



Architecture visualization from the paper seems to confirm this understanding:
enter image description here



The two matrices in the second layer represent two feature maps. Each feature map has dimensionality (s + m - 1) x d, and not (s + m - 1) as I would expect.



Authors refer to a "conventional" model where feature maps have only one dimension as Max-TDNN and differentiate it from their own.



As the authors point out, feature detectors in different rows are fully independent from each other until the top layer. Thus they introduce the Folding layer, which merges each pair of rows in the penultimate layer (by summation), reducing their number in half (from d to d/2).
architecture



Sorry for the prolonged introduction, here are my two main questions:



1) What is the possible motivation for this definition of convolution (as opposed to Max-TDNN or e.g. Yoon Kim's model)



2) In the Folding layer, why is it satisfying to only have dependence between pairs of corresponding rows? I don't understand the gain over no dependence at all.










share|improve this question







New contributor




Tomasz Garbus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$

















    1












    $begingroup$


    I was reading the paper by Kalchbrenner et al. titled A Convolutional Neural Network for Modelling Sentences and am struggling to understand their definition of convolutional layer.



    First, let's take a step back and describe what I'd expect the 1d convolution to look like, just as defined in Yoon Kim (2014).




    sentence. A sentence of length n (padded where necessary) is represented as



    x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)



    where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words
    xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R^hk, which is applied to a window of h words to produce a new feature. For
    example, a feature ci is generated from a window
    of words xi:i+h−1 by



    ci = f(w · xi:i+h−1 + b) (2).



    Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map



    c = [c1, c2, . . . , cn−h+1], (3)



    with c ∈ R^(n−h+1).




    Meaning a single feature detector transforms every window from the input sequence to a single number, resulting in n-h+1 activations.



    Whereas in Kalchbrenner's paper, the convolution is described as follows:




    If we temporarily ignore the pooling layer, we
    may state how one computes each d-dimensional
    column a in the matrix a resulting after the convolutional and non-linear layers. Define M to be the
    matrix of diagonals:
    M = [diag(m:,1), . . . , diag(m:,m)] (5)
    where m are the weights of the d filters of the wide
    convolution. Then after the first pair of a convolutional and a non-linear layer, each column a in the
    matrix a is obtained as follows, for some index j:
    enter image description here
    Here a is a column of first order features. Second order features are similarly obtained by applying Eq. 6 to a sequence of first order features aj , ..., aj+m0−1 with another weight matrix M0. Barring pooling, Eq. 6 represents a core aspect of the feature extraction function and has a rather general form that we return to below. Together with pooling, the feature function induces position invariance and makes the range of higher-order features variable.




    As described in this question, the matrix M has dimensionalty of d x dm and the vector of concatenated w's has dimensionality dm. Thus the multiplication produces a vector of dimensionality d (for a single convolution of a single window!).



    Architecture visualization from the paper seems to confirm this understanding:
    enter image description here



    The two matrices in the second layer represent two feature maps. Each feature map has dimensionality (s + m - 1) x d, and not (s + m - 1) as I would expect.



    Authors refer to a "conventional" model where feature maps have only one dimension as Max-TDNN and differentiate it from their own.



    As the authors point out, feature detectors in different rows are fully independent from each other until the top layer. Thus they introduce the Folding layer, which merges each pair of rows in the penultimate layer (by summation), reducing their number in half (from d to d/2).
    architecture



    Sorry for the prolonged introduction, here are my two main questions:



    1) What is the possible motivation for this definition of convolution (as opposed to Max-TDNN or e.g. Yoon Kim's model)



    2) In the Folding layer, why is it satisfying to only have dependence between pairs of corresponding rows? I don't understand the gain over no dependence at all.










    share|improve this question







    New contributor




    Tomasz Garbus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.







    $endgroup$















      1












      1








      1





      $begingroup$


      I was reading the paper by Kalchbrenner et al. titled A Convolutional Neural Network for Modelling Sentences and am struggling to understand their definition of convolutional layer.



      First, let's take a step back and describe what I'd expect the 1d convolution to look like, just as defined in Yoon Kim (2014).




      sentence. A sentence of length n (padded where necessary) is represented as



      x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)



      where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words
      xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R^hk, which is applied to a window of h words to produce a new feature. For
      example, a feature ci is generated from a window
      of words xi:i+h−1 by



      ci = f(w · xi:i+h−1 + b) (2).



      Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map



      c = [c1, c2, . . . , cn−h+1], (3)



      with c ∈ R^(n−h+1).




      Meaning a single feature detector transforms every window from the input sequence to a single number, resulting in n-h+1 activations.



      Whereas in Kalchbrenner's paper, the convolution is described as follows:




      If we temporarily ignore the pooling layer, we
      may state how one computes each d-dimensional
      column a in the matrix a resulting after the convolutional and non-linear layers. Define M to be the
      matrix of diagonals:
      M = [diag(m:,1), . . . , diag(m:,m)] (5)
      where m are the weights of the d filters of the wide
      convolution. Then after the first pair of a convolutional and a non-linear layer, each column a in the
      matrix a is obtained as follows, for some index j:
      enter image description here
      Here a is a column of first order features. Second order features are similarly obtained by applying Eq. 6 to a sequence of first order features aj , ..., aj+m0−1 with another weight matrix M0. Barring pooling, Eq. 6 represents a core aspect of the feature extraction function and has a rather general form that we return to below. Together with pooling, the feature function induces position invariance and makes the range of higher-order features variable.




      As described in this question, the matrix M has dimensionalty of d x dm and the vector of concatenated w's has dimensionality dm. Thus the multiplication produces a vector of dimensionality d (for a single convolution of a single window!).



      Architecture visualization from the paper seems to confirm this understanding:
      enter image description here



      The two matrices in the second layer represent two feature maps. Each feature map has dimensionality (s + m - 1) x d, and not (s + m - 1) as I would expect.



      Authors refer to a "conventional" model where feature maps have only one dimension as Max-TDNN and differentiate it from their own.



      As the authors point out, feature detectors in different rows are fully independent from each other until the top layer. Thus they introduce the Folding layer, which merges each pair of rows in the penultimate layer (by summation), reducing their number in half (from d to d/2).
      architecture



      Sorry for the prolonged introduction, here are my two main questions:



      1) What is the possible motivation for this definition of convolution (as opposed to Max-TDNN or e.g. Yoon Kim's model)



      2) In the Folding layer, why is it satisfying to only have dependence between pairs of corresponding rows? I don't understand the gain over no dependence at all.










      share|improve this question







      New contributor




      Tomasz Garbus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.







      $endgroup$




      I was reading the paper by Kalchbrenner et al. titled A Convolutional Neural Network for Modelling Sentences and am struggling to understand their definition of convolutional layer.



      First, let's take a step back and describe what I'd expect the 1d convolution to look like, just as defined in Yoon Kim (2014).




      sentence. A sentence of length n (padded where necessary) is represented as



      x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)



      where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words
      xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R^hk, which is applied to a window of h words to produce a new feature. For
      example, a feature ci is generated from a window
      of words xi:i+h−1 by



      ci = f(w · xi:i+h−1 + b) (2).



      Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map



      c = [c1, c2, . . . , cn−h+1], (3)



      with c ∈ R^(n−h+1).




      Meaning a single feature detector transforms every window from the input sequence to a single number, resulting in n-h+1 activations.



      Whereas in Kalchbrenner's paper, the convolution is described as follows:




      If we temporarily ignore the pooling layer, we
      may state how one computes each d-dimensional
      column a in the matrix a resulting after the convolutional and non-linear layers. Define M to be the
      matrix of diagonals:
      M = [diag(m:,1), . . . , diag(m:,m)] (5)
      where m are the weights of the d filters of the wide
      convolution. Then after the first pair of a convolutional and a non-linear layer, each column a in the
      matrix a is obtained as follows, for some index j:
      enter image description here
      Here a is a column of first order features. Second order features are similarly obtained by applying Eq. 6 to a sequence of first order features aj , ..., aj+m0−1 with another weight matrix M0. Barring pooling, Eq. 6 represents a core aspect of the feature extraction function and has a rather general form that we return to below. Together with pooling, the feature function induces position invariance and makes the range of higher-order features variable.




      As described in this question, the matrix M has dimensionalty of d x dm and the vector of concatenated w's has dimensionality dm. Thus the multiplication produces a vector of dimensionality d (for a single convolution of a single window!).



      Architecture visualization from the paper seems to confirm this understanding:
      enter image description here



      The two matrices in the second layer represent two feature maps. Each feature map has dimensionality (s + m - 1) x d, and not (s + m - 1) as I would expect.



      Authors refer to a "conventional" model where feature maps have only one dimension as Max-TDNN and differentiate it from their own.



      As the authors point out, feature detectors in different rows are fully independent from each other until the top layer. Thus they introduce the Folding layer, which merges each pair of rows in the penultimate layer (by summation), reducing their number in half (from d to d/2).
      architecture



      Sorry for the prolonged introduction, here are my two main questions:



      1) What is the possible motivation for this definition of convolution (as opposed to Max-TDNN or e.g. Yoon Kim's model)



      2) In the Folding layer, why is it satisfying to only have dependence between pairs of corresponding rows? I don't understand the gain over no dependence at all.







      machine-learning deep-learning nlp cnn convolution






      share|improve this question







      New contributor




      Tomasz Garbus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.











      share|improve this question







      New contributor




      Tomasz Garbus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      share|improve this question




      share|improve this question






      New contributor




      Tomasz Garbus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.









      asked yesterday









      Tomasz GarbusTomasz Garbus

      61




      61




      New contributor




      Tomasz Garbus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      New contributor





      Tomasz Garbus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      Tomasz Garbus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          0






          active

          oldest

          votes











          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });






          Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46098%2fwhat-is-the-motivation-for-row-wise-convolution-and-folding-in-kalchbrenner-et-a%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          0






          active

          oldest

          votes








          0






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.










          draft saved

          draft discarded


















          Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.













          Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.












          Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.
















          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46098%2fwhat-is-the-motivation-for-row-wise-convolution-and-folding-in-kalchbrenner-et-a%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to label and detect the document text images

          Vallis Paradisi

          Tabula Rosettana