What is the motivation for row-wise convolution and folding in Kalchbrenner et al. (2014)?

I was reading the paper by Kalchbrenner et al. titled A Convolutional Neural Network for Modelling Sentences and am struggling to understand their definition of convolutional layer.

First, let's take a step back and describe what I'd expect the 1d convolution to look like, just as defined in Yoon Kim (2014).

sentence. A sentence of length n (padded where necessary) is represented as

x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)

where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words
xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R^hk, which is applied to a window of h words to produce a new feature. For
example, a feature ci is generated from a window
of words xi:i+h−1 by

ci = f(w · xi:i+h−1 + b) (2).

Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map

c = [c1, c2, . . . , cn−h+1], (3)

with c ∈ R^(n−h+1).

Meaning a single feature detector transforms every window from the input sequence to a single number, resulting in n-h+1 activations.

Whereas in Kalchbrenner's paper, the convolution is described as follows:

If we temporarily ignore the pooling layer, we
may state how one computes each d-dimensional
column a in the matrix a resulting after the convolutional and non-linear layers. Define M to be the
matrix of diagonals:
M = [diag(m:,1), . . . , diag(m:,m)] (5)
where m are the weights of the d filters of the wide
convolution. Then after the first pair of a convolutional and a non-linear layer, each column a in the
matrix a is obtained as follows, for some index j:

Here a is a column of first order features. Second order features are similarly obtained by applying Eq. 6 to a sequence of first order features aj , ..., aj+m0−1 with another weight matrix M0. Barring pooling, Eq. 6 represents a core aspect of the feature extraction function and has a rather general form that we return to below. Together with pooling, the feature function induces position invariance and makes the range of higher-order features variable.

As described in this question, the matrix M has dimensionalty of d x dm and the vector of concatenated w's has dimensionality dm. Thus the multiplication produces a vector of dimensionality d (for a single convolution of a single window!).

Architecture visualization from the paper seems to confirm this understanding:
enter image description here

The two matrices in the second layer represent two feature maps. Each feature map has dimensionality (s + m - 1) x d, and not (s + m - 1) as I would expect.

Authors refer to a "conventional" model where feature maps have only one dimension as Max-TDNN and differentiate it from their own.

As the authors point out, feature detectors in different rows are fully independent from each other until the top layer. Thus they introduce the Folding layer, which merges each pair of rows in the penultimate layer (by summation), reducing their number in half (from d to d/2).
architecture

Sorry for the prolonged introduction, here are my two main questions:

1) What is the possible motivation for this definition of convolution (as opposed to Max-TDNN or e.g. Yoon Kim's model)

2) In the Folding layer, why is it satisfying to only have dependence between pairs of corresponding rows? I don't understand the gain over no dependence at all.

asked yesterday

Tomasz Garbus

New contributor

add a comment |

I was reading the paper by Kalchbrenner et al. titled A Convolutional Neural Network for Modelling Sentences and am struggling to understand their definition of convolutional layer.

First, let's take a step back and describe what I'd expect the 1d convolution to look like, just as defined in Yoon Kim (2014).

sentence. A sentence of length n (padded where necessary) is represented as

x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)

where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words
xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R^hk, which is applied to a window of h words to produce a new feature. For
example, a feature ci is generated from a window
of words xi:i+h−1 by

ci = f(w · xi:i+h−1 + b) (2).

Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map

c = [c1, c2, . . . , cn−h+1], (3)

with c ∈ R^(n−h+1).

Meaning a single feature detector transforms every window from the input sequence to a single number, resulting in n-h+1 activations.

Whereas in Kalchbrenner's paper, the convolution is described as follows:

If we temporarily ignore the pooling layer, we
may state how one computes each d-dimensional
column a in the matrix a resulting after the convolutional and non-linear layers. Define M to be the
matrix of diagonals:
M = [diag(m:,1), . . . , diag(m:,m)] (5)
where m are the weights of the d filters of the wide
convolution. Then after the first pair of a convolutional and a non-linear layer, each column a in the
matrix a is obtained as follows, for some index j:

Here a is a column of first order features. Second order features are similarly obtained by applying Eq. 6 to a sequence of first order features aj , ..., aj+m0−1 with another weight matrix M0. Barring pooling, Eq. 6 represents a core aspect of the feature extraction function and has a rather general form that we return to below. Together with pooling, the feature function induces position invariance and makes the range of higher-order features variable.

Architecture visualization from the paper seems to confirm this understanding:
enter image description here

The two matrices in the second layer represent two feature maps. Each feature map has dimensionality (s + m - 1) x d, and not (s + m - 1) as I would expect.

Authors refer to a "conventional" model where feature maps have only one dimension as Max-TDNN and differentiate it from their own.

Sorry for the prolonged introduction, here are my two main questions:

1) What is the possible motivation for this definition of convolution (as opposed to Max-TDNN or e.g. Yoon Kim's model)

2) In the Folding layer, why is it satisfying to only have dependence between pairs of corresponding rows? I don't understand the gain over no dependence at all.

asked yesterday

Tomasz Garbus

New contributor

add a comment |

I was reading the paper by Kalchbrenner et al. titled A Convolutional Neural Network for Modelling Sentences and am struggling to understand their definition of convolutional layer.

First, let's take a step back and describe what I'd expect the 1d convolution to look like, just as defined in Yoon Kim (2014).

sentence. A sentence of length n (padded where necessary) is represented as

x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)

where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words
xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R^hk, which is applied to a window of h words to produce a new feature. For
example, a feature ci is generated from a window
of words xi:i+h−1 by

ci = f(w · xi:i+h−1 + b) (2).

Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map

c = [c1, c2, . . . , cn−h+1], (3)

with c ∈ R^(n−h+1).

Meaning a single feature detector transforms every window from the input sequence to a single number, resulting in n-h+1 activations.

Whereas in Kalchbrenner's paper, the convolution is described as follows:

If we temporarily ignore the pooling layer, we
may state how one computes each d-dimensional
column a in the matrix a resulting after the convolutional and non-linear layers. Define M to be the
matrix of diagonals:
M = [diag(m:,1), . . . , diag(m:,m)] (5)
where m are the weights of the d filters of the wide
convolution. Then after the first pair of a convolutional and a non-linear layer, each column a in the
matrix a is obtained as follows, for some index j:

Here a is a column of first order features. Second order features are similarly obtained by applying Eq. 6 to a sequence of first order features aj , ..., aj+m0−1 with another weight matrix M0. Barring pooling, Eq. 6 represents a core aspect of the feature extraction function and has a rather general form that we return to below. Together with pooling, the feature function induces position invariance and makes the range of higher-order features variable.

Architecture visualization from the paper seems to confirm this understanding:
enter image description here

The two matrices in the second layer represent two feature maps. Each feature map has dimensionality (s + m - 1) x d, and not (s + m - 1) as I would expect.

Authors refer to a "conventional" model where feature maps have only one dimension as Max-TDNN and differentiate it from their own.

Sorry for the prolonged introduction, here are my two main questions:

1) What is the possible motivation for this definition of convolution (as opposed to Max-TDNN or e.g. Yoon Kim's model)

2) In the Folding layer, why is it satisfying to only have dependence between pairs of corresponding rows? I don't understand the gain over no dependence at all.

asked yesterday

Tomasz Garbus

New contributor

I was reading the paper by Kalchbrenner et al. titled A Convolutional Neural Network for Modelling Sentences and am struggling to understand their definition of convolutional layer.

First, let's take a step back and describe what I'd expect the 1d convolution to look like, just as defined in Yoon Kim (2014).

sentence. A sentence of length n (padded where necessary) is represented as

x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)

where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words
xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R^hk, which is applied to a window of h words to produce a new feature. For
example, a feature ci is generated from a window
of words xi:i+h−1 by

ci = f(w · xi:i+h−1 + b) (2).

Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map

c = [c1, c2, . . . , cn−h+1], (3)

with c ∈ R^(n−h+1).

Meaning a single feature detector transforms every window from the input sequence to a single number, resulting in n-h+1 activations.

Whereas in Kalchbrenner's paper, the convolution is described as follows:

If we temporarily ignore the pooling layer, we
may state how one computes each d-dimensional
column a in the matrix a resulting after the convolutional and non-linear layers. Define M to be the
matrix of diagonals:
M = [diag(m:,1), . . . , diag(m:,m)] (5)
where m are the weights of the d filters of the wide
convolution. Then after the first pair of a convolutional and a non-linear layer, each column a in the
matrix a is obtained as follows, for some index j:

Here a is a column of first order features. Second order features are similarly obtained by applying Eq. 6 to a sequence of first order features aj , ..., aj+m0−1 with another weight matrix M0. Barring pooling, Eq. 6 represents a core aspect of the feature extraction function and has a rather general form that we return to below. Together with pooling, the feature function induces position invariance and makes the range of higher-order features variable.

Architecture visualization from the paper seems to confirm this understanding:
enter image description here

The two matrices in the second layer represent two feature maps. Each feature map has dimensionality (s + m - 1) x d, and not (s + m - 1) as I would expect.

Authors refer to a "conventional" model where feature maps have only one dimension as Max-TDNN and differentiate it from their own.

Sorry for the prolonged introduction, here are my two main questions:

1) What is the possible motivation for this definition of convolution (as opposed to Max-TDNN or e.g. Yoon Kim's model)

2) In the Folding layer, why is it satisfying to only have dependence between pairs of corresponding rows? I don't understand the gain over no dependence at all.

machine-learning deep-learning nlp cnn convolution

asked yesterday

Tomasz Garbus

New contributor

asked yesterday

Tomasz Garbus

New contributor

asked yesterday

Tomasz Garbus

New contributor

asked yesterday

Tomasz Garbus

asked yesterday

Tomasz Garbus

New contributor

Tomasz Garbus is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46098%2fwhat-is-the-motivation-for-row-wise-convolution-and-folding-in-kalchbrenner-et-a%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk