What is the motivation for row-wise convolution and folding in Kalchbrenner et al. (2014)?
$begingroup$
I was reading the paper by Kalchbrenner et al. titled A Convolutional Neural Network for Modelling Sentences and am struggling to understand their definition of convolutional layer.
First, let's take a step back and describe what I'd expect the 1d convolution to look like, just as defined in Yoon Kim (2014).
sentence. A sentence of length n (padded where necessary) is represented as
x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)
where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words
xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R^hk, which is applied to a window of h words to produce a new feature. For
example, a feature ci is generated from a window
of words xi:i+h−1 by
ci = f(w · xi:i+h−1 + b) (2).
Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map
c = [c1, c2, . . . , cn−h+1], (3)
with c ∈ R^(n−h+1).
Meaning a single feature detector transforms every window from the input sequence to a single number, resulting in n-h+1 activations.
Whereas in Kalchbrenner's paper, the convolution is described as follows:
If we temporarily ignore the pooling layer, we
may state how one computes each d-dimensional
column a in the matrix a resulting after the convolutional and non-linear layers. Define M to be the
matrix of diagonals:
M = [diag(m:,1), . . . , diag(m:,m)] (5)
where m are the weights of the d filters of the wide
convolution. Then after the first pair of a convolutional and a non-linear layer, each column a in the
matrix a is obtained as follows, for some index j:
Here a is a column of first order features. Second order features are similarly obtained by applying Eq. 6 to a sequence of first order features aj , ..., aj+m0−1 with another weight matrix M0. Barring pooling, Eq. 6 represents a core aspect of the feature extraction function and has a rather general form that we return to below. Together with pooling, the feature function induces position invariance and makes the range of higher-order features variable.
As described in this question, the matrix M has dimensionalty of d x dm and the vector of concatenated w's has dimensionality dm. Thus the multiplication produces a vector of dimensionality d (for a single convolution of a single window!).
Architecture visualization from the paper seems to confirm this understanding:
The two matrices in the second layer represent two feature maps. Each feature map has dimensionality (s + m - 1) x d, and not (s + m - 1) as I would expect.
Authors refer to a "conventional" model where feature maps have only one dimension as Max-TDNN and differentiate it from their own.
As the authors point out, feature detectors in different rows are fully independent from each other until the top layer. Thus they introduce the Folding layer, which merges each pair of rows in the penultimate layer (by summation), reducing their number in half (from d to d/2).
Sorry for the prolonged introduction, here are my two main questions:
1) What is the possible motivation for this definition of convolution (as opposed to Max-TDNN or e.g. Yoon Kim's model)
2) In the Folding layer, why is it satisfying to only have dependence between pairs of corresponding rows? I don't understand the gain over no dependence at all.
machine-learning deep-learning nlp cnn convolution
New contributor
$endgroup$
add a comment |
$begingroup$
I was reading the paper by Kalchbrenner et al. titled A Convolutional Neural Network for Modelling Sentences and am struggling to understand their definition of convolutional layer.
First, let's take a step back and describe what I'd expect the 1d convolution to look like, just as defined in Yoon Kim (2014).
sentence. A sentence of length n (padded where necessary) is represented as
x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)
where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words
xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R^hk, which is applied to a window of h words to produce a new feature. For
example, a feature ci is generated from a window
of words xi:i+h−1 by
ci = f(w · xi:i+h−1 + b) (2).
Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map
c = [c1, c2, . . . , cn−h+1], (3)
with c ∈ R^(n−h+1).
Meaning a single feature detector transforms every window from the input sequence to a single number, resulting in n-h+1 activations.
Whereas in Kalchbrenner's paper, the convolution is described as follows:
If we temporarily ignore the pooling layer, we
may state how one computes each d-dimensional
column a in the matrix a resulting after the convolutional and non-linear layers. Define M to be the
matrix of diagonals:
M = [diag(m:,1), . . . , diag(m:,m)] (5)
where m are the weights of the d filters of the wide
convolution. Then after the first pair of a convolutional and a non-linear layer, each column a in the
matrix a is obtained as follows, for some index j:
Here a is a column of first order features. Second order features are similarly obtained by applying Eq. 6 to a sequence of first order features aj , ..., aj+m0−1 with another weight matrix M0. Barring pooling, Eq. 6 represents a core aspect of the feature extraction function and has a rather general form that we return to below. Together with pooling, the feature function induces position invariance and makes the range of higher-order features variable.
As described in this question, the matrix M has dimensionalty of d x dm and the vector of concatenated w's has dimensionality dm. Thus the multiplication produces a vector of dimensionality d (for a single convolution of a single window!).
Architecture visualization from the paper seems to confirm this understanding:
The two matrices in the second layer represent two feature maps. Each feature map has dimensionality (s + m - 1) x d, and not (s + m - 1) as I would expect.
Authors refer to a "conventional" model where feature maps have only one dimension as Max-TDNN and differentiate it from their own.
As the authors point out, feature detectors in different rows are fully independent from each other until the top layer. Thus they introduce the Folding layer, which merges each pair of rows in the penultimate layer (by summation), reducing their number in half (from d to d/2).
Sorry for the prolonged introduction, here are my two main questions:
1) What is the possible motivation for this definition of convolution (as opposed to Max-TDNN or e.g. Yoon Kim's model)
2) In the Folding layer, why is it satisfying to only have dependence between pairs of corresponding rows? I don't understand the gain over no dependence at all.
machine-learning deep-learning nlp cnn convolution
New contributor
$endgroup$
add a comment |
$begingroup$
I was reading the paper by Kalchbrenner et al. titled A Convolutional Neural Network for Modelling Sentences and am struggling to understand their definition of convolutional layer.
First, let's take a step back and describe what I'd expect the 1d convolution to look like, just as defined in Yoon Kim (2014).
sentence. A sentence of length n (padded where necessary) is represented as
x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)
where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words
xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R^hk, which is applied to a window of h words to produce a new feature. For
example, a feature ci is generated from a window
of words xi:i+h−1 by
ci = f(w · xi:i+h−1 + b) (2).
Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map
c = [c1, c2, . . . , cn−h+1], (3)
with c ∈ R^(n−h+1).
Meaning a single feature detector transforms every window from the input sequence to a single number, resulting in n-h+1 activations.
Whereas in Kalchbrenner's paper, the convolution is described as follows:
If we temporarily ignore the pooling layer, we
may state how one computes each d-dimensional
column a in the matrix a resulting after the convolutional and non-linear layers. Define M to be the
matrix of diagonals:
M = [diag(m:,1), . . . , diag(m:,m)] (5)
where m are the weights of the d filters of the wide
convolution. Then after the first pair of a convolutional and a non-linear layer, each column a in the
matrix a is obtained as follows, for some index j:
Here a is a column of first order features. Second order features are similarly obtained by applying Eq. 6 to a sequence of first order features aj , ..., aj+m0−1 with another weight matrix M0. Barring pooling, Eq. 6 represents a core aspect of the feature extraction function and has a rather general form that we return to below. Together with pooling, the feature function induces position invariance and makes the range of higher-order features variable.
As described in this question, the matrix M has dimensionalty of d x dm and the vector of concatenated w's has dimensionality dm. Thus the multiplication produces a vector of dimensionality d (for a single convolution of a single window!).
Architecture visualization from the paper seems to confirm this understanding:
The two matrices in the second layer represent two feature maps. Each feature map has dimensionality (s + m - 1) x d, and not (s + m - 1) as I would expect.
Authors refer to a "conventional" model where feature maps have only one dimension as Max-TDNN and differentiate it from their own.
As the authors point out, feature detectors in different rows are fully independent from each other until the top layer. Thus they introduce the Folding layer, which merges each pair of rows in the penultimate layer (by summation), reducing their number in half (from d to d/2).
Sorry for the prolonged introduction, here are my two main questions:
1) What is the possible motivation for this definition of convolution (as opposed to Max-TDNN or e.g. Yoon Kim's model)
2) In the Folding layer, why is it satisfying to only have dependence between pairs of corresponding rows? I don't understand the gain over no dependence at all.
machine-learning deep-learning nlp cnn convolution
New contributor
$endgroup$
I was reading the paper by Kalchbrenner et al. titled A Convolutional Neural Network for Modelling Sentences and am struggling to understand their definition of convolutional layer.
First, let's take a step back and describe what I'd expect the 1d convolution to look like, just as defined in Yoon Kim (2014).
sentence. A sentence of length n (padded where necessary) is represented as
x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn, (1)
where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words
xi , xi+1, . . . , xi+j . A convolution operation involves a filter w ∈ R^hk, which is applied to a window of h words to produce a new feature. For
example, a feature ci is generated from a window
of words xi:i+h−1 by
ci = f(w · xi:i+h−1 + b) (2).
Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence {x1:h, x2:h+1, . . . , xn−h+1:n} to produce a feature map
c = [c1, c2, . . . , cn−h+1], (3)
with c ∈ R^(n−h+1).
Meaning a single feature detector transforms every window from the input sequence to a single number, resulting in n-h+1 activations.
Whereas in Kalchbrenner's paper, the convolution is described as follows:
If we temporarily ignore the pooling layer, we
may state how one computes each d-dimensional
column a in the matrix a resulting after the convolutional and non-linear layers. Define M to be the
matrix of diagonals:
M = [diag(m:,1), . . . , diag(m:,m)] (5)
where m are the weights of the d filters of the wide
convolution. Then after the first pair of a convolutional and a non-linear layer, each column a in the
matrix a is obtained as follows, for some index j:
Here a is a column of first order features. Second order features are similarly obtained by applying Eq. 6 to a sequence of first order features aj , ..., aj+m0−1 with another weight matrix M0. Barring pooling, Eq. 6 represents a core aspect of the feature extraction function and has a rather general form that we return to below. Together with pooling, the feature function induces position invariance and makes the range of higher-order features variable.
As described in this question, the matrix M has dimensionalty of d x dm and the vector of concatenated w's has dimensionality dm. Thus the multiplication produces a vector of dimensionality d (for a single convolution of a single window!).
Architecture visualization from the paper seems to confirm this understanding:
The two matrices in the second layer represent two feature maps. Each feature map has dimensionality (s + m - 1) x d, and not (s + m - 1) as I would expect.
Authors refer to a "conventional" model where feature maps have only one dimension as Max-TDNN and differentiate it from their own.
As the authors point out, feature detectors in different rows are fully independent from each other until the top layer. Thus they introduce the Folding layer, which merges each pair of rows in the penultimate layer (by summation), reducing their number in half (from d to d/2).
Sorry for the prolonged introduction, here are my two main questions:
1) What is the possible motivation for this definition of convolution (as opposed to Max-TDNN or e.g. Yoon Kim's model)
2) In the Folding layer, why is it satisfying to only have dependence between pairs of corresponding rows? I don't understand the gain over no dependence at all.
machine-learning deep-learning nlp cnn convolution
machine-learning deep-learning nlp cnn convolution
New contributor
New contributor
New contributor
asked yesterday
Tomasz GarbusTomasz Garbus
61
61
New contributor
New contributor
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46098%2fwhat-is-the-motivation-for-row-wise-convolution-and-folding-in-kalchbrenner-et-a%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.
Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.
Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.
Tomasz Garbus is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f46098%2fwhat-is-the-motivation-for-row-wise-convolution-and-folding-in-kalchbrenner-et-a%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown