Kerns LSTM kernel
$begingroup$
I am trying to understand how the weight matrix in an LSTM cell is used. An LSTM unit has several weight matrix: Wf, Wi, Wc, Wo
like below:
( from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ )
At the same time, I am playing with the Keras LSTM and studying its source code:
https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1871
In the source code, there is only one kernel mentioned. I am wondering is it referring to Wc
only? Then where are the other weight matrix Wf, Wi, Wo
initialized and used? Thanks!
keras rnn lstm kernel
$endgroup$
bumped to the homepage by Community♦ 15 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I am trying to understand how the weight matrix in an LSTM cell is used. An LSTM unit has several weight matrix: Wf, Wi, Wc, Wo
like below:
( from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ )
At the same time, I am playing with the Keras LSTM and studying its source code:
https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1871
In the source code, there is only one kernel mentioned. I am wondering is it referring to Wc
only? Then where are the other weight matrix Wf, Wi, Wo
initialized and used? Thanks!
keras rnn lstm kernel
$endgroup$
bumped to the homepage by Community♦ 15 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I am trying to understand how the weight matrix in an LSTM cell is used. An LSTM unit has several weight matrix: Wf, Wi, Wc, Wo
like below:
( from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ )
At the same time, I am playing with the Keras LSTM and studying its source code:
https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1871
In the source code, there is only one kernel mentioned. I am wondering is it referring to Wc
only? Then where are the other weight matrix Wf, Wi, Wo
initialized and used? Thanks!
keras rnn lstm kernel
$endgroup$
I am trying to understand how the weight matrix in an LSTM cell is used. An LSTM unit has several weight matrix: Wf, Wi, Wc, Wo
like below:
( from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ )
At the same time, I am playing with the Keras LSTM and studying its source code:
https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1871
In the source code, there is only one kernel mentioned. I am wondering is it referring to Wc
only? Then where are the other weight matrix Wf, Wi, Wo
initialized and used? Thanks!
keras rnn lstm kernel
keras rnn lstm kernel
asked Dec 31 '17 at 23:10
EdamameEdamame
5532616
5532616
bumped to the homepage by Community♦ 15 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 15 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
understand how the weight matrix in an LSTM cell is used
In LSTM you have a cell vector that keeps track of necessary information for the task at hand.
To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.
The key idea is we manipulate the cell vector by linear interactions through gates, where we introduce some non-linearity between the current input and the previous hidden features to obtain new features that range between zero to one, indicating how much we manipulate (add/remove/update) across each element of the new feature vector.
how much information we need to keep track of?
w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.
modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).
how much information we need to extract from the current input.
we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.
proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].
modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.
current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.
how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.
the same mechanism as above:
current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).
$endgroup$
add a comment |
$begingroup$
They use this variable to save all the weight matrices by concatenating them.
In the call
function of the LSTMCell
you can see, how they are unpacked:
self.kernel_i = self.kernel[:, :self.units]
self.kernel_f = self.kernel[:, self.units: self.units * 2]
self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
self.kernel_o = self.kernel[:, self.units * 3:]
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26175%2fkerns-lstm-kernel%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
understand how the weight matrix in an LSTM cell is used
In LSTM you have a cell vector that keeps track of necessary information for the task at hand.
To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.
The key idea is we manipulate the cell vector by linear interactions through gates, where we introduce some non-linearity between the current input and the previous hidden features to obtain new features that range between zero to one, indicating how much we manipulate (add/remove/update) across each element of the new feature vector.
how much information we need to keep track of?
w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.
modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).
how much information we need to extract from the current input.
we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.
proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].
modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.
current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.
how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.
the same mechanism as above:
current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).
$endgroup$
add a comment |
$begingroup$
understand how the weight matrix in an LSTM cell is used
In LSTM you have a cell vector that keeps track of necessary information for the task at hand.
To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.
The key idea is we manipulate the cell vector by linear interactions through gates, where we introduce some non-linearity between the current input and the previous hidden features to obtain new features that range between zero to one, indicating how much we manipulate (add/remove/update) across each element of the new feature vector.
how much information we need to keep track of?
w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.
modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).
how much information we need to extract from the current input.
we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.
proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].
modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.
current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.
how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.
the same mechanism as above:
current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).
$endgroup$
add a comment |
$begingroup$
understand how the weight matrix in an LSTM cell is used
In LSTM you have a cell vector that keeps track of necessary information for the task at hand.
To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.
The key idea is we manipulate the cell vector by linear interactions through gates, where we introduce some non-linearity between the current input and the previous hidden features to obtain new features that range between zero to one, indicating how much we manipulate (add/remove/update) across each element of the new feature vector.
how much information we need to keep track of?
w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.
modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).
how much information we need to extract from the current input.
we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.
proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].
modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.
current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.
how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.
the same mechanism as above:
current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).
$endgroup$
understand how the weight matrix in an LSTM cell is used
In LSTM you have a cell vector that keeps track of necessary information for the task at hand.
To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.
The key idea is we manipulate the cell vector by linear interactions through gates, where we introduce some non-linearity between the current input and the previous hidden features to obtain new features that range between zero to one, indicating how much we manipulate (add/remove/update) across each element of the new feature vector.
how much information we need to keep track of?
w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.
modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).
how much information we need to extract from the current input.
we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.
proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].
modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.
current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.
how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.
the same mechanism as above:
current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).
answered May 19 '18 at 18:07
Fadi BakouraFadi Bakoura
663212
663212
add a comment |
add a comment |
$begingroup$
They use this variable to save all the weight matrices by concatenating them.
In the call
function of the LSTMCell
you can see, how they are unpacked:
self.kernel_i = self.kernel[:, :self.units]
self.kernel_f = self.kernel[:, self.units: self.units * 2]
self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
self.kernel_o = self.kernel[:, self.units * 3:]
$endgroup$
add a comment |
$begingroup$
They use this variable to save all the weight matrices by concatenating them.
In the call
function of the LSTMCell
you can see, how they are unpacked:
self.kernel_i = self.kernel[:, :self.units]
self.kernel_f = self.kernel[:, self.units: self.units * 2]
self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
self.kernel_o = self.kernel[:, self.units * 3:]
$endgroup$
add a comment |
$begingroup$
They use this variable to save all the weight matrices by concatenating them.
In the call
function of the LSTMCell
you can see, how they are unpacked:
self.kernel_i = self.kernel[:, :self.units]
self.kernel_f = self.kernel[:, self.units: self.units * 2]
self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
self.kernel_o = self.kernel[:, self.units * 3:]
$endgroup$
They use this variable to save all the weight matrices by concatenating them.
In the call
function of the LSTMCell
you can see, how they are unpacked:
self.kernel_i = self.kernel[:, :self.units]
self.kernel_f = self.kernel[:, self.units: self.units * 2]
self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]
self.kernel_o = self.kernel[:, self.units * 3:]
edited Jan 19 '18 at 13:30
Stephen Rauch♦
1,52551330
1,52551330
answered Jan 19 '18 at 12:16
sietschiesietschie
1012
1012
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26175%2fkerns-lstm-kernel%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown