Kerns LSTM kernel

I am trying to understand how the weight matrix in an LSTM cell is used. An LSTM unit has several weight matrix: Wf, Wi, Wc, Wo like below:

enter image description here

( from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ )

At the same time, I am playing with the Keras LSTM and studying its source code:
https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1871

In the source code, there is only one kernel mentioned. I am wondering is it referring to Wc only? Then where are the other weight matrix Wf, Wi, Wo initialized and used? Thanks!

asked Dec 31 '17 at 23:10

Edamame

5532616

bumped to the homepage by Community♦ 15 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

I am trying to understand how the weight matrix in an LSTM cell is used. An LSTM unit has several weight matrix: Wf, Wi, Wc, Wo like below:

enter image description here

( from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ )

At the same time, I am playing with the Keras LSTM and studying its source code:
https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1871

In the source code, there is only one kernel mentioned. I am wondering is it referring to Wc only? Then where are the other weight matrix Wf, Wi, Wo initialized and used? Thanks!

asked Dec 31 '17 at 23:10

Edamame

5532616

bumped to the homepage by Community♦ 15 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

I am trying to understand how the weight matrix in an LSTM cell is used. An LSTM unit has several weight matrix: Wf, Wi, Wc, Wo like below:

enter image description here

( from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ )

At the same time, I am playing with the Keras LSTM and studying its source code:
https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1871

In the source code, there is only one kernel mentioned. I am wondering is it referring to Wc only? Then where are the other weight matrix Wf, Wi, Wo initialized and used? Thanks!

asked Dec 31 '17 at 23:10

Edamame

5532616

I am trying to understand how the weight matrix in an LSTM cell is used. An LSTM unit has several weight matrix: Wf, Wi, Wc, Wo like below:

enter image description here

( from http://colah.github.io/posts/2015-08-Understanding-LSTMs/ )

At the same time, I am playing with the Keras LSTM and studying its source code:
https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L1871

In the source code, there is only one kernel mentioned. I am wondering is it referring to Wc only? Then where are the other weight matrix Wf, Wi, Wo initialized and used? Thanks!

keras rnn lstm kernel

asked Dec 31 '17 at 23:10

Edamame

5532616

asked Dec 31 '17 at 23:10

Edamame

5532616

asked Dec 31 '17 at 23:10

Edamame

5532616

asked Dec 31 '17 at 23:10

Edamame

5532616

asked Dec 31 '17 at 23:10

Edamame

5532616

bumped to the homepage by Community♦ 15 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 15 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

2 Answers
2

active

oldest

votes

understand how the weight matrix in an LSTM cell is used

In LSTM you have a cell vector that keeps track of necessary information for the task at hand.

To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.

The key idea is we manipulate the cell vector by linear interactions through gates, where we introduce some non-linearity between the current input and the previous hidden features to obtain new features that range between zero to one, indicating how much we manipulate (add/remove/update) across each element of the new feature vector.

how much information we need to keep track of?

w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.

modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).

how much information we need to extract from the current input.

we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.

proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].

modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.

current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.

how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.

the same mechanism as above:
current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).

answered May 19 '18 at 18:07

Fadi Bakoura

663212

add a comment |

They use this variable to save all the weight matrices by concatenating them.

In the call function of the LSTMCell you can see, how they are unpacked:

self.kernel_i = self.kernel[:, :self.units]

self.kernel_f = self.kernel[:, self.units: self.units * 2]

self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]

self.kernel_o = self.kernel[:, self.units * 3:]

edited Jan 19 '18 at 13:30

Stephen Rauch♦

1,52551330

answered Jan 19 '18 at 12:16

sietschie

1012

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f26175%2fkerns-lstm-kernel%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

understand how the weight matrix in an LSTM cell is used

In LSTM you have a cell vector that keeps track of necessary information for the task at hand.

To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.

how much information we need to keep track of?

w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.

modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).

how much information we need to extract from the current input.

we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.

proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].

modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.

current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.

how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.

the same mechanism as above:
current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).

answered May 19 '18 at 18:07

Fadi Bakoura

663212

add a comment |

understand how the weight matrix in an LSTM cell is used

In LSTM you have a cell vector that keeps track of necessary information for the task at hand.

To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.

how much information we need to keep track of?

w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.

modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).

how much information we need to extract from the current input.

we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.

proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].

modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.

current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.

how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.

the same mechanism as above:
current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).

answered May 19 '18 at 18:07

Fadi Bakoura

663212

add a comment |

understand how the weight matrix in an LSTM cell is used

In LSTM you have a cell vector that keeps track of necessary information for the task at hand.

To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.

how much information we need to keep track of?

w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.

modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).

how much information we need to extract from the current input.

we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.

proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].

modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.

current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.

how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.

the same mechanism as above:
current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).

answered May 19 '18 at 18:07

Fadi Bakoura

663212

understand how the weight matrix in an LSTM cell is used

In LSTM you have a cell vector that keeps track of necessary information for the task at hand.

To backpropagate the errors from far away time steps, LSTM by design has simple linear operations (*/+) that update the cell vector, thus it’s very easy for the gradient to just flow.

how much information we need to keep track of?

w_f: guess from the interactions between the previous hidden features and the current input by concatenating both vectors and multiplying them with a weight matrix and applying the sigmoid non-linearity to get values between [0,1] (different for each element) to see how much should we forget from the old cell vector.

modified_old_cell = old_cell * forgetting_some_dimensions (forget gate f_t).

how much information we need to extract from the current input.

we introduce two feature vectors, by concatenating both vectors (input,previous hiddens) and applying tanh, sigmoid to get proposed cell features, input gate features respectively.

proposed_cell = non_linearity(previous_hiddens,input), hence we used tanh to get values between [-1,+1].

modified_proposed_cell = proposed_cell * input_gate. hence we don't need to add all information from the current time step.

current_cell = integration (modified_old_cell,modified_proposed_cell). hence the integration operation is simply the plus operation.

how much information do we need to pass to the next time step, hence if the problem is sequence tagging, maybe some of the output features are only important for the current time step.

the same mechanism as above:
current_hidden_features = current_cell * outputting_some_dimensions (output gate o_t).

answered May 19 '18 at 18:07

Fadi Bakoura

663212

answered May 19 '18 at 18:07

Fadi Bakoura

663212

answered May 19 '18 at 18:07

Fadi Bakoura

663212

answered May 19 '18 at 18:07

Fadi Bakoura

663212

add a comment |

They use this variable to save all the weight matrices by concatenating them.

In the call function of the LSTMCell you can see, how they are unpacked:

self.kernel_i = self.kernel[:, :self.units]

self.kernel_f = self.kernel[:, self.units: self.units * 2]

self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]

self.kernel_o = self.kernel[:, self.units * 3:]

edited Jan 19 '18 at 13:30

Stephen Rauch♦

1,52551330

answered Jan 19 '18 at 12:16

sietschie

1012

add a comment |

They use this variable to save all the weight matrices by concatenating them.

In the call function of the LSTMCell you can see, how they are unpacked:

self.kernel_i = self.kernel[:, :self.units]

self.kernel_f = self.kernel[:, self.units: self.units * 2]

self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]

self.kernel_o = self.kernel[:, self.units * 3:]

edited Jan 19 '18 at 13:30

Stephen Rauch♦

1,52551330

answered Jan 19 '18 at 12:16

sietschie

1012

add a comment |

They use this variable to save all the weight matrices by concatenating them.

In the call function of the LSTMCell you can see, how they are unpacked:

self.kernel_i = self.kernel[:, :self.units]

self.kernel_f = self.kernel[:, self.units: self.units * 2]

self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]

self.kernel_o = self.kernel[:, self.units * 3:]

edited Jan 19 '18 at 13:30

Stephen Rauch♦

1,52551330

answered Jan 19 '18 at 12:16

sietschie

1012

They use this variable to save all the weight matrices by concatenating them.

In the call function of the LSTMCell you can see, how they are unpacked:

self.kernel_i = self.kernel[:, :self.units]

self.kernel_f = self.kernel[:, self.units: self.units * 2]

self.kernel_c = self.kernel[:, self.units * 2: self.units * 3]

self.kernel_o = self.kernel[:, self.units * 3:]

edited Jan 19 '18 at 13:30

Stephen Rauch♦

1,52551330

answered Jan 19 '18 at 12:16

sietschie

1012

edited Jan 19 '18 at 13:30

Stephen Rauch♦

1,52551330

edited Jan 19 '18 at 13:30

Stephen Rauch♦

1,52551330

edited Jan 19 '18 at 13:30

Stephen Rauch♦

1,52551330

answered Jan 19 '18 at 12:16

sietschie

1012

answered Jan 19 '18 at 12:16

sietschie

1012

answered Jan 19 '18 at 12:16

sietschie

1012

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk