Zero padding for LSTM input
$begingroup$
I am building a text-generation model. In the first layer, I am using Word2Vec embeddings. Now since the input is sentences they are variable length and I am padding them with zero. The input is basically zero padded arrays of indexes of the words which appeared. My problem is that 0 is an index of some word, so isn't padding the sequences with 0 is feeding wrong information to the model.
PFB my code. train_x is the training input, train_y is the final word in my sentence, corpus is the entire corpus of sentences. word2idx function fetches index for the word it is passed, max_sentence_len is the max sentence length in the corpus and is used for zero-padding.
train_x = np.zeros([len(corpus), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(corpus)], dtype=np.int32)
for i, sentence in enumerate(corpus):
if len(nltk.word_tokenize(sentence)) > 2:
for t, word in enumerate(nltk.word_tokenize(sentence)[:-1]):
if word in word_model.wv.vocab:
train_x[i, t] = word2idx(word)
train_y[i] = word2idx(nltk.word_tokenize(sentence)[-1])
To elaborate further say the index of hi is 0 and there is 1 and my max_sentence_len is 5 so for an input hi there, my input array looks like [0,1,0,0,0] which translates to [hi, there, hi, hi, hi]. This sounds incorrect to me. Is this how it should be or is there a way to go about it.
lstm rnn word-embeddings
New contributor
$endgroup$
add a comment |
$begingroup$
I am building a text-generation model. In the first layer, I am using Word2Vec embeddings. Now since the input is sentences they are variable length and I am padding them with zero. The input is basically zero padded arrays of indexes of the words which appeared. My problem is that 0 is an index of some word, so isn't padding the sequences with 0 is feeding wrong information to the model.
PFB my code. train_x is the training input, train_y is the final word in my sentence, corpus is the entire corpus of sentences. word2idx function fetches index for the word it is passed, max_sentence_len is the max sentence length in the corpus and is used for zero-padding.
train_x = np.zeros([len(corpus), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(corpus)], dtype=np.int32)
for i, sentence in enumerate(corpus):
if len(nltk.word_tokenize(sentence)) > 2:
for t, word in enumerate(nltk.word_tokenize(sentence)[:-1]):
if word in word_model.wv.vocab:
train_x[i, t] = word2idx(word)
train_y[i] = word2idx(nltk.word_tokenize(sentence)[-1])
To elaborate further say the index of hi is 0 and there is 1 and my max_sentence_len is 5 so for an input hi there, my input array looks like [0,1,0,0,0] which translates to [hi, there, hi, hi, hi]. This sounds incorrect to me. Is this how it should be or is there a way to go about it.
lstm rnn word-embeddings
New contributor
$endgroup$
$begingroup$
Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
$endgroup$
– Esmailian
yesterday
2
$begingroup$
github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet!'0' padding as the first word
$endgroup$
– Esmailian
yesterday
$begingroup$
Thanks @Esmailian
$endgroup$
– saurbh
yesterday
add a comment |
$begingroup$
I am building a text-generation model. In the first layer, I am using Word2Vec embeddings. Now since the input is sentences they are variable length and I am padding them with zero. The input is basically zero padded arrays of indexes of the words which appeared. My problem is that 0 is an index of some word, so isn't padding the sequences with 0 is feeding wrong information to the model.
PFB my code. train_x is the training input, train_y is the final word in my sentence, corpus is the entire corpus of sentences. word2idx function fetches index for the word it is passed, max_sentence_len is the max sentence length in the corpus and is used for zero-padding.
train_x = np.zeros([len(corpus), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(corpus)], dtype=np.int32)
for i, sentence in enumerate(corpus):
if len(nltk.word_tokenize(sentence)) > 2:
for t, word in enumerate(nltk.word_tokenize(sentence)[:-1]):
if word in word_model.wv.vocab:
train_x[i, t] = word2idx(word)
train_y[i] = word2idx(nltk.word_tokenize(sentence)[-1])
To elaborate further say the index of hi is 0 and there is 1 and my max_sentence_len is 5 so for an input hi there, my input array looks like [0,1,0,0,0] which translates to [hi, there, hi, hi, hi]. This sounds incorrect to me. Is this how it should be or is there a way to go about it.
lstm rnn word-embeddings
New contributor
$endgroup$
I am building a text-generation model. In the first layer, I am using Word2Vec embeddings. Now since the input is sentences they are variable length and I am padding them with zero. The input is basically zero padded arrays of indexes of the words which appeared. My problem is that 0 is an index of some word, so isn't padding the sequences with 0 is feeding wrong information to the model.
PFB my code. train_x is the training input, train_y is the final word in my sentence, corpus is the entire corpus of sentences. word2idx function fetches index for the word it is passed, max_sentence_len is the max sentence length in the corpus and is used for zero-padding.
train_x = np.zeros([len(corpus), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(corpus)], dtype=np.int32)
for i, sentence in enumerate(corpus):
if len(nltk.word_tokenize(sentence)) > 2:
for t, word in enumerate(nltk.word_tokenize(sentence)[:-1]):
if word in word_model.wv.vocab:
train_x[i, t] = word2idx(word)
train_y[i] = word2idx(nltk.word_tokenize(sentence)[-1])
To elaborate further say the index of hi is 0 and there is 1 and my max_sentence_len is 5 so for an input hi there, my input array looks like [0,1,0,0,0] which translates to [hi, there, hi, hi, hi]. This sounds incorrect to me. Is this how it should be or is there a way to go about it.
lstm rnn word-embeddings
lstm rnn word-embeddings
New contributor
New contributor
edited yesterday
saurbh
New contributor
asked yesterday
saurbhsaurbh
63
63
New contributor
New contributor
$begingroup$
Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
$endgroup$
– Esmailian
yesterday
2
$begingroup$
github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet!'0' padding as the first word
$endgroup$
– Esmailian
yesterday
$begingroup$
Thanks @Esmailian
$endgroup$
– saurbh
yesterday
add a comment |
$begingroup$
Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
$endgroup$
– Esmailian
yesterday
2
$begingroup$
github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet!'0' padding as the first word
$endgroup$
– Esmailian
yesterday
$begingroup$
Thanks @Esmailian
$endgroup$
– saurbh
yesterday
$begingroup$
Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
$endgroup$
– Esmailian
yesterday
$begingroup$
Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
$endgroup$
– Esmailian
yesterday
2
2
$begingroup$
github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet!
'0' padding as the first word
$endgroup$
– Esmailian
yesterday
$begingroup$
github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet!
'0' padding as the first word
$endgroup$
– Esmailian
yesterday
$begingroup$
Thanks @Esmailian
$endgroup$
– saurbh
yesterday
$begingroup$
Thanks @Esmailian
$endgroup$
– saurbh
yesterday
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
saurbh is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47542%2fzero-padding-for-lstm-input%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
saurbh is a new contributor. Be nice, and check out our Code of Conduct.
saurbh is a new contributor. Be nice, and check out our Code of Conduct.
saurbh is a new contributor. Be nice, and check out our Code of Conduct.
saurbh is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47542%2fzero-padding-for-lstm-input%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
$endgroup$
– Esmailian
yesterday
2
$begingroup$
github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet!
'0' padding as the first word
$endgroup$
– Esmailian
yesterday
$begingroup$
Thanks @Esmailian
$endgroup$
– saurbh
yesterday