Zero padding for LSTM input












1












$begingroup$


I am building a text-generation model. In the first layer, I am using Word2Vec embeddings. Now since the input is sentences they are variable length and I am padding them with zero. The input is basically zero padded arrays of indexes of the words which appeared. My problem is that 0 is an index of some word, so isn't padding the sequences with 0 is feeding wrong information to the model.
PFB my code. train_x is the training input, train_y is the final word in my sentence, corpus is the entire corpus of sentences. word2idx function fetches index for the word it is passed, max_sentence_len is the max sentence length in the corpus and is used for zero-padding.



train_x = np.zeros([len(corpus), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(corpus)], dtype=np.int32)
for i, sentence in enumerate(corpus):
if len(nltk.word_tokenize(sentence)) > 2:
for t, word in enumerate(nltk.word_tokenize(sentence)[:-1]):
if word in word_model.wv.vocab:
train_x[i, t] = word2idx(word)
train_y[i] = word2idx(nltk.word_tokenize(sentence)[-1])


To elaborate further say the index of hi is 0 and there is 1 and my max_sentence_len is 5 so for an input hi there, my input array looks like [0,1,0,0,0] which translates to [hi, there, hi, hi, hi]. This sounds incorrect to me. Is this how it should be or is there a way to go about it.










share|improve this question









New contributor




saurbh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
    $endgroup$
    – Esmailian
    yesterday








  • 2




    $begingroup$
    github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet! '0' padding as the first word
    $endgroup$
    – Esmailian
    yesterday












  • $begingroup$
    Thanks @Esmailian
    $endgroup$
    – saurbh
    yesterday
















1












$begingroup$


I am building a text-generation model. In the first layer, I am using Word2Vec embeddings. Now since the input is sentences they are variable length and I am padding them with zero. The input is basically zero padded arrays of indexes of the words which appeared. My problem is that 0 is an index of some word, so isn't padding the sequences with 0 is feeding wrong information to the model.
PFB my code. train_x is the training input, train_y is the final word in my sentence, corpus is the entire corpus of sentences. word2idx function fetches index for the word it is passed, max_sentence_len is the max sentence length in the corpus and is used for zero-padding.



train_x = np.zeros([len(corpus), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(corpus)], dtype=np.int32)
for i, sentence in enumerate(corpus):
if len(nltk.word_tokenize(sentence)) > 2:
for t, word in enumerate(nltk.word_tokenize(sentence)[:-1]):
if word in word_model.wv.vocab:
train_x[i, t] = word2idx(word)
train_y[i] = word2idx(nltk.word_tokenize(sentence)[-1])


To elaborate further say the index of hi is 0 and there is 1 and my max_sentence_len is 5 so for an input hi there, my input array looks like [0,1,0,0,0] which translates to [hi, there, hi, hi, hi]. This sounds incorrect to me. Is this how it should be or is there a way to go about it.










share|improve this question









New contributor




saurbh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
    $endgroup$
    – Esmailian
    yesterday








  • 2




    $begingroup$
    github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet! '0' padding as the first word
    $endgroup$
    – Esmailian
    yesterday












  • $begingroup$
    Thanks @Esmailian
    $endgroup$
    – saurbh
    yesterday














1












1








1





$begingroup$


I am building a text-generation model. In the first layer, I am using Word2Vec embeddings. Now since the input is sentences they are variable length and I am padding them with zero. The input is basically zero padded arrays of indexes of the words which appeared. My problem is that 0 is an index of some word, so isn't padding the sequences with 0 is feeding wrong information to the model.
PFB my code. train_x is the training input, train_y is the final word in my sentence, corpus is the entire corpus of sentences. word2idx function fetches index for the word it is passed, max_sentence_len is the max sentence length in the corpus and is used for zero-padding.



train_x = np.zeros([len(corpus), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(corpus)], dtype=np.int32)
for i, sentence in enumerate(corpus):
if len(nltk.word_tokenize(sentence)) > 2:
for t, word in enumerate(nltk.word_tokenize(sentence)[:-1]):
if word in word_model.wv.vocab:
train_x[i, t] = word2idx(word)
train_y[i] = word2idx(nltk.word_tokenize(sentence)[-1])


To elaborate further say the index of hi is 0 and there is 1 and my max_sentence_len is 5 so for an input hi there, my input array looks like [0,1,0,0,0] which translates to [hi, there, hi, hi, hi]. This sounds incorrect to me. Is this how it should be or is there a way to go about it.










share|improve this question









New contributor




saurbh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




I am building a text-generation model. In the first layer, I am using Word2Vec embeddings. Now since the input is sentences they are variable length and I am padding them with zero. The input is basically zero padded arrays of indexes of the words which appeared. My problem is that 0 is an index of some word, so isn't padding the sequences with 0 is feeding wrong information to the model.
PFB my code. train_x is the training input, train_y is the final word in my sentence, corpus is the entire corpus of sentences. word2idx function fetches index for the word it is passed, max_sentence_len is the max sentence length in the corpus and is used for zero-padding.



train_x = np.zeros([len(corpus), max_sentence_len], dtype=np.int32)
train_y = np.zeros([len(corpus)], dtype=np.int32)
for i, sentence in enumerate(corpus):
if len(nltk.word_tokenize(sentence)) > 2:
for t, word in enumerate(nltk.word_tokenize(sentence)[:-1]):
if word in word_model.wv.vocab:
train_x[i, t] = word2idx(word)
train_y[i] = word2idx(nltk.word_tokenize(sentence)[-1])


To elaborate further say the index of hi is 0 and there is 1 and my max_sentence_len is 5 so for an input hi there, my input array looks like [0,1,0,0,0] which translates to [hi, there, hi, hi, hi]. This sounds incorrect to me. Is this how it should be or is there a way to go about it.







lstm rnn word-embeddings






share|improve this question









New contributor




saurbh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




saurbh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited yesterday







saurbh













New contributor




saurbh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked yesterday









saurbhsaurbh

63




63




New contributor




saurbh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





saurbh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






saurbh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • $begingroup$
    Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
    $endgroup$
    – Esmailian
    yesterday








  • 2




    $begingroup$
    github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet! '0' padding as the first word
    $endgroup$
    – Esmailian
    yesterday












  • $begingroup$
    Thanks @Esmailian
    $endgroup$
    – saurbh
    yesterday


















  • $begingroup$
    Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
    $endgroup$
    – Esmailian
    yesterday








  • 2




    $begingroup$
    github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet! '0' padding as the first word
    $endgroup$
    – Esmailian
    yesterday












  • $begingroup$
    Thanks @Esmailian
    $endgroup$
    – saurbh
    yesterday
















$begingroup$
Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
$endgroup$
– Esmailian
yesterday






$begingroup$
Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
$endgroup$
– Esmailian
yesterday






2




2




$begingroup$
github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet! '0' padding as the first word
$endgroup$
– Esmailian
yesterday






$begingroup$
github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet! '0' padding as the first word
$endgroup$
– Esmailian
yesterday














$begingroup$
Thanks @Esmailian
$endgroup$
– saurbh
yesterday




$begingroup$
Thanks @Esmailian
$endgroup$
– saurbh
yesterday










0






active

oldest

votes











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






saurbh is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47542%2fzero-padding-for-lstm-input%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes








saurbh is a new contributor. Be nice, and check out our Code of Conduct.










draft saved

draft discarded


















saurbh is a new contributor. Be nice, and check out our Code of Conduct.













saurbh is a new contributor. Be nice, and check out our Code of Conduct.












saurbh is a new contributor. Be nice, and check out our Code of Conduct.
















Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47542%2fzero-padding-for-lstm-input%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to label and detect the document text images

Tabula Rosettana

Aureus (color)