Zero padding for LSTM input

I am building a text-generation model. In the first layer, I am using Word2Vec embeddings. Now since the input is sentences they are variable length and I am padding them with zero. The input is basically zero padded arrays of indexes of the words which appeared. My problem is that 0 is an index of some word, so isn't padding the sequences with 0 is feeding wrong information to the model.
PFB my code. train_x is the training input, train_y is the final word in my sentence, corpus is the entire corpus of sentences. word2idx function fetches index for the word it is passed, max_sentence_len is the max sentence length in the corpus and is used for zero-padding.

train_x = np.zeros([len(corpus), max_sentence_len], dtype=np.int32)

train_y = np.zeros([len(corpus)], dtype=np.int32)

for i, sentence in enumerate(corpus):

    if len(nltk.word_tokenize(sentence)) > 2:

        for t, word in enumerate(nltk.word_tokenize(sentence)[:-1]):

            if word in word_model.wv.vocab:

                train_x[i, t] = word2idx(word)

        train_y[i] = word2idx(nltk.word_tokenize(sentence)[-1])

To elaborate further say the index of hi is 0 and there is 1 and my max_sentence_len is 5 so for an input hi there, my input array looks like [0,1,0,0,0] which translates to [hi, there, hi, hi, hi]. This sounds incorrect to me. Is this how it should be or is there a way to go about it.

edited yesterday

asked yesterday

saurbh

New contributor

$begingroup$
Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
$endgroup$
– Esmailian
yesterday

2

$begingroup$
github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet! '0' padding as the first word
$endgroup$
– Esmailian
yesterday

$begingroup$
Thanks @Esmailian
$endgroup$
– saurbh
yesterday

add a comment |

train_x = np.zeros([len(corpus), max_sentence_len], dtype=np.int32)

train_y = np.zeros([len(corpus)], dtype=np.int32)

for i, sentence in enumerate(corpus):

    if len(nltk.word_tokenize(sentence)) > 2:

        for t, word in enumerate(nltk.word_tokenize(sentence)[:-1]):

            if word in word_model.wv.vocab:

                train_x[i, t] = word2idx(word)

        train_y[i] = word2idx(nltk.word_tokenize(sentence)[-1])

edited yesterday

asked yesterday

saurbh

New contributor

$begingroup$
Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
$endgroup$
– Esmailian
yesterday

2

$begingroup$
github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet! '0' padding as the first word
$endgroup$
– Esmailian
yesterday

$begingroup$
Thanks @Esmailian
$endgroup$
– saurbh
yesterday

add a comment |

train_x = np.zeros([len(corpus), max_sentence_len], dtype=np.int32)

train_y = np.zeros([len(corpus)], dtype=np.int32)

for i, sentence in enumerate(corpus):

    if len(nltk.word_tokenize(sentence)) > 2:

        for t, word in enumerate(nltk.word_tokenize(sentence)[:-1]):

            if word in word_model.wv.vocab:

                train_x[i, t] = word2idx(word)

        train_y[i] = word2idx(nltk.word_tokenize(sentence)[-1])

edited yesterday

asked yesterday

saurbh

New contributor

train_x = np.zeros([len(corpus), max_sentence_len], dtype=np.int32)

train_y = np.zeros([len(corpus)], dtype=np.int32)

for i, sentence in enumerate(corpus):

    if len(nltk.word_tokenize(sentence)) > 2:

        for t, word in enumerate(nltk.word_tokenize(sentence)[:-1]):

            if word in word_model.wv.vocab:

                train_x[i, t] = word2idx(word)

        train_y[i] = word2idx(nltk.word_tokenize(sentence)[-1])

lstm rnn word-embeddings

edited yesterday

asked yesterday

saurbh

New contributor

edited yesterday

asked yesterday

saurbh

New contributor

edited yesterday

asked yesterday

saurbh

New contributor

asked yesterday

saurbh

asked yesterday

saurbh

New contributor

saurbh is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
$endgroup$
– Esmailian
yesterday

2

$begingroup$
github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet! '0' padding as the first word
$endgroup$
– Esmailian
yesterday

$begingroup$
Thanks @Esmailian
$endgroup$
– saurbh
yesterday

add a comment |

$begingroup$
Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?
$endgroup$
– Esmailian
yesterday

2

$begingroup$
github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet! '0' padding as the first word
$endgroup$
– Esmailian
yesterday

$begingroup$
Thanks @Esmailian
$endgroup$
– saurbh
yesterday

Welcome! It is better to add the code that receives this array. It might be a way to feed an all-zero vector for embedding of padded cells, or set -1 that is not reserved for any word which generates an all-zero embedding? May be you could add a word with index X that has an all-zero embedding? so pads with index X would be that word?

– Esmailian
yesterday

github.com/RaRe-Technologies/gensim/issues/1900 This issue is your problem which seems not been solved yet! '0' padding as the first word

– Esmailian
yesterday

Thanks @Esmailian

– saurbh
yesterday

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

saurbh is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f47542%2fzero-padding-for-lstm-input%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

saurbh is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

saurbh is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk