Size of Output vector from AvgW2V Vectorizer is less than Size of Input data
$begingroup$
Hi,
I have been seeing this problem for quite some time. Whenever I tried vectorizing input text data though avgw2v vectorization technique. The size of vectorized data is less than the size of the input data. Is there any statistical reason behind this? In my case 100K is the size of the input and it gives 999,98 sized output
I'm wondering what is causing this problem. Thanks in advance
Code:
listofsentences=
for sent in x_train:
listofsentences.append(sent.split())
training_model = Word2Vec(sentences=listofsentences, workers=-1,min_count=5)
modelwords = list(training_model.wv.vocab)
std_avgw2v_x_train =
for everysentence in tqdm(listofsentences):
count = 0
sentence = np.zeros(100)
for everyword in everysentence:
if everyword in modelwords:
w2v = training_model.wv[everyword]
count += 1
sentence += w2v
if count != 0:
sentence/=count
std_avgw2v_x_train.append(sentence)
len(std_avgw2v_x_train)
>99998
len(x_train)
>100000
EDIT1: I'd like to mention that I Just started learning ML, Its been 55 days since I started. Also, the same code gives our 100K output samples While I vectorize with TFIDFW2V
I have attached the image of the same. Kindly look into it
machine-learning feature-extraction word2vec text
$endgroup$
add a comment |
$begingroup$
Hi,
I have been seeing this problem for quite some time. Whenever I tried vectorizing input text data though avgw2v vectorization technique. The size of vectorized data is less than the size of the input data. Is there any statistical reason behind this? In my case 100K is the size of the input and it gives 999,98 sized output
I'm wondering what is causing this problem. Thanks in advance
Code:
listofsentences=
for sent in x_train:
listofsentences.append(sent.split())
training_model = Word2Vec(sentences=listofsentences, workers=-1,min_count=5)
modelwords = list(training_model.wv.vocab)
std_avgw2v_x_train =
for everysentence in tqdm(listofsentences):
count = 0
sentence = np.zeros(100)
for everyword in everysentence:
if everyword in modelwords:
w2v = training_model.wv[everyword]
count += 1
sentence += w2v
if count != 0:
sentence/=count
std_avgw2v_x_train.append(sentence)
len(std_avgw2v_x_train)
>99998
len(x_train)
>100000
EDIT1: I'd like to mention that I Just started learning ML, Its been 55 days since I started. Also, the same code gives our 100K output samples While I vectorize with TFIDFW2V
I have attached the image of the same. Kindly look into it
machine-learning feature-extraction word2vec text
$endgroup$
add a comment |
$begingroup$
Hi,
I have been seeing this problem for quite some time. Whenever I tried vectorizing input text data though avgw2v vectorization technique. The size of vectorized data is less than the size of the input data. Is there any statistical reason behind this? In my case 100K is the size of the input and it gives 999,98 sized output
I'm wondering what is causing this problem. Thanks in advance
Code:
listofsentences=
for sent in x_train:
listofsentences.append(sent.split())
training_model = Word2Vec(sentences=listofsentences, workers=-1,min_count=5)
modelwords = list(training_model.wv.vocab)
std_avgw2v_x_train =
for everysentence in tqdm(listofsentences):
count = 0
sentence = np.zeros(100)
for everyword in everysentence:
if everyword in modelwords:
w2v = training_model.wv[everyword]
count += 1
sentence += w2v
if count != 0:
sentence/=count
std_avgw2v_x_train.append(sentence)
len(std_avgw2v_x_train)
>99998
len(x_train)
>100000
EDIT1: I'd like to mention that I Just started learning ML, Its been 55 days since I started. Also, the same code gives our 100K output samples While I vectorize with TFIDFW2V
I have attached the image of the same. Kindly look into it
machine-learning feature-extraction word2vec text
$endgroup$
Hi,
I have been seeing this problem for quite some time. Whenever I tried vectorizing input text data though avgw2v vectorization technique. The size of vectorized data is less than the size of the input data. Is there any statistical reason behind this? In my case 100K is the size of the input and it gives 999,98 sized output
I'm wondering what is causing this problem. Thanks in advance
Code:
listofsentences=
for sent in x_train:
listofsentences.append(sent.split())
training_model = Word2Vec(sentences=listofsentences, workers=-1,min_count=5)
modelwords = list(training_model.wv.vocab)
std_avgw2v_x_train =
for everysentence in tqdm(listofsentences):
count = 0
sentence = np.zeros(100)
for everyword in everysentence:
if everyword in modelwords:
w2v = training_model.wv[everyword]
count += 1
sentence += w2v
if count != 0:
sentence/=count
std_avgw2v_x_train.append(sentence)
len(std_avgw2v_x_train)
>99998
len(x_train)
>100000
EDIT1: I'd like to mention that I Just started learning ML, Its been 55 days since I started. Also, the same code gives our 100K output samples While I vectorize with TFIDFW2V
I have attached the image of the same. Kindly look into it
machine-learning feature-extraction word2vec text
machine-learning feature-extraction word2vec text
edited 5 hours ago
karthikeyan
asked 6 hours ago
karthikeyankarthikeyan
12
12
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
I think the issue can be one of the two :
A . You have missing value in x_train
B . One of the values in x_train
has no word that is there in modelwords
.
In both the cases ,
if everyword in modelwords:
w2v = training_model.wv[everyword]
count += 1
sentence += w2v
condition doesn't get satisfied and you end up not addding any new value to sentence
$endgroup$
$begingroup$
Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
$endgroup$
– karthikeyan
6 hours ago
$begingroup$
Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
$endgroup$
– Gyan Ranjan
6 hours ago
$begingroup$
A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
$endgroup$
– karthikeyan
5 hours ago
$begingroup$
B: I tried some words that are not in the vocabulary. For example, When I triedw2v = training_model.wv['hi']
and this gave meKeyError: "word 'hi' not in vocabulary"
So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
$endgroup$
– karthikeyan
5 hours ago
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44242%2fsize-of-output-vector-from-avgw2v-vectorizer-is-less-than-size-of-input-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I think the issue can be one of the two :
A . You have missing value in x_train
B . One of the values in x_train
has no word that is there in modelwords
.
In both the cases ,
if everyword in modelwords:
w2v = training_model.wv[everyword]
count += 1
sentence += w2v
condition doesn't get satisfied and you end up not addding any new value to sentence
$endgroup$
$begingroup$
Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
$endgroup$
– karthikeyan
6 hours ago
$begingroup$
Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
$endgroup$
– Gyan Ranjan
6 hours ago
$begingroup$
A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
$endgroup$
– karthikeyan
5 hours ago
$begingroup$
B: I tried some words that are not in the vocabulary. For example, When I triedw2v = training_model.wv['hi']
and this gave meKeyError: "word 'hi' not in vocabulary"
So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
$endgroup$
– karthikeyan
5 hours ago
add a comment |
$begingroup$
I think the issue can be one of the two :
A . You have missing value in x_train
B . One of the values in x_train
has no word that is there in modelwords
.
In both the cases ,
if everyword in modelwords:
w2v = training_model.wv[everyword]
count += 1
sentence += w2v
condition doesn't get satisfied and you end up not addding any new value to sentence
$endgroup$
$begingroup$
Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
$endgroup$
– karthikeyan
6 hours ago
$begingroup$
Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
$endgroup$
– Gyan Ranjan
6 hours ago
$begingroup$
A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
$endgroup$
– karthikeyan
5 hours ago
$begingroup$
B: I tried some words that are not in the vocabulary. For example, When I triedw2v = training_model.wv['hi']
and this gave meKeyError: "word 'hi' not in vocabulary"
So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
$endgroup$
– karthikeyan
5 hours ago
add a comment |
$begingroup$
I think the issue can be one of the two :
A . You have missing value in x_train
B . One of the values in x_train
has no word that is there in modelwords
.
In both the cases ,
if everyword in modelwords:
w2v = training_model.wv[everyword]
count += 1
sentence += w2v
condition doesn't get satisfied and you end up not addding any new value to sentence
$endgroup$
I think the issue can be one of the two :
A . You have missing value in x_train
B . One of the values in x_train
has no word that is there in modelwords
.
In both the cases ,
if everyword in modelwords:
w2v = training_model.wv[everyword]
count += 1
sentence += w2v
condition doesn't get satisfied and you end up not addding any new value to sentence
answered 6 hours ago
Gyan RanjanGyan Ranjan
1457
1457
$begingroup$
Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
$endgroup$
– karthikeyan
6 hours ago
$begingroup$
Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
$endgroup$
– Gyan Ranjan
6 hours ago
$begingroup$
A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
$endgroup$
– karthikeyan
5 hours ago
$begingroup$
B: I tried some words that are not in the vocabulary. For example, When I triedw2v = training_model.wv['hi']
and this gave meKeyError: "word 'hi' not in vocabulary"
So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
$endgroup$
– karthikeyan
5 hours ago
add a comment |
$begingroup$
Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
$endgroup$
– karthikeyan
6 hours ago
$begingroup$
Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
$endgroup$
– Gyan Ranjan
6 hours ago
$begingroup$
A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
$endgroup$
– karthikeyan
5 hours ago
$begingroup$
B: I tried some words that are not in the vocabulary. For example, When I triedw2v = training_model.wv['hi']
and this gave meKeyError: "word 'hi' not in vocabulary"
So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
$endgroup$
– karthikeyan
5 hours ago
$begingroup$
Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
$endgroup$
– karthikeyan
6 hours ago
$begingroup$
Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
$endgroup$
– karthikeyan
6 hours ago
$begingroup$
Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
$endgroup$
– Gyan Ranjan
6 hours ago
$begingroup$
Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
$endgroup$
– Gyan Ranjan
6 hours ago
$begingroup$
A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
$endgroup$
– karthikeyan
5 hours ago
$begingroup$
A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
$endgroup$
– karthikeyan
5 hours ago
$begingroup$
B: I tried some words that are not in the vocabulary. For example, When I tried
w2v = training_model.wv['hi']
and this gave me KeyError: "word 'hi' not in vocabulary"
So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks$endgroup$
– karthikeyan
5 hours ago
$begingroup$
B: I tried some words that are not in the vocabulary. For example, When I tried
w2v = training_model.wv['hi']
and this gave me KeyError: "word 'hi' not in vocabulary"
So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks$endgroup$
– karthikeyan
5 hours ago
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44242%2fsize-of-output-vector-from-avgw2v-vectorizer-is-less-than-size-of-input-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown