Size of Output vector from AvgW2V Vectorizer is less than Size of Input data

enter image description here Hi,
I have been seeing this problem for quite some time. Whenever I tried vectorizing input text data though avgw2v vectorization technique. The size of vectorized data is less than the size of the input data. Is there any statistical reason behind this? In my case 100K is the size of the input and it gives 999,98 sized output
I'm wondering what is causing this problem. Thanks in advance

Code:

listofsentences=

for sent in x_train:

    listofsentences.append(sent.split())



training_model = Word2Vec(sentences=listofsentences, workers=-1,min_count=5)

modelwords = list(training_model.wv.vocab)



std_avgw2v_x_train = 

for everysentence in tqdm(listofsentences):

    count = 0

    sentence = np.zeros(100)

    for everyword in everysentence:

        if everyword in modelwords:

            w2v = training_model.wv[everyword]

            count += 1

            sentence += w2v



    if count != 0:

        sentence/=count

        std_avgw2v_x_train.append(sentence)



len(std_avgw2v_x_train)

>99998



len(x_train)

>100000

EDIT1: I'd like to mention that I Just started learning ML, Its been 55 days since I started. Also, the same code gives our 100K output samples While I vectorize with TFIDFW2V

I have attached the image of the same. Kindly look into it

edited 5 hours ago

asked 6 hours ago

karthikeyan

add a comment |

Code:

listofsentences=

for sent in x_train:

    listofsentences.append(sent.split())



training_model = Word2Vec(sentences=listofsentences, workers=-1,min_count=5)

modelwords = list(training_model.wv.vocab)



std_avgw2v_x_train = 

for everysentence in tqdm(listofsentences):

    count = 0

    sentence = np.zeros(100)

    for everyword in everysentence:

        if everyword in modelwords:

            w2v = training_model.wv[everyword]

            count += 1

            sentence += w2v



    if count != 0:

        sentence/=count

        std_avgw2v_x_train.append(sentence)



len(std_avgw2v_x_train)

>99998



len(x_train)

>100000

EDIT1: I'd like to mention that I Just started learning ML, Its been 55 days since I started. Also, the same code gives our 100K output samples While I vectorize with TFIDFW2V

I have attached the image of the same. Kindly look into it

edited 5 hours ago

asked 6 hours ago

karthikeyan

add a comment |

Code:

listofsentences=

for sent in x_train:

    listofsentences.append(sent.split())



training_model = Word2Vec(sentences=listofsentences, workers=-1,min_count=5)

modelwords = list(training_model.wv.vocab)



std_avgw2v_x_train = 

for everysentence in tqdm(listofsentences):

    count = 0

    sentence = np.zeros(100)

    for everyword in everysentence:

        if everyword in modelwords:

            w2v = training_model.wv[everyword]

            count += 1

            sentence += w2v



    if count != 0:

        sentence/=count

        std_avgw2v_x_train.append(sentence)



len(std_avgw2v_x_train)

>99998



len(x_train)

>100000

EDIT1: I'd like to mention that I Just started learning ML, Its been 55 days since I started. Also, the same code gives our 100K output samples While I vectorize with TFIDFW2V

I have attached the image of the same. Kindly look into it

edited 5 hours ago

asked 6 hours ago

karthikeyan

Code:

listofsentences=

for sent in x_train:

    listofsentences.append(sent.split())



training_model = Word2Vec(sentences=listofsentences, workers=-1,min_count=5)

modelwords = list(training_model.wv.vocab)



std_avgw2v_x_train = 

for everysentence in tqdm(listofsentences):

    count = 0

    sentence = np.zeros(100)

    for everyword in everysentence:

        if everyword in modelwords:

            w2v = training_model.wv[everyword]

            count += 1

            sentence += w2v



    if count != 0:

        sentence/=count

        std_avgw2v_x_train.append(sentence)



len(std_avgw2v_x_train)

>99998



len(x_train)

>100000

EDIT1: I'd like to mention that I Just started learning ML, Its been 55 days since I started. Also, the same code gives our 100K output samples While I vectorize with TFIDFW2V

I have attached the image of the same. Kindly look into it

machine-learning feature-extraction word2vec text

edited 5 hours ago

asked 6 hours ago

karthikeyan

edited 5 hours ago

asked 6 hours ago

karthikeyan

edited 5 hours ago

asked 6 hours ago

karthikeyan

asked 6 hours ago

karthikeyan

asked 6 hours ago

karthikeyan

add a comment |

1 Answer
1

active

oldest

votes

I think the issue can be one of the two :

A . You have missing value in x_train

B . One of the values in x_train has no word that is there in modelwords.

In both the cases ,

if everyword in modelwords: w2v = training_model.wv[everyword] count += 1 sentence += w2v

condition doesn't get satisfied and you end up not addding any new value to sentence

answered 6 hours ago

Gyan Ranjan

1457

$begingroup$
Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
$endgroup$
– karthikeyan
6 hours ago

$begingroup$
Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
$endgroup$
– Gyan Ranjan
6 hours ago

$begingroup$
A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
$endgroup$
– karthikeyan
5 hours ago

$begingroup$
B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
$endgroup$
– karthikeyan
5 hours ago

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44242%2fsize-of-output-vector-from-avgw2v-vectorizer-is-less-than-size-of-input-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I think the issue can be one of the two :

A . You have missing value in x_train

B . One of the values in x_train has no word that is there in modelwords.

In both the cases ,

if everyword in modelwords: w2v = training_model.wv[everyword] count += 1 sentence += w2v

condition doesn't get satisfied and you end up not addding any new value to sentence

answered 6 hours ago

Gyan Ranjan

1457

$begingroup$
Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
$endgroup$
– karthikeyan
6 hours ago

$begingroup$
Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
$endgroup$
– Gyan Ranjan
6 hours ago

$begingroup$
A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
$endgroup$
– karthikeyan
5 hours ago

$begingroup$
B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
$endgroup$
– karthikeyan
5 hours ago

add a comment |

I think the issue can be one of the two :

A . You have missing value in x_train

B . One of the values in x_train has no word that is there in modelwords.

In both the cases ,

if everyword in modelwords: w2v = training_model.wv[everyword] count += 1 sentence += w2v

condition doesn't get satisfied and you end up not addding any new value to sentence

answered 6 hours ago

Gyan Ranjan

1457

$begingroup$
Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
$endgroup$
– karthikeyan
6 hours ago

$begingroup$
Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
$endgroup$
– Gyan Ranjan
6 hours ago

$begingroup$
A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
$endgroup$
– karthikeyan
5 hours ago

$begingroup$
B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
$endgroup$
– karthikeyan
5 hours ago

add a comment |

I think the issue can be one of the two :

A . You have missing value in x_train

B . One of the values in x_train has no word that is there in modelwords.

In both the cases ,

if everyword in modelwords: w2v = training_model.wv[everyword] count += 1 sentence += w2v

condition doesn't get satisfied and you end up not addding any new value to sentence

answered 6 hours ago

Gyan Ranjan

1457

I think the issue can be one of the two :

A . You have missing value in x_train

B . One of the values in x_train has no word that is there in modelwords.

In both the cases ,

if everyword in modelwords: w2v = training_model.wv[everyword] count += 1 sentence += w2v

condition doesn't get satisfied and you end up not addding any new value to sentence

answered 6 hours ago

Gyan Ranjan

1457

answered 6 hours ago

Gyan Ranjan

1457

answered 6 hours ago

Gyan Ranjan

1457

answered 6 hours ago

Gyan Ranjan

1457

$begingroup$
Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
$endgroup$
– karthikeyan
6 hours ago

$begingroup$
Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
$endgroup$
– Gyan Ranjan
6 hours ago

$begingroup$
A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
$endgroup$
– karthikeyan
5 hours ago

$begingroup$
B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
$endgroup$
– karthikeyan
5 hours ago

add a comment |

$begingroup$
Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
$endgroup$
– karthikeyan
6 hours ago

$begingroup$
Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
$endgroup$
– Gyan Ranjan
6 hours ago

$begingroup$
A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
$endgroup$
– karthikeyan
5 hours ago

$begingroup$
B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
$endgroup$
– karthikeyan
5 hours ago

Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output

– karthikeyan
6 hours ago

Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.

– Gyan Ranjan
6 hours ago

A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)

– karthikeyan
5 hours ago

B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks

– karthikeyan
5 hours ago

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk