Size of Output vector from AvgW2V Vectorizer is less than Size of Input data












0












$begingroup$


enter image description hereHi,
I have been seeing this problem for quite some time. Whenever I tried vectorizing input text data though avgw2v vectorization technique. The size of vectorized data is less than the size of the input data. Is there any statistical reason behind this? In my case 100K is the size of the input and it gives 999,98 sized output
I'm wondering what is causing this problem. Thanks in advance



Code:



listofsentences=
for sent in x_train:
listofsentences.append(sent.split())

training_model = Word2Vec(sentences=listofsentences, workers=-1,min_count=5)
modelwords = list(training_model.wv.vocab)

std_avgw2v_x_train =
for everysentence in tqdm(listofsentences):
count = 0
sentence = np.zeros(100)
for everyword in everysentence:
if everyword in modelwords:
w2v = training_model.wv[everyword]
count += 1
sentence += w2v

if count != 0:
sentence/=count
std_avgw2v_x_train.append(sentence)

len(std_avgw2v_x_train)
>99998

len(x_train)
>100000


EDIT1: I'd like to mention that I Just started learning ML, Its been 55 days since I started. Also, the same code gives our 100K output samples While I vectorize with TFIDFW2V



I have attached the image of the same. Kindly look into it










share|improve this question











$endgroup$

















    0












    $begingroup$


    enter image description hereHi,
    I have been seeing this problem for quite some time. Whenever I tried vectorizing input text data though avgw2v vectorization technique. The size of vectorized data is less than the size of the input data. Is there any statistical reason behind this? In my case 100K is the size of the input and it gives 999,98 sized output
    I'm wondering what is causing this problem. Thanks in advance



    Code:



    listofsentences=
    for sent in x_train:
    listofsentences.append(sent.split())

    training_model = Word2Vec(sentences=listofsentences, workers=-1,min_count=5)
    modelwords = list(training_model.wv.vocab)

    std_avgw2v_x_train =
    for everysentence in tqdm(listofsentences):
    count = 0
    sentence = np.zeros(100)
    for everyword in everysentence:
    if everyword in modelwords:
    w2v = training_model.wv[everyword]
    count += 1
    sentence += w2v

    if count != 0:
    sentence/=count
    std_avgw2v_x_train.append(sentence)

    len(std_avgw2v_x_train)
    >99998

    len(x_train)
    >100000


    EDIT1: I'd like to mention that I Just started learning ML, Its been 55 days since I started. Also, the same code gives our 100K output samples While I vectorize with TFIDFW2V



    I have attached the image of the same. Kindly look into it










    share|improve this question











    $endgroup$















      0












      0








      0





      $begingroup$


      enter image description hereHi,
      I have been seeing this problem for quite some time. Whenever I tried vectorizing input text data though avgw2v vectorization technique. The size of vectorized data is less than the size of the input data. Is there any statistical reason behind this? In my case 100K is the size of the input and it gives 999,98 sized output
      I'm wondering what is causing this problem. Thanks in advance



      Code:



      listofsentences=
      for sent in x_train:
      listofsentences.append(sent.split())

      training_model = Word2Vec(sentences=listofsentences, workers=-1,min_count=5)
      modelwords = list(training_model.wv.vocab)

      std_avgw2v_x_train =
      for everysentence in tqdm(listofsentences):
      count = 0
      sentence = np.zeros(100)
      for everyword in everysentence:
      if everyword in modelwords:
      w2v = training_model.wv[everyword]
      count += 1
      sentence += w2v

      if count != 0:
      sentence/=count
      std_avgw2v_x_train.append(sentence)

      len(std_avgw2v_x_train)
      >99998

      len(x_train)
      >100000


      EDIT1: I'd like to mention that I Just started learning ML, Its been 55 days since I started. Also, the same code gives our 100K output samples While I vectorize with TFIDFW2V



      I have attached the image of the same. Kindly look into it










      share|improve this question











      $endgroup$




      enter image description hereHi,
      I have been seeing this problem for quite some time. Whenever I tried vectorizing input text data though avgw2v vectorization technique. The size of vectorized data is less than the size of the input data. Is there any statistical reason behind this? In my case 100K is the size of the input and it gives 999,98 sized output
      I'm wondering what is causing this problem. Thanks in advance



      Code:



      listofsentences=
      for sent in x_train:
      listofsentences.append(sent.split())

      training_model = Word2Vec(sentences=listofsentences, workers=-1,min_count=5)
      modelwords = list(training_model.wv.vocab)

      std_avgw2v_x_train =
      for everysentence in tqdm(listofsentences):
      count = 0
      sentence = np.zeros(100)
      for everyword in everysentence:
      if everyword in modelwords:
      w2v = training_model.wv[everyword]
      count += 1
      sentence += w2v

      if count != 0:
      sentence/=count
      std_avgw2v_x_train.append(sentence)

      len(std_avgw2v_x_train)
      >99998

      len(x_train)
      >100000


      EDIT1: I'd like to mention that I Just started learning ML, Its been 55 days since I started. Also, the same code gives our 100K output samples While I vectorize with TFIDFW2V



      I have attached the image of the same. Kindly look into it







      machine-learning feature-extraction word2vec text






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 5 hours ago







      karthikeyan

















      asked 6 hours ago









      karthikeyankarthikeyan

      12




      12






















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          I think the issue can be one of the two :



          A . You have missing value in x_train



          B . One of the values in x_train has no word that is there in modelwords.



          In both the cases ,



          if everyword in modelwords:
          w2v = training_model.wv[everyword]
          count += 1
          sentence += w2v



          condition doesn't get satisfied and you end up not addding any new value to sentence






          share|improve this answer









          $endgroup$













          • $begingroup$
            Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
            $endgroup$
            – karthikeyan
            6 hours ago












          • $begingroup$
            Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
            $endgroup$
            – Gyan Ranjan
            6 hours ago










          • $begingroup$
            A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
            $endgroup$
            – karthikeyan
            5 hours ago










          • $begingroup$
            B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
            $endgroup$
            – karthikeyan
            5 hours ago













          Your Answer





          StackExchange.ifUsing("editor", function () {
          return StackExchange.using("mathjaxEditing", function () {
          StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
          StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
          });
          });
          }, "mathjax-editing");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "557"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44242%2fsize-of-output-vector-from-avgw2v-vectorizer-is-less-than-size-of-input-data%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0












          $begingroup$

          I think the issue can be one of the two :



          A . You have missing value in x_train



          B . One of the values in x_train has no word that is there in modelwords.



          In both the cases ,



          if everyword in modelwords:
          w2v = training_model.wv[everyword]
          count += 1
          sentence += w2v



          condition doesn't get satisfied and you end up not addding any new value to sentence






          share|improve this answer









          $endgroup$













          • $begingroup$
            Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
            $endgroup$
            – karthikeyan
            6 hours ago












          • $begingroup$
            Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
            $endgroup$
            – Gyan Ranjan
            6 hours ago










          • $begingroup$
            A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
            $endgroup$
            – karthikeyan
            5 hours ago










          • $begingroup$
            B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
            $endgroup$
            – karthikeyan
            5 hours ago


















          0












          $begingroup$

          I think the issue can be one of the two :



          A . You have missing value in x_train



          B . One of the values in x_train has no word that is there in modelwords.



          In both the cases ,



          if everyword in modelwords:
          w2v = training_model.wv[everyword]
          count += 1
          sentence += w2v



          condition doesn't get satisfied and you end up not addding any new value to sentence






          share|improve this answer









          $endgroup$













          • $begingroup$
            Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
            $endgroup$
            – karthikeyan
            6 hours ago












          • $begingroup$
            Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
            $endgroup$
            – Gyan Ranjan
            6 hours ago










          • $begingroup$
            A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
            $endgroup$
            – karthikeyan
            5 hours ago










          • $begingroup$
            B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
            $endgroup$
            – karthikeyan
            5 hours ago
















          0












          0








          0





          $begingroup$

          I think the issue can be one of the two :



          A . You have missing value in x_train



          B . One of the values in x_train has no word that is there in modelwords.



          In both the cases ,



          if everyword in modelwords:
          w2v = training_model.wv[everyword]
          count += 1
          sentence += w2v



          condition doesn't get satisfied and you end up not addding any new value to sentence






          share|improve this answer









          $endgroup$



          I think the issue can be one of the two :



          A . You have missing value in x_train



          B . One of the values in x_train has no word that is there in modelwords.



          In both the cases ,



          if everyword in modelwords:
          w2v = training_model.wv[everyword]
          count += 1
          sentence += w2v



          condition doesn't get satisfied and you end up not addding any new value to sentence







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered 6 hours ago









          Gyan RanjanGyan Ranjan

          1457




          1457












          • $begingroup$
            Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
            $endgroup$
            – karthikeyan
            6 hours ago












          • $begingroup$
            Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
            $endgroup$
            – Gyan Ranjan
            6 hours ago










          • $begingroup$
            A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
            $endgroup$
            – karthikeyan
            5 hours ago










          • $begingroup$
            B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
            $endgroup$
            – karthikeyan
            5 hours ago




















          • $begingroup$
            Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
            $endgroup$
            – karthikeyan
            6 hours ago












          • $begingroup$
            Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
            $endgroup$
            – Gyan Ranjan
            6 hours ago










          • $begingroup$
            A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
            $endgroup$
            – karthikeyan
            5 hours ago










          • $begingroup$
            B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
            $endgroup$
            – karthikeyan
            5 hours ago


















          $begingroup$
          Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
          $endgroup$
          – karthikeyan
          6 hours ago






          $begingroup$
          Thanks! I'll check the same and update the question. BTW, The same input when I vectorize via TFIDFW2V Gives out 100K samples as output
          $endgroup$
          – karthikeyan
          6 hours ago














          $begingroup$
          Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
          $endgroup$
          – Gyan Ranjan
          6 hours ago




          $begingroup$
          Since , tf-idf creates a vector based on the corpus only , it accounts for all the words unlike here where you are comparing if the word is present in the w2v vocab or not.
          $endgroup$
          – Gyan Ranjan
          6 hours ago












          $begingroup$
          A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
          $endgroup$
          – karthikeyan
          5 hours ago




          $begingroup$
          A: Thanks, To Check for any missing data, I tried converting this to a dataframe and sorted indices wtih attribute na_position='first' and I couldn't find any missing values interpreted as Na or any empty spaces and also I performed df.dropna(inplace=True) and the size of the dataset still remains (100000,1)
          $endgroup$
          – karthikeyan
          5 hours ago












          $begingroup$
          B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
          $endgroup$
          – karthikeyan
          5 hours ago






          $begingroup$
          B: I tried some words that are not in the vocabulary. For example, When I tried w2v = training_model.wv['hi'] and this gave me KeyError: "word 'hi' not in vocabulary" So this means, Your Suggestion B also is not the problem here right?? Correct me if I'm wrong. Also if you can suggest anyother robust methods for checking the problem, I would try that. Thanks
          $endgroup$
          – karthikeyan
          5 hours ago




















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Data Science Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f44242%2fsize-of-output-vector-from-avgw2v-vectorizer-is-less-than-size-of-input-data%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          How to label and detect the document text images

          Tabula Rosettana

          Aureus (color)