use hidden layer of word2vec instead of 'one-hot', to reduce number of weights in other nets?












1












$begingroup$


I've been reading about word2vec and it's ability to encode words into vector representations. Coordinates (probabilities) of these words are clustered together with their usual context-neighbor words.



For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.



The output of the net are 10 000 neurons, each with a softmax classifier. Every such a "soft-maxed output" represents probability to select the appropriate word from our 10k dictionary.
enter image description here



Question 1: Well, I was able to get benefit of "semantically grouped" words via that final softmaxed-output layer. However, it seems way too large - it still has dimension of 10 000 probabilities.



Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?



Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up



Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?



References:
first, second, third













share|improve this question











$endgroup$

















    1












    $begingroup$


    I've been reading about word2vec and it's ability to encode words into vector representations. Coordinates (probabilities) of these words are clustered together with their usual context-neighbor words.



    For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.



    The output of the net are 10 000 neurons, each with a softmax classifier. Every such a "soft-maxed output" represents probability to select the appropriate word from our 10k dictionary.
    enter image description here



    Question 1: Well, I was able to get benefit of "semantically grouped" words via that final softmaxed-output layer. However, it seems way too large - it still has dimension of 10 000 probabilities.



    Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?



    Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up



    Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?



    References:
    first, second, third













    share|improve this question











    $endgroup$















      1












      1








      1





      $begingroup$


      I've been reading about word2vec and it's ability to encode words into vector representations. Coordinates (probabilities) of these words are clustered together with their usual context-neighbor words.



      For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.



      The output of the net are 10 000 neurons, each with a softmax classifier. Every such a "soft-maxed output" represents probability to select the appropriate word from our 10k dictionary.
      enter image description here



      Question 1: Well, I was able to get benefit of "semantically grouped" words via that final softmaxed-output layer. However, it seems way too large - it still has dimension of 10 000 probabilities.



      Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?



      Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up



      Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?



      References:
      first, second, third













      share|improve this question











      $endgroup$




      I've been reading about word2vec and it's ability to encode words into vector representations. Coordinates (probabilities) of these words are clustered together with their usual context-neighbor words.



      For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.



      The output of the net are 10 000 neurons, each with a softmax classifier. Every such a "soft-maxed output" represents probability to select the appropriate word from our 10k dictionary.
      enter image description here



      Question 1: Well, I was able to get benefit of "semantically grouped" words via that final softmaxed-output layer. However, it seems way too large - it still has dimension of 10 000 probabilities.



      Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?



      Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up



      Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?



      References:
      first, second, third










      word2vec






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Mar 9 '18 at 5:20







      Kari

















      asked Mar 7 '18 at 21:22









      KariKari

      549321




      549321






















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?



          If this is the case, I'd say no. There's nothing in the one-hot vector that specifies distance between two words. For instance, the one-hot vectors for cat and table will be the same distance away for cat and lion. With embeddings, clearly you want to have the second example closer together.






          share|improve this answer











          $endgroup$













          • $begingroup$
            Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
            $endgroup$
            – Kari
            Mar 8 '18 at 11:17





















          1












          $begingroup$

          Answering Question 1 of my own post::




          Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?




          Yes, I can, because the outcome of the hidden (orange in the picture) already represents the actual embedded vector. It should be now used in any other custom networks (LSTM, etc), NOT the one-hot encoded vector. Using this orange vector will indeed greately reduce number of required weights if there are multiple custom networks under development.



          Notice, connections between one-hot vector and the orange layer are represented by a special matrix.



          The matrix is special because each column (or rows, depending on your preferred notation) already represents these pre-activations in those 300 neurons - a response to a corresponding incoming one-hot vector. So you just take a column, and treat it as the entire orange layer, which is super convenient



          More on that here: https://datascience.stackexchange.com/a/29161/43077



          Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.



          The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about






          share|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28778%2fuse-hidden-layer-of-word2vec-instead-of-one-hot-to-reduce-number-of-weights-i%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?



            If this is the case, I'd say no. There's nothing in the one-hot vector that specifies distance between two words. For instance, the one-hot vectors for cat and table will be the same distance away for cat and lion. With embeddings, clearly you want to have the second example closer together.






            share|improve this answer











            $endgroup$













            • $begingroup$
              Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
              $endgroup$
              – Kari
              Mar 8 '18 at 11:17


















            1












            $begingroup$

            Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?



            If this is the case, I'd say no. There's nothing in the one-hot vector that specifies distance between two words. For instance, the one-hot vectors for cat and table will be the same distance away for cat and lion. With embeddings, clearly you want to have the second example closer together.






            share|improve this answer











            $endgroup$













            • $begingroup$
              Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
              $endgroup$
              – Kari
              Mar 8 '18 at 11:17
















            1












            1








            1





            $begingroup$

            Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?



            If this is the case, I'd say no. There's nothing in the one-hot vector that specifies distance between two words. For instance, the one-hot vectors for cat and table will be the same distance away for cat and lion. With embeddings, clearly you want to have the second example closer together.






            share|improve this answer











            $endgroup$



            Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?



            If this is the case, I'd say no. There's nothing in the one-hot vector that specifies distance between two words. For instance, the one-hot vectors for cat and table will be the same distance away for cat and lion. With embeddings, clearly you want to have the second example closer together.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Mar 8 '18 at 3:40









            Toros91

            1,9512628




            1,9512628










            answered Mar 7 '18 at 23:06









            DanielDaniel

            1238




            1238












            • $begingroup$
              Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
              $endgroup$
              – Kari
              Mar 8 '18 at 11:17




















            • $begingroup$
              Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
              $endgroup$
              – Kari
              Mar 8 '18 at 11:17


















            $begingroup$
            Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
            $endgroup$
            – Kari
            Mar 8 '18 at 11:17






            $begingroup$
            Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
            $endgroup$
            – Kari
            Mar 8 '18 at 11:17













            1












            $begingroup$

            Answering Question 1 of my own post::




            Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?




            Yes, I can, because the outcome of the hidden (orange in the picture) already represents the actual embedded vector. It should be now used in any other custom networks (LSTM, etc), NOT the one-hot encoded vector. Using this orange vector will indeed greately reduce number of required weights if there are multiple custom networks under development.



            Notice, connections between one-hot vector and the orange layer are represented by a special matrix.



            The matrix is special because each column (or rows, depending on your preferred notation) already represents these pre-activations in those 300 neurons - a response to a corresponding incoming one-hot vector. So you just take a column, and treat it as the entire orange layer, which is super convenient



            More on that here: https://datascience.stackexchange.com/a/29161/43077



            Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.



            The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about






            share|improve this answer











            $endgroup$


















              1












              $begingroup$

              Answering Question 1 of my own post::




              Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?




              Yes, I can, because the outcome of the hidden (orange in the picture) already represents the actual embedded vector. It should be now used in any other custom networks (LSTM, etc), NOT the one-hot encoded vector. Using this orange vector will indeed greately reduce number of required weights if there are multiple custom networks under development.



              Notice, connections between one-hot vector and the orange layer are represented by a special matrix.



              The matrix is special because each column (or rows, depending on your preferred notation) already represents these pre-activations in those 300 neurons - a response to a corresponding incoming one-hot vector. So you just take a column, and treat it as the entire orange layer, which is super convenient



              More on that here: https://datascience.stackexchange.com/a/29161/43077



              Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.



              The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about






              share|improve this answer











              $endgroup$
















                1












                1








                1





                $begingroup$

                Answering Question 1 of my own post::




                Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?




                Yes, I can, because the outcome of the hidden (orange in the picture) already represents the actual embedded vector. It should be now used in any other custom networks (LSTM, etc), NOT the one-hot encoded vector. Using this orange vector will indeed greately reduce number of required weights if there are multiple custom networks under development.



                Notice, connections between one-hot vector and the orange layer are represented by a special matrix.



                The matrix is special because each column (or rows, depending on your preferred notation) already represents these pre-activations in those 300 neurons - a response to a corresponding incoming one-hot vector. So you just take a column, and treat it as the entire orange layer, which is super convenient



                More on that here: https://datascience.stackexchange.com/a/29161/43077



                Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.



                The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about






                share|improve this answer











                $endgroup$



                Answering Question 1 of my own post::




                Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?




                Yes, I can, because the outcome of the hidden (orange in the picture) already represents the actual embedded vector. It should be now used in any other custom networks (LSTM, etc), NOT the one-hot encoded vector. Using this orange vector will indeed greately reduce number of required weights if there are multiple custom networks under development.



                Notice, connections between one-hot vector and the orange layer are represented by a special matrix.



                The matrix is special because each column (or rows, depending on your preferred notation) already represents these pre-activations in those 300 neurons - a response to a corresponding incoming one-hot vector. So you just take a column, and treat it as the entire orange layer, which is super convenient



                More on that here: https://datascience.stackexchange.com/a/29161/43077



                Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.



                The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited 1 hour ago

























                answered Mar 9 '18 at 3:37









                KariKari

                549321




                549321






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28778%2fuse-hidden-layer-of-word2vec-instead-of-one-hot-to-reduce-number-of-weights-i%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Vallis Paradisi

                    Tabula Rosettana