use hidden layer of word2vec instead of 'one-hot', to reduce number of weights in other nets?
$begingroup$
I've been reading about word2vec and it's ability to encode words into vector representations. Coordinates (probabilities) of these words are clustered together with their usual context-neighbor words.
For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.
The output of the net are 10 000 neurons, each with a softmax classifier. Every such a "soft-maxed output" represents probability to select the appropriate word from our 10k dictionary.
Question 1: Well, I was able to get benefit of "semantically grouped" words via that final softmaxed-output layer. However, it seems way too large - it still has dimension of 10 000 probabilities.
Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?
Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up
Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?
References:
first, second, third
word2vec
$endgroup$
add a comment |
$begingroup$
I've been reading about word2vec and it's ability to encode words into vector representations. Coordinates (probabilities) of these words are clustered together with their usual context-neighbor words.
For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.
The output of the net are 10 000 neurons, each with a softmax classifier. Every such a "soft-maxed output" represents probability to select the appropriate word from our 10k dictionary.
Question 1: Well, I was able to get benefit of "semantically grouped" words via that final softmaxed-output layer. However, it seems way too large - it still has dimension of 10 000 probabilities.
Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?
Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up
Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?
References:
first, second, third
word2vec
$endgroup$
add a comment |
$begingroup$
I've been reading about word2vec and it's ability to encode words into vector representations. Coordinates (probabilities) of these words are clustered together with their usual context-neighbor words.
For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.
The output of the net are 10 000 neurons, each with a softmax classifier. Every such a "soft-maxed output" represents probability to select the appropriate word from our 10k dictionary.
Question 1: Well, I was able to get benefit of "semantically grouped" words via that final softmaxed-output layer. However, it seems way too large - it still has dimension of 10 000 probabilities.
Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?
Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up
Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?
References:
first, second, third
word2vec
$endgroup$
I've been reading about word2vec and it's ability to encode words into vector representations. Coordinates (probabilities) of these words are clustered together with their usual context-neighbor words.
For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.
The output of the net are 10 000 neurons, each with a softmax classifier. Every such a "soft-maxed output" represents probability to select the appropriate word from our 10k dictionary.
Question 1: Well, I was able to get benefit of "semantically grouped" words via that final softmaxed-output layer. However, it seems way too large - it still has dimension of 10 000 probabilities.
Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?
Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up
Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?
References:
first, second, third
word2vec
word2vec
edited Mar 9 '18 at 5:20
Kari
asked Mar 7 '18 at 21:22
KariKari
549321
549321
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?
If this is the case, I'd say no. There's nothing in the one-hot vector that specifies distance between two words. For instance, the one-hot vectors for cat and table will be the same distance away for cat and lion. With embeddings, clearly you want to have the second example closer together.
$endgroup$
$begingroup$
Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
$endgroup$
– Kari
Mar 8 '18 at 11:17
add a comment |
$begingroup$
Answering Question 1 of my own post::
Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?
Yes, I can, because the outcome of the hidden (orange in the picture) already represents the actual embedded vector. It should be now used in any other custom networks (LSTM, etc), NOT the one-hot encoded vector. Using this orange vector will indeed greately reduce number of required weights if there are multiple custom networks under development.
Notice, connections between one-hot vector and the orange layer are represented by a special matrix.
The matrix is special because each column (or rows, depending on your preferred notation) already represents these pre-activations in those 300 neurons - a response to a corresponding incoming one-hot vector. So you just take a column, and treat it as the entire orange layer, which is super convenient
More on that here: https://datascience.stackexchange.com/a/29161/43077
Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.
The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28778%2fuse-hidden-layer-of-word2vec-instead-of-one-hot-to-reduce-number-of-weights-i%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?
If this is the case, I'd say no. There's nothing in the one-hot vector that specifies distance between two words. For instance, the one-hot vectors for cat and table will be the same distance away for cat and lion. With embeddings, clearly you want to have the second example closer together.
$endgroup$
$begingroup$
Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
$endgroup$
– Kari
Mar 8 '18 at 11:17
add a comment |
$begingroup$
Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?
If this is the case, I'd say no. There's nothing in the one-hot vector that specifies distance between two words. For instance, the one-hot vectors for cat and table will be the same distance away for cat and lion. With embeddings, clearly you want to have the second example closer together.
$endgroup$
$begingroup$
Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
$endgroup$
– Kari
Mar 8 '18 at 11:17
add a comment |
$begingroup$
Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?
If this is the case, I'd say no. There's nothing in the one-hot vector that specifies distance between two words. For instance, the one-hot vectors for cat and table will be the same distance away for cat and lion. With embeddings, clearly you want to have the second example closer together.
$endgroup$
Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?
If this is the case, I'd say no. There's nothing in the one-hot vector that specifies distance between two words. For instance, the one-hot vectors for cat and table will be the same distance away for cat and lion. With embeddings, clearly you want to have the second example closer together.
edited Mar 8 '18 at 3:40
Toros91
1,9512628
1,9512628
answered Mar 7 '18 at 23:06
DanielDaniel
1238
1238
$begingroup$
Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
$endgroup$
– Kari
Mar 8 '18 at 11:17
add a comment |
$begingroup$
Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
$endgroup$
– Kari
Mar 8 '18 at 11:17
$begingroup$
Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
$endgroup$
– Kari
Mar 8 '18 at 11:17
$begingroup$
Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
$endgroup$
– Kari
Mar 8 '18 at 11:17
add a comment |
$begingroup$
Answering Question 1 of my own post::
Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?
Yes, I can, because the outcome of the hidden (orange in the picture) already represents the actual embedded vector. It should be now used in any other custom networks (LSTM, etc), NOT the one-hot encoded vector. Using this orange vector will indeed greately reduce number of required weights if there are multiple custom networks under development.
Notice, connections between one-hot vector and the orange layer are represented by a special matrix.
The matrix is special because each column (or rows, depending on your preferred notation) already represents these pre-activations in those 300 neurons - a response to a corresponding incoming one-hot vector. So you just take a column, and treat it as the entire orange layer, which is super convenient
More on that here: https://datascience.stackexchange.com/a/29161/43077
Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.
The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about
$endgroup$
add a comment |
$begingroup$
Answering Question 1 of my own post::
Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?
Yes, I can, because the outcome of the hidden (orange in the picture) already represents the actual embedded vector. It should be now used in any other custom networks (LSTM, etc), NOT the one-hot encoded vector. Using this orange vector will indeed greately reduce number of required weights if there are multiple custom networks under development.
Notice, connections between one-hot vector and the orange layer are represented by a special matrix.
The matrix is special because each column (or rows, depending on your preferred notation) already represents these pre-activations in those 300 neurons - a response to a corresponding incoming one-hot vector. So you just take a column, and treat it as the entire orange layer, which is super convenient
More on that here: https://datascience.stackexchange.com/a/29161/43077
Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.
The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about
$endgroup$
add a comment |
$begingroup$
Answering Question 1 of my own post::
Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?
Yes, I can, because the outcome of the hidden (orange in the picture) already represents the actual embedded vector. It should be now used in any other custom networks (LSTM, etc), NOT the one-hot encoded vector. Using this orange vector will indeed greately reduce number of required weights if there are multiple custom networks under development.
Notice, connections between one-hot vector and the orange layer are represented by a special matrix.
The matrix is special because each column (or rows, depending on your preferred notation) already represents these pre-activations in those 300 neurons - a response to a corresponding incoming one-hot vector. So you just take a column, and treat it as the entire orange layer, which is super convenient
More on that here: https://datascience.stackexchange.com/a/29161/43077
Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.
The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about
$endgroup$
Answering Question 1 of my own post::
Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?
Yes, I can, because the outcome of the hidden (orange in the picture) already represents the actual embedded vector. It should be now used in any other custom networks (LSTM, etc), NOT the one-hot encoded vector. Using this orange vector will indeed greately reduce number of required weights if there are multiple custom networks under development.
Notice, connections between one-hot vector and the orange layer are represented by a special matrix.
The matrix is special because each column (or rows, depending on your preferred notation) already represents these pre-activations in those 300 neurons - a response to a corresponding incoming one-hot vector. So you just take a column, and treat it as the entire orange layer, which is super convenient
More on that here: https://datascience.stackexchange.com/a/29161/43077
Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.
The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about
edited 1 hour ago
answered Mar 9 '18 at 3:37
KariKari
549321
549321
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28778%2fuse-hidden-layer-of-word2vec-instead-of-one-hot-to-reduce-number-of-weights-i%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown