use hidden layer of word2vec instead of 'one-hot', to reduce number of weights in other nets?

I've been reading about word2vec and it's ability to encode words into vector representations. Coordinates (probabilities) of these words are clustered together with their usual context-neighbor words.

For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.

The output of the net are 10 000 neurons, each with a softmax classifier. Every such a "soft-maxed output" represents probability to select the appropriate word from our 10k dictionary.
enter image description here

Question 1: Well, I was able to get benefit of "semantically grouped" words via that final softmaxed-output layer. However, it seems way too large - it still has dimension of 10 000 probabilities.

Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?

Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up

Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?

References:
first, second, third

edited Mar 9 '18 at 5:20

asked Mar 7 '18 at 21:22

Kari

549321

add a comment |

For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.

Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?

Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up

Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?

References:
first, second, third

edited Mar 9 '18 at 5:20

asked Mar 7 '18 at 21:22

Kari

549321

add a comment |

For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.

Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?

Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up

Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?

References:
first, second, third

edited Mar 9 '18 at 5:20

asked Mar 7 '18 at 21:22

Kari

549321

For example, if we have 10k unique words in our dataset, we will feed 10 000 features into the network. The network can have for example 300 hidden neurons, each with linear activation function.

Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?

Then weights of my LSTM wouldn't take so much space on disk. However, I would need to decode the 300-dimensional state back into my 10k-dimensional state, so I can look it up

Question 2: ...are the "Encoded" vectors actually already sitting in the Hidden layer or in the Output Layer?

References:
first, second, third

word2vec

edited Mar 9 '18 at 5:20

asked Mar 7 '18 at 21:22

Kari

549321

edited Mar 9 '18 at 5:20

asked Mar 7 '18 at 21:22

Kari

549321

edited Mar 9 '18 at 5:20

asked Mar 7 '18 at 21:22

Kari

549321

asked Mar 7 '18 at 21:22

Kari

549321

asked Mar 7 '18 at 21:22

Kari

549321

add a comment |

2 Answers
2

active

oldest

votes

Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?

If this is the case, I'd say no. There's nothing in the one-hot vector that specifies distance between two words. For instance, the one-hot vectors for cat and table will be the same distance away for cat and lion. With embeddings, clearly you want to have the second example closer together.

edited Mar 8 '18 at 3:40

Toros91

1,9512628

answered Mar 7 '18 at 23:06

Daniel

1238

$begingroup$
Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
$endgroup$
– Kari
Mar 8 '18 at 11:17

add a comment |

Answering Question 1 of my own post::

Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?

Yes, I can, because the outcome of the hidden (orange in the picture) already represents the actual embedded vector. It should be now used in any other custom networks (LSTM, etc), NOT the one-hot encoded vector. Using this orange vector will indeed greately reduce number of required weights if there are multiple custom networks under development.

Notice, connections between one-hot vector and the orange layer are represented by a special matrix.

The matrix is special because each column (or rows, depending on your preferred notation) already represents these pre-activations in those 300 neurons - a response to a corresponding incoming one-hot vector. So you just take a column, and treat it as the entire orange layer, which is super convenient

More on that here: https://datascience.stackexchange.com/a/29161/43077

Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.

The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about

edited 1 hour ago

answered Mar 9 '18 at 3:37

Kari

549321

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28778%2fuse-hidden-layer-of-word2vec-instead-of-one-hot-to-reduce-number-of-weights-i%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?

edited Mar 8 '18 at 3:40

Toros91

1,9512628

answered Mar 7 '18 at 23:06

Daniel

1238

$begingroup$
Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
$endgroup$
– Kari
Mar 8 '18 at 11:17

add a comment |

Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?

edited Mar 8 '18 at 3:40

Toros91

1,9512628

answered Mar 7 '18 at 23:06

Daniel

1238

$begingroup$
Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
$endgroup$
– Kari
Mar 8 '18 at 11:17

add a comment |

Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?

edited Mar 8 '18 at 3:40

Toros91

1,9512628

answered Mar 7 '18 at 23:06

Daniel

1238

Are you asking if the 300 dimension 'middle layer' has learned to encode words into an embedding like word2vec?

edited Mar 8 '18 at 3:40

Toros91

1,9512628

answered Mar 7 '18 at 23:06

Daniel

1238

edited Mar 8 '18 at 3:40

Toros91

1,9512628

edited Mar 8 '18 at 3:40

Toros91

1,9512628

edited Mar 8 '18 at 3:40

Toros91

1,9512628

answered Mar 7 '18 at 23:06

Daniel

1238

answered Mar 7 '18 at 23:06

Daniel

1238

answered Mar 7 '18 at 23:06

Daniel

1238

$begingroup$
Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
$endgroup$
– Kari
Mar 8 '18 at 11:17

add a comment |

$begingroup$
Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else
$endgroup$
– Kari
Mar 8 '18 at 11:17

Thanks! Is it the softmax of the last layer which performs this "spacial spreading"? But wouldn't the one-hot value be scaled by the weight leading into each hidden neuron? We indeed have only a single feature feature as 'one' all other features 'zero', but there are weights from input-to-hidden, that can scale this ''one' into something else

– Kari
Mar 8 '18 at 11:17

add a comment |

Answering Question 1 of my own post::

Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?

Notice, connections between one-hot vector and the orange layer are represented by a special matrix.

More on that here: https://datascience.stackexchange.com/a/29161/43077

Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.

The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about

edited 1 hour ago

answered Mar 9 '18 at 3:37

Kari

549321

add a comment |

Answering Question 1 of my own post::

Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?

Notice, connections between one-hot vector and the orange layer are represented by a special matrix.

More on that here: https://datascience.stackexchange.com/a/29161/43077

Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.

The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about

edited 1 hour ago

answered Mar 9 '18 at 3:37

Kari

549321

add a comment |

Answering Question 1 of my own post::

Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?

Notice, connections between one-hot vector and the orange layer are represented by a special matrix.

More on that here: https://datascience.stackexchange.com/a/29161/43077

Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.

The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about

edited 1 hour ago

answered Mar 9 '18 at 3:37

Kari

549321

Answering Question 1 of my own post::

Can I actually use that 300 neurons result in some other networks of mine? Say, in my LSTM, etc?

Notice, connections between one-hot vector and the orange layer are represented by a special matrix.

More on that here: https://datascience.stackexchange.com/a/29161/43077

Answering Question 2: Indeed the encoded values are already in the result of a Hidden Layer (orange on on the picture). You can use them in any other algorithm.

The red output layer is only needed during training. The weights leading from InputVector to HiddenLayer is what you ultimately care about

edited 1 hour ago

answered Mar 9 '18 at 3:37

Kari

549321

edited 1 hour ago

answered Mar 9 '18 at 3:37

Kari

549321

answered Mar 9 '18 at 3:37

Kari

549321

answered Mar 9 '18 at 3:37

Kari

549321

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk