How to learn irrelevant words in an information retrieval system?
$begingroup$
Right now my recommender system for information retrieval uses word embedding stogether with Tfidfs weights like written here:
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
Using Tfidf improves results. But I have the problem that irrelevant keywords (high frequent words) still have a large impact.
Can I learn a system such that it learns on which words to pay attention - preferred in an unsupervised way?
What can you suggest for a better information retrieval using word embedings?
nlp recommender-system word-embeddings
$endgroup$
add a comment |
$begingroup$
Right now my recommender system for information retrieval uses word embedding stogether with Tfidfs weights like written here:
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
Using Tfidf improves results. But I have the problem that irrelevant keywords (high frequent words) still have a large impact.
Can I learn a system such that it learns on which words to pay attention - preferred in an unsupervised way?
What can you suggest for a better information retrieval using word embedings?
nlp recommender-system word-embeddings
$endgroup$
add a comment |
$begingroup$
Right now my recommender system for information retrieval uses word embedding stogether with Tfidfs weights like written here:
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
Using Tfidf improves results. But I have the problem that irrelevant keywords (high frequent words) still have a large impact.
Can I learn a system such that it learns on which words to pay attention - preferred in an unsupervised way?
What can you suggest for a better information retrieval using word embedings?
nlp recommender-system word-embeddings
$endgroup$
Right now my recommender system for information retrieval uses word embedding stogether with Tfidfs weights like written here:
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
Using Tfidf improves results. But I have the problem that irrelevant keywords (high frequent words) still have a large impact.
Can I learn a system such that it learns on which words to pay attention - preferred in an unsupervised way?
What can you suggest for a better information retrieval using word embedings?
nlp recommender-system word-embeddings
nlp recommender-system word-embeddings
asked 12 hours ago
TidoTido
596
596
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
If you are working with TF-IDF then it's important to experiment with min_df
and max_df
parameter. I guess you are on Python since you linked a Python tutorial. Here is the TF-IDF documentation and the related text to the above parameters.
max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
strictly higher than the given threshold (corpus-specific stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts. This parameter is ignored if vocabulary is not None.
min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
lower than the given threshold. This value is also called cut-off in
the literature. If float, the parameter represents a proportion of
documents, integer absolute counts. This parameter is ignored if
vocabulary is not None.
You might find several rules of thumb on the web. Some of them suggest using a flat number on the min_df
close to 5-7 documents and a percentage on the max_df
about 80-85%. Maybe even lower. With this, you will be able to get rid of garbage, misspelt or unwanted tokens. Keep in mind that you need to try different combinations to get the right balance in your model.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45183%2fhow-to-learn-irrelevant-words-in-an-information-retrieval-system%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
If you are working with TF-IDF then it's important to experiment with min_df
and max_df
parameter. I guess you are on Python since you linked a Python tutorial. Here is the TF-IDF documentation and the related text to the above parameters.
max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
strictly higher than the given threshold (corpus-specific stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts. This parameter is ignored if vocabulary is not None.
min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
lower than the given threshold. This value is also called cut-off in
the literature. If float, the parameter represents a proportion of
documents, integer absolute counts. This parameter is ignored if
vocabulary is not None.
You might find several rules of thumb on the web. Some of them suggest using a flat number on the min_df
close to 5-7 documents and a percentage on the max_df
about 80-85%. Maybe even lower. With this, you will be able to get rid of garbage, misspelt or unwanted tokens. Keep in mind that you need to try different combinations to get the right balance in your model.
$endgroup$
add a comment |
$begingroup$
If you are working with TF-IDF then it's important to experiment with min_df
and max_df
parameter. I guess you are on Python since you linked a Python tutorial. Here is the TF-IDF documentation and the related text to the above parameters.
max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
strictly higher than the given threshold (corpus-specific stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts. This parameter is ignored if vocabulary is not None.
min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
lower than the given threshold. This value is also called cut-off in
the literature. If float, the parameter represents a proportion of
documents, integer absolute counts. This parameter is ignored if
vocabulary is not None.
You might find several rules of thumb on the web. Some of them suggest using a flat number on the min_df
close to 5-7 documents and a percentage on the max_df
about 80-85%. Maybe even lower. With this, you will be able to get rid of garbage, misspelt or unwanted tokens. Keep in mind that you need to try different combinations to get the right balance in your model.
$endgroup$
add a comment |
$begingroup$
If you are working with TF-IDF then it's important to experiment with min_df
and max_df
parameter. I guess you are on Python since you linked a Python tutorial. Here is the TF-IDF documentation and the related text to the above parameters.
max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
strictly higher than the given threshold (corpus-specific stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts. This parameter is ignored if vocabulary is not None.
min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
lower than the given threshold. This value is also called cut-off in
the literature. If float, the parameter represents a proportion of
documents, integer absolute counts. This parameter is ignored if
vocabulary is not None.
You might find several rules of thumb on the web. Some of them suggest using a flat number on the min_df
close to 5-7 documents and a percentage on the max_df
about 80-85%. Maybe even lower. With this, you will be able to get rid of garbage, misspelt or unwanted tokens. Keep in mind that you need to try different combinations to get the right balance in your model.
$endgroup$
If you are working with TF-IDF then it's important to experiment with min_df
and max_df
parameter. I guess you are on Python since you linked a Python tutorial. Here is the TF-IDF documentation and the related text to the above parameters.
max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
strictly higher than the given threshold (corpus-specific stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts. This parameter is ignored if vocabulary is not None.
min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
lower than the given threshold. This value is also called cut-off in
the literature. If float, the parameter represents a proportion of
documents, integer absolute counts. This parameter is ignored if
vocabulary is not None.
You might find several rules of thumb on the web. Some of them suggest using a flat number on the min_df
close to 5-7 documents and a percentage on the max_df
about 80-85%. Maybe even lower. With this, you will be able to get rid of garbage, misspelt or unwanted tokens. Keep in mind that you need to try different combinations to get the right balance in your model.
answered 14 mins ago
TasosTasos
730425
730425
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45183%2fhow-to-learn-irrelevant-words-in-an-information-retrieval-system%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown