How to learn irrelevant words in an information retrieval system?

Right now my recommender system for information retrieval uses word embedding stogether with Tfidfs weights like written here:
http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

Using Tfidf improves results. But I have the problem that irrelevant keywords (high frequent words) still have a large impact.
Can I learn a system such that it learns on which words to pay attention - preferred in an unsupervised way?

What can you suggest for a better information retrieval using word embedings?

asked 12 hours ago

Tido

596

add a comment |

What can you suggest for a better information retrieval using word embedings?

asked 12 hours ago

Tido

596

add a comment |

What can you suggest for a better information retrieval using word embedings?

asked 12 hours ago

Tido

596

What can you suggest for a better information retrieval using word embedings?

nlp recommender-system word-embeddings

asked 12 hours ago

Tido

596

asked 12 hours ago

Tido

596

asked 12 hours ago

Tido

596

asked 12 hours ago

Tido

596

asked 12 hours ago

Tido

596

add a comment |

1 Answer
1

active

oldest

votes

If you are working with TF-IDF then it's important to experiment with min_df and max_df parameter. I guess you are on Python since you linked a Python tutorial. Here is the TF-IDF documentation and the related text to the above parameters.

max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
strictly higher than the given threshold (corpus-specific stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
lower than the given threshold. This value is also called cut-off in
the literature. If float, the parameter represents a proportion of
documents, integer absolute counts. This parameter is ignored if
vocabulary is not None.

You might find several rules of thumb on the web. Some of them suggest using a flat number on the min_df close to 5-7 documents and a percentage on the max_df about 80-85%. Maybe even lower. With this, you will be able to get rid of garbage, misspelt or unwanted tokens. Keep in mind that you need to try different combinations to get the right balance in your model.

answered 14 mins ago

Tasos

730425

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45183%2fhow-to-learn-irrelevant-words-in-an-information-retrieval-system%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
strictly higher than the given threshold (corpus-specific stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
lower than the given threshold. This value is also called cut-off in
the literature. If float, the parameter represents a proportion of
documents, integer absolute counts. This parameter is ignored if
vocabulary is not None.

answered 14 mins ago

Tasos

730425

add a comment |

max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
strictly higher than the given threshold (corpus-specific stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
lower than the given threshold. This value is also called cut-off in
the literature. If float, the parameter represents a proportion of
documents, integer absolute counts. This parameter is ignored if
vocabulary is not None.

answered 14 mins ago

Tasos

730425

add a comment |

max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
strictly higher than the given threshold (corpus-specific stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
lower than the given threshold. This value is also called cut-off in
the literature. If float, the parameter represents a proportion of
documents, integer absolute counts. This parameter is ignored if
vocabulary is not None.

answered 14 mins ago

Tasos

730425

max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency
strictly higher than the given threshold (corpus-specific stop words).
If float, the parameter represents a proportion of documents, integer
absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly
lower than the given threshold. This value is also called cut-off in
the literature. If float, the parameter represents a proportion of
documents, integer absolute counts. This parameter is ignored if
vocabulary is not None.

answered 14 mins ago

Tasos

730425

answered 14 mins ago

Tasos

730425

answered 14 mins ago

Tasos

730425

answered 14 mins ago

Tasos

730425

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk