Doc2vec to calculate cosine similarity - absolutely inaccurate
$begingroup$
I'm trying to modify the Doc2vec tutorial to calculate cosine similarity and take Pandas
dataframes instead of .txt
documents. I want to find the most similar sentence to a new sentence I put in from my data. However, after training, even if I give almost the same sentence that's present in the dataset, I get low-accuracy results as the top result and none of them is the sentence I modified. For example I have the sentence "This is a nice cat you have." in the dataset I train Doc2vec with, then I use the new sentence "This cat you have is quite nice." as input, and it doesn't bring up the first sentence as similar.
Data comes from an excel sheet, and has roughly the looks of:
Description | Group | Number
0 Sent: This is a sentence Regular NUM1234
1 Sent: Another sentence Regular NUM1243
2 Sent: Basically all the input Other group NUM1278
3 Sent: Creating a test case to validate the routing between applications. No action needed at this moment
| Other group | NUM1287
...etc...
I have the following code (some code not needed for comprehension was trimmed):
df = pd.read_excel("my_data.xls")
df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence
for index, row in df.iterrows():
row["Description"] = row["Description"].lower()
row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters
SentimentDocument = namedtuple('SentimentDocument', 'words tags')
alldocs =
for index, row in df.iterrows():
words = gensim.utils.to_unicode(row["Description"]).split()
tags = [row["Number"]]
alldocs.append(SentimentDocument(words, tags))
doc_list = alldocs[:]
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"
simple_models = [
# PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
# PV-DBOW
Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
# PV-DM w/ average
Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]
# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs) # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
model.reset_from(simple_models[0])
print(model)
models_by_name = OrderedDict((str(model), model) for model in simple_models)
from random import shuffle
alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes
print("START %s" % datetime.datetime.now())
for epoch in range(passes):
shuffle(doc_list)
for name, train_model in models_by_name.items():
# Train
duration = 'na'
train_model.alpha, train_model.min_alpha = alpha, alpha
with elapsed_timer() as elapsed:
train_model.train(doc_list, total_examples=len(doc_list), epochs=1)
for model in simple_models:
new_sentence = "Test case creation to validation of routing between applications. No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset
new_sentence = removeGeneric(new_sentence)
new_sentence = normalize_text(new_sentence)
print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))
For this I get the following output:
[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]
[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]
[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]
creating test case validate routing applications action needed moment
All the recommendations are completely unrelated, sentences like "site id plant address good owner electricity request approved number al district province" show up; the sentence it's actually close to (the sentence "Creating a test case to validate the routing between applications. No action needed at this moment" from the dataset) is not on the list.
Can you see anything that I'm doing wrong? What could I do to improve accuracy? Has anyone else experienced this inaccuracy in doc2vec's cosine similarity prediction? If I hand-code the implementation (like this for example), it does give the correct answers, which are completely different than those from doc2vec (but actually accurate).
python nlp similarity text similar-documents
$endgroup$
add a comment |
$begingroup$
I'm trying to modify the Doc2vec tutorial to calculate cosine similarity and take Pandas
dataframes instead of .txt
documents. I want to find the most similar sentence to a new sentence I put in from my data. However, after training, even if I give almost the same sentence that's present in the dataset, I get low-accuracy results as the top result and none of them is the sentence I modified. For example I have the sentence "This is a nice cat you have." in the dataset I train Doc2vec with, then I use the new sentence "This cat you have is quite nice." as input, and it doesn't bring up the first sentence as similar.
Data comes from an excel sheet, and has roughly the looks of:
Description | Group | Number
0 Sent: This is a sentence Regular NUM1234
1 Sent: Another sentence Regular NUM1243
2 Sent: Basically all the input Other group NUM1278
3 Sent: Creating a test case to validate the routing between applications. No action needed at this moment
| Other group | NUM1287
...etc...
I have the following code (some code not needed for comprehension was trimmed):
df = pd.read_excel("my_data.xls")
df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence
for index, row in df.iterrows():
row["Description"] = row["Description"].lower()
row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters
SentimentDocument = namedtuple('SentimentDocument', 'words tags')
alldocs =
for index, row in df.iterrows():
words = gensim.utils.to_unicode(row["Description"]).split()
tags = [row["Number"]]
alldocs.append(SentimentDocument(words, tags))
doc_list = alldocs[:]
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"
simple_models = [
# PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
# PV-DBOW
Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
# PV-DM w/ average
Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]
# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs) # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
model.reset_from(simple_models[0])
print(model)
models_by_name = OrderedDict((str(model), model) for model in simple_models)
from random import shuffle
alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes
print("START %s" % datetime.datetime.now())
for epoch in range(passes):
shuffle(doc_list)
for name, train_model in models_by_name.items():
# Train
duration = 'na'
train_model.alpha, train_model.min_alpha = alpha, alpha
with elapsed_timer() as elapsed:
train_model.train(doc_list, total_examples=len(doc_list), epochs=1)
for model in simple_models:
new_sentence = "Test case creation to validation of routing between applications. No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset
new_sentence = removeGeneric(new_sentence)
new_sentence = normalize_text(new_sentence)
print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))
For this I get the following output:
[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]
[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]
[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]
creating test case validate routing applications action needed moment
All the recommendations are completely unrelated, sentences like "site id plant address good owner electricity request approved number al district province" show up; the sentence it's actually close to (the sentence "Creating a test case to validate the routing between applications. No action needed at this moment" from the dataset) is not on the list.
Can you see anything that I'm doing wrong? What could I do to improve accuracy? Has anyone else experienced this inaccuracy in doc2vec's cosine similarity prediction? If I hand-code the implementation (like this for example), it does give the correct answers, which are completely different than those from doc2vec (but actually accurate).
python nlp similarity text similar-documents
$endgroup$
$begingroup$
Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
$endgroup$
– lte__
Nov 6 '17 at 11:33
add a comment |
$begingroup$
I'm trying to modify the Doc2vec tutorial to calculate cosine similarity and take Pandas
dataframes instead of .txt
documents. I want to find the most similar sentence to a new sentence I put in from my data. However, after training, even if I give almost the same sentence that's present in the dataset, I get low-accuracy results as the top result and none of them is the sentence I modified. For example I have the sentence "This is a nice cat you have." in the dataset I train Doc2vec with, then I use the new sentence "This cat you have is quite nice." as input, and it doesn't bring up the first sentence as similar.
Data comes from an excel sheet, and has roughly the looks of:
Description | Group | Number
0 Sent: This is a sentence Regular NUM1234
1 Sent: Another sentence Regular NUM1243
2 Sent: Basically all the input Other group NUM1278
3 Sent: Creating a test case to validate the routing between applications. No action needed at this moment
| Other group | NUM1287
...etc...
I have the following code (some code not needed for comprehension was trimmed):
df = pd.read_excel("my_data.xls")
df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence
for index, row in df.iterrows():
row["Description"] = row["Description"].lower()
row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters
SentimentDocument = namedtuple('SentimentDocument', 'words tags')
alldocs =
for index, row in df.iterrows():
words = gensim.utils.to_unicode(row["Description"]).split()
tags = [row["Number"]]
alldocs.append(SentimentDocument(words, tags))
doc_list = alldocs[:]
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"
simple_models = [
# PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
# PV-DBOW
Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
# PV-DM w/ average
Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]
# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs) # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
model.reset_from(simple_models[0])
print(model)
models_by_name = OrderedDict((str(model), model) for model in simple_models)
from random import shuffle
alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes
print("START %s" % datetime.datetime.now())
for epoch in range(passes):
shuffle(doc_list)
for name, train_model in models_by_name.items():
# Train
duration = 'na'
train_model.alpha, train_model.min_alpha = alpha, alpha
with elapsed_timer() as elapsed:
train_model.train(doc_list, total_examples=len(doc_list), epochs=1)
for model in simple_models:
new_sentence = "Test case creation to validation of routing between applications. No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset
new_sentence = removeGeneric(new_sentence)
new_sentence = normalize_text(new_sentence)
print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))
For this I get the following output:
[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]
[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]
[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]
creating test case validate routing applications action needed moment
All the recommendations are completely unrelated, sentences like "site id plant address good owner electricity request approved number al district province" show up; the sentence it's actually close to (the sentence "Creating a test case to validate the routing between applications. No action needed at this moment" from the dataset) is not on the list.
Can you see anything that I'm doing wrong? What could I do to improve accuracy? Has anyone else experienced this inaccuracy in doc2vec's cosine similarity prediction? If I hand-code the implementation (like this for example), it does give the correct answers, which are completely different than those from doc2vec (but actually accurate).
python nlp similarity text similar-documents
$endgroup$
I'm trying to modify the Doc2vec tutorial to calculate cosine similarity and take Pandas
dataframes instead of .txt
documents. I want to find the most similar sentence to a new sentence I put in from my data. However, after training, even if I give almost the same sentence that's present in the dataset, I get low-accuracy results as the top result and none of them is the sentence I modified. For example I have the sentence "This is a nice cat you have." in the dataset I train Doc2vec with, then I use the new sentence "This cat you have is quite nice." as input, and it doesn't bring up the first sentence as similar.
Data comes from an excel sheet, and has roughly the looks of:
Description | Group | Number
0 Sent: This is a sentence Regular NUM1234
1 Sent: Another sentence Regular NUM1243
2 Sent: Basically all the input Other group NUM1278
3 Sent: Creating a test case to validate the routing between applications. No action needed at this moment
| Other group | NUM1287
...etc...
I have the following code (some code not needed for comprehension was trimmed):
df = pd.read_excel("my_data.xls")
df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence
for index, row in df.iterrows():
row["Description"] = row["Description"].lower()
row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters
SentimentDocument = namedtuple('SentimentDocument', 'words tags')
alldocs =
for index, row in df.iterrows():
words = gensim.utils.to_unicode(row["Description"]).split()
tags = [row["Number"]]
alldocs.append(SentimentDocument(words, tags))
doc_list = alldocs[:]
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"
simple_models = [
# PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
# PV-DBOW
Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
# PV-DM w/ average
Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]
# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs) # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
model.reset_from(simple_models[0])
print(model)
models_by_name = OrderedDict((str(model), model) for model in simple_models)
from random import shuffle
alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes
print("START %s" % datetime.datetime.now())
for epoch in range(passes):
shuffle(doc_list)
for name, train_model in models_by_name.items():
# Train
duration = 'na'
train_model.alpha, train_model.min_alpha = alpha, alpha
with elapsed_timer() as elapsed:
train_model.train(doc_list, total_examples=len(doc_list), epochs=1)
for model in simple_models:
new_sentence = "Test case creation to validation of routing between applications. No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset
new_sentence = removeGeneric(new_sentence)
new_sentence = normalize_text(new_sentence)
print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))
For this I get the following output:
[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]
[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]
[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]
creating test case validate routing applications action needed moment
All the recommendations are completely unrelated, sentences like "site id plant address good owner electricity request approved number al district province" show up; the sentence it's actually close to (the sentence "Creating a test case to validate the routing between applications. No action needed at this moment" from the dataset) is not on the list.
Can you see anything that I'm doing wrong? What could I do to improve accuracy? Has anyone else experienced this inaccuracy in doc2vec's cosine similarity prediction? If I hand-code the implementation (like this for example), it does give the correct answers, which are completely different than those from doc2vec (but actually accurate).
python nlp similarity text similar-documents
python nlp similarity text similar-documents
edited 3 hours ago
lte__
asked Nov 6 '17 at 11:03
lte__lte__
3101720
3101720
$begingroup$
Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
$endgroup$
– lte__
Nov 6 '17 at 11:33
add a comment |
$begingroup$
Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
$endgroup$
– lte__
Nov 6 '17 at 11:33
$begingroup$
Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
$endgroup$
– lte__
Nov 6 '17 at 11:33
$begingroup$
Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
$endgroup$
– lte__
Nov 6 '17 at 11:33
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24402%2fdoc2vec-to-calculate-cosine-similarity-absolutely-inaccurate%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24402%2fdoc2vec-to-calculate-cosine-similarity-absolutely-inaccurate%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
$endgroup$
– lte__
Nov 6 '17 at 11:33