Doc2vec to calculate cosine similarity - absolutely inaccurate












2












$begingroup$


I'm trying to modify the Doc2vec tutorial to calculate cosine similarity and take Pandas dataframes instead of .txt documents. I want to find the most similar sentence to a new sentence I put in from my data. However, after training, even if I give almost the same sentence that's present in the dataset, I get low-accuracy results as the top result and none of them is the sentence I modified. For example I have the sentence "This is a nice cat you have." in the dataset I train Doc2vec with, then I use the new sentence "This cat you have is quite nice." as input, and it doesn't bring up the first sentence as similar.



Data comes from an excel sheet, and has roughly the looks of:



  Description                  | Group        | Number
0 Sent: This is a sentence Regular NUM1234
1 Sent: Another sentence Regular NUM1243
2 Sent: Basically all the input Other group NUM1278
3 Sent: Creating a test case to validate the routing between applications. No action needed at this moment
| Other group | NUM1287
...etc...


I have the following code (some code not needed for comprehension was trimmed):



df = pd.read_excel("my_data.xls")

df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence
for index, row in df.iterrows():
row["Description"] = row["Description"].lower()
row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters

SentimentDocument = namedtuple('SentimentDocument', 'words tags')

alldocs =
for index, row in df.iterrows():
words = gensim.utils.to_unicode(row["Description"]).split()
tags = [row["Number"]]
alldocs.append(SentimentDocument(words, tags))

doc_list = alldocs[:]
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

simple_models = [
# PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
# PV-DBOW
Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
# PV-DM w/ average
Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]

# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs) # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
model.reset_from(simple_models[0])
print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

from random import shuffle

alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

for epoch in range(passes):
shuffle(doc_list)

for name, train_model in models_by_name.items():
# Train
duration = 'na'
train_model.alpha, train_model.min_alpha = alpha, alpha
with elapsed_timer() as elapsed:
train_model.train(doc_list, total_examples=len(doc_list), epochs=1)

for model in simple_models:
new_sentence = "Test case creation to validation of routing between applications. No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset
new_sentence = removeGeneric(new_sentence)
new_sentence = normalize_text(new_sentence)
print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))


For this I get the following output:



[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]
[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]
[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]
creating test case validate routing applications action needed moment


All the recommendations are completely unrelated, sentences like "site id plant address good owner electricity request approved number al district province" show up; the sentence it's actually close to (the sentence "Creating a test case to validate the routing between applications. No action needed at this moment" from the dataset) is not on the list.



Can you see anything that I'm doing wrong? What could I do to improve accuracy? Has anyone else experienced this inaccuracy in doc2vec's cosine similarity prediction? If I hand-code the implementation (like this for example), it does give the correct answers, which are completely different than those from doc2vec (but actually accurate).










share|improve this question











$endgroup$












  • $begingroup$
    Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
    $endgroup$
    – lte__
    Nov 6 '17 at 11:33
















2












$begingroup$


I'm trying to modify the Doc2vec tutorial to calculate cosine similarity and take Pandas dataframes instead of .txt documents. I want to find the most similar sentence to a new sentence I put in from my data. However, after training, even if I give almost the same sentence that's present in the dataset, I get low-accuracy results as the top result and none of them is the sentence I modified. For example I have the sentence "This is a nice cat you have." in the dataset I train Doc2vec with, then I use the new sentence "This cat you have is quite nice." as input, and it doesn't bring up the first sentence as similar.



Data comes from an excel sheet, and has roughly the looks of:



  Description                  | Group        | Number
0 Sent: This is a sentence Regular NUM1234
1 Sent: Another sentence Regular NUM1243
2 Sent: Basically all the input Other group NUM1278
3 Sent: Creating a test case to validate the routing between applications. No action needed at this moment
| Other group | NUM1287
...etc...


I have the following code (some code not needed for comprehension was trimmed):



df = pd.read_excel("my_data.xls")

df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence
for index, row in df.iterrows():
row["Description"] = row["Description"].lower()
row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters

SentimentDocument = namedtuple('SentimentDocument', 'words tags')

alldocs =
for index, row in df.iterrows():
words = gensim.utils.to_unicode(row["Description"]).split()
tags = [row["Number"]]
alldocs.append(SentimentDocument(words, tags))

doc_list = alldocs[:]
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

simple_models = [
# PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
# PV-DBOW
Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
# PV-DM w/ average
Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]

# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs) # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
model.reset_from(simple_models[0])
print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

from random import shuffle

alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

for epoch in range(passes):
shuffle(doc_list)

for name, train_model in models_by_name.items():
# Train
duration = 'na'
train_model.alpha, train_model.min_alpha = alpha, alpha
with elapsed_timer() as elapsed:
train_model.train(doc_list, total_examples=len(doc_list), epochs=1)

for model in simple_models:
new_sentence = "Test case creation to validation of routing between applications. No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset
new_sentence = removeGeneric(new_sentence)
new_sentence = normalize_text(new_sentence)
print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))


For this I get the following output:



[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]
[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]
[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]
creating test case validate routing applications action needed moment


All the recommendations are completely unrelated, sentences like "site id plant address good owner electricity request approved number al district province" show up; the sentence it's actually close to (the sentence "Creating a test case to validate the routing between applications. No action needed at this moment" from the dataset) is not on the list.



Can you see anything that I'm doing wrong? What could I do to improve accuracy? Has anyone else experienced this inaccuracy in doc2vec's cosine similarity prediction? If I hand-code the implementation (like this for example), it does give the correct answers, which are completely different than those from doc2vec (but actually accurate).










share|improve this question











$endgroup$












  • $begingroup$
    Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
    $endgroup$
    – lte__
    Nov 6 '17 at 11:33














2












2








2


2



$begingroup$


I'm trying to modify the Doc2vec tutorial to calculate cosine similarity and take Pandas dataframes instead of .txt documents. I want to find the most similar sentence to a new sentence I put in from my data. However, after training, even if I give almost the same sentence that's present in the dataset, I get low-accuracy results as the top result and none of them is the sentence I modified. For example I have the sentence "This is a nice cat you have." in the dataset I train Doc2vec with, then I use the new sentence "This cat you have is quite nice." as input, and it doesn't bring up the first sentence as similar.



Data comes from an excel sheet, and has roughly the looks of:



  Description                  | Group        | Number
0 Sent: This is a sentence Regular NUM1234
1 Sent: Another sentence Regular NUM1243
2 Sent: Basically all the input Other group NUM1278
3 Sent: Creating a test case to validate the routing between applications. No action needed at this moment
| Other group | NUM1287
...etc...


I have the following code (some code not needed for comprehension was trimmed):



df = pd.read_excel("my_data.xls")

df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence
for index, row in df.iterrows():
row["Description"] = row["Description"].lower()
row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters

SentimentDocument = namedtuple('SentimentDocument', 'words tags')

alldocs =
for index, row in df.iterrows():
words = gensim.utils.to_unicode(row["Description"]).split()
tags = [row["Number"]]
alldocs.append(SentimentDocument(words, tags))

doc_list = alldocs[:]
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

simple_models = [
# PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
# PV-DBOW
Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
# PV-DM w/ average
Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]

# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs) # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
model.reset_from(simple_models[0])
print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

from random import shuffle

alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

for epoch in range(passes):
shuffle(doc_list)

for name, train_model in models_by_name.items():
# Train
duration = 'na'
train_model.alpha, train_model.min_alpha = alpha, alpha
with elapsed_timer() as elapsed:
train_model.train(doc_list, total_examples=len(doc_list), epochs=1)

for model in simple_models:
new_sentence = "Test case creation to validation of routing between applications. No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset
new_sentence = removeGeneric(new_sentence)
new_sentence = normalize_text(new_sentence)
print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))


For this I get the following output:



[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]
[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]
[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]
creating test case validate routing applications action needed moment


All the recommendations are completely unrelated, sentences like "site id plant address good owner electricity request approved number al district province" show up; the sentence it's actually close to (the sentence "Creating a test case to validate the routing between applications. No action needed at this moment" from the dataset) is not on the list.



Can you see anything that I'm doing wrong? What could I do to improve accuracy? Has anyone else experienced this inaccuracy in doc2vec's cosine similarity prediction? If I hand-code the implementation (like this for example), it does give the correct answers, which are completely different than those from doc2vec (but actually accurate).










share|improve this question











$endgroup$




I'm trying to modify the Doc2vec tutorial to calculate cosine similarity and take Pandas dataframes instead of .txt documents. I want to find the most similar sentence to a new sentence I put in from my data. However, after training, even if I give almost the same sentence that's present in the dataset, I get low-accuracy results as the top result and none of them is the sentence I modified. For example I have the sentence "This is a nice cat you have." in the dataset I train Doc2vec with, then I use the new sentence "This cat you have is quite nice." as input, and it doesn't bring up the first sentence as similar.



Data comes from an excel sheet, and has roughly the looks of:



  Description                  | Group        | Number
0 Sent: This is a sentence Regular NUM1234
1 Sent: Another sentence Regular NUM1243
2 Sent: Basically all the input Other group NUM1278
3 Sent: Creating a test case to validate the routing between applications. No action needed at this moment
| Other group | NUM1287
...etc...


I have the following code (some code not needed for comprehension was trimmed):



df = pd.read_excel("my_data.xls")

df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence
for index, row in df.iterrows():
row["Description"] = row["Description"].lower()
row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters

SentimentDocument = namedtuple('SentimentDocument', 'words tags')

alldocs =
for index, row in df.iterrows():
words = gensim.utils.to_unicode(row["Description"]).split()
tags = [row["Number"]]
alldocs.append(SentimentDocument(words, tags))

doc_list = alldocs[:]
cores = multiprocessing.cpu_count()
assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"

simple_models = [
# PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size
Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),
# PV-DBOW
Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),
# PV-DM w/ average
Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),
]

# Speed up setup by sharing results of the 1st model's vocabulary scan
simple_models[0].build_vocab(alldocs) # PV-DM w/ concat requires one special NULL word so it serves as template
print(simple_models[0])
for model in simple_models[1:]:
model.reset_from(simple_models[0])
print(model)

models_by_name = OrderedDict((str(model), model) for model in simple_models)

from random import shuffle

alpha, min_alpha, passes = (0.025, 0.001, 20)
alpha_delta = (alpha - min_alpha) / passes

print("START %s" % datetime.datetime.now())

for epoch in range(passes):
shuffle(doc_list)

for name, train_model in models_by_name.items():
# Train
duration = 'na'
train_model.alpha, train_model.min_alpha = alpha, alpha
with elapsed_timer() as elapsed:
train_model.train(doc_list, total_examples=len(doc_list), epochs=1)

for model in simple_models:
new_sentence = "Test case creation to validation of routing between applications. No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset
new_sentence = removeGeneric(new_sentence)
new_sentence = normalize_text(new_sentence)
print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))


For this I get the following output:



[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]
[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]
[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]
creating test case validate routing applications action needed moment


All the recommendations are completely unrelated, sentences like "site id plant address good owner electricity request approved number al district province" show up; the sentence it's actually close to (the sentence "Creating a test case to validate the routing between applications. No action needed at this moment" from the dataset) is not on the list.



Can you see anything that I'm doing wrong? What could I do to improve accuracy? Has anyone else experienced this inaccuracy in doc2vec's cosine similarity prediction? If I hand-code the implementation (like this for example), it does give the correct answers, which are completely different than those from doc2vec (but actually accurate).







python nlp similarity text similar-documents






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 3 hours ago







lte__

















asked Nov 6 '17 at 11:03









lte__lte__

3101720




3101720












  • $begingroup$
    Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
    $endgroup$
    – lte__
    Nov 6 '17 at 11:33


















  • $begingroup$
    Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
    $endgroup$
    – lte__
    Nov 6 '17 at 11:33
















$begingroup$
Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
$endgroup$
– lte__
Nov 6 '17 at 11:33




$begingroup$
Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
$endgroup$
– lte__
Nov 6 '17 at 11:33










0






active

oldest

votes











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24402%2fdoc2vec-to-calculate-cosine-similarity-absolutely-inaccurate%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24402%2fdoc2vec-to-calculate-cosine-similarity-absolutely-inaccurate%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to label and detect the document text images

Tabula Rosettana

Aureus (color)