Doc2vec to calculate cosine similarity

Doc2vec to calculate cosine similarity - absolutely inaccurate

I'm trying to modify the Doc2vec tutorial to calculate cosine similarity and take Pandas dataframes instead of .txt documents. I want to find the most similar sentence to a new sentence I put in from my data. However, after training, even if I give almost the same sentence that's present in the dataset, I get low-accuracy results as the top result and none of them is the sentence I modified. For example I have the sentence "This is a nice cat you have." in the dataset I train Doc2vec with, then I use the new sentence "This cat you have is quite nice." as input, and it doesn't bring up the first sentence as similar.

Data comes from an excel sheet, and has roughly the looks of:

  Description                  | Group        | Number

0 Sent: This is a sentence       Regular        NUM1234

1 Sent: Another sentence         Regular        NUM1243

2 Sent: Basically all the input  Other group    NUM1278

3 Sent: Creating a test case to validate the routing between applications.  No action needed at this moment 

                               | Other group  | NUM1287

...etc...

I have the following code (some code not needed for comprehension was trimmed):

df = pd.read_excel("my_data.xls")



df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence 

for index, row in df.iterrows():

    row["Description"] = row["Description"].lower()

    row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters



SentimentDocument = namedtuple('SentimentDocument', 'words tags')



alldocs =   

for index, row in df.iterrows():

    words = gensim.utils.to_unicode(row["Description"]).split()

    tags = [row["Number"]]

    alldocs.append(SentimentDocument(words, tags))



doc_list = alldocs[:]

cores = multiprocessing.cpu_count()

assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"



simple_models = [

    # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size

    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),

    # PV-DBOW 

    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),

    # PV-DM w/ average

    Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),

]



# Speed up setup by sharing results of the 1st model's vocabulary scan

simple_models[0].build_vocab(alldocs)  # PV-DM w/ concat requires one special NULL word so it serves as template

print(simple_models[0])

for model in simple_models[1:]:

    model.reset_from(simple_models[0])

    print(model)



models_by_name = OrderedDict((str(model), model) for model in simple_models)



from random import shuffle



alpha, min_alpha, passes = (0.025, 0.001, 20)

alpha_delta = (alpha - min_alpha) / passes



print("START %s" % datetime.datetime.now())



for epoch in range(passes):

    shuffle(doc_list)



    for name, train_model in models_by_name.items():

        # Train

        duration = 'na'

        train_model.alpha, train_model.min_alpha = alpha, alpha

        with elapsed_timer() as elapsed:

            train_model.train(doc_list, total_examples=len(doc_list), epochs=1)



for model in simple_models:

    new_sentence = "Test case creation to validation of routing between applications.  No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset

    new_sentence = removeGeneric(new_sentence)

    new_sentence = normalize_text(new_sentence)

    print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))

For this I get the following output:

[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]

[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]

[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]

creating test case validate routing applications action needed moment

All the recommendations are completely unrelated, sentences like "site id plant address good owner electricity request approved number al district province" show up; the sentence it's actually close to (the sentence "Creating a test case to validate the routing between applications. No action needed at this moment" from the dataset) is not on the list.

Can you see anything that I'm doing wrong? What could I do to improve accuracy? Has anyone else experienced this inaccuracy in doc2vec's cosine similarity prediction? If I hand-code the implementation (like this for example), it does give the correct answers, which are completely different than those from doc2vec (but actually accurate).

edited 3 hours ago

asked Nov 6 '17 at 11:03

lte__

3101720

$begingroup$
Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
$endgroup$
– lte__
Nov 6 '17 at 11:33

add a comment |

Data comes from an excel sheet, and has roughly the looks of:

  Description                  | Group        | Number

0 Sent: This is a sentence       Regular        NUM1234

1 Sent: Another sentence         Regular        NUM1243

2 Sent: Basically all the input  Other group    NUM1278

3 Sent: Creating a test case to validate the routing between applications.  No action needed at this moment 

                               | Other group  | NUM1287

...etc...

I have the following code (some code not needed for comprehension was trimmed):

df = pd.read_excel("my_data.xls")



df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence 

for index, row in df.iterrows():

    row["Description"] = row["Description"].lower()

    row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters



SentimentDocument = namedtuple('SentimentDocument', 'words tags')



alldocs =   

for index, row in df.iterrows():

    words = gensim.utils.to_unicode(row["Description"]).split()

    tags = [row["Number"]]

    alldocs.append(SentimentDocument(words, tags))



doc_list = alldocs[:]

cores = multiprocessing.cpu_count()

assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"



simple_models = [

    # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size

    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),

    # PV-DBOW 

    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),

    # PV-DM w/ average

    Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),

]



# Speed up setup by sharing results of the 1st model's vocabulary scan

simple_models[0].build_vocab(alldocs)  # PV-DM w/ concat requires one special NULL word so it serves as template

print(simple_models[0])

for model in simple_models[1:]:

    model.reset_from(simple_models[0])

    print(model)



models_by_name = OrderedDict((str(model), model) for model in simple_models)



from random import shuffle



alpha, min_alpha, passes = (0.025, 0.001, 20)

alpha_delta = (alpha - min_alpha) / passes



print("START %s" % datetime.datetime.now())



for epoch in range(passes):

    shuffle(doc_list)



    for name, train_model in models_by_name.items():

        # Train

        duration = 'na'

        train_model.alpha, train_model.min_alpha = alpha, alpha

        with elapsed_timer() as elapsed:

            train_model.train(doc_list, total_examples=len(doc_list), epochs=1)



for model in simple_models:

    new_sentence = "Test case creation to validation of routing between applications.  No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset

    new_sentence = removeGeneric(new_sentence)

    new_sentence = normalize_text(new_sentence)

    print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))

For this I get the following output:

[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]

[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]

[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]

creating test case validate routing applications action needed moment

edited 3 hours ago

asked Nov 6 '17 at 11:03

lte__

3101720

$begingroup$
Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
$endgroup$
– lte__
Nov 6 '17 at 11:33

add a comment |

Data comes from an excel sheet, and has roughly the looks of:

  Description                  | Group        | Number

0 Sent: This is a sentence       Regular        NUM1234

1 Sent: Another sentence         Regular        NUM1243

2 Sent: Basically all the input  Other group    NUM1278

3 Sent: Creating a test case to validate the routing between applications.  No action needed at this moment 

                               | Other group  | NUM1287

...etc...

I have the following code (some code not needed for comprehension was trimmed):

df = pd.read_excel("my_data.xls")



df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence 

for index, row in df.iterrows():

    row["Description"] = row["Description"].lower()

    row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters



SentimentDocument = namedtuple('SentimentDocument', 'words tags')



alldocs =   

for index, row in df.iterrows():

    words = gensim.utils.to_unicode(row["Description"]).split()

    tags = [row["Number"]]

    alldocs.append(SentimentDocument(words, tags))



doc_list = alldocs[:]

cores = multiprocessing.cpu_count()

assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"



simple_models = [

    # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size

    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),

    # PV-DBOW 

    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),

    # PV-DM w/ average

    Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),

]



# Speed up setup by sharing results of the 1st model's vocabulary scan

simple_models[0].build_vocab(alldocs)  # PV-DM w/ concat requires one special NULL word so it serves as template

print(simple_models[0])

for model in simple_models[1:]:

    model.reset_from(simple_models[0])

    print(model)



models_by_name = OrderedDict((str(model), model) for model in simple_models)



from random import shuffle



alpha, min_alpha, passes = (0.025, 0.001, 20)

alpha_delta = (alpha - min_alpha) / passes



print("START %s" % datetime.datetime.now())



for epoch in range(passes):

    shuffle(doc_list)



    for name, train_model in models_by_name.items():

        # Train

        duration = 'na'

        train_model.alpha, train_model.min_alpha = alpha, alpha

        with elapsed_timer() as elapsed:

            train_model.train(doc_list, total_examples=len(doc_list), epochs=1)



for model in simple_models:

    new_sentence = "Test case creation to validation of routing between applications.  No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset

    new_sentence = removeGeneric(new_sentence)

    new_sentence = normalize_text(new_sentence)

    print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))

For this I get the following output:

[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]

[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]

[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]

creating test case validate routing applications action needed moment

edited 3 hours ago

asked Nov 6 '17 at 11:03

lte__

3101720

Data comes from an excel sheet, and has roughly the looks of:

  Description                  | Group        | Number

0 Sent: This is a sentence       Regular        NUM1234

1 Sent: Another sentence         Regular        NUM1243

2 Sent: Basically all the input  Other group    NUM1278

3 Sent: Creating a test case to validate the routing between applications.  No action needed at this moment 

                               | Other group  | NUM1287

...etc...

I have the following code (some code not needed for comprehension was trimmed):

df = pd.read_excel("my_data.xls")



df["Description"] = df["Description"].apply(lambda x: removeGeneric(x)) #removeGeneric() just strips "Sent:" from the beginning of each sentence 

for index, row in df.iterrows():

    row["Description"] = row["Description"].lower()

    row["Description"] = normalize_text(row["Description"]) #normalize_text() removes stopwords defined in the nltk package and words shorter than 2 characters



SentimentDocument = namedtuple('SentimentDocument', 'words tags')



alldocs =   

for index, row in df.iterrows():

    words = gensim.utils.to_unicode(row["Description"]).split()

    tags = [row["Number"]]

    alldocs.append(SentimentDocument(words, tags))



doc_list = alldocs[:]

cores = multiprocessing.cpu_count()

assert gensim.models.doc2vec.FAST_VERSION > -1, "This will be painfully slow otherwise"



simple_models = [

    # PV-DM w/ concatenation - window=5 (both sides) approximates paper's 10-word total window size

    Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5, hs=0, min_count=2, workers=cores),

    # PV-DBOW 

    Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2, workers=cores),

    # PV-DM w/ average

    Doc2Vec(dm=1, dm_mean=1, size=100, window=10, negative=5, hs=0, min_count=2, workers=cores),

]



# Speed up setup by sharing results of the 1st model's vocabulary scan

simple_models[0].build_vocab(alldocs)  # PV-DM w/ concat requires one special NULL word so it serves as template

print(simple_models[0])

for model in simple_models[1:]:

    model.reset_from(simple_models[0])

    print(model)



models_by_name = OrderedDict((str(model), model) for model in simple_models)



from random import shuffle



alpha, min_alpha, passes = (0.025, 0.001, 20)

alpha_delta = (alpha - min_alpha) / passes



print("START %s" % datetime.datetime.now())



for epoch in range(passes):

    shuffle(doc_list)



    for name, train_model in models_by_name.items():

        # Train

        duration = 'na'

        train_model.alpha, train_model.min_alpha = alpha, alpha

        with elapsed_timer() as elapsed:

            train_model.train(doc_list, total_examples=len(doc_list), epochs=1)



for model in simple_models:

    new_sentence = "Test case creation to validation of routing between applications.  No action needed" #Notice how I'm testing with a sentence very similar to one in the original dataset

    new_sentence = removeGeneric(new_sentence)

    new_sentence = normalize_text(new_sentence)

    print(model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=2))

For this I get the following output:

[('NUM1254', 0.3154909014701843), ('NUM5247', 0.2487245500087738)]

[('NUM3875', 0.20226456224918365), ('NUM3793', 0.1970052272081375)]

[('NUM3585', 0.13086965680122375), ('NUM3857', 0.1298370361328125)]

creating test case validate routing applications action needed moment

python nlp similarity text similar-documents

edited 3 hours ago

asked Nov 6 '17 at 11:03

lte__

3101720

edited 3 hours ago

asked Nov 6 '17 at 11:03

lte__

3101720

edited 3 hours ago

asked Nov 6 '17 at 11:03

lte__

3101720

asked Nov 6 '17 at 11:03

lte__

3101720

asked Nov 6 '17 at 11:03

lte__

3101720

$begingroup$
Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
$endgroup$
– lte__
Nov 6 '17 at 11:33

add a comment |

$begingroup$
Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.
$endgroup$
– lte__
Nov 6 '17 at 11:33

Funny thing is, when I calculate cosine similarity "by hand" (via hand-coded Python), it does show that the sentences are 80-90% similar. But Doc2vec won't find the similarity.

– lte__
Nov 6 '17 at 11:33

add a comment |

0

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f24402%2fdoc2vec-to-calculate-cosine-similarity-absolutely-inaccurate%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

0

active

oldest

votes

0

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk