What's the best classification model for this recommendation engine?












1












$begingroup$


I'm not a data scientist but I'm trying to implement a recommendation engine on my company. My application runs on PHP but I'll use Python to process this data.



My company is an online school, with 40 online courses as of now. I have a CSV file with around 30k users preferences, and it looks like this:



Dataframe



0 means that user is not subscribed (I consider here that he has no interest), while 1 means subscribed (interested).



My idea is to compare one single user array such as [0,1,0,0,0,1,1...] with all this data and return a grade for each course with the probability of interest for this user.



I was thinking of using a Multinomial Logistic Regression, but as far as I know (and I don't know much) it would return me a binary result, right?



What classification model would you recommend me to use? Ideally, my result should be something like:



[0.95, 0.1, 0.54, 0.3, 0.87...]



Cheers!










share|improve this question









$endgroup$




bumped to the homepage by Community 12 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.











  • 1




    $begingroup$
    Formulate the problem as a Collaborative filtering task.
    $endgroup$
    – Fadi Bakoura
    May 21 '18 at 14:52










  • $begingroup$
    Thanks @FadiBakoura, will research on this and let you know.
    $endgroup$
    – grpaiva
    May 21 '18 at 18:03












  • $begingroup$
    Can you include more information about the user? (sex, age ...) An user single with 18 years old may like a course that another 50 years old do not like ...
    $endgroup$
    – Intruso
    Aug 20 '18 at 13:50










  • $begingroup$
    Seems like a prediction problem, not one of classification, so a neural network? Have you tried loading this data into Orange3? Seems you could test out your models pretty quickly. Orange3 uses Scikit, so once you find your workflow, you can use Python. By the way, if it is a neural network solution, TensorFlow has PHP bindings, so you could do the whole thing in PHP. Both may save you time.
    $endgroup$
    – davmor
    Nov 18 '18 at 11:10


















1












$begingroup$


I'm not a data scientist but I'm trying to implement a recommendation engine on my company. My application runs on PHP but I'll use Python to process this data.



My company is an online school, with 40 online courses as of now. I have a CSV file with around 30k users preferences, and it looks like this:



Dataframe



0 means that user is not subscribed (I consider here that he has no interest), while 1 means subscribed (interested).



My idea is to compare one single user array such as [0,1,0,0,0,1,1...] with all this data and return a grade for each course with the probability of interest for this user.



I was thinking of using a Multinomial Logistic Regression, but as far as I know (and I don't know much) it would return me a binary result, right?



What classification model would you recommend me to use? Ideally, my result should be something like:



[0.95, 0.1, 0.54, 0.3, 0.87...]



Cheers!










share|improve this question









$endgroup$




bumped to the homepage by Community 12 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.











  • 1




    $begingroup$
    Formulate the problem as a Collaborative filtering task.
    $endgroup$
    – Fadi Bakoura
    May 21 '18 at 14:52










  • $begingroup$
    Thanks @FadiBakoura, will research on this and let you know.
    $endgroup$
    – grpaiva
    May 21 '18 at 18:03












  • $begingroup$
    Can you include more information about the user? (sex, age ...) An user single with 18 years old may like a course that another 50 years old do not like ...
    $endgroup$
    – Intruso
    Aug 20 '18 at 13:50










  • $begingroup$
    Seems like a prediction problem, not one of classification, so a neural network? Have you tried loading this data into Orange3? Seems you could test out your models pretty quickly. Orange3 uses Scikit, so once you find your workflow, you can use Python. By the way, if it is a neural network solution, TensorFlow has PHP bindings, so you could do the whole thing in PHP. Both may save you time.
    $endgroup$
    – davmor
    Nov 18 '18 at 11:10
















1












1








1


1



$begingroup$


I'm not a data scientist but I'm trying to implement a recommendation engine on my company. My application runs on PHP but I'll use Python to process this data.



My company is an online school, with 40 online courses as of now. I have a CSV file with around 30k users preferences, and it looks like this:



Dataframe



0 means that user is not subscribed (I consider here that he has no interest), while 1 means subscribed (interested).



My idea is to compare one single user array such as [0,1,0,0,0,1,1...] with all this data and return a grade for each course with the probability of interest for this user.



I was thinking of using a Multinomial Logistic Regression, but as far as I know (and I don't know much) it would return me a binary result, right?



What classification model would you recommend me to use? Ideally, my result should be something like:



[0.95, 0.1, 0.54, 0.3, 0.87...]



Cheers!










share|improve this question









$endgroup$




I'm not a data scientist but I'm trying to implement a recommendation engine on my company. My application runs on PHP but I'll use Python to process this data.



My company is an online school, with 40 online courses as of now. I have a CSV file with around 30k users preferences, and it looks like this:



Dataframe



0 means that user is not subscribed (I consider here that he has no interest), while 1 means subscribed (interested).



My idea is to compare one single user array such as [0,1,0,0,0,1,1...] with all this data and return a grade for each course with the probability of interest for this user.



I was thinking of using a Multinomial Logistic Regression, but as far as I know (and I don't know much) it would return me a binary result, right?



What classification model would you recommend me to use? Ideally, my result should be something like:



[0.95, 0.1, 0.54, 0.3, 0.87...]



Cheers!







python recommender-system multiclass-classification






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked May 21 '18 at 14:34









grpaivagrpaiva

61




61





bumped to the homepage by Community 12 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 12 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.










  • 1




    $begingroup$
    Formulate the problem as a Collaborative filtering task.
    $endgroup$
    – Fadi Bakoura
    May 21 '18 at 14:52










  • $begingroup$
    Thanks @FadiBakoura, will research on this and let you know.
    $endgroup$
    – grpaiva
    May 21 '18 at 18:03












  • $begingroup$
    Can you include more information about the user? (sex, age ...) An user single with 18 years old may like a course that another 50 years old do not like ...
    $endgroup$
    – Intruso
    Aug 20 '18 at 13:50










  • $begingroup$
    Seems like a prediction problem, not one of classification, so a neural network? Have you tried loading this data into Orange3? Seems you could test out your models pretty quickly. Orange3 uses Scikit, so once you find your workflow, you can use Python. By the way, if it is a neural network solution, TensorFlow has PHP bindings, so you could do the whole thing in PHP. Both may save you time.
    $endgroup$
    – davmor
    Nov 18 '18 at 11:10
















  • 1




    $begingroup$
    Formulate the problem as a Collaborative filtering task.
    $endgroup$
    – Fadi Bakoura
    May 21 '18 at 14:52










  • $begingroup$
    Thanks @FadiBakoura, will research on this and let you know.
    $endgroup$
    – grpaiva
    May 21 '18 at 18:03












  • $begingroup$
    Can you include more information about the user? (sex, age ...) An user single with 18 years old may like a course that another 50 years old do not like ...
    $endgroup$
    – Intruso
    Aug 20 '18 at 13:50










  • $begingroup$
    Seems like a prediction problem, not one of classification, so a neural network? Have you tried loading this data into Orange3? Seems you could test out your models pretty quickly. Orange3 uses Scikit, so once you find your workflow, you can use Python. By the way, if it is a neural network solution, TensorFlow has PHP bindings, so you could do the whole thing in PHP. Both may save you time.
    $endgroup$
    – davmor
    Nov 18 '18 at 11:10










1




1




$begingroup$
Formulate the problem as a Collaborative filtering task.
$endgroup$
– Fadi Bakoura
May 21 '18 at 14:52




$begingroup$
Formulate the problem as a Collaborative filtering task.
$endgroup$
– Fadi Bakoura
May 21 '18 at 14:52












$begingroup$
Thanks @FadiBakoura, will research on this and let you know.
$endgroup$
– grpaiva
May 21 '18 at 18:03






$begingroup$
Thanks @FadiBakoura, will research on this and let you know.
$endgroup$
– grpaiva
May 21 '18 at 18:03














$begingroup$
Can you include more information about the user? (sex, age ...) An user single with 18 years old may like a course that another 50 years old do not like ...
$endgroup$
– Intruso
Aug 20 '18 at 13:50




$begingroup$
Can you include more information about the user? (sex, age ...) An user single with 18 years old may like a course that another 50 years old do not like ...
$endgroup$
– Intruso
Aug 20 '18 at 13:50












$begingroup$
Seems like a prediction problem, not one of classification, so a neural network? Have you tried loading this data into Orange3? Seems you could test out your models pretty quickly. Orange3 uses Scikit, so once you find your workflow, you can use Python. By the way, if it is a neural network solution, TensorFlow has PHP bindings, so you could do the whole thing in PHP. Both may save you time.
$endgroup$
– davmor
Nov 18 '18 at 11:10






$begingroup$
Seems like a prediction problem, not one of classification, so a neural network? Have you tried loading this data into Orange3? Seems you could test out your models pretty quickly. Orange3 uses Scikit, so once you find your workflow, you can use Python. By the way, if it is a neural network solution, TensorFlow has PHP bindings, so you could do the whole thing in PHP. Both may save you time.
$endgroup$
– davmor
Nov 18 '18 at 11:10












1 Answer
1






active

oldest

votes


















0












$begingroup$

Without more information about your dataset, it's impossible to recommend one particular classifier over another.



If you want your classifier to return a vector of probabilities, then if you're using the sklearn library, you could use the predict_proba method.



Here's an example:



from sklearn.datasets import load_digits
digits = load_digits(2)
from sklearn.linear_model import LogisticRegression
preds = LogisticRegression().fit(digits.data, digits.target).predict_proba(digits.data)
print([i[1] for i in preds])





share|improve this answer









$endgroup$













  • $begingroup$
    Thanks for your answer @Lupacante! What I don't get here is that when I print digits.data.shape and digits.target.shape I get: (360, 64) and (360,). Shouldn't the target shape be something like(64,)? My dataset's shape looks like this: (27920, 46) and (46,). I'm getting an error: ValueError: Found input variables with inconsistent numbers of samples: [27920, 46]
    $endgroup$
    – grpaiva
    May 21 '18 at 18:02












  • $begingroup$
    The predictors and target from the training set should have the same number of rows. The first number in the tuple returned by shape gives you the number of rows, so (360, 64) and (360,) is exactly what we'd expect.
    $endgroup$
    – marco_gorelli
    May 22 '18 at 8:16












Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f31932%2fwhats-the-best-classification-model-for-this-recommendation-engine%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0












$begingroup$

Without more information about your dataset, it's impossible to recommend one particular classifier over another.



If you want your classifier to return a vector of probabilities, then if you're using the sklearn library, you could use the predict_proba method.



Here's an example:



from sklearn.datasets import load_digits
digits = load_digits(2)
from sklearn.linear_model import LogisticRegression
preds = LogisticRegression().fit(digits.data, digits.target).predict_proba(digits.data)
print([i[1] for i in preds])





share|improve this answer









$endgroup$













  • $begingroup$
    Thanks for your answer @Lupacante! What I don't get here is that when I print digits.data.shape and digits.target.shape I get: (360, 64) and (360,). Shouldn't the target shape be something like(64,)? My dataset's shape looks like this: (27920, 46) and (46,). I'm getting an error: ValueError: Found input variables with inconsistent numbers of samples: [27920, 46]
    $endgroup$
    – grpaiva
    May 21 '18 at 18:02












  • $begingroup$
    The predictors and target from the training set should have the same number of rows. The first number in the tuple returned by shape gives you the number of rows, so (360, 64) and (360,) is exactly what we'd expect.
    $endgroup$
    – marco_gorelli
    May 22 '18 at 8:16
















0












$begingroup$

Without more information about your dataset, it's impossible to recommend one particular classifier over another.



If you want your classifier to return a vector of probabilities, then if you're using the sklearn library, you could use the predict_proba method.



Here's an example:



from sklearn.datasets import load_digits
digits = load_digits(2)
from sklearn.linear_model import LogisticRegression
preds = LogisticRegression().fit(digits.data, digits.target).predict_proba(digits.data)
print([i[1] for i in preds])





share|improve this answer









$endgroup$













  • $begingroup$
    Thanks for your answer @Lupacante! What I don't get here is that when I print digits.data.shape and digits.target.shape I get: (360, 64) and (360,). Shouldn't the target shape be something like(64,)? My dataset's shape looks like this: (27920, 46) and (46,). I'm getting an error: ValueError: Found input variables with inconsistent numbers of samples: [27920, 46]
    $endgroup$
    – grpaiva
    May 21 '18 at 18:02












  • $begingroup$
    The predictors and target from the training set should have the same number of rows. The first number in the tuple returned by shape gives you the number of rows, so (360, 64) and (360,) is exactly what we'd expect.
    $endgroup$
    – marco_gorelli
    May 22 '18 at 8:16














0












0








0





$begingroup$

Without more information about your dataset, it's impossible to recommend one particular classifier over another.



If you want your classifier to return a vector of probabilities, then if you're using the sklearn library, you could use the predict_proba method.



Here's an example:



from sklearn.datasets import load_digits
digits = load_digits(2)
from sklearn.linear_model import LogisticRegression
preds = LogisticRegression().fit(digits.data, digits.target).predict_proba(digits.data)
print([i[1] for i in preds])





share|improve this answer









$endgroup$



Without more information about your dataset, it's impossible to recommend one particular classifier over another.



If you want your classifier to return a vector of probabilities, then if you're using the sklearn library, you could use the predict_proba method.



Here's an example:



from sklearn.datasets import load_digits
digits = load_digits(2)
from sklearn.linear_model import LogisticRegression
preds = LogisticRegression().fit(digits.data, digits.target).predict_proba(digits.data)
print([i[1] for i in preds])






share|improve this answer












share|improve this answer



share|improve this answer










answered May 21 '18 at 14:47









marco_gorellimarco_gorelli

4819




4819












  • $begingroup$
    Thanks for your answer @Lupacante! What I don't get here is that when I print digits.data.shape and digits.target.shape I get: (360, 64) and (360,). Shouldn't the target shape be something like(64,)? My dataset's shape looks like this: (27920, 46) and (46,). I'm getting an error: ValueError: Found input variables with inconsistent numbers of samples: [27920, 46]
    $endgroup$
    – grpaiva
    May 21 '18 at 18:02












  • $begingroup$
    The predictors and target from the training set should have the same number of rows. The first number in the tuple returned by shape gives you the number of rows, so (360, 64) and (360,) is exactly what we'd expect.
    $endgroup$
    – marco_gorelli
    May 22 '18 at 8:16


















  • $begingroup$
    Thanks for your answer @Lupacante! What I don't get here is that when I print digits.data.shape and digits.target.shape I get: (360, 64) and (360,). Shouldn't the target shape be something like(64,)? My dataset's shape looks like this: (27920, 46) and (46,). I'm getting an error: ValueError: Found input variables with inconsistent numbers of samples: [27920, 46]
    $endgroup$
    – grpaiva
    May 21 '18 at 18:02












  • $begingroup$
    The predictors and target from the training set should have the same number of rows. The first number in the tuple returned by shape gives you the number of rows, so (360, 64) and (360,) is exactly what we'd expect.
    $endgroup$
    – marco_gorelli
    May 22 '18 at 8:16
















$begingroup$
Thanks for your answer @Lupacante! What I don't get here is that when I print digits.data.shape and digits.target.shape I get: (360, 64) and (360,). Shouldn't the target shape be something like(64,)? My dataset's shape looks like this: (27920, 46) and (46,). I'm getting an error: ValueError: Found input variables with inconsistent numbers of samples: [27920, 46]
$endgroup$
– grpaiva
May 21 '18 at 18:02






$begingroup$
Thanks for your answer @Lupacante! What I don't get here is that when I print digits.data.shape and digits.target.shape I get: (360, 64) and (360,). Shouldn't the target shape be something like(64,)? My dataset's shape looks like this: (27920, 46) and (46,). I'm getting an error: ValueError: Found input variables with inconsistent numbers of samples: [27920, 46]
$endgroup$
– grpaiva
May 21 '18 at 18:02














$begingroup$
The predictors and target from the training set should have the same number of rows. The first number in the tuple returned by shape gives you the number of rows, so (360, 64) and (360,) is exactly what we'd expect.
$endgroup$
– marco_gorelli
May 22 '18 at 8:16




$begingroup$
The predictors and target from the training set should have the same number of rows. The first number in the tuple returned by shape gives you the number of rows, so (360, 64) and (360,) is exactly what we'd expect.
$endgroup$
– marco_gorelli
May 22 '18 at 8:16


















draft saved

draft discarded




















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f31932%2fwhats-the-best-classification-model-for-this-recommendation-engine%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

How to label and detect the document text images

Vallis Paradisi

Tabula Rosettana