Skewed two class data set

Is there any theory on the influence of skew in the data set on the performance of binary classifiers? At work, we are doing abuse detection, the negative population is regular logins, and the positive population is attack logins (account take over = ATO).

However, the frequency of ATO logins is 1/50,000 or less. So, we have a very skewed natural data set. Should I "unskew" my training data set by downsampling the legit logins? How much can I do that and still keep a model that will work well on the actual data? Any theory behind that?

edited Oct 25 '18 at 1:10

Stephen Rauch

1,52551229

asked Oct 24 '18 at 16:09

Frank

1012

1

$begingroup$
It is a rather common problem for anomaly detection problems, and downsampling the legit login is a reasonable first try. As long as the data is representative it could still work well (as long as the pattern is very different from that of ATOs, etc. sorry not a domain expert)
$endgroup$
– The Lyrist
Oct 24 '18 at 16:23

$begingroup$
Sure - but I want to understand why "it is reasonable", if there is a solid theory behind it and/or what happens at various levels of downsampling.
$endgroup$
– Frank
Oct 24 '18 at 21:03

add a comment |

edited Oct 25 '18 at 1:10

Stephen Rauch

1,52551229

asked Oct 24 '18 at 16:09

Frank

1012

1

$begingroup$
It is a rather common problem for anomaly detection problems, and downsampling the legit login is a reasonable first try. As long as the data is representative it could still work well (as long as the pattern is very different from that of ATOs, etc. sorry not a domain expert)
$endgroup$
– The Lyrist
Oct 24 '18 at 16:23

$begingroup$
Sure - but I want to understand why "it is reasonable", if there is a solid theory behind it and/or what happens at various levels of downsampling.
$endgroup$
– Frank
Oct 24 '18 at 21:03

add a comment |

edited Oct 25 '18 at 1:10

Stephen Rauch

1,52551229

asked Oct 24 '18 at 16:09

Frank

1012

classification dataset anomaly-detection unbalanced-classes

edited Oct 25 '18 at 1:10

Stephen Rauch

1,52551229

asked Oct 24 '18 at 16:09

Frank

1012

edited Oct 25 '18 at 1:10

Stephen Rauch

1,52551229

asked Oct 24 '18 at 16:09

Frank

1012

edited Oct 25 '18 at 1:10

Stephen Rauch

1,52551229

edited Oct 25 '18 at 1:10

Stephen Rauch

1,52551229

edited Oct 25 '18 at 1:10

Stephen Rauch

1,52551229

asked Oct 24 '18 at 16:09

Frank

1012

asked Oct 24 '18 at 16:09

Frank

1012

asked Oct 24 '18 at 16:09

Frank

1012

1

$begingroup$
It is a rather common problem for anomaly detection problems, and downsampling the legit login is a reasonable first try. As long as the data is representative it could still work well (as long as the pattern is very different from that of ATOs, etc. sorry not a domain expert)
$endgroup$
– The Lyrist
Oct 24 '18 at 16:23

$begingroup$
Sure - but I want to understand why "it is reasonable", if there is a solid theory behind it and/or what happens at various levels of downsampling.
$endgroup$
– Frank
Oct 24 '18 at 21:03

add a comment |

1

$begingroup$
It is a rather common problem for anomaly detection problems, and downsampling the legit login is a reasonable first try. As long as the data is representative it could still work well (as long as the pattern is very different from that of ATOs, etc. sorry not a domain expert)
$endgroup$
– The Lyrist
Oct 24 '18 at 16:23

$begingroup$
Sure - but I want to understand why "it is reasonable", if there is a solid theory behind it and/or what happens at various levels of downsampling.
$endgroup$
– Frank
Oct 24 '18 at 21:03

It is a rather common problem for anomaly detection problems, and downsampling the legit login is a reasonable first try. As long as the data is representative it could still work well (as long as the pattern is very different from that of ATOs, etc. sorry not a domain expert)

– The Lyrist
Oct 24 '18 at 16:23

Sure - but I want to understand why "it is reasonable", if there is a solid theory behind it and/or what happens at various levels of downsampling.

– Frank
Oct 24 '18 at 21:03

add a comment |

2 Answers
2

active

oldest

votes

It is typically called a class imbalance issue, where the occurrence of a label happens so infrequently that makes predictions unreliable.

For instance, if I know Vancouver, Canada rains 85% of the time in winter, I would simply predict that it is raining when it is winter + vancouver. You don't want your algorithm to favour one label over another because one label predominates.

One common strategy would be resampling. If you have enough data, downsampling could make more sense as (oversampling requires the creation of synthetic data (e.g., SMOTE, etc.)). so that the algorithm can properly learn the difference between the two classes and wouldn't favour one over another. 50-50 split between the two classes would probably be a good starting point, but it also depends on what is available. You still want your negative labels to be representative, and even an 80-20 split would be a vast improvement already.

Another common solution is to increase the penalty of incorrect predictions. The way to think about it, false positive = admin spending time investigating a false alarm; false negative = rouge activities went undetected. For different business, one cost could be more severe than another so you could potentially say getting a false negative is 1000x more costly to business, etc.

Not knowing your data, those are probably the first two things (separately or together) to try. Most ML packages could handle either strategy rather easily so that's why I think those are reasonable things to try.

There are many thorough articles and tutorials with additional strategies available. Try the keywords anomaly detection and class imbalance and it seems to give me some pretty good results.

edited Oct 24 '18 at 22:51

answered Oct 24 '18 at 21:39

The Lyrist

419112

$begingroup$
Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.
$endgroup$
– Frank
Oct 24 '18 at 22:07

$begingroup$
Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".
$endgroup$
– Frank
Oct 24 '18 at 22:11

$begingroup$
@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.
$endgroup$
– The Lyrist
Oct 24 '18 at 22:17

add a comment |

In your case you need to use precision and recall error metric to get more insights on error, as in skewed data sets it is inefficient to maintain accuracy by looking at the normal error metric that we use.

You can refer this link for the details, I personally found it helpful.

Happy to answer!

edited 12 hours ago

answered 14 hours ago

Ankit Agrawal

415

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40166%2fskewed-two-class-data-set%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

It is typically called a class imbalance issue, where the occurrence of a label happens so infrequently that makes predictions unreliable.

There are many thorough articles and tutorials with additional strategies available. Try the keywords anomaly detection and class imbalance and it seems to give me some pretty good results.

edited Oct 24 '18 at 22:51

answered Oct 24 '18 at 21:39

The Lyrist

419112

$begingroup$
Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.
$endgroup$
– Frank
Oct 24 '18 at 22:07

$begingroup$
Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".
$endgroup$
– Frank
Oct 24 '18 at 22:11

$begingroup$
@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.
$endgroup$
– The Lyrist
Oct 24 '18 at 22:17

add a comment |

It is typically called a class imbalance issue, where the occurrence of a label happens so infrequently that makes predictions unreliable.

There are many thorough articles and tutorials with additional strategies available. Try the keywords anomaly detection and class imbalance and it seems to give me some pretty good results.

edited Oct 24 '18 at 22:51

answered Oct 24 '18 at 21:39

The Lyrist

419112

$begingroup$
Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.
$endgroup$
– Frank
Oct 24 '18 at 22:07

$begingroup$
Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".
$endgroup$
– Frank
Oct 24 '18 at 22:11

$begingroup$
@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.
$endgroup$
– The Lyrist
Oct 24 '18 at 22:17

add a comment |

It is typically called a class imbalance issue, where the occurrence of a label happens so infrequently that makes predictions unreliable.

There are many thorough articles and tutorials with additional strategies available. Try the keywords anomaly detection and class imbalance and it seems to give me some pretty good results.

edited Oct 24 '18 at 22:51

answered Oct 24 '18 at 21:39

The Lyrist

419112

It is typically called a class imbalance issue, where the occurrence of a label happens so infrequently that makes predictions unreliable.

There are many thorough articles and tutorials with additional strategies available. Try the keywords anomaly detection and class imbalance and it seems to give me some pretty good results.

edited Oct 24 '18 at 22:51

answered Oct 24 '18 at 21:39

The Lyrist

419112

edited Oct 24 '18 at 22:51

answered Oct 24 '18 at 21:39

The Lyrist

419112

answered Oct 24 '18 at 21:39

The Lyrist

419112

answered Oct 24 '18 at 21:39

The Lyrist

419112

$begingroup$
Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.
$endgroup$
– Frank
Oct 24 '18 at 22:07

$begingroup$
Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".
$endgroup$
– Frank
Oct 24 '18 at 22:11

$begingroup$
@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.
$endgroup$
– The Lyrist
Oct 24 '18 at 22:17

add a comment |

$begingroup$
Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.
$endgroup$
– Frank
Oct 24 '18 at 22:07

$begingroup$
Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".
$endgroup$
– Frank
Oct 24 '18 at 22:11

$begingroup$
@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.
$endgroup$
– The Lyrist
Oct 24 '18 at 22:17

Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.

– Frank
Oct 24 '18 at 22:07

Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".

– Frank
Oct 24 '18 at 22:11

@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.

– The Lyrist
Oct 24 '18 at 22:17

add a comment |

You can refer this link for the details, I personally found it helpful.

Happy to answer!

edited 12 hours ago

answered 14 hours ago

Ankit Agrawal

415

add a comment |

You can refer this link for the details, I personally found it helpful.

Happy to answer!

edited 12 hours ago

answered 14 hours ago

Ankit Agrawal

415

add a comment |

You can refer this link for the details, I personally found it helpful.

Happy to answer!

edited 12 hours ago

answered 14 hours ago

Ankit Agrawal

415

You can refer this link for the details, I personally found it helpful.

Happy to answer!

edited 12 hours ago

answered 14 hours ago

Ankit Agrawal

415

edited 12 hours ago

answered 14 hours ago

Ankit Agrawal

415

answered 14 hours ago

Ankit Agrawal

415

answered 14 hours ago

Ankit Agrawal

415

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk