Skewed two class data set
$begingroup$
Is there any theory on the influence of skew in the data set on the performance of binary classifiers? At work, we are doing abuse detection, the negative population is regular logins, and the positive population is attack logins (account take over = ATO).
However, the frequency of ATO logins is 1/50,000 or less. So, we have a very skewed natural data set. Should I "unskew" my training data set by downsampling the legit logins? How much can I do that and still keep a model that will work well on the actual data? Any theory behind that?
classification dataset anomaly-detection unbalanced-classes
$endgroup$
add a comment |
$begingroup$
Is there any theory on the influence of skew in the data set on the performance of binary classifiers? At work, we are doing abuse detection, the negative population is regular logins, and the positive population is attack logins (account take over = ATO).
However, the frequency of ATO logins is 1/50,000 or less. So, we have a very skewed natural data set. Should I "unskew" my training data set by downsampling the legit logins? How much can I do that and still keep a model that will work well on the actual data? Any theory behind that?
classification dataset anomaly-detection unbalanced-classes
$endgroup$
1
$begingroup$
It is a rather common problem for anomaly detection problems, and downsampling the legit login is a reasonable first try. As long as the data is representative it could still work well (as long as the pattern is very different from that of ATOs, etc. sorry not a domain expert)
$endgroup$
– The Lyrist
Oct 24 '18 at 16:23
$begingroup$
Sure - but I want to understand why "it is reasonable", if there is a solid theory behind it and/or what happens at various levels of downsampling.
$endgroup$
– Frank
Oct 24 '18 at 21:03
add a comment |
$begingroup$
Is there any theory on the influence of skew in the data set on the performance of binary classifiers? At work, we are doing abuse detection, the negative population is regular logins, and the positive population is attack logins (account take over = ATO).
However, the frequency of ATO logins is 1/50,000 or less. So, we have a very skewed natural data set. Should I "unskew" my training data set by downsampling the legit logins? How much can I do that and still keep a model that will work well on the actual data? Any theory behind that?
classification dataset anomaly-detection unbalanced-classes
$endgroup$
Is there any theory on the influence of skew in the data set on the performance of binary classifiers? At work, we are doing abuse detection, the negative population is regular logins, and the positive population is attack logins (account take over = ATO).
However, the frequency of ATO logins is 1/50,000 or less. So, we have a very skewed natural data set. Should I "unskew" my training data set by downsampling the legit logins? How much can I do that and still keep a model that will work well on the actual data? Any theory behind that?
classification dataset anomaly-detection unbalanced-classes
classification dataset anomaly-detection unbalanced-classes
edited Oct 25 '18 at 1:10
Stephen Rauch
1,52551229
1,52551229
asked Oct 24 '18 at 16:09
FrankFrank
1012
1012
1
$begingroup$
It is a rather common problem for anomaly detection problems, and downsampling the legit login is a reasonable first try. As long as the data is representative it could still work well (as long as the pattern is very different from that of ATOs, etc. sorry not a domain expert)
$endgroup$
– The Lyrist
Oct 24 '18 at 16:23
$begingroup$
Sure - but I want to understand why "it is reasonable", if there is a solid theory behind it and/or what happens at various levels of downsampling.
$endgroup$
– Frank
Oct 24 '18 at 21:03
add a comment |
1
$begingroup$
It is a rather common problem for anomaly detection problems, and downsampling the legit login is a reasonable first try. As long as the data is representative it could still work well (as long as the pattern is very different from that of ATOs, etc. sorry not a domain expert)
$endgroup$
– The Lyrist
Oct 24 '18 at 16:23
$begingroup$
Sure - but I want to understand why "it is reasonable", if there is a solid theory behind it and/or what happens at various levels of downsampling.
$endgroup$
– Frank
Oct 24 '18 at 21:03
1
1
$begingroup$
It is a rather common problem for anomaly detection problems, and downsampling the legit login is a reasonable first try. As long as the data is representative it could still work well (as long as the pattern is very different from that of ATOs, etc. sorry not a domain expert)
$endgroup$
– The Lyrist
Oct 24 '18 at 16:23
$begingroup$
It is a rather common problem for anomaly detection problems, and downsampling the legit login is a reasonable first try. As long as the data is representative it could still work well (as long as the pattern is very different from that of ATOs, etc. sorry not a domain expert)
$endgroup$
– The Lyrist
Oct 24 '18 at 16:23
$begingroup$
Sure - but I want to understand why "it is reasonable", if there is a solid theory behind it and/or what happens at various levels of downsampling.
$endgroup$
– Frank
Oct 24 '18 at 21:03
$begingroup$
Sure - but I want to understand why "it is reasonable", if there is a solid theory behind it and/or what happens at various levels of downsampling.
$endgroup$
– Frank
Oct 24 '18 at 21:03
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
It is typically called a class imbalance issue, where the occurrence of a label happens so infrequently that makes predictions unreliable.
For instance, if I know Vancouver, Canada rains 85% of the time in winter, I would simply predict that it is raining when it is winter + vancouver. You don't want your algorithm to favour one label over another because one label predominates.
One common strategy would be resampling. If you have enough data, downsampling could make more sense as (oversampling requires the creation of synthetic data (e.g., SMOTE, etc.)). so that the algorithm can properly learn the difference between the two classes and wouldn't favour one over another. 50-50 split between the two classes would probably be a good starting point, but it also depends on what is available. You still want your negative labels to be representative, and even an 80-20 split would be a vast improvement already.
Another common solution is to increase the penalty of incorrect predictions. The way to think about it, false positive = admin spending time investigating a false alarm; false negative = rouge activities went undetected. For different business, one cost could be more severe than another so you could potentially say getting a false negative is 1000x more costly to business, etc.
Not knowing your data, those are probably the first two things (separately or together) to try. Most ML packages could handle either strategy rather easily so that's why I think those are reasonable things to try.
There are many thorough articles and tutorials with additional strategies available. Try the keywords anomaly detection
and class imbalance
and it seems to give me some pretty good results.
$endgroup$
$begingroup$
Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.
$endgroup$
– Frank
Oct 24 '18 at 22:07
$begingroup$
Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".
$endgroup$
– Frank
Oct 24 '18 at 22:11
$begingroup$
@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.
$endgroup$
– The Lyrist
Oct 24 '18 at 22:17
add a comment |
$begingroup$
In your case you need to use precision and recall error metric to get more insights on error, as in skewed data sets it is inefficient to maintain accuracy by looking at the normal error metric that we use.
You can refer this link for the details, I personally found it helpful.
Happy to answer!
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40166%2fskewed-two-class-data-set%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
It is typically called a class imbalance issue, where the occurrence of a label happens so infrequently that makes predictions unreliable.
For instance, if I know Vancouver, Canada rains 85% of the time in winter, I would simply predict that it is raining when it is winter + vancouver. You don't want your algorithm to favour one label over another because one label predominates.
One common strategy would be resampling. If you have enough data, downsampling could make more sense as (oversampling requires the creation of synthetic data (e.g., SMOTE, etc.)). so that the algorithm can properly learn the difference between the two classes and wouldn't favour one over another. 50-50 split between the two classes would probably be a good starting point, but it also depends on what is available. You still want your negative labels to be representative, and even an 80-20 split would be a vast improvement already.
Another common solution is to increase the penalty of incorrect predictions. The way to think about it, false positive = admin spending time investigating a false alarm; false negative = rouge activities went undetected. For different business, one cost could be more severe than another so you could potentially say getting a false negative is 1000x more costly to business, etc.
Not knowing your data, those are probably the first two things (separately or together) to try. Most ML packages could handle either strategy rather easily so that's why I think those are reasonable things to try.
There are many thorough articles and tutorials with additional strategies available. Try the keywords anomaly detection
and class imbalance
and it seems to give me some pretty good results.
$endgroup$
$begingroup$
Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.
$endgroup$
– Frank
Oct 24 '18 at 22:07
$begingroup$
Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".
$endgroup$
– Frank
Oct 24 '18 at 22:11
$begingroup$
@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.
$endgroup$
– The Lyrist
Oct 24 '18 at 22:17
add a comment |
$begingroup$
It is typically called a class imbalance issue, where the occurrence of a label happens so infrequently that makes predictions unreliable.
For instance, if I know Vancouver, Canada rains 85% of the time in winter, I would simply predict that it is raining when it is winter + vancouver. You don't want your algorithm to favour one label over another because one label predominates.
One common strategy would be resampling. If you have enough data, downsampling could make more sense as (oversampling requires the creation of synthetic data (e.g., SMOTE, etc.)). so that the algorithm can properly learn the difference between the two classes and wouldn't favour one over another. 50-50 split between the two classes would probably be a good starting point, but it also depends on what is available. You still want your negative labels to be representative, and even an 80-20 split would be a vast improvement already.
Another common solution is to increase the penalty of incorrect predictions. The way to think about it, false positive = admin spending time investigating a false alarm; false negative = rouge activities went undetected. For different business, one cost could be more severe than another so you could potentially say getting a false negative is 1000x more costly to business, etc.
Not knowing your data, those are probably the first two things (separately or together) to try. Most ML packages could handle either strategy rather easily so that's why I think those are reasonable things to try.
There are many thorough articles and tutorials with additional strategies available. Try the keywords anomaly detection
and class imbalance
and it seems to give me some pretty good results.
$endgroup$
$begingroup$
Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.
$endgroup$
– Frank
Oct 24 '18 at 22:07
$begingroup$
Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".
$endgroup$
– Frank
Oct 24 '18 at 22:11
$begingroup$
@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.
$endgroup$
– The Lyrist
Oct 24 '18 at 22:17
add a comment |
$begingroup$
It is typically called a class imbalance issue, where the occurrence of a label happens so infrequently that makes predictions unreliable.
For instance, if I know Vancouver, Canada rains 85% of the time in winter, I would simply predict that it is raining when it is winter + vancouver. You don't want your algorithm to favour one label over another because one label predominates.
One common strategy would be resampling. If you have enough data, downsampling could make more sense as (oversampling requires the creation of synthetic data (e.g., SMOTE, etc.)). so that the algorithm can properly learn the difference between the two classes and wouldn't favour one over another. 50-50 split between the two classes would probably be a good starting point, but it also depends on what is available. You still want your negative labels to be representative, and even an 80-20 split would be a vast improvement already.
Another common solution is to increase the penalty of incorrect predictions. The way to think about it, false positive = admin spending time investigating a false alarm; false negative = rouge activities went undetected. For different business, one cost could be more severe than another so you could potentially say getting a false negative is 1000x more costly to business, etc.
Not knowing your data, those are probably the first two things (separately or together) to try. Most ML packages could handle either strategy rather easily so that's why I think those are reasonable things to try.
There are many thorough articles and tutorials with additional strategies available. Try the keywords anomaly detection
and class imbalance
and it seems to give me some pretty good results.
$endgroup$
It is typically called a class imbalance issue, where the occurrence of a label happens so infrequently that makes predictions unreliable.
For instance, if I know Vancouver, Canada rains 85% of the time in winter, I would simply predict that it is raining when it is winter + vancouver. You don't want your algorithm to favour one label over another because one label predominates.
One common strategy would be resampling. If you have enough data, downsampling could make more sense as (oversampling requires the creation of synthetic data (e.g., SMOTE, etc.)). so that the algorithm can properly learn the difference between the two classes and wouldn't favour one over another. 50-50 split between the two classes would probably be a good starting point, but it also depends on what is available. You still want your negative labels to be representative, and even an 80-20 split would be a vast improvement already.
Another common solution is to increase the penalty of incorrect predictions. The way to think about it, false positive = admin spending time investigating a false alarm; false negative = rouge activities went undetected. For different business, one cost could be more severe than another so you could potentially say getting a false negative is 1000x more costly to business, etc.
Not knowing your data, those are probably the first two things (separately or together) to try. Most ML packages could handle either strategy rather easily so that's why I think those are reasonable things to try.
There are many thorough articles and tutorials with additional strategies available. Try the keywords anomaly detection
and class imbalance
and it seems to give me some pretty good results.
edited Oct 24 '18 at 22:51
answered Oct 24 '18 at 21:39
The LyristThe Lyrist
419112
419112
$begingroup$
Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.
$endgroup$
– Frank
Oct 24 '18 at 22:07
$begingroup$
Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".
$endgroup$
– Frank
Oct 24 '18 at 22:11
$begingroup$
@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.
$endgroup$
– The Lyrist
Oct 24 '18 at 22:17
add a comment |
$begingroup$
Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.
$endgroup$
– Frank
Oct 24 '18 at 22:07
$begingroup$
Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".
$endgroup$
– Frank
Oct 24 '18 at 22:11
$begingroup$
@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.
$endgroup$
– The Lyrist
Oct 24 '18 at 22:17
$begingroup$
Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.
$endgroup$
– Frank
Oct 24 '18 at 22:07
$begingroup$
Note that it is not just the actual data you would need, but also the business context, as you point out in your third paragraph: it can happen that, depending on the business context, false positives/negatives are actually more or less costly. The lesson here seems to be that if some event is rare, we can't really include that fact in the learning, as the model will just be swamped by the other class abundance. I was somehow hoping to include "attacks are rare" in the learning.
$endgroup$
– Frank
Oct 24 '18 at 22:07
$begingroup$
Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".
$endgroup$
– Frank
Oct 24 '18 at 22:11
$begingroup$
Also your answer is very interesting, but doesn't quite get at what I was after: why is it ok to change the data set statistics and still expect good performance on the actual problem data set? Does it depend on the type of model? For example, are GBDTs tolerant to class imbalance, whereas logistic regression would perform poorly on real data if you intentionally weighted the training data set? I'll check out "class imbalance".
$endgroup$
– Frank
Oct 24 '18 at 22:11
$begingroup$
@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.
$endgroup$
– The Lyrist
Oct 24 '18 at 22:17
$begingroup$
@Frank what is important is whether the training data is representative of the actual data. Downsampling in this sense is essentially, instead of giving the model 10000000 negative sample vs 200 positive samples, you give 200 of each, etc. If your 200 is representative enough of your actually data, and your model generalizes well, it is actually ok not to include all the available data in your training set.
$endgroup$
– The Lyrist
Oct 24 '18 at 22:17
add a comment |
$begingroup$
In your case you need to use precision and recall error metric to get more insights on error, as in skewed data sets it is inefficient to maintain accuracy by looking at the normal error metric that we use.
You can refer this link for the details, I personally found it helpful.
Happy to answer!
$endgroup$
add a comment |
$begingroup$
In your case you need to use precision and recall error metric to get more insights on error, as in skewed data sets it is inefficient to maintain accuracy by looking at the normal error metric that we use.
You can refer this link for the details, I personally found it helpful.
Happy to answer!
$endgroup$
add a comment |
$begingroup$
In your case you need to use precision and recall error metric to get more insights on error, as in skewed data sets it is inefficient to maintain accuracy by looking at the normal error metric that we use.
You can refer this link for the details, I personally found it helpful.
Happy to answer!
$endgroup$
In your case you need to use precision and recall error metric to get more insights on error, as in skewed data sets it is inefficient to maintain accuracy by looking at the normal error metric that we use.
You can refer this link for the details, I personally found it helpful.
Happy to answer!
edited 12 hours ago
answered 14 hours ago
Ankit AgrawalAnkit Agrawal
415
415
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f40166%2fskewed-two-class-data-set%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
It is a rather common problem for anomaly detection problems, and downsampling the legit login is a reasonable first try. As long as the data is representative it could still work well (as long as the pattern is very different from that of ATOs, etc. sorry not a domain expert)
$endgroup$
– The Lyrist
Oct 24 '18 at 16:23
$begingroup$
Sure - but I want to understand why "it is reasonable", if there is a solid theory behind it and/or what happens at various levels of downsampling.
$endgroup$
– Frank
Oct 24 '18 at 21:03