When should you balance a time series dataset?
$begingroup$
I'm training a machine learning algorithm to classify up/down trends in a time series and I'm using an imbalanced feature set. It seems necessary to balance the data since the algorithm could learn a bias towards a particular trend, but this comes at the cost of a non-representative dataset. Should I balance my data? And if so, is random undersampling the right method?
machine-learning classification time-series unbalanced-classes class-imbalance
$endgroup$
add a comment |
$begingroup$
I'm training a machine learning algorithm to classify up/down trends in a time series and I'm using an imbalanced feature set. It seems necessary to balance the data since the algorithm could learn a bias towards a particular trend, but this comes at the cost of a non-representative dataset. Should I balance my data? And if so, is random undersampling the right method?
machine-learning classification time-series unbalanced-classes class-imbalance
$endgroup$
1
$begingroup$
Which types of models do you use? some models are less sensitive to imbalanced datasets
$endgroup$
– Omri374
Feb 24 '18 at 20:41
1
$begingroup$
@Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier.
$endgroup$
– Jonathan Shobrook
Feb 25 '18 at 5:28
1
$begingroup$
For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows
$endgroup$
– Omri374
Feb 25 '18 at 9:07
1
$begingroup$
For LSTMs, you could tweak the loss function. see here: stats.stackexchange.com/questions/197273/… and stackoverflow.com/questions/35155655/…
$endgroup$
– Omri374
Feb 25 '18 at 9:13
$begingroup$
You might want to read this paper
$endgroup$
– iso_9001_
2 days ago
add a comment |
$begingroup$
I'm training a machine learning algorithm to classify up/down trends in a time series and I'm using an imbalanced feature set. It seems necessary to balance the data since the algorithm could learn a bias towards a particular trend, but this comes at the cost of a non-representative dataset. Should I balance my data? And if so, is random undersampling the right method?
machine-learning classification time-series unbalanced-classes class-imbalance
$endgroup$
I'm training a machine learning algorithm to classify up/down trends in a time series and I'm using an imbalanced feature set. It seems necessary to balance the data since the algorithm could learn a bias towards a particular trend, but this comes at the cost of a non-representative dataset. Should I balance my data? And if so, is random undersampling the right method?
machine-learning classification time-series unbalanced-classes class-imbalance
machine-learning classification time-series unbalanced-classes class-imbalance
edited Feb 22 '18 at 18:58
Jonathan Shobrook
asked Feb 22 '18 at 18:10
Jonathan ShobrookJonathan Shobrook
73117
73117
1
$begingroup$
Which types of models do you use? some models are less sensitive to imbalanced datasets
$endgroup$
– Omri374
Feb 24 '18 at 20:41
1
$begingroup$
@Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier.
$endgroup$
– Jonathan Shobrook
Feb 25 '18 at 5:28
1
$begingroup$
For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows
$endgroup$
– Omri374
Feb 25 '18 at 9:07
1
$begingroup$
For LSTMs, you could tweak the loss function. see here: stats.stackexchange.com/questions/197273/… and stackoverflow.com/questions/35155655/…
$endgroup$
– Omri374
Feb 25 '18 at 9:13
$begingroup$
You might want to read this paper
$endgroup$
– iso_9001_
2 days ago
add a comment |
1
$begingroup$
Which types of models do you use? some models are less sensitive to imbalanced datasets
$endgroup$
– Omri374
Feb 24 '18 at 20:41
1
$begingroup$
@Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier.
$endgroup$
– Jonathan Shobrook
Feb 25 '18 at 5:28
1
$begingroup$
For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows
$endgroup$
– Omri374
Feb 25 '18 at 9:07
1
$begingroup$
For LSTMs, you could tweak the loss function. see here: stats.stackexchange.com/questions/197273/… and stackoverflow.com/questions/35155655/…
$endgroup$
– Omri374
Feb 25 '18 at 9:13
$begingroup$
You might want to read this paper
$endgroup$
– iso_9001_
2 days ago
1
1
$begingroup$
Which types of models do you use? some models are less sensitive to imbalanced datasets
$endgroup$
– Omri374
Feb 24 '18 at 20:41
$begingroup$
Which types of models do you use? some models are less sensitive to imbalanced datasets
$endgroup$
– Omri374
Feb 24 '18 at 20:41
1
1
$begingroup$
@Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier.
$endgroup$
– Jonathan Shobrook
Feb 25 '18 at 5:28
$begingroup$
@Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier.
$endgroup$
– Jonathan Shobrook
Feb 25 '18 at 5:28
1
1
$begingroup$
For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows
$endgroup$
– Omri374
Feb 25 '18 at 9:07
$begingroup$
For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows
$endgroup$
– Omri374
Feb 25 '18 at 9:07
1
1
$begingroup$
For LSTMs, you could tweak the loss function. see here: stats.stackexchange.com/questions/197273/… and stackoverflow.com/questions/35155655/…
$endgroup$
– Omri374
Feb 25 '18 at 9:13
$begingroup$
For LSTMs, you could tweak the loss function. see here: stats.stackexchange.com/questions/197273/… and stackoverflow.com/questions/35155655/…
$endgroup$
– Omri374
Feb 25 '18 at 9:13
$begingroup$
You might want to read this paper
$endgroup$
– iso_9001_
2 days ago
$begingroup$
You might want to read this paper
$endgroup$
– iso_9001_
2 days ago
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
If you can change the Loss function
of the algorithm, It will be very helpful and as a result you don't need to down sample your data. There are many useful metrics which were introduced for evaluating the performance of classification methods for imbalanced data-sets. Some of them are Kappa, CEN, MCEN, MCC, and DP.
Disclaimer:
If you use python, PyCM module can help you to find out these metrics.
Here is a simple code to get the recommended parameters from this module:
>>> from pycm import *
>>> cm = ConfusionMatrix(matrix={"Class1": {"Class1": 1, "Class2":2}, "Class2": {"Class1": 0, "Class2": 5}})
>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]
After that, each of these parameters you want to use as the loss function can be used as follows:
>>> y_pred = model.predict #the prediction of the implemented model
>>> y_actu = data.target #data labels
>>> cm = ConfusionMatrix(y_actu, y_pred)
>>> loss = cm.Kappa #or any other parameter (Example: cm.SOA1)
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28200%2fwhen-should-you-balance-a-time-series-dataset%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
If you can change the Loss function
of the algorithm, It will be very helpful and as a result you don't need to down sample your data. There are many useful metrics which were introduced for evaluating the performance of classification methods for imbalanced data-sets. Some of them are Kappa, CEN, MCEN, MCC, and DP.
Disclaimer:
If you use python, PyCM module can help you to find out these metrics.
Here is a simple code to get the recommended parameters from this module:
>>> from pycm import *
>>> cm = ConfusionMatrix(matrix={"Class1": {"Class1": 1, "Class2":2}, "Class2": {"Class1": 0, "Class2": 5}})
>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]
After that, each of these parameters you want to use as the loss function can be used as follows:
>>> y_pred = model.predict #the prediction of the implemented model
>>> y_actu = data.target #data labels
>>> cm = ConfusionMatrix(y_actu, y_pred)
>>> loss = cm.Kappa #or any other parameter (Example: cm.SOA1)
$endgroup$
add a comment |
$begingroup$
If you can change the Loss function
of the algorithm, It will be very helpful and as a result you don't need to down sample your data. There are many useful metrics which were introduced for evaluating the performance of classification methods for imbalanced data-sets. Some of them are Kappa, CEN, MCEN, MCC, and DP.
Disclaimer:
If you use python, PyCM module can help you to find out these metrics.
Here is a simple code to get the recommended parameters from this module:
>>> from pycm import *
>>> cm = ConfusionMatrix(matrix={"Class1": {"Class1": 1, "Class2":2}, "Class2": {"Class1": 0, "Class2": 5}})
>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]
After that, each of these parameters you want to use as the loss function can be used as follows:
>>> y_pred = model.predict #the prediction of the implemented model
>>> y_actu = data.target #data labels
>>> cm = ConfusionMatrix(y_actu, y_pred)
>>> loss = cm.Kappa #or any other parameter (Example: cm.SOA1)
$endgroup$
add a comment |
$begingroup$
If you can change the Loss function
of the algorithm, It will be very helpful and as a result you don't need to down sample your data. There are many useful metrics which were introduced for evaluating the performance of classification methods for imbalanced data-sets. Some of them are Kappa, CEN, MCEN, MCC, and DP.
Disclaimer:
If you use python, PyCM module can help you to find out these metrics.
Here is a simple code to get the recommended parameters from this module:
>>> from pycm import *
>>> cm = ConfusionMatrix(matrix={"Class1": {"Class1": 1, "Class2":2}, "Class2": {"Class1": 0, "Class2": 5}})
>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]
After that, each of these parameters you want to use as the loss function can be used as follows:
>>> y_pred = model.predict #the prediction of the implemented model
>>> y_actu = data.target #data labels
>>> cm = ConfusionMatrix(y_actu, y_pred)
>>> loss = cm.Kappa #or any other parameter (Example: cm.SOA1)
$endgroup$
If you can change the Loss function
of the algorithm, It will be very helpful and as a result you don't need to down sample your data. There are many useful metrics which were introduced for evaluating the performance of classification methods for imbalanced data-sets. Some of them are Kappa, CEN, MCEN, MCC, and DP.
Disclaimer:
If you use python, PyCM module can help you to find out these metrics.
Here is a simple code to get the recommended parameters from this module:
>>> from pycm import *
>>> cm = ConfusionMatrix(matrix={"Class1": {"Class1": 1, "Class2":2}, "Class2": {"Class1": 0, "Class2": 5}})
>>> print(cm.recommended_list)
["Kappa", "SOA1(Landis & Koch)", "SOA2(Fleiss)", "SOA3(Altman)", "SOA4(Cicchetti)", "CEN", "MCEN", "MCC", "J", "Overall J", "Overall MCC", "Overall CEN", "Overall MCEN", "AUC", "AUCI", "G", "DP", "DPI", "GI"]
After that, each of these parameters you want to use as the loss function can be used as follows:
>>> y_pred = model.predict #the prediction of the implemented model
>>> y_actu = data.target #data labels
>>> cm = ConfusionMatrix(y_actu, y_pred)
>>> loss = cm.Kappa #or any other parameter (Example: cm.SOA1)
edited yesterday
answered 2 days ago
Alireza ZolanvariAlireza Zolanvari
18914
18914
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f28200%2fwhen-should-you-balance-a-time-series-dataset%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
$begingroup$
Which types of models do you use? some models are less sensitive to imbalanced datasets
$endgroup$
– Omri374
Feb 24 '18 at 20:41
1
$begingroup$
@Omri374: I'm testing an LSTM network, SVM, and Random Forest classifier.
$endgroup$
– Jonathan Shobrook
Feb 25 '18 at 5:28
1
$begingroup$
For SVMs and Random Forests, are you using a sliding window to create samples? If yes, you can then perform sampling on the created windows
$endgroup$
– Omri374
Feb 25 '18 at 9:07
1
$begingroup$
For LSTMs, you could tweak the loss function. see here: stats.stackexchange.com/questions/197273/… and stackoverflow.com/questions/35155655/…
$endgroup$
– Omri374
Feb 25 '18 at 9:13
$begingroup$
You might want to read this paper
$endgroup$
– iso_9001_
2 days ago