shuffling ML training data?
$begingroup$
I was curious to know if shuffling ML training data is beneficial to better results?
Sorry not a lot of wisdom here, but I have been reading a post from pythonprogramming.net for this topic.
I copied this function from the post and modified to just save my shuffled data to csv file.
def Randomizing():
df2 = df.reindex(np.random.permutation(df.index))
df2.to_csv('C:\Users\Machine-Learning-Electric-Data\randomized.csv')
Randomizing()
What appears to happen is only the index gets shuffled and all other data stays the same. I have many columns in my pd dataframe where I would need to keep all rows the same. (randomly shuffle all rows, its time series data) If this is beneficial can someone give me a tip on how to randomly shuffle my data more than just the index?
machine-learning python scikit-learn pandas
$endgroup$
add a comment |
$begingroup$
I was curious to know if shuffling ML training data is beneficial to better results?
Sorry not a lot of wisdom here, but I have been reading a post from pythonprogramming.net for this topic.
I copied this function from the post and modified to just save my shuffled data to csv file.
def Randomizing():
df2 = df.reindex(np.random.permutation(df.index))
df2.to_csv('C:\Users\Machine-Learning-Electric-Data\randomized.csv')
Randomizing()
What appears to happen is only the index gets shuffled and all other data stays the same. I have many columns in my pd dataframe where I would need to keep all rows the same. (randomly shuffle all rows, its time series data) If this is beneficial can someone give me a tip on how to randomly shuffle my data more than just the index?
machine-learning python scikit-learn pandas
$endgroup$
$begingroup$
this question could easily be googled... one convenient way isdf2.sample(frac=1.0)
$endgroup$
– oW_
1 hour ago
$begingroup$
Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
$endgroup$
– HenryHub
1 hour ago
add a comment |
$begingroup$
I was curious to know if shuffling ML training data is beneficial to better results?
Sorry not a lot of wisdom here, but I have been reading a post from pythonprogramming.net for this topic.
I copied this function from the post and modified to just save my shuffled data to csv file.
def Randomizing():
df2 = df.reindex(np.random.permutation(df.index))
df2.to_csv('C:\Users\Machine-Learning-Electric-Data\randomized.csv')
Randomizing()
What appears to happen is only the index gets shuffled and all other data stays the same. I have many columns in my pd dataframe where I would need to keep all rows the same. (randomly shuffle all rows, its time series data) If this is beneficial can someone give me a tip on how to randomly shuffle my data more than just the index?
machine-learning python scikit-learn pandas
$endgroup$
I was curious to know if shuffling ML training data is beneficial to better results?
Sorry not a lot of wisdom here, but I have been reading a post from pythonprogramming.net for this topic.
I copied this function from the post and modified to just save my shuffled data to csv file.
def Randomizing():
df2 = df.reindex(np.random.permutation(df.index))
df2.to_csv('C:\Users\Machine-Learning-Electric-Data\randomized.csv')
Randomizing()
What appears to happen is only the index gets shuffled and all other data stays the same. I have many columns in my pd dataframe where I would need to keep all rows the same. (randomly shuffle all rows, its time series data) If this is beneficial can someone give me a tip on how to randomly shuffle my data more than just the index?
machine-learning python scikit-learn pandas
machine-learning python scikit-learn pandas
asked 1 hour ago
HenryHubHenryHub
1134
1134
$begingroup$
this question could easily be googled... one convenient way isdf2.sample(frac=1.0)
$endgroup$
– oW_
1 hour ago
$begingroup$
Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
$endgroup$
– HenryHub
1 hour ago
add a comment |
$begingroup$
this question could easily be googled... one convenient way isdf2.sample(frac=1.0)
$endgroup$
– oW_
1 hour ago
$begingroup$
Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
$endgroup$
– HenryHub
1 hour ago
$begingroup$
this question could easily be googled... one convenient way is
df2.sample(frac=1.0)$endgroup$
– oW_
1 hour ago
$begingroup$
this question could easily be googled... one convenient way is
df2.sample(frac=1.0)$endgroup$
– oW_
1 hour ago
$begingroup$
Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
$endgroup$
– HenryHub
1 hour ago
$begingroup$
Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
$endgroup$
– HenryHub
1 hour ago
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
Shuffling the training data is generally good practice during initial preprocessing steps.
When you do a normal train_test_split, where you will have a 75% / 25% split, your split may overlook class ordering in the original dataset. For example, class labels that might resemble a data set similar to the iris data set would include target variables that resemble the following:
For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]
You could see from this example above, that splitting your data without shuffling might lead to very poor performance in your test set evaluation. Said another way, you may only capture the classes 0, 1, and 2 in your training data and only 3 will be represented in your test data. Specifically for classification tasks, but also for other ML tasks it may be useful to shuffle your data. However, each situation is different so the best idea would be to try it both ways to see whether you see a significant improvement or not.
Hope this answers your question. Drop a comment if you would like any further clarification.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45663%2fshuffling-ml-training-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Shuffling the training data is generally good practice during initial preprocessing steps.
When you do a normal train_test_split, where you will have a 75% / 25% split, your split may overlook class ordering in the original dataset. For example, class labels that might resemble a data set similar to the iris data set would include target variables that resemble the following:
For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]
You could see from this example above, that splitting your data without shuffling might lead to very poor performance in your test set evaluation. Said another way, you may only capture the classes 0, 1, and 2 in your training data and only 3 will be represented in your test data. Specifically for classification tasks, but also for other ML tasks it may be useful to shuffle your data. However, each situation is different so the best idea would be to try it both ways to see whether you see a significant improvement or not.
Hope this answers your question. Drop a comment if you would like any further clarification.
$endgroup$
add a comment |
$begingroup$
Shuffling the training data is generally good practice during initial preprocessing steps.
When you do a normal train_test_split, where you will have a 75% / 25% split, your split may overlook class ordering in the original dataset. For example, class labels that might resemble a data set similar to the iris data set would include target variables that resemble the following:
For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]
You could see from this example above, that splitting your data without shuffling might lead to very poor performance in your test set evaluation. Said another way, you may only capture the classes 0, 1, and 2 in your training data and only 3 will be represented in your test data. Specifically for classification tasks, but also for other ML tasks it may be useful to shuffle your data. However, each situation is different so the best idea would be to try it both ways to see whether you see a significant improvement or not.
Hope this answers your question. Drop a comment if you would like any further clarification.
$endgroup$
add a comment |
$begingroup$
Shuffling the training data is generally good practice during initial preprocessing steps.
When you do a normal train_test_split, where you will have a 75% / 25% split, your split may overlook class ordering in the original dataset. For example, class labels that might resemble a data set similar to the iris data set would include target variables that resemble the following:
For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]
You could see from this example above, that splitting your data without shuffling might lead to very poor performance in your test set evaluation. Said another way, you may only capture the classes 0, 1, and 2 in your training data and only 3 will be represented in your test data. Specifically for classification tasks, but also for other ML tasks it may be useful to shuffle your data. However, each situation is different so the best idea would be to try it both ways to see whether you see a significant improvement or not.
Hope this answers your question. Drop a comment if you would like any further clarification.
$endgroup$
Shuffling the training data is generally good practice during initial preprocessing steps.
When you do a normal train_test_split, where you will have a 75% / 25% split, your split may overlook class ordering in the original dataset. For example, class labels that might resemble a data set similar to the iris data set would include target variables that resemble the following:
For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]
You could see from this example above, that splitting your data without shuffling might lead to very poor performance in your test set evaluation. Said another way, you may only capture the classes 0, 1, and 2 in your training data and only 3 will be represented in your test data. Specifically for classification tasks, but also for other ML tasks it may be useful to shuffle your data. However, each situation is different so the best idea would be to try it both ways to see whether you see a significant improvement or not.
Hope this answers your question. Drop a comment if you would like any further clarification.
answered 51 mins ago
EthanEthan
15015
15015
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45663%2fshuffling-ml-training-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
this question could easily be googled... one convenient way is
df2.sample(frac=1.0)$endgroup$
– oW_
1 hour ago
$begingroup$
Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
$endgroup$
– HenryHub
1 hour ago