shuffling ML training data?

I was curious to know if shuffling ML training data is beneficial to better results?

Sorry not a lot of wisdom here, but I have been reading a post from pythonprogramming.net for this topic.

I copied this function from the post and modified to just save my shuffled data to csv file.

def Randomizing():

    df2 = df.reindex(np.random.permutation(df.index))

    df2.to_csv('C:\Users\Machine-Learning-Electric-Data\randomized.csv')



Randomizing()

What appears to happen is only the index gets shuffled and all other data stays the same. I have many columns in my pd dataframe where I would need to keep all rows the same. (randomly shuffle all rows, its time series data) If this is beneficial can someone give me a tip on how to randomly shuffle my data more than just the index?

asked 1 hour ago

HenryHub

1134

$begingroup$
this question could easily be googled... one convenient way is df2.sample(frac=1.0)
$endgroup$
– oW_
1 hour ago

$begingroup$
Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
$endgroup$
– HenryHub
1 hour ago

add a comment |

I was curious to know if shuffling ML training data is beneficial to better results?

Sorry not a lot of wisdom here, but I have been reading a post from pythonprogramming.net for this topic.

I copied this function from the post and modified to just save my shuffled data to csv file.

def Randomizing():

    df2 = df.reindex(np.random.permutation(df.index))

    df2.to_csv('C:\Users\Machine-Learning-Electric-Data\randomized.csv')



Randomizing()

asked 1 hour ago

HenryHub

1134

$begingroup$
this question could easily be googled... one convenient way is df2.sample(frac=1.0)
$endgroup$
– oW_
1 hour ago

$begingroup$
Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
$endgroup$
– HenryHub
1 hour ago

add a comment |

I was curious to know if shuffling ML training data is beneficial to better results?

Sorry not a lot of wisdom here, but I have been reading a post from pythonprogramming.net for this topic.

I copied this function from the post and modified to just save my shuffled data to csv file.

def Randomizing():

    df2 = df.reindex(np.random.permutation(df.index))

    df2.to_csv('C:\Users\Machine-Learning-Electric-Data\randomized.csv')



Randomizing()

asked 1 hour ago

HenryHub

1134

I was curious to know if shuffling ML training data is beneficial to better results?

Sorry not a lot of wisdom here, but I have been reading a post from pythonprogramming.net for this topic.

I copied this function from the post and modified to just save my shuffled data to csv file.

def Randomizing():

    df2 = df.reindex(np.random.permutation(df.index))

    df2.to_csv('C:\Users\Machine-Learning-Electric-Data\randomized.csv')



Randomizing()

machine-learning python scikit-learn pandas

asked 1 hour ago

HenryHub

1134

asked 1 hour ago

HenryHub

1134

asked 1 hour ago

HenryHub

1134

asked 1 hour ago

HenryHub

1134

asked 1 hour ago

HenryHub

1134

$begingroup$
this question could easily be googled... one convenient way is df2.sample(frac=1.0)
$endgroup$
– oW_
1 hour ago

$begingroup$
Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
$endgroup$
– HenryHub
1 hour ago

add a comment |

$begingroup$
this question could easily be googled... one convenient way is df2.sample(frac=1.0)
$endgroup$
– oW_
1 hour ago

$begingroup$
Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
$endgroup$
– HenryHub
1 hour ago

this question could easily be googled... one convenient way is df2.sample(frac=1.0)

– oW_
1 hour ago

Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half

– HenryHub
1 hour ago

add a comment |

1 Answer
1

active

oldest

votes

Shuffling the training data is generally good practice during initial preprocessing steps.

When you do a normal train_test_split, where you will have a 75% / 25% split, your split may overlook class ordering in the original dataset. For example, class labels that might resemble a data set similar to the iris data set would include target variables that resemble the following:

For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]

You could see from this example above, that splitting your data without shuffling might lead to very poor performance in your test set evaluation. Said another way, you may only capture the classes 0, 1, and 2 in your training data and only 3 will be represented in your test data. Specifically for classification tasks, but also for other ML tasks it may be useful to shuffle your data. However, each situation is different so the best idea would be to try it both ways to see whether you see a significant improvement or not.

Hope this answers your question. Drop a comment if you would like any further clarification.

answered 51 mins ago

Ethan

15015

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45663%2fshuffling-ml-training-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Shuffling the training data is generally good practice during initial preprocessing steps.

For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]

Hope this answers your question. Drop a comment if you would like any further clarification.

answered 51 mins ago

Ethan

15015

add a comment |

Shuffling the training data is generally good practice during initial preprocessing steps.

For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]

Hope this answers your question. Drop a comment if you would like any further clarification.

answered 51 mins ago

Ethan

15015

add a comment |

Shuffling the training data is generally good practice during initial preprocessing steps.

For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]

Hope this answers your question. Drop a comment if you would like any further clarification.

answered 51 mins ago

Ethan

15015

Shuffling the training data is generally good practice during initial preprocessing steps.

For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]

Hope this answers your question. Drop a comment if you would like any further clarification.

answered 51 mins ago

Ethan

15015

answered 51 mins ago

Ethan

15015

answered 51 mins ago

Ethan

15015

answered 51 mins ago

Ethan

15015

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk