shuffling ML training data?












0












$begingroup$


I was curious to know if shuffling ML training data is beneficial to better results?



Sorry not a lot of wisdom here, but I have been reading a post from pythonprogramming.net for this topic.



I copied this function from the post and modified to just save my shuffled data to csv file.



def Randomizing():
df2 = df.reindex(np.random.permutation(df.index))
df2.to_csv('C:\Users\Machine-Learning-Electric-Data\randomized.csv')

Randomizing()


What appears to happen is only the index gets shuffled and all other data stays the same. I have many columns in my pd dataframe where I would need to keep all rows the same. (randomly shuffle all rows, its time series data) If this is beneficial can someone give me a tip on how to randomly shuffle my data more than just the index?










share|improve this question









$endgroup$












  • $begingroup$
    this question could easily be googled... one convenient way is df2.sample(frac=1.0)
    $endgroup$
    – oW_
    1 hour ago










  • $begingroup$
    Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
    $endgroup$
    – HenryHub
    1 hour ago
















0












$begingroup$


I was curious to know if shuffling ML training data is beneficial to better results?



Sorry not a lot of wisdom here, but I have been reading a post from pythonprogramming.net for this topic.



I copied this function from the post and modified to just save my shuffled data to csv file.



def Randomizing():
df2 = df.reindex(np.random.permutation(df.index))
df2.to_csv('C:\Users\Machine-Learning-Electric-Data\randomized.csv')

Randomizing()


What appears to happen is only the index gets shuffled and all other data stays the same. I have many columns in my pd dataframe where I would need to keep all rows the same. (randomly shuffle all rows, its time series data) If this is beneficial can someone give me a tip on how to randomly shuffle my data more than just the index?










share|improve this question









$endgroup$












  • $begingroup$
    this question could easily be googled... one convenient way is df2.sample(frac=1.0)
    $endgroup$
    – oW_
    1 hour ago










  • $begingroup$
    Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
    $endgroup$
    – HenryHub
    1 hour ago














0












0








0





$begingroup$


I was curious to know if shuffling ML training data is beneficial to better results?



Sorry not a lot of wisdom here, but I have been reading a post from pythonprogramming.net for this topic.



I copied this function from the post and modified to just save my shuffled data to csv file.



def Randomizing():
df2 = df.reindex(np.random.permutation(df.index))
df2.to_csv('C:\Users\Machine-Learning-Electric-Data\randomized.csv')

Randomizing()


What appears to happen is only the index gets shuffled and all other data stays the same. I have many columns in my pd dataframe where I would need to keep all rows the same. (randomly shuffle all rows, its time series data) If this is beneficial can someone give me a tip on how to randomly shuffle my data more than just the index?










share|improve this question









$endgroup$




I was curious to know if shuffling ML training data is beneficial to better results?



Sorry not a lot of wisdom here, but I have been reading a post from pythonprogramming.net for this topic.



I copied this function from the post and modified to just save my shuffled data to csv file.



def Randomizing():
df2 = df.reindex(np.random.permutation(df.index))
df2.to_csv('C:\Users\Machine-Learning-Electric-Data\randomized.csv')

Randomizing()


What appears to happen is only the index gets shuffled and all other data stays the same. I have many columns in my pd dataframe where I would need to keep all rows the same. (randomly shuffle all rows, its time series data) If this is beneficial can someone give me a tip on how to randomly shuffle my data more than just the index?







machine-learning python scikit-learn pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked 1 hour ago









HenryHubHenryHub

1134




1134












  • $begingroup$
    this question could easily be googled... one convenient way is df2.sample(frac=1.0)
    $endgroup$
    – oW_
    1 hour ago










  • $begingroup$
    Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
    $endgroup$
    – HenryHub
    1 hour ago


















  • $begingroup$
    this question could easily be googled... one convenient way is df2.sample(frac=1.0)
    $endgroup$
    – oW_
    1 hour ago










  • $begingroup$
    Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
    $endgroup$
    – HenryHub
    1 hour ago
















$begingroup$
this question could easily be googled... one convenient way is df2.sample(frac=1.0)
$endgroup$
– oW_
1 hour ago




$begingroup$
this question could easily be googled... one convenient way is df2.sample(frac=1.0)
$endgroup$
– oW_
1 hour ago












$begingroup$
Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
$endgroup$
– HenryHub
1 hour ago




$begingroup$
Thanks for the tips, I am running a ML regression experiment and shuffling the data cuts the rmse in half
$endgroup$
– HenryHub
1 hour ago










1 Answer
1






active

oldest

votes


















0












$begingroup$

Shuffling the training data is generally good practice during initial preprocessing steps.



When you do a normal train_test_split, where you will have a 75% / 25% split, your split may overlook class ordering in the original dataset. For example, class labels that might resemble a data set similar to the iris data set would include target variables that resemble the following:



For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]



You could see from this example above, that splitting your data without shuffling might lead to very poor performance in your test set evaluation. Said another way, you may only capture the classes 0, 1, and 2 in your training data and only 3 will be represented in your test data. Specifically for classification tasks, but also for other ML tasks it may be useful to shuffle your data. However, each situation is different so the best idea would be to try it both ways to see whether you see a significant improvement or not.



Hope this answers your question. Drop a comment if you would like any further clarification.






share|improve this answer









$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "557"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45663%2fshuffling-ml-training-data%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0












    $begingroup$

    Shuffling the training data is generally good practice during initial preprocessing steps.



    When you do a normal train_test_split, where you will have a 75% / 25% split, your split may overlook class ordering in the original dataset. For example, class labels that might resemble a data set similar to the iris data set would include target variables that resemble the following:



    For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]



    You could see from this example above, that splitting your data without shuffling might lead to very poor performance in your test set evaluation. Said another way, you may only capture the classes 0, 1, and 2 in your training data and only 3 will be represented in your test data. Specifically for classification tasks, but also for other ML tasks it may be useful to shuffle your data. However, each situation is different so the best idea would be to try it both ways to see whether you see a significant improvement or not.



    Hope this answers your question. Drop a comment if you would like any further clarification.






    share|improve this answer









    $endgroup$


















      0












      $begingroup$

      Shuffling the training data is generally good practice during initial preprocessing steps.



      When you do a normal train_test_split, where you will have a 75% / 25% split, your split may overlook class ordering in the original dataset. For example, class labels that might resemble a data set similar to the iris data set would include target variables that resemble the following:



      For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]



      You could see from this example above, that splitting your data without shuffling might lead to very poor performance in your test set evaluation. Said another way, you may only capture the classes 0, 1, and 2 in your training data and only 3 will be represented in your test data. Specifically for classification tasks, but also for other ML tasks it may be useful to shuffle your data. However, each situation is different so the best idea would be to try it both ways to see whether you see a significant improvement or not.



      Hope this answers your question. Drop a comment if you would like any further clarification.






      share|improve this answer









      $endgroup$
















        0












        0








        0





        $begingroup$

        Shuffling the training data is generally good practice during initial preprocessing steps.



        When you do a normal train_test_split, where you will have a 75% / 25% split, your split may overlook class ordering in the original dataset. For example, class labels that might resemble a data set similar to the iris data set would include target variables that resemble the following:



        For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]



        You could see from this example above, that splitting your data without shuffling might lead to very poor performance in your test set evaluation. Said another way, you may only capture the classes 0, 1, and 2 in your training data and only 3 will be represented in your test data. Specifically for classification tasks, but also for other ML tasks it may be useful to shuffle your data. However, each situation is different so the best idea would be to try it both ways to see whether you see a significant improvement or not.



        Hope this answers your question. Drop a comment if you would like any further clarification.






        share|improve this answer









        $endgroup$



        Shuffling the training data is generally good practice during initial preprocessing steps.



        When you do a normal train_test_split, where you will have a 75% / 25% split, your split may overlook class ordering in the original dataset. For example, class labels that might resemble a data set similar to the iris data set would include target variables that resemble the following:



        For example: [0, 0, 0, 1, 2, 2, 2, 3, 3, 3, 3, 3]



        You could see from this example above, that splitting your data without shuffling might lead to very poor performance in your test set evaluation. Said another way, you may only capture the classes 0, 1, and 2 in your training data and only 3 will be represented in your test data. Specifically for classification tasks, but also for other ML tasks it may be useful to shuffle your data. However, each situation is different so the best idea would be to try it both ways to see whether you see a significant improvement or not.



        Hope this answers your question. Drop a comment if you would like any further clarification.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered 51 mins ago









        EthanEthan

        15015




        15015






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Data Science Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f45663%2fshuffling-ml-training-data%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Callistus I

            Tabula Rosettana

            How to label and detect the document text images