Different approaches of creating the test set












0












$begingroup$


I came across different approaches to creating a test set. Theoretically, it's quite simple, just pick some instances randomly, typically 20% of the dataset and set them aside. Below are the approaches




The naive way of creating the test set is




def split_train_test(data,test_set_ratio):
#create indices
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_set_ratio)
test_set_indices = shuffled_indices[:test_set_size]
train_set_indices = shuffled_indices[test_set_size:]
return data.iloc[train_set_indices],data.iloc[test_set_indices]


The above splitting mechanism works, but if the program is run, again and again, it will generate a different dataset. Over the time, the machine learning algorithm will get to see all the examples. The solutions to fix the above problem was (guided by the author of the book)




  1. Save the test set on the first run and then load it in subsequent runs

  2. To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices


But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.




Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.




Then the author came up with another reliable approach to create the test.



 def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]



Approach #1




 def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio



Approach #2




 def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32



Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.




Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest() and I got different results.




Is there any intuition behind these results ?.











share|improve this question









$endgroup$




bumped to the homepage by Community 12 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.




















    0












    $begingroup$


    I came across different approaches to creating a test set. Theoretically, it's quite simple, just pick some instances randomly, typically 20% of the dataset and set them aside. Below are the approaches




    The naive way of creating the test set is




    def split_train_test(data,test_set_ratio):
    #create indices
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_set_ratio)
    test_set_indices = shuffled_indices[:test_set_size]
    train_set_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_set_indices],data.iloc[test_set_indices]


    The above splitting mechanism works, but if the program is run, again and again, it will generate a different dataset. Over the time, the machine learning algorithm will get to see all the examples. The solutions to fix the above problem was (guided by the author of the book)




    1. Save the test set on the first run and then load it in subsequent runs

    2. To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices


    But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.




    Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.




    Then the author came up with another reliable approach to create the test.



     def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]



    Approach #1




     def test_set_check(identifier, test_ratio, hash=hashlib.md5):
    return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio



    Approach #2




     def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32



    Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.




    Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest() and I got different results.




    Is there any intuition behind these results ?.











    share|improve this question









    $endgroup$




    bumped to the homepage by Community 12 mins ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















      0












      0








      0





      $begingroup$


      I came across different approaches to creating a test set. Theoretically, it's quite simple, just pick some instances randomly, typically 20% of the dataset and set them aside. Below are the approaches




      The naive way of creating the test set is




      def split_train_test(data,test_set_ratio):
      #create indices
      shuffled_indices = np.random.permutation(len(data))
      test_set_size = int(len(data) * test_set_ratio)
      test_set_indices = shuffled_indices[:test_set_size]
      train_set_indices = shuffled_indices[test_set_size:]
      return data.iloc[train_set_indices],data.iloc[test_set_indices]


      The above splitting mechanism works, but if the program is run, again and again, it will generate a different dataset. Over the time, the machine learning algorithm will get to see all the examples. The solutions to fix the above problem was (guided by the author of the book)




      1. Save the test set on the first run and then load it in subsequent runs

      2. To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices


      But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.




      Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.




      Then the author came up with another reliable approach to create the test.



       def split_train_test_by_id(data, test_ratio, id_column):
      ids = data[id_column]
      in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
      return data.loc[~in_test_set], data.loc[in_test_set]



      Approach #1




       def test_set_check(identifier, test_ratio, hash=hashlib.md5):
      return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio



      Approach #2




       def test_set_check(identifier, test_ratio):
      return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32



      Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.




      Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest() and I got different results.




      Is there any intuition behind these results ?.











      share|improve this question









      $endgroup$




      I came across different approaches to creating a test set. Theoretically, it's quite simple, just pick some instances randomly, typically 20% of the dataset and set them aside. Below are the approaches




      The naive way of creating the test set is




      def split_train_test(data,test_set_ratio):
      #create indices
      shuffled_indices = np.random.permutation(len(data))
      test_set_size = int(len(data) * test_set_ratio)
      test_set_indices = shuffled_indices[:test_set_size]
      train_set_indices = shuffled_indices[test_set_size:]
      return data.iloc[train_set_indices],data.iloc[test_set_indices]


      The above splitting mechanism works, but if the program is run, again and again, it will generate a different dataset. Over the time, the machine learning algorithm will get to see all the examples. The solutions to fix the above problem was (guided by the author of the book)




      1. Save the test set on the first run and then load it in subsequent runs

      2. To set the random number generator's seed(np.random.seed(42)) before calling np.random.permutation() so that it always generates the same shuffled indices


      But both the above solutions break when we fetch the next updated dataset. I am not still clear with this statement.




      Can someone give me an intuition behind how do the above two solutions breaks when we fetch the next updated dataset ?.




      Then the author came up with another reliable approach to create the test.



       def split_train_test_by_id(data, test_ratio, id_column):
      ids = data[id_column]
      in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
      return data.loc[~in_test_set], data.loc[in_test_set]



      Approach #1




       def test_set_check(identifier, test_ratio, hash=hashlib.md5):
      return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio



      Approach #2




       def test_set_check(identifier, test_ratio):
      return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32



      Approaches #1,#2, why are we making use of crc32, 0xffffffff, byte array?.




      Just out of curiosity, I passed different values for identifier variable into hash_function(np.int64(identifier)).digest() and I got different results.




      Is there any intuition behind these results ?.








      machine-learning python preprocessing numpy






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Jun 13 '18 at 9:28









      James K JJames K J

      1198




      1198





      bumped to the homepage by Community 12 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 12 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          It gets a little complicated, I've attached links at the end of the answer to explain as well.



          def test_set_check(identifier, test_ratio, hash=hashlib.md5):
          return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio


          The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).



          The bytearray method stores the hash value into an array. The [-1] represents getting 1 byte/the last byte from the array/hash value. This byte will be a # between 0 and 255 (1 byte=11111111, or 255 in decimal).



          Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.



          Rest is explained well in these links:



          https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data



          https://github.com/ageron/handson-ml/issues/71



          https://docs.python.org/3/library/hashlib.html






          share|improve this answer











          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33072%2fdifferent-approaches-of-creating-the-test-set%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            It gets a little complicated, I've attached links at the end of the answer to explain as well.



            def test_set_check(identifier, test_ratio, hash=hashlib.md5):
            return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio


            The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).



            The bytearray method stores the hash value into an array. The [-1] represents getting 1 byte/the last byte from the array/hash value. This byte will be a # between 0 and 255 (1 byte=11111111, or 255 in decimal).



            Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.



            Rest is explained well in these links:



            https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data



            https://github.com/ageron/handson-ml/issues/71



            https://docs.python.org/3/library/hashlib.html






            share|improve this answer











            $endgroup$


















              0












              $begingroup$

              It gets a little complicated, I've attached links at the end of the answer to explain as well.



              def test_set_check(identifier, test_ratio, hash=hashlib.md5):
              return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio


              The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).



              The bytearray method stores the hash value into an array. The [-1] represents getting 1 byte/the last byte from the array/hash value. This byte will be a # between 0 and 255 (1 byte=11111111, or 255 in decimal).



              Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.



              Rest is explained well in these links:



              https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data



              https://github.com/ageron/handson-ml/issues/71



              https://docs.python.org/3/library/hashlib.html






              share|improve this answer











              $endgroup$
















                0












                0








                0





                $begingroup$

                It gets a little complicated, I've attached links at the end of the answer to explain as well.



                def test_set_check(identifier, test_ratio, hash=hashlib.md5):
                return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio


                The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).



                The bytearray method stores the hash value into an array. The [-1] represents getting 1 byte/the last byte from the array/hash value. This byte will be a # between 0 and 255 (1 byte=11111111, or 255 in decimal).



                Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.



                Rest is explained well in these links:



                https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data



                https://github.com/ageron/handson-ml/issues/71



                https://docs.python.org/3/library/hashlib.html






                share|improve this answer











                $endgroup$



                It gets a little complicated, I've attached links at the end of the answer to explain as well.



                def test_set_check(identifier, test_ratio, hash=hashlib.md5):
                return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio


                The hash(np.int64(identifier)).digest() part returns a hash value of the identifier (which we cast as an int of 8 bytes with np.int64()).



                The bytearray method stores the hash value into an array. The [-1] represents getting 1 byte/the last byte from the array/hash value. This byte will be a # between 0 and 255 (1 byte=11111111, or 255 in decimal).



                Assuming our test_ratio is .20, the byte value should be less than or equal to 51 (256*.20=51.2). If the byte is less than or equal to 51, the row/instance will be added to the test set.



                Rest is explained well in these links:



                https://stackoverflow.com/questions/50646890/how-does-the-crc32-function-work-when-using-sampling-data



                https://github.com/ageron/handson-ml/issues/71



                https://docs.python.org/3/library/hashlib.html







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Dec 26 '18 at 6:24

























                answered Dec 26 '18 at 6:18









                AbhiAbhi

                11




                11






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f33072%2fdifferent-approaches-of-creating-the-test-set%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    How to label and detect the document text images

                    Vallis Paradisi

                    Tabula Rosettana