Equivalence of Tidy Data and Third Normal Form












0












$begingroup$


In Hadley Wickham's "Tidy Data" paper, he states that




In tidy data:




  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table.


This is Codd’s 3rd normal form (Codd 1990), but with the constraints
framed in statistical language, and the focus put on a single dataset
rather than the many connected datasets common in relational
databases.




Codd's Third Normal Form can be described as:





  1. A table is in 1st normal form if




    • It stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row.

    • Each column contains atomic values, and there are no repeating groups of columns.




  2. A table is in 2nd normal form if




    • The table is in 1st normal form, and

    • All the non-key columns are dependent on the table’s primary key.




  3. A table is in 3rd normal form if




    • It is in 2nd normal form, and

    • It contains only columns that are non-transitively dependent on the primary key




I am trying to understand how these two sets of rules are equivalent.



I believe that the first Tidy Data rule maps to 1st normal form. Specifically, "Each variable forms a column" maps to "Each column contains atomic values, and there are no repeating groups of columns".



I believe the Tidy Data rule "Each observation forms a row" maps to "All the non-key columns are dependent on the table’s primary key" and also "It contains only columns that are non-transitively dependent on the primary key".



I suspect that "Each type of observational unit forms a table" maps to "The table stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row", but I'm least sure about this mapping.



I believe my above analysis is not (quite) correct, so I'm hoping someone can clarify the link between these two sets of rules.



My question is: How are the three aforementioned Tidy Data rules equivalent to Third Normal Form?










share|improve this question











$endgroup$

















    0












    $begingroup$


    In Hadley Wickham's "Tidy Data" paper, he states that




    In tidy data:




    1. Each variable forms a column.

    2. Each observation forms a row.

    3. Each type of observational unit forms a table.


    This is Codd’s 3rd normal form (Codd 1990), but with the constraints
    framed in statistical language, and the focus put on a single dataset
    rather than the many connected datasets common in relational
    databases.




    Codd's Third Normal Form can be described as:





    1. A table is in 1st normal form if




      • It stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row.

      • Each column contains atomic values, and there are no repeating groups of columns.




    2. A table is in 2nd normal form if




      • The table is in 1st normal form, and

      • All the non-key columns are dependent on the table’s primary key.




    3. A table is in 3rd normal form if




      • It is in 2nd normal form, and

      • It contains only columns that are non-transitively dependent on the primary key




    I am trying to understand how these two sets of rules are equivalent.



    I believe that the first Tidy Data rule maps to 1st normal form. Specifically, "Each variable forms a column" maps to "Each column contains atomic values, and there are no repeating groups of columns".



    I believe the Tidy Data rule "Each observation forms a row" maps to "All the non-key columns are dependent on the table’s primary key" and also "It contains only columns that are non-transitively dependent on the primary key".



    I suspect that "Each type of observational unit forms a table" maps to "The table stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row", but I'm least sure about this mapping.



    I believe my above analysis is not (quite) correct, so I'm hoping someone can clarify the link between these two sets of rules.



    My question is: How are the three aforementioned Tidy Data rules equivalent to Third Normal Form?










    share|improve this question











    $endgroup$















      0












      0








      0





      $begingroup$


      In Hadley Wickham's "Tidy Data" paper, he states that




      In tidy data:




      1. Each variable forms a column.

      2. Each observation forms a row.

      3. Each type of observational unit forms a table.


      This is Codd’s 3rd normal form (Codd 1990), but with the constraints
      framed in statistical language, and the focus put on a single dataset
      rather than the many connected datasets common in relational
      databases.




      Codd's Third Normal Form can be described as:





      1. A table is in 1st normal form if




        • It stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row.

        • Each column contains atomic values, and there are no repeating groups of columns.




      2. A table is in 2nd normal form if




        • The table is in 1st normal form, and

        • All the non-key columns are dependent on the table’s primary key.




      3. A table is in 3rd normal form if




        • It is in 2nd normal form, and

        • It contains only columns that are non-transitively dependent on the primary key




      I am trying to understand how these two sets of rules are equivalent.



      I believe that the first Tidy Data rule maps to 1st normal form. Specifically, "Each variable forms a column" maps to "Each column contains atomic values, and there are no repeating groups of columns".



      I believe the Tidy Data rule "Each observation forms a row" maps to "All the non-key columns are dependent on the table’s primary key" and also "It contains only columns that are non-transitively dependent on the primary key".



      I suspect that "Each type of observational unit forms a table" maps to "The table stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row", but I'm least sure about this mapping.



      I believe my above analysis is not (quite) correct, so I'm hoping someone can clarify the link between these two sets of rules.



      My question is: How are the three aforementioned Tidy Data rules equivalent to Third Normal Form?










      share|improve this question











      $endgroup$




      In Hadley Wickham's "Tidy Data" paper, he states that




      In tidy data:




      1. Each variable forms a column.

      2. Each observation forms a row.

      3. Each type of observational unit forms a table.


      This is Codd’s 3rd normal form (Codd 1990), but with the constraints
      framed in statistical language, and the focus put on a single dataset
      rather than the many connected datasets common in relational
      databases.




      Codd's Third Normal Form can be described as:





      1. A table is in 1st normal form if




        • It stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row.

        • Each column contains atomic values, and there are no repeating groups of columns.




      2. A table is in 2nd normal form if




        • The table is in 1st normal form, and

        • All the non-key columns are dependent on the table’s primary key.




      3. A table is in 3rd normal form if




        • It is in 2nd normal form, and

        • It contains only columns that are non-transitively dependent on the primary key




      I am trying to understand how these two sets of rules are equivalent.



      I believe that the first Tidy Data rule maps to 1st normal form. Specifically, "Each variable forms a column" maps to "Each column contains atomic values, and there are no repeating groups of columns".



      I believe the Tidy Data rule "Each observation forms a row" maps to "All the non-key columns are dependent on the table’s primary key" and also "It contains only columns that are non-transitively dependent on the primary key".



      I suspect that "Each type of observational unit forms a table" maps to "The table stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row", but I'm least sure about this mapping.



      I believe my above analysis is not (quite) correct, so I'm hoping someone can clarify the link between these two sets of rules.



      My question is: How are the three aforementioned Tidy Data rules equivalent to Third Normal Form?







      dataset data data-cleaning normalization






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Dec 27 '18 at 6:01







      Greg Thatcher

















      asked Dec 27 '18 at 4:43









      Greg ThatcherGreg Thatcher

      27116




      27116






















          1 Answer
          1






          active

          oldest

          votes


















          0












          $begingroup$

          I believe it goes a like this, based on thinking about this myself:



          Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.



          The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:



          The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.



          Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.



          I think someone could probably do a more thorough proof with more time.



          HTH






          share|improve this answer








          New contributor




          soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          $endgroup$













            Your Answer





            StackExchange.ifUsing("editor", function () {
            return StackExchange.using("mathjaxEditing", function () {
            StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
            StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
            });
            });
            }, "mathjax-editing");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "557"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43184%2fequivalence-of-tidy-data-and-third-normal-form%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0












            $begingroup$

            I believe it goes a like this, based on thinking about this myself:



            Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.



            The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:



            The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.



            Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.



            I think someone could probably do a more thorough proof with more time.



            HTH






            share|improve this answer








            New contributor




            soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
            Check out our Code of Conduct.






            $endgroup$


















              0












              $begingroup$

              I believe it goes a like this, based on thinking about this myself:



              Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.



              The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:



              The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.



              Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.



              I think someone could probably do a more thorough proof with more time.



              HTH






              share|improve this answer








              New contributor




              soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.






              $endgroup$
















                0












                0








                0





                $begingroup$

                I believe it goes a like this, based on thinking about this myself:



                Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.



                The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:



                The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.



                Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.



                I think someone could probably do a more thorough proof with more time.



                HTH






                share|improve this answer








                New contributor




                soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                $endgroup$



                I believe it goes a like this, based on thinking about this myself:



                Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.



                The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:



                The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.



                Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.



                I think someone could probably do a more thorough proof with more time.



                HTH







                share|improve this answer








                New contributor




                soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                share|improve this answer



                share|improve this answer






                New contributor




                soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.









                answered 2 days ago









                soundofsilencesoundofsilence

                1




                1




                New contributor




                soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                New contributor





                soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






                soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43184%2fequivalence-of-tidy-data-and-third-normal-form%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown