Equivalence of Tidy Data and Third Normal Form
$begingroup$
In Hadley Wickham's "Tidy Data" paper, he states that
In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
This is Codd’s 3rd normal form (Codd 1990), but with the constraints
framed in statistical language, and the focus put on a single dataset
rather than the many connected datasets common in relational
databases.
Codd's Third Normal Form can be described as:
A table is in 1st normal form if
- It stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row.
- Each column contains atomic values, and there are no repeating groups of columns.
A table is in 2nd normal form if
- The table is in 1st normal form, and
- All the non-key columns are dependent on the table’s primary key.
A table is in 3rd normal form if
- It is in 2nd normal form, and
- It contains only columns that are non-transitively dependent on the primary key
I am trying to understand how these two sets of rules are equivalent.
I believe that the first Tidy Data rule maps to 1st normal form. Specifically, "Each variable forms a column" maps to "Each column contains atomic values, and there are no repeating groups of columns".
I believe the Tidy Data rule "Each observation forms a row" maps to "All the non-key columns are dependent on the table’s primary key" and also "It contains only columns that are non-transitively dependent on the primary key".
I suspect that "Each type of observational unit forms a table" maps to "The table stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row", but I'm least sure about this mapping.
I believe my above analysis is not (quite) correct, so I'm hoping someone can clarify the link between these two sets of rules.
My question is: How are the three aforementioned Tidy Data rules equivalent to Third Normal Form?
dataset data data-cleaning normalization
$endgroup$
add a comment |
$begingroup$
In Hadley Wickham's "Tidy Data" paper, he states that
In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
This is Codd’s 3rd normal form (Codd 1990), but with the constraints
framed in statistical language, and the focus put on a single dataset
rather than the many connected datasets common in relational
databases.
Codd's Third Normal Form can be described as:
A table is in 1st normal form if
- It stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row.
- Each column contains atomic values, and there are no repeating groups of columns.
A table is in 2nd normal form if
- The table is in 1st normal form, and
- All the non-key columns are dependent on the table’s primary key.
A table is in 3rd normal form if
- It is in 2nd normal form, and
- It contains only columns that are non-transitively dependent on the primary key
I am trying to understand how these two sets of rules are equivalent.
I believe that the first Tidy Data rule maps to 1st normal form. Specifically, "Each variable forms a column" maps to "Each column contains atomic values, and there are no repeating groups of columns".
I believe the Tidy Data rule "Each observation forms a row" maps to "All the non-key columns are dependent on the table’s primary key" and also "It contains only columns that are non-transitively dependent on the primary key".
I suspect that "Each type of observational unit forms a table" maps to "The table stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row", but I'm least sure about this mapping.
I believe my above analysis is not (quite) correct, so I'm hoping someone can clarify the link between these two sets of rules.
My question is: How are the three aforementioned Tidy Data rules equivalent to Third Normal Form?
dataset data data-cleaning normalization
$endgroup$
add a comment |
$begingroup$
In Hadley Wickham's "Tidy Data" paper, he states that
In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
This is Codd’s 3rd normal form (Codd 1990), but with the constraints
framed in statistical language, and the focus put on a single dataset
rather than the many connected datasets common in relational
databases.
Codd's Third Normal Form can be described as:
A table is in 1st normal form if
- It stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row.
- Each column contains atomic values, and there are no repeating groups of columns.
A table is in 2nd normal form if
- The table is in 1st normal form, and
- All the non-key columns are dependent on the table’s primary key.
A table is in 3rd normal form if
- It is in 2nd normal form, and
- It contains only columns that are non-transitively dependent on the primary key
I am trying to understand how these two sets of rules are equivalent.
I believe that the first Tidy Data rule maps to 1st normal form. Specifically, "Each variable forms a column" maps to "Each column contains atomic values, and there are no repeating groups of columns".
I believe the Tidy Data rule "Each observation forms a row" maps to "All the non-key columns are dependent on the table’s primary key" and also "It contains only columns that are non-transitively dependent on the primary key".
I suspect that "Each type of observational unit forms a table" maps to "The table stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row", but I'm least sure about this mapping.
I believe my above analysis is not (quite) correct, so I'm hoping someone can clarify the link between these two sets of rules.
My question is: How are the three aforementioned Tidy Data rules equivalent to Third Normal Form?
dataset data data-cleaning normalization
$endgroup$
In Hadley Wickham's "Tidy Data" paper, he states that
In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
This is Codd’s 3rd normal form (Codd 1990), but with the constraints
framed in statistical language, and the focus put on a single dataset
rather than the many connected datasets common in relational
databases.
Codd's Third Normal Form can be described as:
A table is in 1st normal form if
- It stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row.
- Each column contains atomic values, and there are no repeating groups of columns.
A table is in 2nd normal form if
- The table is in 1st normal form, and
- All the non-key columns are dependent on the table’s primary key.
A table is in 3rd normal form if
- It is in 2nd normal form, and
- It contains only columns that are non-transitively dependent on the primary key
I am trying to understand how these two sets of rules are equivalent.
I believe that the first Tidy Data rule maps to 1st normal form. Specifically, "Each variable forms a column" maps to "Each column contains atomic values, and there are no repeating groups of columns".
I believe the Tidy Data rule "Each observation forms a row" maps to "All the non-key columns are dependent on the table’s primary key" and also "It contains only columns that are non-transitively dependent on the primary key".
I suspect that "Each type of observational unit forms a table" maps to "The table stores information in rows and columns where one or more columns, called the primary key, uniquely identify each row", but I'm least sure about this mapping.
I believe my above analysis is not (quite) correct, so I'm hoping someone can clarify the link between these two sets of rules.
My question is: How are the three aforementioned Tidy Data rules equivalent to Third Normal Form?
dataset data data-cleaning normalization
dataset data data-cleaning normalization
edited Dec 27 '18 at 6:01
Greg Thatcher
asked Dec 27 '18 at 4:43
Greg ThatcherGreg Thatcher
27116
27116
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
I believe it goes a like this, based on thinking about this myself:
Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.
The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:
The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.
Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.
I think someone could probably do a more thorough proof with more time.
HTH
New contributor
soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43184%2fequivalence-of-tidy-data-and-third-normal-form%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I believe it goes a like this, based on thinking about this myself:
Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.
The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:
The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.
Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.
I think someone could probably do a more thorough proof with more time.
HTH
New contributor
soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
I believe it goes a like this, based on thinking about this myself:
Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.
The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:
The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.
Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.
I think someone could probably do a more thorough proof with more time.
HTH
New contributor
soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
I believe it goes a like this, based on thinking about this myself:
Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.
The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:
The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.
Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.
I think someone could probably do a more thorough proof with more time.
HTH
New contributor
soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
I believe it goes a like this, based on thinking about this myself:
Hadley's rule 1 & 2 (call it TR1 & TR2) is equivalent to 1NF with the exception that Hadley's rule 1 would permit repetition of columns. All we are saying in TR1 & TR2 is that we have a single value in each cell of our table, although I think that is dependent on how you interpret observation. If an observation is an instance of a specific object then this is a stronger statement, although TR3 suggests that is not the case I think.
The combination of Hadley's rule 1, 2 and 3 is equivalent to 3NF. Using this link which also explains transitive dependence:
The table with UnitCode, UnitName, CourseCode, CourseName is in 2NF but violates TR3 since course and unit are different objects. The tables in 3NF where course and unit are separated and linked by foreign key comply with TR1, TR2 and TR3.
Let's say we want to prove implication and assume we have TR1, TR2 and TR3. Since those say nothing about repeating columns I think we have to either assume no repetition of columns or accept that these are not equivalent. Let's assume no repetition of columns. From the definitions TR1 & TR2 => 1NF. If we have a table that is not in 2NF then TR3 is violated using the logic in the example above. Therefore we have 2NF. Similarly if we have a table that is not in 3NF then TR3 is violated, again using the logic above. Therefore TR1 & TR2 & TR3 => 3NF.
I think someone could probably do a more thorough proof with more time.
HTH
New contributor
soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered 2 days ago
soundofsilencesoundofsilence
1
1
New contributor
soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
soundofsilence is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f43184%2fequivalence-of-tidy-data-and-third-normal-form%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown