Shifted feature distribution across different datasets
$begingroup$
I am trying to validate a classifier using two different training and testing datasets.
The feature I am considering is a feature constructed doing the fold-change between two original features, i.e. $log2(feat_A / feat_b)$
The problem is that, across the two different datasets, I see the same distribution of fold-change across groups but with different values. See, in the images attached, the values for each group (each box) in the different datasets (each plot).


What may be causing the differences in fold-change? In this case the fold-change is different also in sign, which means that while in one dataset $feat_a$ is greater than $feat_b$, the opposite is true for the other dataset. However, the pairwise relation between the two groups (the two boxes in each plot) is maintained.
I was thinking about different normalization procedures on each feature across different dataset, which may explain the shift of the distribution of the fold-change. What other problems may be in these datasets?
classification dataset distribution
$endgroup$
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
$begingroup$
I am trying to validate a classifier using two different training and testing datasets.
The feature I am considering is a feature constructed doing the fold-change between two original features, i.e. $log2(feat_A / feat_b)$
The problem is that, across the two different datasets, I see the same distribution of fold-change across groups but with different values. See, in the images attached, the values for each group (each box) in the different datasets (each plot).


What may be causing the differences in fold-change? In this case the fold-change is different also in sign, which means that while in one dataset $feat_a$ is greater than $feat_b$, the opposite is true for the other dataset. However, the pairwise relation between the two groups (the two boxes in each plot) is maintained.
I was thinking about different normalization procedures on each feature across different dataset, which may explain the shift of the distribution of the fold-change. What other problems may be in these datasets?
classification dataset distribution
$endgroup$
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
$endgroup$
– The Lyrist
Jun 12 '18 at 14:25
add a comment |
$begingroup$
I am trying to validate a classifier using two different training and testing datasets.
The feature I am considering is a feature constructed doing the fold-change between two original features, i.e. $log2(feat_A / feat_b)$
The problem is that, across the two different datasets, I see the same distribution of fold-change across groups but with different values. See, in the images attached, the values for each group (each box) in the different datasets (each plot).


What may be causing the differences in fold-change? In this case the fold-change is different also in sign, which means that while in one dataset $feat_a$ is greater than $feat_b$, the opposite is true for the other dataset. However, the pairwise relation between the two groups (the two boxes in each plot) is maintained.
I was thinking about different normalization procedures on each feature across different dataset, which may explain the shift of the distribution of the fold-change. What other problems may be in these datasets?
classification dataset distribution
$endgroup$
I am trying to validate a classifier using two different training and testing datasets.
The feature I am considering is a feature constructed doing the fold-change between two original features, i.e. $log2(feat_A / feat_b)$
The problem is that, across the two different datasets, I see the same distribution of fold-change across groups but with different values. See, in the images attached, the values for each group (each box) in the different datasets (each plot).


What may be causing the differences in fold-change? In this case the fold-change is different also in sign, which means that while in one dataset $feat_a$ is greater than $feat_b$, the opposite is true for the other dataset. However, the pairwise relation between the two groups (the two boxes in each plot) is maintained.
I was thinking about different normalization procedures on each feature across different dataset, which may explain the shift of the distribution of the fold-change. What other problems may be in these datasets?
classification dataset distribution
classification dataset distribution
asked Sep 19 '16 at 12:21
gc5gc5
4541614
4541614
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ yesterday
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
$begingroup$
Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
$endgroup$
– The Lyrist
Jun 12 '18 at 14:25
add a comment |
$begingroup$
Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
$endgroup$
– The Lyrist
Jun 12 '18 at 14:25
$begingroup$
Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
$endgroup$
– The Lyrist
Jun 12 '18 at 14:25
$begingroup$
Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
$endgroup$
– The Lyrist
Jun 12 '18 at 14:25
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
It'd be easier to understand if the plots were labeled. Nonetheless, I think I understand the question, and perhaps it's just sampling error. Why do you have two different pairs of train and test data? Are they from different sources? In that case, you probably don't want to be using the same classifier on both. Also, what do you mean 'validate' using two different train and test sets? Classifiers are trained on a train set, and validated on a test set, what do you mean you're validating on two different train/test pairs?
$endgroup$
$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f14090%2fshifted-feature-distribution-across-different-datasets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
It'd be easier to understand if the plots were labeled. Nonetheless, I think I understand the question, and perhaps it's just sampling error. Why do you have two different pairs of train and test data? Are they from different sources? In that case, you probably don't want to be using the same classifier on both. Also, what do you mean 'validate' using two different train and test sets? Classifiers are trained on a train set, and validated on a test set, what do you mean you're validating on two different train/test pairs?
$endgroup$
$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26
add a comment |
$begingroup$
It'd be easier to understand if the plots were labeled. Nonetheless, I think I understand the question, and perhaps it's just sampling error. Why do you have two different pairs of train and test data? Are they from different sources? In that case, you probably don't want to be using the same classifier on both. Also, what do you mean 'validate' using two different train and test sets? Classifiers are trained on a train set, and validated on a test set, what do you mean you're validating on two different train/test pairs?
$endgroup$
$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26
add a comment |
$begingroup$
It'd be easier to understand if the plots were labeled. Nonetheless, I think I understand the question, and perhaps it's just sampling error. Why do you have two different pairs of train and test data? Are they from different sources? In that case, you probably don't want to be using the same classifier on both. Also, what do you mean 'validate' using two different train and test sets? Classifiers are trained on a train set, and validated on a test set, what do you mean you're validating on two different train/test pairs?
$endgroup$
It'd be easier to understand if the plots were labeled. Nonetheless, I think I understand the question, and perhaps it's just sampling error. Why do you have two different pairs of train and test data? Are they from different sources? In that case, you probably don't want to be using the same classifier on both. Also, what do you mean 'validate' using two different train and test sets? Classifiers are trained on a train set, and validated on a test set, what do you mean you're validating on two different train/test pairs?
answered Sep 20 '16 at 4:07
Ben FBen F
1038
1038
$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26
add a comment |
$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26
$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26
$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f14090%2fshifted-feature-distribution-across-different-datasets%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
$endgroup$
– The Lyrist
Jun 12 '18 at 14:25