Shifted feature distribution across different datasets












3












$begingroup$


I am trying to validate a classifier using two different training and testing datasets.



The feature I am considering is a feature constructed doing the fold-change between two original features, i.e. $log2(feat_A / feat_b)$



The problem is that, across the two different datasets, I see the same distribution of fold-change across groups but with different values. See, in the images attached, the values for each group (each box) in the different datasets (each plot).



FirstSecond



What may be causing the differences in fold-change? In this case the fold-change is different also in sign, which means that while in one dataset $feat_a$ is greater than $feat_b$, the opposite is true for the other dataset. However, the pairwise relation between the two groups (the two boxes in each plot) is maintained.



I was thinking about different normalization procedures on each feature across different dataset, which may explain the shift of the distribution of the fold-change. What other problems may be in these datasets?










share|improve this question









$endgroup$




bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • $begingroup$
    Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
    $endgroup$
    – The Lyrist
    Jun 12 '18 at 14:25


















3












$begingroup$


I am trying to validate a classifier using two different training and testing datasets.



The feature I am considering is a feature constructed doing the fold-change between two original features, i.e. $log2(feat_A / feat_b)$



The problem is that, across the two different datasets, I see the same distribution of fold-change across groups but with different values. See, in the images attached, the values for each group (each box) in the different datasets (each plot).



FirstSecond



What may be causing the differences in fold-change? In this case the fold-change is different also in sign, which means that while in one dataset $feat_a$ is greater than $feat_b$, the opposite is true for the other dataset. However, the pairwise relation between the two groups (the two boxes in each plot) is maintained.



I was thinking about different normalization procedures on each feature across different dataset, which may explain the shift of the distribution of the fold-change. What other problems may be in these datasets?










share|improve this question









$endgroup$




bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • $begingroup$
    Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
    $endgroup$
    – The Lyrist
    Jun 12 '18 at 14:25
















3












3








3





$begingroup$


I am trying to validate a classifier using two different training and testing datasets.



The feature I am considering is a feature constructed doing the fold-change between two original features, i.e. $log2(feat_A / feat_b)$



The problem is that, across the two different datasets, I see the same distribution of fold-change across groups but with different values. See, in the images attached, the values for each group (each box) in the different datasets (each plot).



FirstSecond



What may be causing the differences in fold-change? In this case the fold-change is different also in sign, which means that while in one dataset $feat_a$ is greater than $feat_b$, the opposite is true for the other dataset. However, the pairwise relation between the two groups (the two boxes in each plot) is maintained.



I was thinking about different normalization procedures on each feature across different dataset, which may explain the shift of the distribution of the fold-change. What other problems may be in these datasets?










share|improve this question









$endgroup$




I am trying to validate a classifier using two different training and testing datasets.



The feature I am considering is a feature constructed doing the fold-change between two original features, i.e. $log2(feat_A / feat_b)$



The problem is that, across the two different datasets, I see the same distribution of fold-change across groups but with different values. See, in the images attached, the values for each group (each box) in the different datasets (each plot).



FirstSecond



What may be causing the differences in fold-change? In this case the fold-change is different also in sign, which means that while in one dataset $feat_a$ is greater than $feat_b$, the opposite is true for the other dataset. However, the pairwise relation between the two groups (the two boxes in each plot) is maintained.



I was thinking about different normalization procedures on each feature across different dataset, which may explain the shift of the distribution of the fold-change. What other problems may be in these datasets?







classification dataset distribution






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Sep 19 '16 at 12:21









gc5gc5

4541614




4541614





bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community yesterday


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.














  • $begingroup$
    Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
    $endgroup$
    – The Lyrist
    Jun 12 '18 at 14:25




















  • $begingroup$
    Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
    $endgroup$
    – The Lyrist
    Jun 12 '18 at 14:25


















$begingroup$
Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
$endgroup$
– The Lyrist
Jun 12 '18 at 14:25






$begingroup$
Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
$endgroup$
– The Lyrist
Jun 12 '18 at 14:25












1 Answer
1






active

oldest

votes


















0












$begingroup$

It'd be easier to understand if the plots were labeled. Nonetheless, I think I understand the question, and perhaps it's just sampling error. Why do you have two different pairs of train and test data? Are they from different sources? In that case, you probably don't want to be using the same classifier on both. Also, what do you mean 'validate' using two different train and test sets? Classifiers are trained on a train set, and validated on a test set, what do you mean you're validating on two different train/test pairs?






share|improve this answer









$endgroup$













  • $begingroup$
    In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
    $endgroup$
    – gc5
    Sep 20 '16 at 7:26












Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f14090%2fshifted-feature-distribution-across-different-datasets%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0












$begingroup$

It'd be easier to understand if the plots were labeled. Nonetheless, I think I understand the question, and perhaps it's just sampling error. Why do you have two different pairs of train and test data? Are they from different sources? In that case, you probably don't want to be using the same classifier on both. Also, what do you mean 'validate' using two different train and test sets? Classifiers are trained on a train set, and validated on a test set, what do you mean you're validating on two different train/test pairs?






share|improve this answer









$endgroup$













  • $begingroup$
    In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
    $endgroup$
    – gc5
    Sep 20 '16 at 7:26
















0












$begingroup$

It'd be easier to understand if the plots were labeled. Nonetheless, I think I understand the question, and perhaps it's just sampling error. Why do you have two different pairs of train and test data? Are they from different sources? In that case, you probably don't want to be using the same classifier on both. Also, what do you mean 'validate' using two different train and test sets? Classifiers are trained on a train set, and validated on a test set, what do you mean you're validating on two different train/test pairs?






share|improve this answer









$endgroup$













  • $begingroup$
    In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
    $endgroup$
    – gc5
    Sep 20 '16 at 7:26














0












0








0





$begingroup$

It'd be easier to understand if the plots were labeled. Nonetheless, I think I understand the question, and perhaps it's just sampling error. Why do you have two different pairs of train and test data? Are they from different sources? In that case, you probably don't want to be using the same classifier on both. Also, what do you mean 'validate' using two different train and test sets? Classifiers are trained on a train set, and validated on a test set, what do you mean you're validating on two different train/test pairs?






share|improve this answer









$endgroup$



It'd be easier to understand if the plots were labeled. Nonetheless, I think I understand the question, and perhaps it's just sampling error. Why do you have two different pairs of train and test data? Are they from different sources? In that case, you probably don't want to be using the same classifier on both. Also, what do you mean 'validate' using two different train and test sets? Classifiers are trained on a train set, and validated on a test set, what do you mean you're validating on two different train/test pairs?







share|improve this answer












share|improve this answer



share|improve this answer










answered Sep 20 '16 at 4:07









Ben FBen F

1038




1038












  • $begingroup$
    In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
    $endgroup$
    – gc5
    Sep 20 '16 at 7:26


















  • $begingroup$
    In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
    $endgroup$
    – gc5
    Sep 20 '16 at 7:26
















$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26




$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26


















draft saved

draft discarded




















































Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f14090%2fshifted-feature-distribution-across-different-datasets%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Callistus I

Tabula Rosettana

How to label and detect the document text images