Shifted feature distribution across different datasets

I am trying to validate a classifier using two different training and testing datasets.

The feature I am considering is a feature constructed doing the fold-change between two original features, i.e. $log2(feat_A / feat_b)$

The problem is that, across the two different datasets, I see the same distribution of fold-change across groups but with different values. See, in the images attached, the values for each group (each box) in the different datasets (each plot).

First Second

What may be causing the differences in fold-change? In this case the fold-change is different also in sign, which means that while in one dataset $feat_a$ is greater than $feat_b$, the opposite is true for the other dataset. However, the pairwise relation between the two groups (the two boxes in each plot) is maintained.

I was thinking about different normalization procedures on each feature across different dataset, which may explain the shift of the distribution of the fold-change. What other problems may be in these datasets?

asked Sep 19 '16 at 12:21

gc5

4541614

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

$begingroup$
Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
$endgroup$
– The Lyrist
Jun 12 '18 at 14:25

add a comment |

I am trying to validate a classifier using two different training and testing datasets.

The feature I am considering is a feature constructed doing the fold-change between two original features, i.e. $log2(feat_A / feat_b)$

First Second

asked Sep 19 '16 at 12:21

gc5

4541614

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

$begingroup$
Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
$endgroup$
– The Lyrist
Jun 12 '18 at 14:25

add a comment |

I am trying to validate a classifier using two different training and testing datasets.

The feature I am considering is a feature constructed doing the fold-change between two original features, i.e. $log2(feat_A / feat_b)$

First Second

asked Sep 19 '16 at 12:21

gc5

4541614

I am trying to validate a classifier using two different training and testing datasets.

The feature I am considering is a feature constructed doing the fold-change between two original features, i.e. $log2(feat_A / feat_b)$

First Second

classification dataset distribution

asked Sep 19 '16 at 12:21

gc5

4541614

asked Sep 19 '16 at 12:21

gc5

4541614

asked Sep 19 '16 at 12:21

gc5

4541614

asked Sep 19 '16 at 12:21

gc5

4541614

asked Sep 19 '16 at 12:21

gc5

4541614

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ yesterday

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

$begingroup$
Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
$endgroup$
– The Lyrist
Jun 12 '18 at 14:25

add a comment |

$begingroup$
Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?
$endgroup$
– The Lyrist
Jun 12 '18 at 14:25

Did you stratify your data? Also, sorry but only trying to double check, do you have any sequence numeric I’d field or date feature that might potentially be mislabeled and influence the result?

– The Lyrist
Jun 12 '18 at 14:25

add a comment |

1 Answer
1

active

oldest

votes

It'd be easier to understand if the plots were labeled. Nonetheless, I think I understand the question, and perhaps it's just sampling error. Why do you have two different pairs of train and test data? Are they from different sources? In that case, you probably don't want to be using the same classifier on both. Also, what do you mean 'validate' using two different train and test sets? Classifiers are trained on a train set, and validated on a test set, what do you mean you're validating on two different train/test pairs?

answered Sep 20 '16 at 4:07

Ben F

1038

$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f14090%2fshifted-feature-distribution-across-different-datasets%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered Sep 20 '16 at 4:07

Ben F

1038

$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26

add a comment |

answered Sep 20 '16 at 4:07

Ben F

1038

$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26

add a comment |

answered Sep 20 '16 at 4:07

Ben F

1038

answered Sep 20 '16 at 4:07

Ben F

1038

answered Sep 20 '16 at 4:07

Ben F

1038

answered Sep 20 '16 at 4:07

Ben F

1038

answered Sep 20 '16 at 4:07

Ben F

1038

$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26

add a comment |

$begingroup$
In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.
$endgroup$
– gc5
Sep 20 '16 at 7:26

In each plot, each boxplot represents a group of samples. My task is to classify each sample correctly in the first or in the second group. If I train the classifier on the first dataset (first plot) I am unable to classify correctly the samples described by the second dataset (second plot). This is because a simple classifier would put a boundary in fold-change $= -2$ to split the distribution of the samples in the first plot (dataset). However, using the same classifier in the second plot (dataset) will fail to classify samples correctly, because all fold-changes are greater than $-2$.

– gc5
Sep 20 '16 at 7:26

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk