Do Random Forest overfit?

I have been reading around about Random Forests but I cannot really find a definitive answer about the problem of overfitting. According to the original paper of Breiman, they should not overfit when increasing the number of trees in the forest, but it seems that there is not consensus about this. This is creating me quite some confusion about the issue.

Maybe someone more expert than me can give me a more concrete answer or point me in the right direction to better understand the problem.

asked Aug 23 '14 at 16:54

markusian

270128

3

$begingroup$
All algorithms will overfit to some degree. It's not about picking something that doesn't overfit, it's about carefully considering the amount of overfitting and the form of the problem you're solving to maximize more relevant metrics.
$endgroup$
– indico
Aug 23 '14 at 18:16

1

$begingroup$
ISTR that Breiman had a proof based on the Law of Large Numbers. Has someone discovered a flaw in that proof?
$endgroup$
– JenSCDC
Aug 28 '14 at 1:18

$begingroup$
@AndyBlankertz ISTR = internetslang.com/ISTR-meaning-definition.asp ?
$endgroup$
– Hack-R
Nov 3 '15 at 3:15

add a comment |

Maybe someone more expert than me can give me a more concrete answer or point me in the right direction to better understand the problem.

asked Aug 23 '14 at 16:54

markusian

270128

3

$begingroup$
All algorithms will overfit to some degree. It's not about picking something that doesn't overfit, it's about carefully considering the amount of overfitting and the form of the problem you're solving to maximize more relevant metrics.
$endgroup$
– indico
Aug 23 '14 at 18:16

1

$begingroup$
ISTR that Breiman had a proof based on the Law of Large Numbers. Has someone discovered a flaw in that proof?
$endgroup$
– JenSCDC
Aug 28 '14 at 1:18

$begingroup$
@AndyBlankertz ISTR = internetslang.com/ISTR-meaning-definition.asp ?
$endgroup$
– Hack-R
Nov 3 '15 at 3:15

add a comment |

Maybe someone more expert than me can give me a more concrete answer or point me in the right direction to better understand the problem.

asked Aug 23 '14 at 16:54

markusian

270128

Maybe someone more expert than me can give me a more concrete answer or point me in the right direction to better understand the problem.

machine-learning random-forest

asked Aug 23 '14 at 16:54

markusian

270128

asked Aug 23 '14 at 16:54

markusian

270128

asked Aug 23 '14 at 16:54

markusian

270128

asked Aug 23 '14 at 16:54

markusian

270128

asked Aug 23 '14 at 16:54

markusian

270128

3

$begingroup$
All algorithms will overfit to some degree. It's not about picking something that doesn't overfit, it's about carefully considering the amount of overfitting and the form of the problem you're solving to maximize more relevant metrics.
$endgroup$
– indico
Aug 23 '14 at 18:16

1

$begingroup$
ISTR that Breiman had a proof based on the Law of Large Numbers. Has someone discovered a flaw in that proof?
$endgroup$
– JenSCDC
Aug 28 '14 at 1:18

$begingroup$
@AndyBlankertz ISTR = internetslang.com/ISTR-meaning-definition.asp ?
$endgroup$
– Hack-R
Nov 3 '15 at 3:15

add a comment |

3

$begingroup$
All algorithms will overfit to some degree. It's not about picking something that doesn't overfit, it's about carefully considering the amount of overfitting and the form of the problem you're solving to maximize more relevant metrics.
$endgroup$
– indico
Aug 23 '14 at 18:16

1

$begingroup$
ISTR that Breiman had a proof based on the Law of Large Numbers. Has someone discovered a flaw in that proof?
$endgroup$
– JenSCDC
Aug 28 '14 at 1:18

$begingroup$
@AndyBlankertz ISTR = internetslang.com/ISTR-meaning-definition.asp ?
$endgroup$
– Hack-R
Nov 3 '15 at 3:15

All algorithms will overfit to some degree. It's not about picking something that doesn't overfit, it's about carefully considering the amount of overfitting and the form of the problem you're solving to maximize more relevant metrics.

– indico
Aug 23 '14 at 18:16

ISTR that Breiman had a proof based on the Law of Large Numbers. Has someone discovered a flaw in that proof?

– JenSCDC
Aug 28 '14 at 1:18

@AndyBlankertz ISTR = internetslang.com/ISTR-meaning-definition.asp ?

– Hack-R
Nov 3 '15 at 3:15

add a comment |

4 Answers
4

active

oldest

votes

Every ML algorithm with high complexity can overfit. However, the OP is asking whether an RF will not overfit when increasing the number of trees in the forest.

In general, ensemble methods reduces the prediction variance to almost nothing, improving the accuracy of the ensemble. If we define the variance of the expected generalization error of an individual randomized model as:

From here, the variance of the expected generalization error of an ensemble corresponds to:

where p(x) is the Pearson’s correlation coefficient between the predictions of two randomized models trained on the same data from two independent seeds. If we increase the number of DT's in the RF, larger M, the variance of the ensemble decreases when ρ(x)<1. Therefore, the variance of an ensemble is strictly smaller than the variance of an individual model.

In a nutshell, increasing the number of individual randomized models in an ensemble will never increase the generalization error.

edited Nov 17 '15 at 16:19

DaL

2,194411

answered Oct 20 '14 at 9:31

tashuhka

356310

1

$begingroup$
That's definitely what Leo Breiman and the theory says, but empirically it seems like they definitely do overfit. For example I currently have a model with 10-fold CV MSE of 0.02 but when measured against the ground truth the CV MSE is .4. OTOH if I reduce tree depth and tree number the model performance improves significantly.
$endgroup$
– Hack-R
Feb 18 '16 at 14:41

3

$begingroup$
If you reduce the tree depth is a different case because you are adding regularisation, which will decrease the overfitting. Try to plot the MSE when you increase the number of trees while keeping the rest of parameters unchanged. So, you have MSE in the y-axis and num_tress in the x-axis. You will see that when adding more trees, the error decreases fast, and then it has a plateau; but it will never increase.
$endgroup$
– tashuhka
Feb 19 '16 at 13:43

add a comment |

You may want to check cross-validated - a stachexchange website for many things, including machine learning.

In particular, this question (with exactly same title) has already been answered multiple times. Check these links: https://stats.stackexchange.com/search?q=random+forest+overfit

But I may give you the short answer to it: yes, it does overfit, and sometimes you need to control the complexity of the trees in your forest, or even prune when they grow too much - but this depends on the library you use for building the forest. E.g. in randomForest in R you can only control the complexity

edited Apr 13 '17 at 12:44

Community♦

answered Aug 24 '14 at 8:22

Alexey Grigorev

1,900617

add a comment |

STRUCTURED DATASET -> MISLEADING OOB ERRORS

I've found interesting case of RF overfitting in my work practice. When data are structured RF overfits on OOB observations.

Detail :

I try to predict electricity prices on electricity spot market for each single hour (each row of dataset contain price and system parameters (load, capacities etc.) for that single hour).

Electricity prices are created in batches (24 prices created on electricity market in one fixing in one moment of time).

So OOB obs for each tree are random subsets of set of hours, but if you predict next 24 hours you do it all at once (in first moment you obtain all system parameters, then you predict 24 prices, then there is an fixing which produces those prices), so its easier to make OOB predictions, then for the whole next day. OOB obs are not contained in 24-hour blocks, but dispersed uniformly, as there is an autocorrelation of prediction errors its easier to predict price for single hour which is missing then for whole block of missing hours.

easier to predict in case of error autocorrelation :
known, known, prediction, known, prediction - OBB case

harder one :
known, known, known, prediction, prediction - real world prediction case

I hope its interesting

answered Jul 22 '16 at 8:15

Qbik

1284

add a comment |

The Random Forest does overfit.

The Random Forest does not increase generalization error when more trees are added to the model. The generalization variance is going to zero with more trees used.

I've made a very simple experiment. I have generated the synthetic data:

y = 10 * x + noise

I've train two Random Forest models:

one with full trees

one with pruned trees

The model with full trees has lower train error but higher test error than the model with pruned trees. The responses of both models:

responses

It is clear evidence of overfitting. Then I took the hyper-parameters of the overfitted model and check the error while adding at each step 1 tree. I got the following plot:

growing trees

As you can see the overfit error is not changing when adding more trees but the model is overfitted. Here is the link for the experiment I've made.

answered yesterday

pplonski

21115

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "557"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f1028%2fdo-random-forest-overfit%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

4 Answers
4

active

oldest

votes

4 Answers
4

active

oldest

votes

Every ML algorithm with high complexity can overfit. However, the OP is asking whether an RF will not overfit when increasing the number of trees in the forest.

From here, the variance of the expected generalization error of an ensemble corresponds to:

In a nutshell, increasing the number of individual randomized models in an ensemble will never increase the generalization error.

edited Nov 17 '15 at 16:19

DaL

2,194411

answered Oct 20 '14 at 9:31

tashuhka

356310

1

$begingroup$
That's definitely what Leo Breiman and the theory says, but empirically it seems like they definitely do overfit. For example I currently have a model with 10-fold CV MSE of 0.02 but when measured against the ground truth the CV MSE is .4. OTOH if I reduce tree depth and tree number the model performance improves significantly.
$endgroup$
– Hack-R
Feb 18 '16 at 14:41

3

$begingroup$
If you reduce the tree depth is a different case because you are adding regularisation, which will decrease the overfitting. Try to plot the MSE when you increase the number of trees while keeping the rest of parameters unchanged. So, you have MSE in the y-axis and num_tress in the x-axis. You will see that when adding more trees, the error decreases fast, and then it has a plateau; but it will never increase.
$endgroup$
– tashuhka
Feb 19 '16 at 13:43

add a comment |

Every ML algorithm with high complexity can overfit. However, the OP is asking whether an RF will not overfit when increasing the number of trees in the forest.

From here, the variance of the expected generalization error of an ensemble corresponds to:

In a nutshell, increasing the number of individual randomized models in an ensemble will never increase the generalization error.

edited Nov 17 '15 at 16:19

DaL

2,194411

answered Oct 20 '14 at 9:31

tashuhka

356310

1

$begingroup$
That's definitely what Leo Breiman and the theory says, but empirically it seems like they definitely do overfit. For example I currently have a model with 10-fold CV MSE of 0.02 but when measured against the ground truth the CV MSE is .4. OTOH if I reduce tree depth and tree number the model performance improves significantly.
$endgroup$
– Hack-R
Feb 18 '16 at 14:41

3

$begingroup$
If you reduce the tree depth is a different case because you are adding regularisation, which will decrease the overfitting. Try to plot the MSE when you increase the number of trees while keeping the rest of parameters unchanged. So, you have MSE in the y-axis and num_tress in the x-axis. You will see that when adding more trees, the error decreases fast, and then it has a plateau; but it will never increase.
$endgroup$
– tashuhka
Feb 19 '16 at 13:43

add a comment |

Every ML algorithm with high complexity can overfit. However, the OP is asking whether an RF will not overfit when increasing the number of trees in the forest.

From here, the variance of the expected generalization error of an ensemble corresponds to:

In a nutshell, increasing the number of individual randomized models in an ensemble will never increase the generalization error.

edited Nov 17 '15 at 16:19

DaL

2,194411

answered Oct 20 '14 at 9:31

tashuhka

356310

Every ML algorithm with high complexity can overfit. However, the OP is asking whether an RF will not overfit when increasing the number of trees in the forest.

From here, the variance of the expected generalization error of an ensemble corresponds to:

In a nutshell, increasing the number of individual randomized models in an ensemble will never increase the generalization error.

edited Nov 17 '15 at 16:19

DaL

2,194411

answered Oct 20 '14 at 9:31

tashuhka

356310

edited Nov 17 '15 at 16:19

DaL

2,194411

edited Nov 17 '15 at 16:19

DaL

2,194411

edited Nov 17 '15 at 16:19

DaL

2,194411

answered Oct 20 '14 at 9:31

tashuhka

356310

answered Oct 20 '14 at 9:31

tashuhka

356310

answered Oct 20 '14 at 9:31

tashuhka

356310

1

$begingroup$
That's definitely what Leo Breiman and the theory says, but empirically it seems like they definitely do overfit. For example I currently have a model with 10-fold CV MSE of 0.02 but when measured against the ground truth the CV MSE is .4. OTOH if I reduce tree depth and tree number the model performance improves significantly.
$endgroup$
– Hack-R
Feb 18 '16 at 14:41

3

$begingroup$
If you reduce the tree depth is a different case because you are adding regularisation, which will decrease the overfitting. Try to plot the MSE when you increase the number of trees while keeping the rest of parameters unchanged. So, you have MSE in the y-axis and num_tress in the x-axis. You will see that when adding more trees, the error decreases fast, and then it has a plateau; but it will never increase.
$endgroup$
– tashuhka
Feb 19 '16 at 13:43

add a comment |

1

$begingroup$
That's definitely what Leo Breiman and the theory says, but empirically it seems like they definitely do overfit. For example I currently have a model with 10-fold CV MSE of 0.02 but when measured against the ground truth the CV MSE is .4. OTOH if I reduce tree depth and tree number the model performance improves significantly.
$endgroup$
– Hack-R
Feb 18 '16 at 14:41

3

$begingroup$
If you reduce the tree depth is a different case because you are adding regularisation, which will decrease the overfitting. Try to plot the MSE when you increase the number of trees while keeping the rest of parameters unchanged. So, you have MSE in the y-axis and num_tress in the x-axis. You will see that when adding more trees, the error decreases fast, and then it has a plateau; but it will never increase.
$endgroup$
– tashuhka
Feb 19 '16 at 13:43

That's definitely what Leo Breiman and the theory says, but empirically it seems like they definitely do overfit. For example I currently have a model with 10-fold CV MSE of 0.02 but when measured against the ground truth the CV MSE is .4. OTOH if I reduce tree depth and tree number the model performance improves significantly.

– Hack-R
Feb 18 '16 at 14:41

If you reduce the tree depth is a different case because you are adding regularisation, which will decrease the overfitting. Try to plot the MSE when you increase the number of trees while keeping the rest of parameters unchanged. So, you have MSE in the y-axis and num_tress in the x-axis. You will see that when adding more trees, the error decreases fast, and then it has a plateau; but it will never increase.

– tashuhka
Feb 19 '16 at 13:43

add a comment |

You may want to check cross-validated - a stachexchange website for many things, including machine learning.

In particular, this question (with exactly same title) has already been answered multiple times. Check these links: https://stats.stackexchange.com/search?q=random+forest+overfit

edited Apr 13 '17 at 12:44

Community♦

answered Aug 24 '14 at 8:22

Alexey Grigorev

1,900617

add a comment |

You may want to check cross-validated - a stachexchange website for many things, including machine learning.

In particular, this question (with exactly same title) has already been answered multiple times. Check these links: https://stats.stackexchange.com/search?q=random+forest+overfit

edited Apr 13 '17 at 12:44

Community♦

answered Aug 24 '14 at 8:22

Alexey Grigorev

1,900617

add a comment |

You may want to check cross-validated - a stachexchange website for many things, including machine learning.

In particular, this question (with exactly same title) has already been answered multiple times. Check these links: https://stats.stackexchange.com/search?q=random+forest+overfit

edited Apr 13 '17 at 12:44

Community♦

answered Aug 24 '14 at 8:22

Alexey Grigorev

1,900617

You may want to check cross-validated - a stachexchange website for many things, including machine learning.

In particular, this question (with exactly same title) has already been answered multiple times. Check these links: https://stats.stackexchange.com/search?q=random+forest+overfit

edited Apr 13 '17 at 12:44

Community♦

answered Aug 24 '14 at 8:22

Alexey Grigorev

1,900617

edited Apr 13 '17 at 12:44

Community♦

edited Apr 13 '17 at 12:44

Community♦

edited Apr 13 '17 at 12:44

Community♦

answered Aug 24 '14 at 8:22

Alexey Grigorev

1,900617

answered Aug 24 '14 at 8:22

Alexey Grigorev

1,900617

answered Aug 24 '14 at 8:22

Alexey Grigorev

1,900617

add a comment |

STRUCTURED DATASET -> MISLEADING OOB ERRORS

I've found interesting case of RF overfitting in my work practice. When data are structured RF overfits on OOB observations.

Detail :

I hope its interesting

answered Jul 22 '16 at 8:15

Qbik

1284

add a comment |

STRUCTURED DATASET -> MISLEADING OOB ERRORS

I've found interesting case of RF overfitting in my work practice. When data are structured RF overfits on OOB observations.

Detail :

I hope its interesting

answered Jul 22 '16 at 8:15

Qbik

1284

add a comment |

STRUCTURED DATASET -> MISLEADING OOB ERRORS

I've found interesting case of RF overfitting in my work practice. When data are structured RF overfits on OOB observations.

Detail :

I hope its interesting

answered Jul 22 '16 at 8:15

Qbik

1284

STRUCTURED DATASET -> MISLEADING OOB ERRORS

I've found interesting case of RF overfitting in my work practice. When data are structured RF overfits on OOB observations.

Detail :

I hope its interesting

answered Jul 22 '16 at 8:15

Qbik

1284

answered Jul 22 '16 at 8:15

Qbik

1284

answered Jul 22 '16 at 8:15

Qbik

1284

answered Jul 22 '16 at 8:15

Qbik

1284

add a comment |

The Random Forest does overfit.

The Random Forest does not increase generalization error when more trees are added to the model. The generalization variance is going to zero with more trees used.

I've made a very simple experiment. I have generated the synthetic data:

y = 10 * x + noise

I've train two Random Forest models:

one with full trees

one with pruned trees

The model with full trees has lower train error but higher test error than the model with pruned trees. The responses of both models:

responses

It is clear evidence of overfitting. Then I took the hyper-parameters of the overfitted model and check the error while adding at each step 1 tree. I got the following plot:

growing trees

As you can see the overfit error is not changing when adding more trees but the model is overfitted. Here is the link for the experiment I've made.

answered yesterday

pplonski

21115

add a comment |

The Random Forest does overfit.

The Random Forest does not increase generalization error when more trees are added to the model. The generalization variance is going to zero with more trees used.

I've made a very simple experiment. I have generated the synthetic data:

y = 10 * x + noise

I've train two Random Forest models:

one with full trees

one with pruned trees

The model with full trees has lower train error but higher test error than the model with pruned trees. The responses of both models:

responses

It is clear evidence of overfitting. Then I took the hyper-parameters of the overfitted model and check the error while adding at each step 1 tree. I got the following plot:

growing trees

As you can see the overfit error is not changing when adding more trees but the model is overfitted. Here is the link for the experiment I've made.

answered yesterday

pplonski

21115

add a comment |

The Random Forest does overfit.

The Random Forest does not increase generalization error when more trees are added to the model. The generalization variance is going to zero with more trees used.

I've made a very simple experiment. I have generated the synthetic data:

y = 10 * x + noise

I've train two Random Forest models:

one with full trees

one with pruned trees

The model with full trees has lower train error but higher test error than the model with pruned trees. The responses of both models:

responses

It is clear evidence of overfitting. Then I took the hyper-parameters of the overfitted model and check the error while adding at each step 1 tree. I got the following plot:

growing trees

As you can see the overfit error is not changing when adding more trees but the model is overfitted. Here is the link for the experiment I've made.

answered yesterday

pplonski

21115

The Random Forest does overfit.

The Random Forest does not increase generalization error when more trees are added to the model. The generalization variance is going to zero with more trees used.

I've made a very simple experiment. I have generated the synthetic data:

y = 10 * x + noise

I've train two Random Forest models:

one with full trees

one with pruned trees

The model with full trees has lower train error but higher test error than the model with pruned trees. The responses of both models:

responses

It is clear evidence of overfitting. Then I took the hyper-parameters of the overfitted model and check the error while adding at each step 1 tree. I got the following plot:

growing trees

As you can see the overfit error is not changing when adding more trees but the model is overfitted. Here is the link for the experiment I've made.

answered yesterday

pplonski

21115

answered yesterday

pplonski

21115

answered yesterday

pplonski

21115

answered yesterday

pplonski

21115

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Htydjtk