{"id":600,"date":"2021-08-19T20:39:22","date_gmt":"2021-08-19T18:39:22","guid":{"rendered":"https:\/\/lorentzen.ch\/?p=600"},"modified":"2021-08-27T10:36:19","modified_gmt":"2021-08-27T08:36:19","slug":"feature-subsampling-for-random-forest-regression","status":"publish","type":"post","link":"https:\/\/lorentzen.ch\/index.php\/2021\/08\/19\/feature-subsampling-for-random-forest-regression\/","title":{"rendered":"Feature Subsampling For  Random Forest Regression"},"content":{"rendered":"\n<p><strong>TLDR:<\/strong> The number of subsampled features is a main source of randomness and an important parameter in random forests. Mind the different default values across implementations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Randomness in Random Forests<\/h2>\n\n\n\n<p>Random forests are very popular machine learning models. They are build from easily understandable and well visualizable decision trees and give usually good predictive performance without the need for excessive hyperparameter tuning. Some drawbacks are that they do not scale well to very large datasets and that their predictions are discontinuous on continuous features.<\/p>\n\n\n\n<p>A key ingredient for random forests is\u2014no surprise here\u2014randomness. The two main sources for randomness are:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>Feature subsampling in every node split when fitting decision trees.<\/li><li>Row subsampling (bagging) of the training dataset for each decision tree.<\/li><\/ul>\n\n\n\n<p>In this post, we want to investigate the first source, <em>feature subsampling<\/em>, with a special focus on regression problems on continuous targets (as opposed to classification).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Feature Subsampling<\/h2>\n\n\n\n<p>In his <a href=\"https:\/\/doi.org\/10.1023\/A:1010933404324\">seminal paper<\/a>, Leo Breiman introduced random forests and pointed out several advantages of feature subsamling per node split. We cite from his paper:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>The forests studied here consist of using randomly selected inputs or combinations of inputs at each node to grow each tree. The resulting forests give accuracy that compare favorably with Adaboost. This class of procedures has desirable characteristics:<\/p><p>i Its accuracy is as good as Adaboost and sometimes better.<\/p><p>ii It&#8217;s relatively robust to outliers and noise.<\/p><p>iii It&#8217;s faster than bagging or boosting.<\/p><p>iv It gives useful internal estimates of error, strength, correlation and variable importance.<\/p><p>v It&#8217;s simple and easily parallelized.<\/p><cite>Breiman, L. Random Forests.&nbsp;<em>Machine Learning<\/em>&nbsp;<strong>45,&nbsp;<\/strong>5\u201332 (2001).<\/cite><\/blockquote>\n\n\n\n<p>Note the focus on comparing with Adaboost at that time and the, in today&#8217;s view, relatively small datasets used for empirical studies in this paper.<\/p>\n\n\n\n<p>If the input data as <em>p<\/em> number of features (columns), implementations of random forests usually allow to specify how many features to consider at each split:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Implementation<\/strong><\/td><td><strong>Language<\/strong><\/td><td><strong>Parameter<\/strong><\/td><td><strong>Default<\/strong><\/td><\/tr><tr><td>scikit-learn <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor\">RandomForestRegressor<\/a><\/td><td>Python<\/td><td><em>max_features<\/em><\/td><td><em>p<\/em><\/td><\/tr><tr><td>scikit-learn <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier\">RandomForestClassifier<\/a><\/td><td>Python<\/td><td><em>max_features<\/em><\/td><td><em>sqrt(p)<\/em><\/td><\/tr><tr><td><a href=\"https:\/\/cran.r-project.org\/package=ranger\">ranger<\/a><\/td><td>R<\/td><td><em>mtry<\/em><\/td><td><em>sqrt(p)<\/em><\/td><\/tr><tr><td><a href=\"https:\/\/cran.r-project.org\/package=randomForest\">randomForest<\/a> regression<\/td><td>R<\/td><td><em>mtry<\/em><\/td><td><em>p\/3<\/em><\/td><\/tr><tr><td><a href=\"https:\/\/cran.r-project.org\/package=randomForest\">randomForest<\/a> classification<\/td><td>R<\/td><td><em>mtry<\/em><\/td><td><em>sqrt(p)<\/em><\/td><\/tr><tr><td><a href=\"https:\/\/docs.h2o.ai\/h2o\/latest-stable\/h2o-docs\/data-science\/drf.html\">H2O<\/a> regression<\/td><td>Python &amp; R<\/td><td><em>mtries<\/em><\/td><td><em>p\/3<\/em><\/td><\/tr><tr><td><a href=\"https:\/\/docs.h2o.ai\/h2o\/latest-stable\/h2o-docs\/data-science\/drf.html\">H2O<\/a> classification<\/td><td>Python &amp; R<\/td><td><em>mtries<\/em><\/td><td><em>sqrt(p)<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Note that the default of scikit-learn for regression is surprising because it switches of the randomness from feature subsampling rendering it equal to bagged trees!<\/p>\n\n\n\n<p>While empirical studies on the impact of feature for good default choices focus on classification problems, see the literature review in <a href=\"https:\/\/arxiv.org\/pdf\/1804.03515.pdf\">Probst et al 2019<\/a>, we consider a set of regression problems with continuous targets. Note that different results might be more related to different feature spaces than to the difference between classification and regression.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>The hyperparameters mtry, sample size and node size are the parameters that control the randomness of the RF. [&#8230;]<span style=\"color: inherit; font-size: inherit;\">. Out of these parameters, mtry is<\/span> most influential both according to the literature and in our own experiments. The best value of mtry depends on the number of variables that are related to the outcome.<\/p><cite>Probst, P. et al. \u201cHyperparameters and tuning strategies for random forest.\u201d&nbsp;<em>Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery<\/em>&nbsp;9 (2019): n. pag.<\/cite><\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">Benchmarks<\/h2>\n\n\n\n<p>We selected the following 13 datasets with regression problems:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Dataset<\/strong><\/td><td><strong>number of samples<\/strong><\/td><td><strong>number of used features p<\/strong><\/td><\/tr><tr><td>Allstate<\/td><td>188,318<\/td><td>130<\/td><\/tr><tr><td>Bike_Sharing_Demand<\/td><td>17,379<\/td><td>12<\/td><\/tr><tr><td>Brazilian_houses<\/td><td>10&#8217;692<\/td><td>12<\/td><\/tr><tr><td>ames<\/td><td>1&#8217;460<\/td><td>79<\/td><\/tr><tr><td>black_friday<\/td><td>166&#8217;821<\/td><td>9<\/td><\/tr><tr><td>colleges<\/td><td>7&#8217;063<\/td><td>49<\/td><\/tr><tr><td>delays_zurich_transport<\/td><td>27&#8217;327<\/td><td>17<\/td><\/tr><tr><td>diamonds<\/td><td>53&#8217;940<\/td><td>6<\/td><\/tr><tr><td>la_crimes<\/td><td>1&#8217;468&#8217;825<\/td><td>25<\/td><\/tr><tr><td>medical_charges_nominal<\/td><td>163&#8217;065<\/td><td>11<\/td><\/tr><tr><td>nyc-taxi-green-dec-2016<\/td><td>581&#8217;835<\/td><td>14<\/td><\/tr><tr><td>particulate-matter-ukair-2017<\/td><td>394&#8217;299<\/td><td>9<\/td><\/tr><tr><td>taxi<\/td><td>581&#8217;835<\/td><td>18<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Note that among those, there is no high dimensional dataset in the sense that <em>p<\/em>&gt;number of samples.<\/p>\n\n\n\n<p>On these, we fitted the scikit-learn (version 0.24) RandomForestRegressor (within a short pipeline handling missing values) with default parameters. We used 5-fold cross validation with 4 different values <em>max_feature=p\/3 <\/em>(blue),<em> sqrt(p) <\/em>(orange), <em>0.9 p <\/em>(green), and <em>p<\/em> (red). Now, we show the mean squared error with uncertainty bars (\u00b1 one standard deviation of cross validation splits), the lower the better.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"718\" height=\"1024\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/07\/image-718x1024.png\" alt=\"\" class=\"wp-image-603\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/07\/image-718x1024.png 718w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/07\/image-210x300.png 210w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/07\/image-768x1096.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/07\/image.png 1014w\" sizes=\"auto, (max-width: 718px) 100vw, 718px\" \/><\/figure>\n\n\n\n<p>In addition, we also report the fit time of each (5-fold) fit in seconds, again the lower the better.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"718\" height=\"1024\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/07\/image-1-718x1024.png\" alt=\"\" class=\"wp-image-604\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/07\/image-1-718x1024.png 718w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/07\/image-1-210x300.png 210w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/07\/image-1-768x1095.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2021\/07\/image-1.png 1015w\" sizes=\"auto, (max-width: 718px) 100vw, 718px\" \/><\/figure>\n\n\n\n<p>Note that <em>sqrt(p)<\/em> is often smaller than <em>p\/3<\/em>. With this in mind, this graphs show that fit time is about proportional to the number of  features subsampled.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<ul class=\"wp-block-list\"><li>The main tuning parameter of random forests is the number of features used for feature subsampling (<em>max_features<\/em>, <em>mtry<\/em>). Depending on the dataset, it has a relevant impact on the predictive performance.<\/li><li>The default of scikit-learn&#8217;s RandomForestRegressor seems odd. It produces bagged trees. This is a bit like using ridge regression with a zero penalty\ud83d\ude09. However, it can be justified by our benchmarks above.<\/li><\/ul>\n\n\n\n<p>The full code can be found here:<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2021-08-19%20random_forests_max_features.ipynb\">https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2021-08-19%20random_forests_max_features.ipynb<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>TLDR: The number of subsampled features is a main source of randomness and an important parameter in random forests. Mind the different default values across implementations. Randomness in Random Forests Random forests are very popular machine learning models. They are build from easily understandable and well visualizable decision trees and give usually good predictive performance [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9],"tags":[6],"class_list":["post-600","post","type-post","status-publish","format-standard","hentry","category-statistics","tag-python"],"featured_image_src":null,"author_info":{"display_name":"Christian Lorentzen","author_link":"https:\/\/lorentzen.ch\/index.php\/author\/christian\/"},"_links":{"self":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/600","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/comments?post=600"}],"version-history":[{"count":19,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/600\/revisions"}],"predecessor-version":[{"id":625,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/600\/revisions\/625"}],"wp:attachment":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/media?parent=600"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/categories?post=600"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/tags?post=600"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}