{"id":638,"date":"2021-10-21T16:38:26","date_gmt":"2021-10-21T14:38:26","guid":{"rendered":"https:\/\/lorentzen.ch\/?p=638"},"modified":"2021-10-21T16:38:27","modified_gmt":"2021-10-21T14:38:27","slug":"personal-highlights-of-scikit-learn-1-0","status":"publish","type":"post","link":"https:\/\/lorentzen.ch\/index.php\/2021\/10\/21\/personal-highlights-of-scikit-learn-1-0\/","title":{"rendered":"Personal Highlights of Scikit-Learn 1.0"},"content":{"rendered":"\n<p>Yes! After more than 10 years, <a href=\"https:\/\/scikit-learn.org\">scikit-learn<\/a> released its 1.0 version on 24 September 2021. In this post, I&#8217;d like to point out some personal highlights apart from the <a href=\"https:\/\/scikit-learn.org\/dev\/auto_examples\/release_highlights\/plot_release_highlights_1_0_0.html\">release highlights<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Feature Names<\/h2>\n\n\n\n<p>This one is listed in the release highlights, but deserves to be mentioned again.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">from sklearn.compose import ColumnTransformer\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.pipeline import make_pipeline\nfrom sklearn.preprocessing import OneHotEncoder, StandardScaler\nimport pandas as pd\n\ndf = pd.DataFrame({\n    &quot;pet&quot;: [&quot;dog&quot;, &quot;cat&quot;, &quot;fish&quot;],\n    &quot;age&quot;: [3, 7, 1],\n    &quot;noise&quot;: [-99, pd.NA, 1e-10],\n    &quot;target&quot;: [1, 0, 1],\n})\ny = df.pop(&quot;target&quot;)\nX = df\n\npreprocessor = ColumnTransformer(\n    [\n        (&quot;numerical&quot;, StandardScaler(), [&quot;age&quot;]),\n        (&quot;categorical&quot;, OneHotEncoder(), [&quot;pet&quot;]),\n    ],\n    verbose_feature_names_out=False,\n    remainder=&quot;drop&quot;,\n)\n\npipe = make_pipeline(preprocessor, LogisticRegression())\npipe.fit(X, y)\npipe[:-1].get_feature_names_out()<\/pre><\/div>\n\n\n\n<pre class=\"wp-block-code\"><code class=\"\">array(['age', 'pet_cat', 'pet_dog', 'pet_fish'], dtype=object)<\/code><\/pre>\n\n\n\n<p>This is not yet available for all estimators and transformers, but it is a big step towards <a href=\"https:\/\/scikit-learn-enhancement-proposals.readthedocs.io\/en\/latest\/slep007\/proposal.html\">SLEP007<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. ColumnTransformer allows changed order of columns<\/h2>\n\n\n\n<p>Before this release, <a href=\"https:\/\/scikit-learn.org\/dev\/modules\/generated\/sklearn.compose.ColumnTransformer.html#sklearn-compose-columntransformer\"><em>ColumnTransformer<\/em><\/a> recorded the order of the columns of a dataframe during the <em>fit<\/em> method and required that a dataframe <em>X<\/em> passed to <em>transform<\/em> had the exact same columns and in the exact same order.<\/p>\n\n\n\n<p>This was a big pain point in productive settings because <em>fit<\/em> and <em>predict<\/em> of a model pipeline, both calling <em>transform<\/em>, often get data from different sources, and, for instance, SQL does not care about the order of columns. On top, <em>remainder=&#8221;drop&#8221;<\/em> forced you to have also all dropped columns in <em>transform<\/em>. This contradicted at least my modelling workflow as I often specify all meaningful features explicitly and drop the rest by the <em>remainder<\/em> option. This then led to unwanted surprises when applying <em>predict<\/em> to new data in the end. It might also happen, that one forgets to remove the target variable from the training <em>X<\/em> and relies on the drop option. Usually, the application of the predictive model pipeline is on new data without the target variable. The error in this case, however, might be considered a good thing.<\/p>\n\n\n\n<p>With pull request (PR) <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/19263\">#19263<\/a>, the <em>ColumnTransformer<\/em> only cares about the presence and names of the columns, not about their order. With <em>remainder=&#8221;drop&#8221;<\/em>, it only cares about the specified columns and ignores all other columns, even no matter if the dropped ones are different in fit and transform. Note that this only works with pandas dataframes as input (or an object that quacks alike).<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">df_new = pd.DataFrame({\n    &quot;age&quot;: [1, 9, 3],\n    &quot;another_noise&quot;: [pd.NA, -99, 1e-10],\n    &quot;pet&quot;: [&quot;cat&quot;, &quot;dog&quot;, &quot;fish&quot;],\n})\npipe.predict(df_new)<\/pre><\/div>\n\n\n\n<p>You find these little code snippets as notebook at the usual place: <a href=\"https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2021-10-21%20scikit-learn_v1_release_highlights.ipynb\">https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2021-10-21%20scikit-learn_v1_release_highlights.ipynb<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Poisson criterion for random forests<\/h2>\n\n\n\n<p>Scikit-learn v0.24 shipped with the new option <em>criterion=&#8221;poisson&#8221;<\/em> for <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor\">DecisionTreeRegressor<\/a> to split nodes based on the reduction of Poisson deviance. Version 1.0 passed this option further to the <a href=\"https:\/\/scikit-learn.org\/dev\/modules\/generated\/sklearn.ensemble.RandomForestRegressor.html#sklearn-ensemble-randomforestregressor\">RandomForestRegressor<\/a> in <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/19836\">PR #19836<\/a>. Random forests are often used models and valued for their ease of use. We even like to write blog posts about them:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><a href=\"https:\/\/lorentzen.ch\/index.php\/2021\/05\/21\/strong-random-forests-with-xgboost\/\" data-type=\"post\" data-id=\"376\">Strong Random Forests with XGBoost<\/a><\/li><li><a href=\"https:\/\/lorentzen.ch\/index.php\/2021\/08\/19\/feature-subsampling-for-random-forest-regression\/\" data-type=\"post\" data-id=\"600\">Feature Subsampling For  Random Forest Regression<\/a><\/li><\/ul>\n\n\n\n<p>The Poisson splitting criterion has its place when modelling counts or frequencies. It allows for non-negative values to be modelled, but forbids non-positive predictions. This corresponds to <span class=\"katex-eq\" data-katex-display=\"false\">y_{train} \\geq 0<\/span> and <span class=\"katex-eq\" data-katex-display=\"false\">y_{predict} &gt; 0<\/span>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. The best example\/tutorial of the year<\/h2>\n\n\n\n<p>It&#8217;s not visible from the release notes, but this deserves to be noted. <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/20281\">PR #20281<\/a> added a fantastic example, more like a tutorial, on time-related feature engineering. You find a lot of interesting features, some of them shipped with the 1.0 release, e.g. time base cross validation, generation of cyclic b-splines and adding pairwise interactions to a linear model, usage of native categorical features in the <em>HistGradientBoostingRegressor<\/em>&#8230; <\/p>\n\n\n\n<p>Take a look for yourself at this wonderful <a href=\"https:\/\/scikit-learn.org\/dev\/auto_examples\/applications\/plot_cyclical_feature_engineering.html#sphx-glr-auto-examples-applications-plot-cyclical-feature-engineering-py\">example<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Yes! After more than 10 years, scikit-learn released its 1.0 version on 24 September 2021. In this post, I&#8217;d like to point out some personal highlights apart from the release highlights. 1. Feature Names This one is listed in the release highlights, but deserves to be mentioned again. This is not yet available for all [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[16],"tags":[6],"class_list":["post-638","post","type-post","status-publish","format-standard","hentry","category-machine-learning","tag-python"],"featured_image_src":null,"author_info":{"display_name":"Christian Lorentzen","author_link":"https:\/\/lorentzen.ch\/index.php\/author\/christian\/"},"_links":{"self":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/638","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/comments?post=638"}],"version-history":[{"count":12,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/638\/revisions"}],"predecessor-version":[{"id":657,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/638\/revisions\/657"}],"wp:attachment":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/media?parent=638"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/categories?post=638"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/tags?post=638"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}