{"id":1346,"date":"2023-11-24T11:34:46","date_gmt":"2023-11-24T10:34:46","guid":{"rendered":"https:\/\/lorentzen.ch\/?p=1346"},"modified":"2023-11-24T18:44:59","modified_gmt":"2023-11-24T17:44:59","slug":"an-open-source-journey-with-scikit-learn","status":"publish","type":"post","link":"https:\/\/lorentzen.ch\/index.php\/2023\/11\/24\/an-open-source-journey-with-scikit-learn\/","title":{"rendered":"An Open Source Journey with Scikit-Learn"},"content":{"rendered":"\n<p>In this post, I&#8217;d like to tell the story of my journey into the open source world of Python with a focus on scikit-learn. My hope is that it encourages others to start or to keep contributing and have endurance for bigger picture changes.<\/p>\n\n\n<div class=\"wp-block-ub-table-of-contents-block ub_table-of-contents\" id=\"ub_table-of-contents-59c35afd-57c7-4ff0-8c94-c0fc32946698\" data-linktodivider=\"false\" data-showtext=\"show\" data-hidetext=\"hide\" data-scrolltype=\"auto\" data-enablesmoothscroll=\"false\" data-initiallyhideonmobile=\"false\" data-initiallyshow=\"true\"><div class=\"ub_table-of-contents-header-container\" style=\"\">\n\t\t\t<div class=\"ub_table-of-contents-header\" style=\"text-align: left; \">\n\t\t\t\t<div class=\"ub_table-of-contents-title\">Table of Content<\/div>\n\t\t\t\t\n\t\t\t<\/div>\n\t\t<\/div><div class=\"ub_table-of-contents-extra-container\" style=\"\">\n\t\t\t<div class=\"ub_table-of-contents-container ub_table-of-contents-1-column \">\n\t\t\t\t<ul style=\"\"><li style=\"\"><a href=\"https:\/\/lorentzen.ch\/index.php\/2023\/11\/24\/an-open-source-journey-with-scikit-learn\/#0-how-it-all-started\" style=\"\">How it all started<\/a><\/li><li style=\"\"><a href=\"https:\/\/lorentzen.ch\/index.php\/2023\/11\/24\/an-open-source-journey-with-scikit-learn\/#1-becoming-a-scikit-learn-core-developer\" style=\"\">Becoming a scikit-learn core developer<\/a><\/li><li style=\"\"><a href=\"https:\/\/lorentzen.ch\/index.php\/2023\/11\/24\/an-open-source-journey-with-scikit-learn\/#2-summary-as-core-developer\" style=\"\">Summary as core developer<\/a><\/li><\/ul>\n\t\t\t<\/div>\n\t\t<\/div><\/div>\n\n\n<h2 class=\"wp-block-heading\" id=\"0-how-it-all-started\">How it all started<\/h2>\n\n\n\n<p>Back in 2015\/2016, I was working as a non-life pricing actuary. The standard vendor desktop applications we used for generalized linear models (GLM) had problems of system discontinuities, manual error prone steps and the lack of modern machine learning capabilities (not even out-of-sample model comparison).<\/p>\n\n\n\n<p>Python was then on the rise for data science. Numpy, scipy and pandas had laid the foundations, then came deep learning alias neural net frameworks leading to tensorflow and pytorch. XGBoost was also a game changer visible in Kaggle competition leaderboards. All those projects came as open source with thriving communities and possibilities to contribute.<\/p>\n\n\n\n<p>While the R base package always comes with splendid dataframes (I guess they invented it) and battle proven GLMs out of the box, the Python site for GLMs was not that well developed. So I started with GLMs in <a href=\"https:\/\/www.statsmodels.org\/\">statsmodels<\/a> and generalized linear mixed models (a.k.a. hierarchical or multilevel models) in <a href=\"https:\/\/www.pymc.io\">pymc<\/a> (then called pymc3). My first open source contributions in the Python world were small issues in statsmodels and a little later the bug report <a href=\"https:\/\/github.com\/pymc-devs\/pymc\/issues\/2640\">pymc#2640<\/a> about memory alignment issues which was caused by <a href=\"https:\/\/github.com\/joblib\/joblib\/issues\/563\">joblib#563<\/a>.<\/p>\n\n\n\n<p>To my great surprise the famous machine learning library <a href=\"https:\/\/scikit-learn.org\">scikit-learn<\/a> did not have GLMs, only penalized linear models and logistic regression, but no Poisson or Gamma GLMs which are essential in non-life insurance pricing. Fortunately, I was not the first one to notice this lack. There was already an open issue <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/issues\/5975\">scikit-learn#5975<\/a> with many people asking for this feature. Just nobody had contributed a pull request (PR) yet. <\/p>\n\n\n\n<p>That&#8217;s when I said to myself: It should not fail just because no one implements it. I really like open source and gained some programming experience during my PhD in particle physics, mainly C++. Eventually, I boldly (because I was still a newbie) opened the PR <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/9405\">scikit-learn#9405<\/a> in summer 2017.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"1-becoming-a-scikit-learn-core-developer\">Becoming a scikit-learn core developer<\/h2>\n\n\n\n<p>This PR turned out to be essential for the development of GLMs and for becoming a scikit-learn core developer. I dare say that I almost got crazy trying to convince the core developers that GLMs are really that useful for supervised machine learning and that GLMs should land in scikit-learn. In retrospective, this was the hardest part and it took me almost 2 years of patience and repeating my arguments, some examples comments are given below:<\/p>\n\n\n<div class=\"wp-block-ub-content-toggle wp-block-ub-content-toggle-block\" id=\"ub-content-toggle-block-129f6d09-928f-401f-bf02-702af3fcc131\" data-mobilecollapse=\"false\" data-desktopcollapse=\"true\" data-preventcollapse=\"false\" data-showonlyone=\"false\">\n<div class=\"wp-block-ub-content-toggle-accordion\" style=\"border-color: #f1f1f1; \" id=\"ub-content-toggle-panel-block-\">\n\t\t\t<div class=\"wp-block-ub-content-toggle-accordion-title-wrap\" style=\"background-color: #f1f1f1;\" aria-controls=\"ub-content-toggle-panel-0-129f6d09-928f-401f-bf02-702af3fcc131\" tabindex=\"0\">\n\t\t\t<p class=\"wp-block-ub-content-toggle-accordion-title ub-content-toggle-title-129f6d09-928f-401f-bf02-702af3fcc131\" style=\"color: #000000; \">comment example 1<\/p>\n\t\t\t<div class=\"wp-block-ub-content-toggle-accordion-toggle-wrap right\" style=\"color: #000000;\"><span class=\"wp-block-ub-content-toggle-accordion-state-indicator wp-block-ub-chevron-down\"><\/span><\/div>\n\t\t<\/div>\n\t\t\t<div role=\"region\" aria-expanded=\"false\" class=\"wp-block-ub-content-toggle-accordion-content-wrap ub-hide\" id=\"ub-content-toggle-panel-0-129f6d09-928f-401f-bf02-702af3fcc131\">\n\n<p><em>&#8220;I can only repeat myself: I&#8217;d prefer to have this functionality in scikit-learn for several reasons (your review, opinion and ideas, very official\/trustworthy library, more efficient maintainance, effort to release this pr as its own library, \u2026).<br>To be more explicit for the moment: If it takes longer than the end of 2019 (+-), I&#8217;ll consider to release it as separate library.&#8221;<\/em> <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/9405#issuecomment-450936476\">link<\/a><\/p>\n\n<\/div>\n\t\t<\/div>\n\n<div class=\"wp-block-ub-content-toggle-accordion\" style=\"border-color: #f1f1f1; \" id=\"ub-content-toggle-panel-block-\">\n\t\t\t<div class=\"wp-block-ub-content-toggle-accordion-title-wrap\" style=\"background-color: #f1f1f1;\" aria-controls=\"ub-content-toggle-panel-1-129f6d09-928f-401f-bf02-702af3fcc131\" tabindex=\"0\">\n\t\t\t<p class=\"wp-block-ub-content-toggle-accordion-title ub-content-toggle-title-129f6d09-928f-401f-bf02-702af3fcc131\" style=\"color: #000000; \">comment example 2<\/p>\n\t\t\t<div class=\"wp-block-ub-content-toggle-accordion-toggle-wrap right\" style=\"color: #000000;\"><span class=\"wp-block-ub-content-toggle-accordion-state-indicator wp-block-ub-chevron-down\"><\/span><\/div>\n\t\t<\/div>\n\t\t\t<div role=\"region\" aria-expanded=\"false\" class=\"wp-block-ub-content-toggle-accordion-content-wrap ub-hide\" id=\"ub-content-toggle-panel-1-129f6d09-928f-401f-bf02-702af3fcc131\">\n\n<p><em>&#8220;I see it a bit different. Scikit-Learn like R glm and glmnet is trusted world-wide and can be used in many companies, whereas it might be difficult to get any of the existing GLM libraries on pypi (h2o excluded) into production (no offense intended). That being said, I&#8217;d like to return the question and ask you: What exactly has to be fulfilled in order for a GLM PR to be merged into scikit-learn? Once that is clarified, I&#8217;ll think about starting a collaboration for this.&#8221;<\/em> <a href=\"http:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/9405#issuecomment-464898593\">link<\/a><\/p>\n\n<\/div>\n\t\t<\/div>\n\n<div class=\"wp-block-ub-content-toggle-accordion\" style=\"border-color: #f1f1f1; \" id=\"ub-content-toggle-panel-block-\">\n\t\t\t<div class=\"wp-block-ub-content-toggle-accordion-title-wrap\" style=\"background-color: #f1f1f1;\" aria-controls=\"ub-content-toggle-panel-2-129f6d09-928f-401f-bf02-702af3fcc131\" tabindex=\"0\">\n\t\t\t<p class=\"wp-block-ub-content-toggle-accordion-title ub-content-toggle-title-129f6d09-928f-401f-bf02-702af3fcc131\" style=\"color: #000000; \">comment example 3<\/p>\n\t\t\t<div class=\"wp-block-ub-content-toggle-accordion-toggle-wrap right\" style=\"color: #000000;\"><span class=\"wp-block-ub-content-toggle-accordion-state-indicator wp-block-ub-chevron-down\"><\/span><\/div>\n\t\t<\/div>\n\t\t\t<div role=\"region\" aria-expanded=\"false\" class=\"wp-block-ub-content-toggle-accordion-content-wrap ub-hide\" id=\"ub-content-toggle-panel-2-129f6d09-928f-401f-bf02-702af3fcc131\">\n\n<p><em>&#8230;<\/em><\/p>\n\n\n\n<p><em><strong>guidance &#8211; maintenance<\/strong><br>As a GLM user on a fairly regular basis, I&#8217;d be happy to help as good as I can. Feel free to reach out to me. As to maintenance, I think a unified framework would even lower the burden. I can also imagine to give some support for maintenance.<\/em><\/p>\n\n\n\n<p><strong><em>miscellaneous<\/em><\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>&#8230;<\/em><\/li>\n\n\n\n<li><em>For GBMs to rely on the same loss and link functions would make sense &#8230;<\/em><\/li>\n<\/ul>\n\n\n\n<p><em>&#8230;<\/em><\/p>\n\n\n\n<p><em><strong>further steps<\/strong><br>Besides further commits to this PR, let me know how I can help you best.<\/em><\/p>\n\n\n\n<p>[<a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/9405#issuecomment-468065740\">link<\/a>]<\/p>\n\n<\/div>\n\t\t<\/div>\n<\/div>\n\n\n<p>As I wanted to demonstrate the full utility of GLMs, this PR had become much too large for review and inclusion: +4000 lines of code with several solvers, penalty matrices, 3 examples, a lot of documentation and good test coverage (and a lot of things I would do differently today).<\/p>\n\n\n\n<p>The conclusion was to carve out a minimal GLM implementation using the L-BFGS solver of scipy. This way, I met <a href=\"https:\/\/github.com\/rth\">Roman Yurchak<\/a> with whom it was a pleasure to work with. It took a little \ud83c\udde8\ud83c\uddedSwiss chocolate\ud83c\udf6b incentive to finally get <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/14300\">scikit-learn#14300<\/a> (still +2900 loc) reviewed and merged in spring 2020. Almost 3 years after opening my original PR, it was released in <a href=\"https:\/\/scikit-learn.org\/dev\/whats_new\/v0.23.html#id9\">scikit-learn version 0.23<\/a>!<\/p>\n\n\n\n<p>I guess it was mainly this work and perseverance around GLMs that catched the attention of the core developers and that motivated them to vote for me: In summer 2020, I was invited to become a scikit-learn core developer and gladly accepted.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"2-summary-as-core-developer\">Summary as core developer<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"3-further-directions\">Further directions<\/h3>\n\n\n\n<p>My work on GLMs was easily extensible to other estimators in the form of loss functions. Again, to my surprise, loss functions, a core element for supervised learning, were re-implemented again and again within scikit-learn. So, based on Roman&#8217;s idea in <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/issues\/15123\">#15123<\/a>, I started a project to unify them, and by unifying also extending several tree estimator classes with poisson and gamma losses (and making existing ones more stable and faster).<\/p>\n\n\n\n<p>As loss functions are such important core components, they have basically 2 major requirements: be numerically stable and fast. That&#8217;s why I went with Cython (preferred way for fast code in scikit-learn) in <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/20567\">scikit-learn#20567<\/a> and guess which loop it closed? Again, I met segfault errors caused by <a href=\"https:\/\/github.com\/joblib\/joblib\/issues\/563\">joblib#563<\/a>. This time, it motivated another core developer to quite an investment in fixing it in <a href=\"https:\/\/github.com\/joblib\/joblib\/pull\/1254\">joblib#1254<\/a>.<\/p>\n\n\n\n<p>Another story branch is the dedicated GLM Python library <a href=\"https:\/\/glum.readthedocs.io\">glum<\/a>. The authors took my original way too long GLM PR as a starting point and developed one of the most feature rich and fastest GLM implementations out there. This is almost like a dream come true.<\/p>\n\n\n\n<p>A summary of my contributions over those 3 intensive years as scikit-learn core developer are best given in several categories.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"4-pull-requests\">Pull requests<\/h3>\n\n\n\n<p>A summary of my contributions in terms of code may be:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Unified loss module, unified naming of losses, poisson and gamma losses for GLMs and decision tree based models<\/li>\n\n\n\n<li><code>LinearModelLoss<\/code> and <code>NewtonSolver<\/code> (newton-cholesky) for GLMs like <code>LogisticRegression<\/code> and <code>PoissonRegressor<\/code> as well as further solver improvements<\/li>\n\n\n\n<li><code>QuantileRegressor<\/code> (linear quantile regression) and quantile\/pinball loss for <code>HistGradientBoostingRegressor<\/code> (HGBT). BTW, linear quantile regression is much harder than GLM solvers!<\/li>\n\n\n\n<li><code>SplineTransformer<\/code><\/li>\n\n\n\n<li>Interaction constraints and feature subsampling for HGBT<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-ub-content-toggle wp-block-ub-content-toggle-block\" id=\"ub-content-toggle-block-c1e2bfa0-c29d-4003-9626-ec0d8e7a9645\" data-mobilecollapse=\"false\" data-desktopcollapse=\"true\" data-preventcollapse=\"false\" data-showonlyone=\"false\">\n<div class=\"wp-block-ub-content-toggle-accordion\" style=\"border-color: #f1f1f1; \" id=\"ub-content-toggle-panel-block-\">\n\t\t\t<div class=\"wp-block-ub-content-toggle-accordion-title-wrap\" style=\"background-color: #f1f1f1;\" aria-controls=\"ub-content-toggle-panel-0-c1e2bfa0-c29d-4003-9626-ec0d8e7a9645\" tabindex=\"0\">\n\t\t\t<p class=\"wp-block-ub-content-toggle-accordion-title ub-content-toggle-title-c1e2bfa0-c29d-4003-9626-ec0d8e7a9645\" style=\"color: #000000; \">From the <a href=\"https:\/\/scikit-learn.org\/dev\/whats_new.html\">release notes<\/a> and the <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pulls?q=is%3Apr+author%3Alorentzenchr+is%3Amerged+\">github PRs<\/a> (where one would miss a few) a more details list of important PRs<\/p>\n\t\t\t<div class=\"wp-block-ub-content-toggle-accordion-toggle-wrap right\" style=\"color: #000000;\"><span class=\"wp-block-ub-content-toggle-accordion-state-indicator wp-block-ub-chevron-down\"><\/span><\/div>\n\t\t<\/div>\n\t\t\t<div role=\"region\" aria-expanded=\"false\" class=\"wp-block-ub-content-toggle-accordion-content-wrap ub-hide\" id=\"ub-content-toggle-panel-0-c1e2bfa0-c29d-4003-9626-ec0d8e7a9645\">\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/15436\">Sample weights for ElasticNet<\/a> v0.23 (Major Feature)<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/14300\">Minimal Generalized linear models implementation<\/a> v0.23 (Major Feature)<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/16692\">ENH Poisson loss for HistGradientBoostingRegressor<\/a> v0.23<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/17386\">ENH add Poisson splitting criterion for single trees<\/a> v0.24<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/20567\">Common Private Loss Module with tempita<\/a> v1.1\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/issues\/18248\">RFC Consistent options\/names for loss and criterion<\/a> v1.0 and v1.1<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/20811\">ENH Replace loss module HGBT<\/a> v1.1<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/21800\">FEA add quantile HGBT<\/a> v1.1 (Major Feature)<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/21808\">ENH Loss module LogisticRegression<\/a> v1.1<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/22548\">ENH migrate GLMs \/ TweedieRegressor to linear loss<\/a> v1.1<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/22409\">FEA Add Gamma deviance as loss function to HGBT<\/a> v1.3<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/26278\">ENH replace loss module Gradient boosting<\/a> future v1.4<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/9978\">Add quantile regression<\/a> together with <a href=\"https:\/\/github.com\/avidale\">David Dale<\/a> v1.0 (MajorFeature)<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/18368\">FEA Add SplineTransformer<\/a> v1.0<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/21020\">ENH FEA add interaction constraints to HGBT<\/a> v.1.2 (Major Feature)<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/24637\">FEA add (single) Cholesky Newton solver to GLMs<\/a> v.1.2<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/24767\">ENH add newton-cholesky solver to LogisticRegression<\/a> v.1.2<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/23619\">TST tight tests for GLMs<\/a> v.1.2<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/26721\">ENH scaling of LogisticRegression loss as 1\/n * LinearModelLoss<\/a> future v1.4<\/li>\n\n\n\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/27139\">ENH add feature subsampling per split for HGBT<\/a> future v1.4<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n<\/div>\n\t\t<\/div>\n<\/div>\n\n\n<h3 class=\"wp-block-heading\" id=\"5-reviewing-and-steering\">Reviewing and steering<\/h3>\n\n\n\n<p>Among the biggest changes in newer scikit-learn history are two scikit-learn enhancement proposals (SLEP)<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/scikit-learn-enhancement-proposals.readthedocs.io\/en\/latest\/slep018\/proposal.html\">SLEP018: Pandas Output for Transformers with set_output<\/a><br>championed by <a href=\"https:\/\/github.com\/thomasjpfan\">Thomas Fan<\/a>, implemented in <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/23734\">PR#23734<\/a> v1.2, further developments like <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/27315\">PR#27315<\/a> for polars in future v1.4<\/li>\n\n\n\n<li><a href=\"https:\/\/scikit-learn-enhancement-proposals.readthedocs.io\/en\/latest\/slep006\/proposal.html\">SLEP006: Metadata Routing<\/a><br>championed by <a href=\"https:\/\/github.com\/adrinjalali\">Adrin Jalali<\/a>, base implementation <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/22083\">PR#22083<\/a><\/li>\n<\/ul>\n\n\n\n<p>For both, I did one of the 2 obligatory reviews. Then, maybe the technically most challenging review I can remember was on:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/18394\">ENH Add Categorical support for HistGradientBoosting<\/a> from Thomas Fan, v0.24 (MajorFeature)<\/li>\n<\/ul>\n\n\n\n<p>Keep in mind that review(er)s are by far the scarcest resource of scikit-learn.<\/p>\n\n\n\n<p>I also like to mention <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/25753\">PR#25753<\/a> which changed to government to be more inclusive, in particular with voting rights.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"6-lessons-learned\">Lessons learned<\/h3>\n\n\n\n<p>Just before the end, a few critical words must be allowed.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Scikit-learn is focused a lot on stability. For some items of my wish list to land in scikit-learn, it would have again taken years. This time, I decided to release my own library <a href=\"https:\/\/lorentzenchr.github.io\/model-diagnostics\/\">model-diagnostics<\/a> and I enjoy the freedom to use cutting edge components like polars.<\/li>\n\n\n\n<li>As part-time statistician, I consider certain design choices like classifiers&#8217; <code>predict<\/code> implicitly using a 50% threshold instead of returning a predicted probability (what <code>predict_proba<\/code> does) a bit poor. Hard to change!!! At least, <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/pull\/26120\">PR#26120<\/a> might improve that to some extent.<\/li>\n\n\n\n<li>I ponder a lot on the pipeline concept. At first, it was like an eye-opener for me to think of feature preprocessing as part of the estimator. The scikit-learn API is build around the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/compose.html#pipelines-and-composite-estimators\">pipeline design<\/a> with <code>fit<\/code>, <code>transform<\/code> and <code>predict<\/code>. But the current trend of modern model classes like gradient boosted trees (XGBoost, LightGBM, HGBT) don&#8217;t need a preprocessing pipeline anymore, e.g., they can natively deal with categorical features and missing values. But it is hard to pass the information which feature to treat as categorical through a pipeline, see <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/issues\/18894\">scikit-learn#18894<\/a>.<\/li>\n\n\n\n<li>It is still a very painful experience to specify design matrices of linear models, in particular interaction terms, see <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/issues\/15263\">scikit-learn#15263<\/a>, <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/issues\/19533\">#19533<\/a> and <a href=\"https:\/\/github.com\/scikit-learn\/scikit-learn\/issues\/25412\">#25412<\/a>. Doing that in a pipline with a <code>ColumnTransformer<\/code> is just very complicated and prohibits a lot of optimizations (mostly for categoricals)\u2014which is one of the reasons glum is faster.<\/li>\n<\/ul>\n\n\n\n<p>One of the greatest rewards of this journey was that I learned a lot, about Python, machine learning, rigorous reviews, CI\/CD, open source communities, endurance. But even more so, I had the pleasure to meet and work with some kind, brilliant and gifted people like Roman Yurchak, Alexandre Gramfort, Olivier Grisel, Thomas Fan, Nicolas Hug, Adrin Jalali and many more. I am really grateful to be a part of something bigger than its parts. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this post, I&#8217;d like to tell the story of my journey into the open source world of Python with a focus on scikit-learn. My hope is that it encourages others to start or to keep contributing and have endurance for bigger picture changes.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[6],"class_list":["post-1346","post","type-post","status-publish","format-standard","hentry","category-stories","tag-python"],"featured_image_src":null,"author_info":{"display_name":"Christian Lorentzen","author_link":"https:\/\/lorentzen.ch\/index.php\/author\/christian\/"},"_links":{"self":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1346","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/comments?post=1346"}],"version-history":[{"count":38,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1346\/revisions"}],"predecessor-version":[{"id":1388,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1346\/revisions\/1388"}],"wp:attachment":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/media?parent=1346"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/categories?post=1346"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/tags?post=1346"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}