Tag: Python

  • Building Strong GLMs in Python via ML + XAI

    We use Python to craft a strong GLM by insights from a boosted trees model.

  • An Open Source Journey with Scikit-Learn

    In this post, I’d like to tell the story of my journey into the open source world of Python with a focus on scikit-learn. My hope is that it encourages others to start or to keep contributing and have endurance for bigger picture changes.

  • Model Diagnostics in Python

    Version 1.0.0 of the new Python package for model-diagnostics was just released on PyPI.

  • Geographic SHAP

    “R Python” continued… Geographic SHAP

  • Quantiles And Their Estimation

    Applied statistics is dominated by the ubiquitous mean. For a change, this post is dedicated to quantiles. I will give my best to provide a good mix of theory and practical examples. While the mean describes only the central tendency of a distribution or random sample, quantiles are able to describe the whole distribution. They…

  • Histograms, Gradient Boosted Trees, Group-By Queries and One-Hot Encoding

    This post shows how filling histograms can be done in very different ways thereby connecting very different areas: from gradient boosted trees to SQL queries to one-hot encoding. Let’s jump into it! Modern gradient boosted trees (GBT) like LightGBM, XGBoost and the HistGradientBoostingRegressor of scikit-learn all use two techniques on top of standard gradient boosting:…

  • Kernel SHAP in R and Python

    “R Python” continued… Kernel SHAP

  • From Least Squares Benchmarks to the Marchenko–Pastur Distribution

    In this blog post, I tell the story how I learned about a theorem for random matrices of the two Ukrainian🇺🇦 mathematicians Vladimir Marchenko and Leonid Pastur. It all started with benchmarking least squares solvers in scipy. Setting the Stage for Least Squares Solvers Least squares starts with a matrix and a vector and one…

  • DuckDB: Quacking SQL

    “R Python” continued… DuckDB: Quacking SQL

  • Random Forests with Monotonic Constraints

    “R Python” continued… Random forests with monotonic constraints

  • Personal Highlights of Scikit-Learn 1.0

    Yes! After more than 10 years, scikit-learn released its 1.0 version on 24 September 2021. In this post, I’d like to point out some personal highlights apart from the release highlights. 1. Feature Names This one is listed in the release highlights, but deserves to be mentioned again. This is not yet available for all…

  • Feature Subsampling For Random Forest Regression

    TLDR: The number of subsampled features is a main source of randomness and an important parameter in random forests. Mind the different default values across implementations. Randomness in Random Forests Random forests are very popular machine learning models. They are build from easily understandable and well visualizable decision trees and give usually good predictive performance…