{"id":984,"date":"2022-10-31T20:48:49","date_gmt":"2022-10-31T19:48:49","guid":{"rendered":"https:\/\/lorentzen.ch\/?p=984"},"modified":"2024-02-13T23:19:47","modified_gmt":"2024-02-13T22:19:47","slug":"histograms-gradient-boosted-trees-group-by-queries-and-one-hot-encoding","status":"publish","type":"post","link":"https:\/\/lorentzen.ch\/index.php\/2022\/10\/31\/histograms-gradient-boosted-trees-group-by-queries-and-one-hot-encoding\/","title":{"rendered":"Histograms, Gradient Boosted Trees, Group-By Queries and One-Hot Encoding"},"content":{"rendered":"\n<p>This post shows how filling histograms can be done in very different ways thereby connecting very different areas: from gradient boosted trees to SQL queries to one-hot encoding. Let&#8217;s jump into it!<\/p>\n\n\n\n<p>Modern gradient boosted trees (GBT) like <a href=\"https:\/\/lightgbm.readthedocs.io\">LightGBM<\/a>, <a href=\"https:\/\/xgboost.readthedocs.io\">XGBoost<\/a> and the <a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor\">HistGradientBoostingRegressor<\/a> of scikit-learn all use two techniques on top of standard gradient boosting:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>2nd order Taylor expansion of the loss which amounts to using gradients and hessians.<\/li>\n\n\n\n<li>One histogram per feature: bin the feature and fill the histogram with the gradients and hessians.<\/li>\n<\/ul>\n\n\n\n<p>The filling of the histograms is often the bottleneck when fitting GBTs. While filling a single histogram is very fast, this operation is executed many times: for each boosting round, for each tree split and for each feature. This is the reason why GBT implementations have dedicated routines for it. We look into this operation from different angles.<\/p>\n\n\n\n<p>For the coming (I)Python code snippets to work (<code># %%<\/code> indicates a new notebook cell), we need the following imports.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">import duckdb                    # v0.5.1\nimport matplotlib.pyplot as plt  # v.3.6.1\nfrom matplotlib.ticker import MultipleLocator\nimport numpy as np               # v1.23.4\nimport pandas as pd              # v1.5.0\nimport pyarrow as pa             # v9.0.0\nimport tabmat                    # v3.1.2\n\nfrom sklearn.ensemble._hist_gradient_boosting.histogram import (\n    _build_histogram_root,\n)                                # v1.1.2\nfrom sklearn.ensemble._hist_gradient_boosting.common import (\n  HISTOGRAM_DTYPE\n)<\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Naive Histogram Visualisation<\/h2>\n\n\n\n<p>As a starter, we create a small table with two columns: bin index and value of the hessian.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">def highlight(df):\n    if df[&quot;bin&quot;] == 0:\n        return [&quot;background-color: rgb(255, 128, 128)&quot;] * len(df)\n    elif df[&quot;bin&quot;] == 1:\n        return [&quot;background-color: rgb(128, 255, 128)&quot;] * len(df)\n    else:\n        return ['background-color: rgb(128, 128, 255)'] * len(df)\n\ndf = pd.DataFrame({&quot;bin&quot;: [0, 2, 1, 0, 1], &quot;hessian&quot;: [1.5, 1, 2, 2.5, 3]})\ndf.style.apply(highlight, axis=1)<\/pre><\/div>\n\n\n\n<style type=\"text\/css\">\n#T_9aec4_row0_col0, #T_9aec4_row0_col1, #T_9aec4_row3_col0, #T_9aec4_row3_col1 {\n  background-color: rgb(255, 128, 128);\n}\n#T_9aec4_row1_col0, #T_9aec4_row1_col1 {\n  background-color: rgb(128, 128, 255);\n}\n#T_9aec4_row2_col0, #T_9aec4_row2_col1, #T_9aec4_row4_col0, #T_9aec4_row4_col1 {\n  background-color: rgb(128, 255, 128);\n}\n<\/style>\n<table id=\"T_9aec4\">\n  <thead>\n    <tr>\n      <th class=\"blank level0\">&nbsp;<\/th>\n      <th id=\"T_9aec4_level0_col0\" class=\"col_heading level0 col0\">bin<\/th>\n      <th id=\"T_9aec4_level0_col1\" class=\"col_heading level0 col1\">hessian<\/th>\n    <\/tr>\n  <\/thead>\n  <tbody>\n    <tr>\n      <th id=\"T_9aec4_level0_row0\" class=\"row_heading level0 row0\">0<\/th>\n      <td id=\"T_9aec4_row0_col0\" class=\"data row0 col0\">0<\/td>\n      <td id=\"T_9aec4_row0_col1\" class=\"data row0 col1\">1.500000<\/td>\n    <\/tr>\n    <tr>\n      <th id=\"T_9aec4_level0_row1\" class=\"row_heading level0 row1\">1<\/th>\n      <td id=\"T_9aec4_row1_col0\" class=\"data row1 col0\">2<\/td>\n      <td id=\"T_9aec4_row1_col1\" class=\"data row1 col1\">1.000000<\/td>\n    <\/tr>\n    <tr>\n      <th id=\"T_9aec4_level0_row2\" class=\"row_heading level0 row2\">2<\/th>\n      <td id=\"T_9aec4_row2_col0\" class=\"data row2 col0\">1<\/td>\n      <td id=\"T_9aec4_row2_col1\" class=\"data row2 col1\">2.000000<\/td>\n    <\/tr>\n    <tr>\n      <th id=\"T_9aec4_level0_row3\" class=\"row_heading level0 row3\">3<\/th>\n      <td id=\"T_9aec4_row3_col0\" class=\"data row3 col0\">0<\/td>\n      <td id=\"T_9aec4_row3_col1\" class=\"data row3 col1\">2.500000<\/td>\n    <\/tr>\n    <tr>\n      <th id=\"T_9aec4_level0_row4\" class=\"row_heading level0 row4\">4<\/th>\n      <td id=\"T_9aec4_row4_col0\" class=\"data row4 col0\">1<\/td>\n      <td id=\"T_9aec4_row4_col1\" class=\"data row4 col1\">3.000000<\/td>\n    <\/tr>\n  <\/tbody>\n<\/table>\n\n\n\n<p>A histogram then sums up all the hessian values belonging to the same bin. The result looks like the following.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"368\" height=\"316\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2022\/10\/image.png\" alt=\"\" class=\"wp-image-991\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2022\/10\/image.png 368w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2022\/10\/image-300x258.png 300w\" sizes=\"auto, (max-width: 368px) 100vw, 368px\" \/><figcaption class=\"wp-element-caption\">Above table visualised as histogram<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Dedicated Method<\/h2>\n\n\n\n<p>We simulate filling the histogram of a single feature. Therefore, we draw 1,000,000 random variables for gradients and hessians as well as the bin indices.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">import duckdb\nimport pyarrow as pa\nimport numpy as np\nimport tabmat\n\nfrom sklearn.ensemble._hist_gradient_boosting.histogram import (\n    _build_histogram_root,\n)\nfrom sklearn.ensemble._hist_gradient_boosting.common import HISTOGRAM_DTYPE\n\n\nrng = np.random.default_rng(42)\nn_obs = 1000_000\nn_bins = 256\nbinned_feature = rng.integers(0, n_bins, size=n_obs, dtype=np.uint8)\ngradients = rng.normal(size=n_obs).astype(np.float32)\nhessians = rng.lognormal(size=n_obs).astype(np.float32)<\/pre><\/div>\n\n\n\n<p>Now we use the dedicated (and private!) and single-threaded method <code>_build_histogram_root<\/code> from sckit-learn to fill a histogram.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">hist_root = np.zeros((1, n_bins), dtype=HISTOGRAM_DTYPE)\n%time _build_histogram_root(0, binned_feature, gradients, hessians, hist_root)\n# Wall time: 1.38 ms<\/pre><\/div>\n\n\n\n<p>This executes in around 1.4 ms. This is quite fast. But again, imagine 100 boosting rounds with 10 tree splits on average and 100 features. This means this is done around 100,000 times and would therefore take roughly 2 minutes.<\/p>\n\n\n\n<p>Let&#8217;s have a look at the first 5 bins:<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">hist_root[:, 0:5]<\/pre><\/div>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text\/plain&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}\">array([[(-79.72386998, 6508.89500265, 3894),\n        ( 37.98393589, 6460.63222205, 3998),\n        ( 53.54256977, 6492.22722797, 3805),\n        ( 21.19542398, 6797.34159299, 3928),\n        ( 16.24716742, 6327.03757573, 3875)]],\n      dtype=[('sum_gradients', '&lt;f8'), ('sum_hessians', '&lt;f8'), ('count', '&lt;u4')])<\/pre><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">SQL Group-By Query<\/h2>\n\n\n\n<p>Someone familiar with SQL and database queries might immediately see how this task can be formulated as SQL group-by-aggregate query. To demonstrate it on our simulated data, we use <a href=\"https:\/\/duckdb.org\/\">DuckDB<\/a> as well as <a href=\"https:\/\/arrow.apache.org\/\">Apache Arrow<\/a> (the file format as well as the Python library <a href=\"https:\/\/arrow.apache.org\/docs\/python\">pyarrow<\/a>). You can read more about DuckDB in our post <a href=\"https:\/\/lorentzen.ch\/index.php\/2022\/04\/02\/duckdb-quacking-sql\/\" data-type=\"post\" data-id=\"745\">DuckDB: Quacking SQL<\/a>.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\"># %%\ncon = duckdb.connect()\narrow_table = pa.Table.from_pydict(\n    {\n        &quot;bin&quot;: binned_feature,\n        &quot;gradients&quot;: gradients,\n        &quot;hessians&quot;: hessians,\n})\n# Read data once to make timing fairer\narrow_result = con.execute(&quot;SELECT * FROM arrow_table&quot;)\n\n# %%\n%%time\narrow_result = con.execute(&quot;&quot;&quot;\nSELECT\n    bin as bin,\n    SUM(gradients) as sum_gradients,\n    SUM(hessians) as sum_hessians,\n    COUNT() as count\nFROM arrow_table\nGROUP BY bin\n&quot;&quot;&quot;).arrow()\n# Wall time: 6.52 ms<\/pre><\/div>\n\n\n\n<p>On my laptop, this takes about 6.5 ms and, upon sorting, gives the same results:<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">arrow_result.sort_by(&quot;bin&quot;).slice(length=5)<\/pre><\/div>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text\/plain&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}\">pyarrow.Table\nbin: uint8\nsum_gradients: double\nsum_hessians: double\ncount: int64\n----\nbin: [[0,1,2,3,4]]\nsum_gradients: [[-79.72386997545254,37.98393589106854,53.54256977112527,21.195423980039777,16.247167424764484]]\nsum_hessians: [[6508.895002648234,6460.632222048938,6492.227227974683,6797.341592986137,6327.037575732917]]\ncount: [[3894,3998,3805,3928,3875]]<\/pre><\/div>\n\n\n\n<p>As we have the table as an Arrow table, we can stay within pyarrow:<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\">%%time\narrow_result = arrow_table.group_by(&quot;bin&quot;).aggregate(\n    [\n        (&quot;gradients&quot;, &quot;sum&quot;),\n        (&quot;hessians&quot;, &quot;sum&quot;),\n        (&quot;bin&quot;, &quot;count&quot;),\n    ]\n)\n# Wall time: 10.8 ms<\/pre><\/div>\n\n\n\n<p>The fact that DuckDB is faster than Arrow on this task might have to do with the large invested effort on parallelised group-by operations, see their post <a href=\"https:\/\/duckdb.org\/2022\/03\/07\/aggregate-hashtable.html\">Parallel Grouped Aggregation in DuckDB<\/a> for more infos.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">One-Hot encoded Matrix Multiplication<\/h2>\n\n\n\n<p>I think it is very interesting that filling histograms can be written as a matrix multiplication! The trick is to view the feature as a categorical feature and use its one-hot encoded matrix representation. This blows up memory, of course. Note that one-hot encoding is usually met with generalized linear models (GLM) in order to incorporate nominal categorical feature variables with no internal ordering in the design matrix.<\/p>\n\n\n\n<p>For our demonstration, we use a numpy index trick to construct the one-hot encoded matrix employing the fact that the binned feature already contains the right indices.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\"># %%\n%%time\nm_OHE = np.eye(n_bins)[binned_feature].T\nvec = np.column_stack((gradients, hessians, np.ones_like(gradients)))\n# Wall time: 770 ms\n\n# %%\n%time result_ohe = m_OHE @ vec\n# Wall time: 199 ms\n\n# %%\nresult_ohe[:5]<\/pre><\/div>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text\/plain&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}\">array([[ -79.72386998, 6508.89500265, 3894.        ],\n       [  37.98393589, 6460.63222205, 3998.        ],\n       [  53.54256977, 6492.22722797, 3805.        ],\n       [  21.19542398, 6797.34159299, 3928.        ],\n       [  16.24716742, 6327.03757573, 3875.        ]])<\/pre><\/div>\n\n\n\n<p>This is way slower, but, somehow surprisingly, produces the same result.<\/p>\n\n\n\n<p>The one-hot encoded matrix is very sparse, with only one non-zero value per column, i.e. only one out of 256 (number of bins) values is non-zero. This structure can be exploited to reduce both CPU time as well as memory consumption, with the help of the package <a href=\"https:\/\/tabmat.readthedocs.io\/\">tabmat<\/a> that was built to accelerate GLMs. Unfortunately, tabmat only provides a matrix-vector multiplication (and the sandwich product, of course), but no matrix-matrix multiplication. So we have to do a little extra work.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;python&quot;,&quot;mime&quot;:&quot;text\/x-python&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Python&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;python&quot;}\"># %%\n%time m_categorical = tabmat.CategoricalMatrix(cat_vec=binned_feature)\n# Wall time: 21.5 ms\n\n# %%\n# tabmat needs contigous arrays with dtype = Python float = float64\nvec = np.asfortranarray(vec, dtype=float)\n\n# %%\n%%time\ntabmat_result = np.column_stack(\n    (\n        vec[:, 0] @ m_categorical,\n        vec[:, 1] @ m_categorical,\n        vec[:, 2] @ m_categorical,\n    )\n)\n# Wall time: 4.82 ms\n\n# %%\ntabmat_result[0:5]<\/pre><\/div>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;null&quot;,&quot;mime&quot;:&quot;text\/plain&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;Plain Text&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;text&quot;}\">array([[ -79.72386998, 6508.89500265, 3894.        ],\n       [  37.98393589, 6460.63222205, 3998.        ],\n       [  53.54256977, 6492.22722797, 3805.        ],\n       [  21.19542398, 6797.34159299, 3928.        ],\n       [  16.24716742, 6327.03757573, 3875.        ]])<\/pre><\/div>\n\n\n\n<p>While the timing of this approach is quite good, the construction of a <code>CategoricalMatrix<\/code> requires more time than the matrix-vector multiplication.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p>In the end, the special (Cython) routine of scikit-learn ist the fastest of our tested methods for filling histograms. The other GBT libraries have their own even more specialised routines which might be a reason for even faster fit times. What we learned in this post is that this seemingly simple task plays a very crucial part in modern GBTs and can be accomplished by very different approaches. These different approaches uncover connections of algorithms of quite different domains.<\/p>\n\n\n\n<p>The full code as ipython notebook can be found at <a href=\"https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2022-10-31%20histogram-GBT-GroupBy-OHE.ipynb\">https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2022-10-31%20histogram-GBT-GroupBy-OHE.ipynb<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post shows how filling histograms can be done in very different ways thereby connecting very different areas: from gradient boosted trees to SQL queries to one-hot encoding. Let&#8217;s jump into it! Modern gradient boosted trees (GBT) like LightGBM, XGBoost and the HistGradientBoostingRegressor of scikit-learn all use two techniques on top of standard gradient boosting: [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[15,16,17],"tags":[6],"class_list":["post-984","post","type-post","status-publish","format-standard","hentry","category-linear-algebra","category-machine-learning","category-programming","tag-python"],"featured_image_src":null,"author_info":{"display_name":"Christian Lorentzen","author_link":"https:\/\/lorentzen.ch\/index.php\/author\/christian\/"},"_links":{"self":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/984","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/comments?post=984"}],"version-history":[{"count":30,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/984\/revisions"}],"predecessor-version":[{"id":1456,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/984\/revisions\/1456"}],"wp:attachment":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/media?parent=984"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/categories?post=984"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/tags?post=984"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}