{"id":1880,"date":"2025-04-30T12:51:39","date_gmt":"2025-04-30T10:51:39","guid":{"rendered":"https:\/\/lorentzen.ch\/?p=1880"},"modified":"2025-05-02T09:12:48","modified_gmt":"2025-05-02T07:12:48","slug":"fast-grouped-counts-and-means-in-r","status":"publish","type":"post","link":"https:\/\/lorentzen.ch\/index.php\/2025\/04\/30\/fast-grouped-counts-and-means-in-r\/","title":{"rendered":"Fast Grouped Counts and Means in R"},"content":{"rendered":"\n<p><em>Edited on 2025-05-01: Multiple improvements by Christian, especially on making Polars neater, DuckDB faster, and the plot easier to read.<\/em><\/p>\n\n\n\n<p>From time to time, the following questions pop up: <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>How to calculate grouped counts and (weighted) means?<\/li>\n\n\n\n<li>What are fast ways to do it in R?<\/li>\n<\/ol>\n\n\n\n<p>This blog post presents a couple of approaches and then compares their speed with a naive (non-scientific!) benchmark.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"567\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2025\/04\/image-10-1024x567.png\" alt=\"\" class=\"wp-image-1933\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2025\/04\/image-10-1024x567.png 1024w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2025\/04\/image-10-300x166.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2025\/04\/image-10-768x425.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2025\/04\/image-10.png 1306w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Base R<\/h2>\n\n\n\n<p>There are many ways to calculate grouped counts and means in base R, e.g., <code>aggregate()<\/code>, <code>tapply()<\/code>, <code>by()<\/code>, <code>split()<\/code> + <code>lapply()<\/code>. In my experience, the fastest way is a combination of <code>tabulate()<\/code> and <code>rowsum()<\/code>.<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-d259cf76-b8ba-4f1c-8a96-620efada4628\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-d259cf76-b8ba-4f1c-8a96-620efada4628-tab-0\" aria-controls=\"ub-tabbed-content-d259cf76-b8ba-4f1c-8a96-620efada4628-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: left; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-d259cf76-b8ba-4f1c-8a96-620efada4628-panel-0\" aria-labelledby=\"ub-tabbed-content-d259cf76-b8ba-4f1c-8a96-620efada4628-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"R\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'># Make data\nset.seed(1)\n\nn &lt;- 1e6\n\ny &lt;- rexp(n)\nw &lt;- runif(n)\ng &lt;- factor(sample(LETTERS[1:3], n, TRUE))\ndf &lt;- data.frame(y = y, g = g, w = w)\n\n# Grouped counts\ntabulate(g)\n# 333469 333569 332962\n\n# Grouped means\nrowsum(y, g) \/ tabulate(g)\n      [,1]\n# A 1.000869\n# B 1.001043\n# C 1.000445\n\n# Grouped weighted mean\nws &lt;- rowsum(data.frame(y = y * w, w), g)\nws[, 1L] \/ ws[, 2L]\n# 1.0022749 1.0017816 0.9997058<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<p>But: <code>tabulate()<\/code> ignores missing values. To avoid problems, create an explicit missing level via <code>factor(x, exclude = NULL<\/code>).<\/p>\n\n\n\n<p>Let&#8217;s turn to some other approaches.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">dplyr<\/h2>\n\n\n\n<p>Not optimized for speed or memory, but the de-facto standard in data processing with R. I love its syntax.<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-7665d42f-4d17-43c0-90e3-1880d60dcd52\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-7665d42f-4d17-43c0-90e3-1880d60dcd52-tab-0\" aria-controls=\"ub-tabbed-content-7665d42f-4d17-43c0-90e3-1880d60dcd52-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: left; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-7665d42f-4d17-43c0-90e3-1880d60dcd52-panel-0\" aria-labelledby=\"ub-tabbed-content-7665d42f-4d17-43c0-90e3-1880d60dcd52-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"R\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'>library(tidyverse)\n\ndf &lt;- tibble(df)\n\n# Grouped counts\ndplyr::count(df, g)\n\n# Grouped means\ndf |&gt;\n  group_by(g) |&gt;\n  summarize(mean(y))\n\n# Grouped weighted means\ndf |&gt;\n  group_by(g) |&gt;\n  summarize(sum(w * y) \/ sum(w))<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<h2 class=\"wp-block-heading\"><a href=\"https:\/\/github.com\/Rdatatable\/data.table\">data.table<\/a><\/h2>\n\n\n\n<p>Does not need an introduction. Since 2006 <em>the<\/em> package for fast data manipulation written in C.<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-6a965841-29e0-4e47-bbc5-0cfd6520b8a2\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-6a965841-29e0-4e47-bbc5-0cfd6520b8a2-tab-0\" aria-controls=\"ub-tabbed-content-6a965841-29e0-4e47-bbc5-0cfd6520b8a2-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: left; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-6a965841-29e0-4e47-bbc5-0cfd6520b8a2-panel-0\" aria-labelledby=\"ub-tabbed-content-6a965841-29e0-4e47-bbc5-0cfd6520b8a2-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"R\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'>library(data.table)\n\ndt &lt;- data.table(df)\n\n# Grouped counts (use keyby for sorted output)\ndt[, .N, by = g]\n#         g      N\n#    &lt;fctr&gt;  &lt;int&gt;\n# 1:      C 332962\n# 2:      B 333569\n# 3:      A 333469\n\n# Grouped means\ndt[, mean(y), by = g]\n\n# Grouped weighted means\ndt[, sum(w * y) \/ sum(w), by = g]\ndt[, weighted.mean(y, w), by = g]<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<h2 class=\"wp-block-heading\">DuckDB<\/h2>\n\n\n\n<p>Extremely powerful query engine \/ database system written in C++, with initial release in 2019, and R bindings since 2020. Allows larger-than-RAM calculations. <\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-5088590b-6f0b-4ec8-9a07-379bec35617f\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-5088590b-6f0b-4ec8-9a07-379bec35617f-tab-0\" aria-controls=\"ub-tabbed-content-5088590b-6f0b-4ec8-9a07-379bec35617f-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: left; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-5088590b-6f0b-4ec8-9a07-379bec35617f-panel-0\" aria-labelledby=\"ub-tabbed-content-5088590b-6f0b-4ec8-9a07-379bec35617f-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"R\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'>library(duckdb)\n\ncon &lt;- dbConnect(duckdb())\n\n# only registers: duckdb_register(con, name = \"df\", df = df)\ndbWriteTable(con, name = \"df\", value = df)\n\ndbGetQuery(con, \"SELECT g, COUNT(*) N FROM df GROUP BY g\")\ndbGetQuery(con, \"SELECT g, AVG(y) AS mean FROM df GROUP BY g\")\ncon |&gt; \n  dbGetQuery(\n  \"\n  SELECT g, SUM(y * w) \/ sum(w) as wmean\n  FROM df\n  GROUP BY g\n  \"\n  )\n#   g     wmean\n# 1 A 1.0022749\n# 2 B 1.0017816\n# 3 C 0.9997058<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<h2 class=\"wp-block-heading\">collapse<\/h2>\n\n\n\n<p>C\/C++-based package for data transformation and statistical computing. {collapse} was initially released on CRAN in 2020. It can do much more than grouped calculations, check it out!<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-7678479d-f30b-416a-971a-39557ba36a98\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-7678479d-f30b-416a-971a-39557ba36a98-tab-0\" aria-controls=\"ub-tabbed-content-7678479d-f30b-416a-971a-39557ba36a98-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: left; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-7678479d-f30b-416a-971a-39557ba36a98-panel-0\" aria-labelledby=\"ub-tabbed-content-7678479d-f30b-416a-971a-39557ba36a98-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"R\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'>library(collapse)\n\nfcount(g)\nfnobs(g, g) # Faster and does not need memory, but ignores missing values\nfmean(y, g = g)\nfmean(y, g = g, w = w)\n#         A         B         C\n# 1.0022749 1.0017816 0.9997058<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<h2 class=\"wp-block-heading\">Polars<\/h2>\n\n\n\n<p>R bindings of the fantastic <a href=\"https:\/\/github.com\/pola-rs\/r-polars\">Polars project<\/a> that started in 2020. First R release in 2022. Currently under heavy revision.<\/p>\n\n\n\n<p>The current package is not up-to-date with the main project, thus we expect the revised version (available in this <a href=\"https:\/\/github.com\/pola-rs\/r-polars\/tree\/next\">branch<\/a>)  to be faster.<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-ae9b416a-1d82-4e99-ab3d-17219ed78cca\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-ae9b416a-1d82-4e99-ab3d-17219ed78cca-tab-0\" aria-controls=\"ub-tabbed-content-ae9b416a-1d82-4e99-ab3d-17219ed78cca-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: left; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-ae9b416a-1d82-4e99-ab3d-17219ed78cca-panel-0\" aria-labelledby=\"ub-tabbed-content-ae9b416a-1d82-4e99-ab3d-17219ed78cca-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"R\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'># Sys.setenv(NOT_CRAN = \"true\")\n# install.packages(\"polars\", repos = \"https:\/\/community.r-multiverse.org\")\nlibrary(polars)\n\ndfp &lt;- as_polars_df(df)\n\n# Grouped counts\ndfp$get_column(\"g\")$value_counts()\n# Faster, but eats more memory\ndfp$select(\"g\")$with_columns(pl$lit(1L))$group_by(\"g\")$sum()\n\n# Grouped means\n(\n  dfp\n  $group_by(\"g\")\n  $agg(pl$col(\"y\")$mean())\n)\n\n# Grouped weighted means\n(\n  dfp\n  $group_by(\"g\")\n  $agg((pl$col(\"y\") * pl$col(\"w\"))$sum() \/ pl$col(\"w\")$sum())\n)\n# shape: (3, 2)\n# \u250c\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n# \u2502 g   \u2506 y        \u2502\n# \u2502 --- \u2506 ---      \u2502\n# \u2502 cat \u2506 f64      \u2502\n# \u255e\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2561\n# \u2502 C   \u2506 0.999706 \u2502\n# \u2502 B   \u2506 1.001782 \u2502\n# \u2502 A   \u2506 1.002275 \u2502\n# \u2514\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<h2 class=\"wp-block-heading\">Naive Benchmark<\/h2>\n\n\n\n<p>Let&#8217;s compare the speed of these approaches for sample sizes up to 10^8 using a Windows system with an Intel i7-13700H CPU.<\/p>\n\n\n\n<p>This is not at all meant as a scientific benchmark!<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-359c1458-90d1-4cd1-8be2-b4a4286fdd9a\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-359c1458-90d1-4cd1-8be2-b4a4286fdd9a-tab-0\" aria-controls=\"ub-tabbed-content-359c1458-90d1-4cd1-8be2-b4a4286fdd9a-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: left; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-359c1458-90d1-4cd1-8be2-b4a4286fdd9a-panel-0\" aria-labelledby=\"ub-tabbed-content-359c1458-90d1-4cd1-8be2-b4a4286fdd9a-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"R\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'># We run the code in a fresh session\nlibrary(tidyverse) # 2.0.0\nlibrary(duckdb) # 1.2.1\nlibrary(data.table) # 1.16.0\nlibrary(collapse) # 2.0.19\nlibrary(polars) # 0.22.3\n\npolars_info() # 8 threads\nsetDTthreads(8)\ncon &lt;- dbConnect(duckdb(config = list(threads = \"8\")))\n\nset.seed(1)\n\nN &lt;- 10^(5:8)\nm_queries &lt;- 3\nresults &lt;- vector(\"list\", length(N) * m_queries)\n\nfor (i in seq_along(N)) {\n  n &lt;- N[i]\n\n  # Create data\n  y &lt;- rexp(n)\n  w &lt;- runif(n)\n  g &lt;- factor(sample(LETTERS, n, TRUE))\n\n  df &lt;- tibble(y = y, g = g, w = w)\n  dt &lt;- data.table(df)\n  dfp &lt;- as_polars_df(df)\n  dbWriteTable(con, name = \"df\", value = df, overwrite = TRUE)\n\n  # Grouped counts\n  results[[1 + (i - 1) * m_queries]] &lt;- bench::mark(\n    base = tabulate(g),\n    dplyr = dplyr::count(df, g),\n    data.table = dt[, .N, by = g],\n    polars = dfp$get_column(\"g\")$value_counts(),\n    collapse = fcount(g),\n    duckdb = dbGetQuery(con, \"SELECT g, COUNT(*) N FROM df GROUP BY g\"),\n    check = FALSE,\n    min_iterations = 3,\n  ) |&gt;\n    bind_cols(n = n, query = \"counts\")\n\n  results[[2 + (i - 1) * m_queries]] &lt;- bench::mark(\n    base = rowsum(y, g) \/ tabulate(g),\n    dplyr = df |&gt; group_by(g) |&gt; summarize(mean(y)),\n    data.table = dt[, mean(y), by = g],\n    polars = dfp$group_by(\"g\")$agg(pl$col(\"y\")$mean()),\n    collapse = fmean(y, g = g),\n    duckdb = dbGetQuery(con, \"SELECT g, AVG(y) AS mean FROM df GROUP BY g\"),\n    check = FALSE,\n    min_iterations = 3\n  ) |&gt;\n    bind_cols(n = n, query = \"means\")\n\n  results[[3 + (i - 1) * m_queries]] &lt;- bench::mark(\n    base = {\n      ws &lt;- rowsum(data.frame(y = y * w, w), g)\n      ws[, 1L] \/ ws[, 2L]\n    },\n    dplyr = df |&gt; group_by(g) |&gt; summarize(sum(w * y) \/ sum(w)),\n    data.table = dt[, sum(w * y) \/ sum(w), by = g],\n    polars = (\n      dfp\n      $group_by(\"g\")\n      $agg((pl$col(\"y\") * pl$col(\"w\"))$sum() \/ pl$col(\"w\")$sum())\n    ),\n    collapse = fmean(y, g = g, w = w),\n    duckdb = dbGetQuery(\n      con,\n      \"SELECT g, SUM(y * w) \/ sum(w) as wmean FROM df GROUP BY g\"\n    ),\n    check = FALSE,\n    min_iterations = 3\n  ) |&gt;\n    bind_cols(n = n, query = \"weighted means\")\n}\n\nresults_df &lt;- bind_rows(results) |&gt;\n  group_by(n, query) |&gt;\n  mutate(\n    time = as.numeric(median) * 1000, # ms\n    n = as.factor(n),\n    approach = as.character(expression),\n    relative = as.numeric(time \/ min(time))\n  ) |&gt;\n  ungroup()\n\nggplot(\n  results_df, aes(y = time, x = n, group = approach, color = approach)\n) +\n  geom_point() +\n  geom_line() +\n  scale_y_log10(labels = scales::label_number()) +\n  facet_wrap(\"query\") +\n  labs(x = \"n\", y = \"Time [ms]\", color = element_blank()) +\n  theme_gray(base_size = 14)\n\n<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"567\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2025\/04\/image-9-1024x567.png\" alt=\"\" class=\"wp-image-1932\" srcset=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2025\/04\/image-9-1024x567.png 1024w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2025\/04\/image-9-300x166.png 300w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2025\/04\/image-9-768x425.png 768w, https:\/\/lorentzen.ch\/wp-content\/uploads\/2025\/04\/image-9.png 1306w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Memory<\/h2>\n\n\n\n<p>What about memory? {dplyr}, {data.table}, and <code>rowsum()<\/code> require a lot of it, as does <code>collapse::fcount()<\/code>. For the other approaches, almost no memory is required, or profmem can&#8217; t measure it.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Final words<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>{duckdb} is increadibly fast for large data.<\/li>\n\n\n\n<li>{collapse} is increadibly fast for all sample sizes. In <a href=\"https:\/\/h2oai.github.io\/db-benchmark\/\">other benchmarks<\/a>, it is slower because there, the grouping has to be a string rather than a factor.<\/li>\n\n\n\n<li>{polars} looks really cool.<\/li>\n\n\n\n<li><code>rowsum()<\/code> and <code>tabulate()<\/code> provide fast solutions with base R.<\/li>\n<\/ul>\n\n\n\n<p><a href=\"https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2025-04-30%20grouped_sums_r.R\">R scrip<\/a>t<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Grouped counts and grouped (weighted)<br \/>\nmeans? How to calculate them? And how to do it as fast as possible?<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[27,16,17],"tags":[5],"class_list":["post-1880","post","type-post","status-publish","format-standard","hentry","category-data","category-machine-learning","category-programming","tag-r"],"featured_image_src":null,"author_info":{"display_name":"Michael Mayer","author_link":"https:\/\/lorentzen.ch\/index.php\/author\/michael\/"},"_links":{"self":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1880","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/comments?post=1880"}],"version-history":[{"count":9,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1880\/revisions"}],"predecessor-version":[{"id":1936,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1880\/revisions\/1936"}],"wp:attachment":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/media?parent=1880"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/categories?post=1880"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/tags?post=1880"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}