{"id":1874,"date":"2025-03-30T13:43:16","date_gmt":"2025-03-30T11:43:16","guid":{"rendered":"https:\/\/lorentzen.ch\/?p=1874"},"modified":"2025-03-30T13:43:17","modified_gmt":"2025-03-30T11:43:17","slug":"converting-arbitrarily-large-csvs-to-parquet-with-r","status":"publish","type":"post","link":"https:\/\/lorentzen.ch\/index.php\/2025\/03\/30\/converting-arbitrarily-large-csvs-to-parquet-with-r\/","title":{"rendered":"Converting arbitrarily large CSVs to Parquet with R"},"content":{"rendered":"\n<p>In this <a href=\"https:\/\/lorentzen.ch\/index.php\/2025\/03\/22\/converting-csv-to-parquet\/\">recent post<\/a>, we have used <a href=\"https:\/\/pola.rs\/\"><strong>Polars<\/strong><\/a> and <a href=\"https:\/\/duckdb.org\/\"><strong>DuckDB<\/strong><\/a> to convert a large CSV file to Parquet in steaming mode &#8211; and Python. <\/p>\n\n\n\n<p>Different people have contacted me and asked: &#8220;and in R?&#8221;<\/p>\n\n\n\n<p>Simple answer: We have DuckDB, and we have different Polars bindings. Here, we are using {<a href=\"https:\/\/github.com\/pola-rs\/r-polars\">polars}<\/a> which is currently being overhauled into {neopandas}.<\/p>\n\n\n\n<p>So let&#8217;s not wait any longer!<\/p>\n\n\n\n<p><img decoding=\"async\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2025\/03\/image-3.png\" alt=\"\"><br><img decoding=\"async\" src=\"https:\/\/lorentzen.ch\/wp-content\/uploads\/2025\/03\/image-2.png\" alt=\"\"><\/p>\n\n\n\n<p>Run times are on a Windows system with an Intel i7-13700H CPU. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Generate 2.2 GB csv file<\/h2>\n\n\n\n<p>We use {data.table} to dump a randomly generated dataset with 100 Mio rows into a csv file.<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-03f6b536-ed1b-4b0e-b53a-cfa4850e5003\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-03f6b536-ed1b-4b0e-b53a-cfa4850e5003-tab-0\" aria-controls=\"ub-tabbed-content-03f6b536-ed1b-4b0e-b53a-cfa4850e5003-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: left; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-03f6b536-ed1b-4b0e-b53a-cfa4850e5003-panel-0\" aria-labelledby=\"ub-tabbed-content-03f6b536-ed1b-4b0e-b53a-cfa4850e5003-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"R\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'>library(data.table)\n\nset.seed(1)\n\nn &lt;- 1e8\n\ndf &lt;- data.frame(\n  X = sample(letters[1:3], n, TRUE),\n  Y = runif(n),\n  Z = sample(1:5, n, TRUE)\n)\n\nfwrite(df, \"data.csv\")<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<h2 class=\"wp-block-heading\">DuckDB<\/h2>\n\n\n\n<p>Then, we use <a href=\"https:\/\/duckdb.org\/\">DuckDB<\/a> to fire a query to the file and stream the result into Parquet. <\/p>\n\n\n\n<p>Threads and RAM can be set on the fly, which is very convenient. Setting a low memory limit (e.g., 500 MB) will work &#8211; try it out!<\/p>\n\n\n<div class=\"wp-block-ub-tabbed-content wp-block-ub-tabbed-content-holder wp-block-ub-tabbed-content-horizontal-holder-mobile wp-block-ub-tabbed-content-horizontal-holder-tablet\" id=\"ub-tabbed-content-28dd201b-d12f-401b-9254-3e42c9ca1a73\" style=\"\">\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-holder horizontal-tab-width-mobile horizontal-tab-width-tablet\">\n\t\t\t\t<div role=\"tablist\" class=\"wp-block-ub-tabbed-content-tabs-title wp-block-ub-tabbed-content-tabs-title-mobile-horizontal-tab wp-block-ub-tabbed-content-tabs-title-tablet-horizontal-tab\" style=\"justify-content: flex-start; \"><div role=\"tab\" id=\"ub-tabbed-content-28dd201b-d12f-401b-9254-3e42c9ca1a73-tab-0\" aria-controls=\"ub-tabbed-content-28dd201b-d12f-401b-9254-3e42c9ca1a73-panel-0\" aria-selected=\"true\" class=\"wp-block-ub-tabbed-content-tab-title-wrap active\" style=\"--ub-tabbed-title-background-color: #6d6d6d; --ub-tabbed-active-title-color: inherit; --ub-tabbed-active-title-background-color: #6d6d6d; text-align: center; \" tabindex=\"-1\">\n\t\t\t\t<div class=\"wp-block-ub-tabbed-content-tab-title\">R<\/div>\n\t\t\t<\/div><\/div>\n\t\t\t<\/div>\n\t\t\t<div class=\"wp-block-ub-tabbed-content-tabs-content\" style=\"\"><div role=\"tabpanel\" class=\"wp-block-ub-tabbed-content-tab-content-wrap active\" id=\"ub-tabbed-content-28dd201b-d12f-401b-9254-3e42c9ca1a73-panel-0\" aria-labelledby=\"ub-tabbed-content-28dd201b-d12f-401b-9254-3e42c9ca1a73-tab-0\" tabindex=\"0\">\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting='{\"showPanel\":true,\"languageLabel\":\"language\",\"fullScreenButton\":true,\"copyButton\":true,\"mode\":\"r\",\"mime\":\"text\/x-rsrc\",\"theme\":\"material\",\"lineNumbers\":false,\"styleActiveLine\":false,\"lineWrapping\":false,\"readOnly\":true,\"fileName\":\"\",\"language\":\"R\",\"maxHeight\":\"400px\",\"modeName\":\"r\"}'>library(duckdb)\n\ncon &lt;- dbConnect(duckdb(config = list(threads = \"8\", memory_limit = \"4GB\")))\n\nsystem.time( # 3.5s\n  dbSendQuery(\n    con,\n    \"\n    COPY (\n      SELECT Y, Z\n      FROM 'data.csv'\n      WHERE X == 'a'\n      ORDER BY Y\n    ) TO 'data.parquet' (FORMAT parquet, COMPRESSION zstd)\n    \"\n  )\n)\n\n# Check\ndbGetQuery(con, \"SELECT COUNT(*) N FROM 'data.parquet'\") # 33329488\ndbGetQuery(con, \"SELECT * FROM 'data.parquet' LIMIT 5\")\n#              Y Z\n# 1 5.355105e-09 4\n# 2 9.080395e-09 5\n# 3 2.258457e-08 2\n# 4 3.445894e-08 2\n# 5 6.891787e-08 1<\/pre><\/div>\n\n<\/div><\/div>\n\t\t<\/div>\n\n\n<p>3.5 seconds &#8211; wow! The resulting file looks good. It is 125 MB large.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Polars<\/h2>\n\n\n\n<p>Let&#8217;s do the same with Polars.<\/p>\n\n\n\n<div class=\"wp-block-codemirror-blocks-code-block code-block\"><pre class=\"CodeMirror\" data-setting=\"{&quot;showPanel&quot;:true,&quot;languageLabel&quot;:&quot;language&quot;,&quot;fullScreenButton&quot;:true,&quot;copyButton&quot;:true,&quot;mode&quot;:&quot;r&quot;,&quot;mime&quot;:&quot;text\/x-rsrc&quot;,&quot;theme&quot;:&quot;material&quot;,&quot;lineNumbers&quot;:false,&quot;styleActiveLine&quot;:false,&quot;lineWrapping&quot;:false,&quot;readOnly&quot;:true,&quot;fileName&quot;:&quot;&quot;,&quot;language&quot;:&quot;R&quot;,&quot;maxHeight&quot;:&quot;400px&quot;,&quot;modeName&quot;:&quot;r&quot;}\"># Sys.setenv(NOT_CRAN = &quot;true&quot;)\n# install.packages(&quot;polars&quot;, repos = &quot;https:\/\/community.r-multiverse.org&quot;)\nlibrary(polars)\n\npolars_info()\n\nsystem.time( # 9s\n  (\n    pl$scan_csv(&quot;data.csv&quot;)\n    $filter(pl$col(&quot;X&quot;) == &quot;a&quot;)\n    $drop(&quot;X&quot;)\n    $sort(&quot;Y&quot;)\n    $sink_parquet(&quot;data.parquet&quot;, row_group_size = 1e5)\n  )\n)\n\n# Check\npl$scan_parquet(&quot;data.parquet&quot;)$head()$collect()\n# shape: (5, 2)\n# \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2510\n# \u2502 Y         \u2506 Z   \u2502\n# \u2502 ---       \u2506 --- \u2502\n# \u2502 f64       \u2506 i64 \u2502\n# \u255e\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u2550\u256a\u2550\u2550\u2550\u2550\u2550\u2561\n# \u2502 5.3551e-9 \u2506 4   \u2502\n# \u2502 9.0804e-9 \u2506 5   \u2502\n# \u2502 2.2585e-8 \u2506 2   \u2502\n# \u2502 3.4459e-8 \u2506 2   \u2502\n# \u2502 6.8918e-8 \u2506 1   \u2502\n# \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2518<\/pre><\/div>\n\n\n\n<p>With nine seconds, it is slower than DuckDB. But the output looks as expected and has the same size as with DuckDB.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Final words<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>With DuckDB or Polars, conversion of CSVs to Parquet is easy and fast, even in larger-than-RAM situations.<\/li>\n\n\n\n<li>We can apply filters, selects, sorts etc. on the fly.<\/li>\n\n\n\n<li>Let&#8217;s keep an eye on Polars in R. It looks really interesting.<\/li>\n<\/ul>\n\n\n\n<p><a href=\"https:\/\/github.com\/lorentzenchr\/notebooks\/blob\/master\/blogposts\/2025-03-30%20csv%20to%20parquet.R\">R script<\/a><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>How to convert that huge csv to parquet? Now in R!<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[27,16,17],"tags":[5],"class_list":["post-1874","post","type-post","status-publish","format-standard","hentry","category-data","category-machine-learning","category-programming","tag-r"],"featured_image_src":null,"author_info":{"display_name":"Michael Mayer","author_link":"https:\/\/lorentzen.ch\/index.php\/author\/michael\/"},"_links":{"self":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1874","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/comments?post=1874"}],"version-history":[{"count":3,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1874\/revisions"}],"predecessor-version":[{"id":1879,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/posts\/1874\/revisions\/1879"}],"wp:attachment":[{"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/media?parent=1874"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/categories?post=1874"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lorentzen.ch\/index.php\/wp-json\/wp\/v2\/tags?post=1874"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}