Converting arbitrarily large CSVs to Parquet with Python

Conversion from CSV to Parquet in streaming mode? No problem for the two power houses Polars and DuckDB. We can even throw in some data preprocessing steps in-between, like column selection, data filters, or sorts.

Edit: Streaming writing (or “lazy sinking”) of data with Polars was introduced with release 1.25.2 in March 2025, thanks Christian for pointing this out.

pip install polars

pip install duckdb

Run times are on a normal laptop, dedicating 8 threads to the crunching.

Let’s generate a 2 GB csv file first

Python

import duckdb  # 1.2.1
import numpy as np  # 1.26.4
import polars as pl  # 1.25.2

n = 100_000_000

rng = np.random.default_rng(42)

df = pl.DataFrame(
    {
        "X": rng.choice(["a", "b", "c"], n),
        "Y": rng.uniform(0, 1, n),
        "Z": rng.choice([1, 2, 3, 4, 5], n),
    }
)

df.write_csv("data.csv")

Polars

Let’s use Polars in Lazy mode to connect to the CSV, apply some data operations, and stream the result into a Parquet file.

Python

# Native API with POLARS_MAX_THREADS = 8
(
    pl.scan_csv("data.csv")
    .filter(pl.col("X") == "a")
    .drop("X")
    .sort(["Y", "Z"])
    .sink_parquet("data.parquet", row_group_size=100_000)  # "zstd" compression
)
# 3.5 s

In case you prefer to write SQL code, you can alternatively use the SQL API of Polars. Curiously, run time is substantially longer:

Python

# Via SQL API (slower!?)
(
    pl.scan_csv("data.csv")
    .sql("SELECT Y, Z FROM self WHERE X == 'a' ORDER BY Y, Z")
    .sink_parquet("data.parquet", row_group_size=100_000)
)

# 6.8 s

In both cases, the result looks as expected, and the resulting Parquet file is about 170 MB large.

Python

pl.scan_parquet("data.parquet").head(5).collect()

# Output
        Y   Z
      f64 i64
3.7796e-8	4
5.0273e-8	5
5.7652e-8	4
8.0578e-8	3
8.1598e-8	4

DuckDB

As an alternative, we use DuckDB. Thread pool size and RAM limit can be set on the fly. Setting a low memory limit (e.g., 500 MB) will lead to longer run times, but it works.

Python

con = duckdb.connect(config={"threads": 8, "memory_limit": "4GB"})

con.sql(
    """
    COPY (
        SELECT Y, Z
        FROM 'data.csv'
        WHERE X == 'a'
        ORDER BY Y, Z
    ) TO 'data.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000)
    """
)

# 3.9 s

Again, the output looks as expected. The Parquet file is again 170 MB large, thanks to using the same compression (“zstd”) as with Polars..

Python

con.sql("SELECT * FROM 'data.parquet' LIMIT 5")

# Output
┌────────────────────────┬───────┐
│           Y            │   Z   │
│         double         │ int64 │
├────────────────────────┼───────┤
│  3.779571322581887e-08 │     4 │
│ 5.0273087692787044e-08 │     5 │
│   5.76523543349694e-08 │     4 │
│  8.057776434977626e-08 │     3 │
│  8.159834352650108e-08 │     4 │
└────────────────────────┴───────┘

Final words

With Polars or DuckDB, conversion of CSVs to Parquet is easy and fast, even in larger-than-RAM situations.
We can apply filters, selects, sorts etc. on the fly.

Python notebook

Comments

2 responses to “Converting arbitrarily large CSVs to Parquet with Python”

2025-03-30

Converting arbitrarily large CSVs to Parquet with R – Michael's and Christian's Blog

[…] this recent post, we have used Polars and DuckDB to convert a large CSV file to Parquet in steaming mode – and […]

2025-03-30

Converting arbitrarily large CSVs to Parquet with R

[…] this recent post, we have used Polars and DuckDB to convert a large CSV file to Parquet in steaming mode – and […]

Converting arbitrarily large CSVs to Parquet with Python

Let’s generate a 2 GB csv file first

Polars

DuckDB

Final words

Comments

2 responses to “Converting arbitrarily large CSVs to Parquet with Python”

Leave a Reply Cancel reply

More posts

Introducing LightSHAP

Key improvements in shapviz and kernelshap

Update of Swiss Mortality

Model Diagnostics: Statistics vs Machine Learning