Conversion from CSV to Parquet in streaming mode? No problem for the two power houses Polars and DuckDB. We can even throw in some data preprocessing steps in-between, like column selection, data filters, or sorts.
Edit: Streaming writing (or “lazy sinking”) of data with Polars was introduced with release 1.25.2 in March 2025, thanks Christian for pointing this out.
pip install polars
pip install duckdb
Run times are on a normal laptop, dedicating 8 threads to the crunching.
Let’s generate a 2 GB csv file first
import duckdb # 1.2.1 import numpy as np # 1.26.4 import polars as pl # 1.25.2 n = 100_000_000 rng = np.random.default_rng(42) df = pl.DataFrame( { "X": rng.choice(["a", "b", "c"], n), "Y": rng.uniform(0, 1, n), "Z": rng.choice([1, 2, 3, 4, 5], n), } ) df.write_csv("data.csv")

Polars
Let’s use Polars in Lazy mode to connect to the CSV, apply some data operations, and stream the result into a Parquet file.
# Native API with POLARS_MAX_THREADS = 8 ( pl.scan_csv("data.csv") .filter(pl.col("X") == "a") .drop("X") .sort(["Y", "Z"]) .sink_parquet("data.parquet", row_group_size=100_000) # "zstd" compression ) # 3.5 s
In case you prefer to write SQL code, you can alternatively use the SQL API of Polars. Curiously, run time is substantially longer:
# Via SQL API (slower!?) ( pl.scan_csv("data.csv") .sql("SELECT Y, Z FROM self WHERE X == 'a' ORDER BY Y, Z") .sink_parquet("data.parquet", row_group_size=100_000) ) # 6.8 s
In both cases, the result looks as expected, and the resulting Parquet file is about 170 MB large.
pl.scan_parquet("data.parquet").head(5).collect() # Output Y Z f64 i64 3.7796e-8 4 5.0273e-8 5 5.7652e-8 4 8.0578e-8 3 8.1598e-8 4
DuckDB
As an alternative, we use DuckDB. Thread pool size and RAM limit can be set on the fly. Setting a low memory limit (e.g., 500 MB) will lead to longer run times, but it works.
con = duckdb.connect(config={"threads": 8, "memory_limit": "4GB"}) con.sql( """ COPY ( SELECT Y, Z FROM 'data.csv' WHERE X == 'a' ORDER BY Y, Z ) TO 'data.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000) """ ) # 3.9 s
Again, the output looks as expected. The Parquet file is again 170 MB large, thanks to using the same compression (“zstd”) as with Polars..
con.sql("SELECT * FROM 'data.parquet' LIMIT 5") # Output ┌────────────────────────┬───────┐ │ Y │ Z │ │ double │ int64 │ ├────────────────────────┼───────┤ │ 3.779571322581887e-08 │ 4 │ │ 5.0273087692787044e-08 │ 5 │ │ 5.76523543349694e-08 │ 4 │ │ 8.057776434977626e-08 │ 3 │ │ 8.159834352650108e-08 │ 4 │ └────────────────────────┴───────┘
Final words
- With Polars or DuckDB, conversion of CSVs to Parquet is easy and fast, even in larger-than-RAM situations.
- We can apply filters, selects, sorts etc. on the fly.