Category: Data

  • Converting CSV to Parquet

    Conversion from CSV to Parquet in streaming mode? No problem for the two power houses Polars and DuckDB. We can even throw in some data preprocessing steps in-between, like column selection, data filters, or sorts.

    Edit: Streaming writing (or “lazy sinking”) of data with Polars was introduced with release 1.25.2 in March 2025, thanks Christian for pointing this out.

    pip install polars

    pip install duckdb


    Run times are on a normal laptop, dedicating 8 threads to the crunching.

    Let’s generate a 2 GB csv file first

    import duckdb  # 1.2.1
    import numpy as np  # 1.26.4
    import polars as pl  # 1.25.2
    
    n = 100_000_000
    
    rng = np.random.default_rng(42)
    
    df = pl.DataFrame(
        {
            "X": rng.choice(["a", "b", "c"], n),
            "Y": rng.uniform(0, 1, n),
            "Z": rng.choice([1, 2, 3, 4, 5], n),
        }
    )
    
    df.write_csv("data.csv")

    Polars

    Let’s use Polars in Lazy mode to connect to the CSV, apply some data operations, and stream the result into a Parquet file.

    # Native API with POLARS_MAX_THREADS = 8
    (
        pl.scan_csv("data.csv")
        .filter(pl.col("X") == "a")
        .drop("X")
        .sort(["Y", "Z"])
        .sink_parquet("data.parquet", row_group_size=100_000)  # "zstd" compression
    )
    # 3.5 s

    In case you prefer to write SQL code, you can alternatively use the SQL API of Polars. Curiously, run time is substantially longer:

    # Via SQL API (slower!?)
    (
        pl.scan_csv("data.csv")
        .sql("SELECT Y, Z FROM self WHERE X == 'a' ORDER BY Y, Z")
        .sink_parquet("data.parquet", row_group_size=100_000)
    )
    
    # 6.8 s

    In both cases, the result looks as expected, and the resulting Parquet file is about 170 MB large.

    pl.scan_parquet("data.parquet").head(5).collect()
    
    # Output
            Y   Z
          f64 i64
    3.7796e-8	4
    5.0273e-8	5
    5.7652e-8	4
    8.0578e-8	3
    8.1598e-8	4

    DuckDB

    As an alternative, we use DuckDB. Thread pool size and RAM limit can be set on the fly. Setting a low memory limit (e.g., 500 MB) will lead to longer run times, but it works.

    con = duckdb.connect(config={"threads": 8, "memory_limit": "4GB"})
    
    con.sql(
        """
        COPY (
            SELECT Y, Z
            FROM 'data.csv'
            WHERE X == 'a'
            ORDER BY Y, Z
        ) TO 'data.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000)
        """
    )
    
    # 3.9 s

    Again, the output looks as expected. The Parquet file is again 170 MB large, thanks to using the same compression (“zstd”) as with Polars..

    con.sql("SELECT * FROM 'data.parquet' LIMIT 5")
    
    # Output
    ┌────────────────────────┬───────┐
    │           Y            │   Z   │
    │         double         │ int64 │
    ├────────────────────────┼───────┤
    │  3.779571322581887e-08 │     4 │
    │ 5.0273087692787044e-08 │     5 │
    │   5.76523543349694e-08 │     4 │
    │  8.057776434977626e-08 │     3 │
    │  8.159834352650108e-08 │     4 │
    └────────────────────────┴───────┘

    Final words

    • With Polars or DuckDB, conversion of CSVs to Parquet is easy and fast, even in larger-than-RAM situations.
    • We can apply filters, selects, sorts etc. on the fly.

    Python notebook