Category: Data

  • Converting arbitrarily large CSVs to Parquet with R

    In this recent post, we have used Polars and DuckDB to convert a large CSV file to Parquet in steaming mode – and Python.

    Different people have contacted me and asked: “and in R?”

    Simple answer: We have DuckDB, and we have different Polars bindings. Here, we are using {polars} which is currently being overhauled into {neopandas}.

    So let’s not wait any longer!


    Run times are on a Windows system with an Intel i7-13700H CPU.

    Generate 2.2 GB csv file

    We use {data.table} to dump a randomly generated dataset with 100 Mio rows into a csv file.

    library(data.table)
    
    set.seed(1)
    
    n <- 1e8
    
    df <- data.frame(
      X = sample(letters[1:3], n, TRUE),
      Y = runif(n),
      Z = sample(1:5, n, TRUE)
    )
    
    fwrite(df, "data.csv")

    DuckDB

    Then, we use DuckDB to fire a query to the file and stream the result into Parquet.

    Threads and RAM can be set on the fly, which is very convenient. Setting a low memory limit (e.g., 500 MB) will work – try it out!

    library(duckdb)
    
    con <- dbConnect(duckdb(config = list(threads = "8", memory_limit = "4GB")))
    
    system.time( # 3.5s
      dbSendQuery(
        con,
        "
        COPY (
          SELECT Y, Z
          FROM 'data.csv'
          WHERE X == 'a'
          ORDER BY Y
        ) TO 'data.parquet' (FORMAT parquet, COMPRESSION zstd)
        "
      )
    )
    
    # Check
    dbGetQuery(con, "SELECT COUNT(*) N FROM 'data.parquet'") # 33329488
    dbGetQuery(con, "SELECT * FROM 'data.parquet' LIMIT 5")
    #              Y Z
    # 1 5.355105e-09 4
    # 2 9.080395e-09 5
    # 3 2.258457e-08 2
    # 4 3.445894e-08 2
    # 5 6.891787e-08 1

    3.5 seconds – wow! The resulting file looks good. It is 125 MB large.

    Polars

    Let’s do the same with Polars.

    # Sys.setenv(NOT_CRAN = "true")
    # install.packages("polars", repos = "https://community.r-multiverse.org")
    library(polars)
    
    polars_info()
    
    system.time( # 9s
      (
        pl$scan_csv("data.csv")
        $filter(pl$col("X") == "a")
        $drop("X")
        $sort("Y")
        $sink_parquet("data.parquet", row_group_size = 1e5)
      )
    )
    
    # Check
    pl$scan_parquet("data.parquet")$head()$collect()
    # shape: (5, 2)
    # ┌───────────┬─────┐
    # │ Y         ┆ Z   │
    # │ ---       ┆ --- │
    # │ f64       ┆ i64 │
    # ╞═══════════╪═════╡
    # │ 5.3551e-9 ┆ 4   │
    # │ 9.0804e-9 ┆ 5   │
    # │ 2.2585e-8 ┆ 2   │
    # │ 3.4459e-8 ┆ 2   │
    # │ 6.8918e-8 ┆ 1   │
    # └───────────┴─────┘

    With nine seconds, it is slower than DuckDB. But the output looks as expected and has the same size as with DuckDB.

    Final words

    • With DuckDB or Polars, conversion of CSVs to Parquet is easy and fast, even in larger-than-RAM situations.
    • We can apply filters, selects, sorts etc. on the fly.
    • Let’s keep an eye on Polars in R. It looks really interesting.

    R script

  • Converting arbitrarily large CSVs to Parquet with Python

    Conversion from CSV to Parquet in streaming mode? No problem for the two power houses Polars and DuckDB. We can even throw in some data preprocessing steps in-between, like column selection, data filters, or sorts.

    Edit: Streaming writing (or “lazy sinking”) of data with Polars was introduced with release 1.25.2 in March 2025, thanks Christian for pointing this out.

    pip install polars

    pip install duckdb


    Run times are on a normal laptop, dedicating 8 threads to the crunching.

    Let’s generate a 2 GB csv file first

    import duckdb  # 1.2.1
    import numpy as np  # 1.26.4
    import polars as pl  # 1.25.2
    
    n = 100_000_000
    
    rng = np.random.default_rng(42)
    
    df = pl.DataFrame(
        {
            "X": rng.choice(["a", "b", "c"], n),
            "Y": rng.uniform(0, 1, n),
            "Z": rng.choice([1, 2, 3, 4, 5], n),
        }
    )
    
    df.write_csv("data.csv")

    Polars

    Let’s use Polars in Lazy mode to connect to the CSV, apply some data operations, and stream the result into a Parquet file.

    # Native API with POLARS_MAX_THREADS = 8
    (
        pl.scan_csv("data.csv")
        .filter(pl.col("X") == "a")
        .drop("X")
        .sort(["Y", "Z"])
        .sink_parquet("data.parquet", row_group_size=100_000)  # "zstd" compression
    )
    # 3.5 s

    In case you prefer to write SQL code, you can alternatively use the SQL API of Polars. Curiously, run time is substantially longer:

    # Via SQL API (slower!?)
    (
        pl.scan_csv("data.csv")
        .sql("SELECT Y, Z FROM self WHERE X == 'a' ORDER BY Y, Z")
        .sink_parquet("data.parquet", row_group_size=100_000)
    )
    
    # 6.8 s

    In both cases, the result looks as expected, and the resulting Parquet file is about 170 MB large.

    pl.scan_parquet("data.parquet").head(5).collect()
    
    # Output
            Y   Z
          f64 i64
    3.7796e-8	4
    5.0273e-8	5
    5.7652e-8	4
    8.0578e-8	3
    8.1598e-8	4

    DuckDB

    As an alternative, we use DuckDB. Thread pool size and RAM limit can be set on the fly. Setting a low memory limit (e.g., 500 MB) will lead to longer run times, but it works.

    con = duckdb.connect(config={"threads": 8, "memory_limit": "4GB"})
    
    con.sql(
        """
        COPY (
            SELECT Y, Z
            FROM 'data.csv'
            WHERE X == 'a'
            ORDER BY Y, Z
        ) TO 'data.parquet' (FORMAT parquet, COMPRESSION zstd, ROW_GROUP_SIZE 100_000)
        """
    )
    
    # 3.9 s

    Again, the output looks as expected. The Parquet file is again 170 MB large, thanks to using the same compression (“zstd”) as with Polars..

    con.sql("SELECT * FROM 'data.parquet' LIMIT 5")
    
    # Output
    ┌────────────────────────┬───────┐
    │           Y            │   Z   │
    │         double         │ int64 │
    ├────────────────────────┼───────┤
    │  3.779571322581887e-08 │     4 │
    │ 5.0273087692787044e-08 │     5 │
    │   5.76523543349694e-08 │     4 │
    │  8.057776434977626e-08 │     3 │
    │  8.159834352650108e-08 │     4 │
    └────────────────────────┴───────┘

    Final words

    • With Polars or DuckDB, conversion of CSVs to Parquet is easy and fast, even in larger-than-RAM situations.
    • We can apply filters, selects, sorts etc. on the fly.

    Python notebook