Parquet vs CSV

Both are tabular data formats. Parquet is binary and columnar; CSV is text and row-based. Each shines in completely different situations — here's how to pick.

At a glance

Summary comparison

Property	Parquet	CSV
File layout	Binary, columnar	Text, row-based
Typical size	5–10× smaller	Baseline
Schema	Built-in (typed)	None — every value is a string
Compression	Snappy / Zstd / Gzip / Brotli / LZ4	None natively (must wrap in .gz/.zip)
Read speed (analytics)	Fast, can skip columns	Slow, must scan everything
Append speed	Slow, immutable design	Fast, just append text
Human-readable	No (binary)	Yes
Excel / Sheets	Not supported directly	Native support
Nested data (lists, structs)	First-class	Awkward (JSON inside cells)
Best for	Analytics, data lakes, ML datasets	Manual editing, portability, small files

Size: why Parquet is dramatically smaller

On real-world tabular data, Parquet files are typically 5–10× smaller than equivalent CSV. The savings come from three layers stacked on top of each other: storing each column separately (so similar values sit next to each other), encoding tricks like dictionary and run-length encoding, and finally a compression codec like Snappy or Zstd. CSV has none of this — every number is text, every separator is a byte, every newline is a byte.

For a typical e-commerce orders table (10M rows, 20 columns), the breakdown looks like ~2.5 GB CSV → ~250 MB Parquet (Snappy). Even after gzipping the CSV, you're still 2–3× larger than Parquet, and you lose random access.

Speed: why analytics queries are 10–100× faster

Suppose you have a 50-column table and you want SELECT user_id, amount FROM orders WHERE country = 'US'.

CSV:the engine has to read every byte of the file, parse every row, then filter. There's no way to read just two columns.
Parquet: the engine reads the footer (a few KB), checks the column statistics to skip row groups where country obviously can't be 'US', then reads only the three relevant column chunks. Often 5–10% of the file.

On large datasets this is the difference between a query running in seconds vs. minutes — and the difference between a $1 query and a $20 query on cloud warehouses.

Schema and types

CSV has no schema. Every value is a string, and every reader has to guess: is this column an integer or a string? What about that date — is it ISO 8601 or US format? Does "NA" mean null or the country Namibia?

Parquet stores the schema in the footer. INT64 stays INT64. Dates are real DATE/TIMESTAMP types. Decimals are exact. Lists, maps, and structs are first-class. You don't lose information passing Parquet between Spark, Pandas, and DuckDB.

When to use CSV anyway

CSV isn't obsolete — it's the right choice when:

A human needs to open the file in Excel or Google Sheets.
You're emailing data to a non-technical colleague.
The dataset is tiny (a few hundred rows), so size doesn't matter.
You need to cat, grep, or eyeball the file in a terminal.
The receiving system explicitly requires CSV (some BI tools, government uploads, legacy ETL).

For everything else — analytics, machine learning, data lakes, long-term storage — Parquet is the better default.

Going between the two

Converting is easy in either direction:

Parquet to CSV: free online converter — drop a .parquet file, download .csv. Or in Python: pd.read_parquet("f.parquet").to_csv("f.csv").
CSV to Parquet: in Python, pd.read_csv("f.csv").to_parquet("f.parquet"); in DuckDB, COPY (SELECT * FROM 'f.csv') TO 'f.parquet' (FORMAT 'parquet').

← How to open a Parquet file