Parquet vs CSV
Both are tabular data formats. Parquet is binary and columnar; CSV is text and row-based. Each shines in completely different situations — here's how to pick.
At a glance
Summary comparison
| Property | Parquet | CSV |
|---|---|---|
| File layout | Binary, columnar | Text, row-based |
| Typical size | 5–10× smaller | Baseline |
| Schema | Built-in (typed) | None — every value is a string |
| Compression | Snappy / Zstd / Gzip / Brotli / LZ4 | None natively (must wrap in .gz/.zip) |
| Read speed (analytics) | Fast, can skip columns | Slow, must scan everything |
| Append speed | Slow, immutable design | Fast, just append text |
| Human-readable | No (binary) | Yes |
| Excel / Sheets | Not supported directly | Native support |
| Nested data (lists, structs) | First-class | Awkward (JSON inside cells) |
| Best for | Analytics, data lakes, ML datasets | Manual editing, portability, small files |
01
Size: why Parquet is dramatically smaller
On real-world tabular data, Parquet files are typically 5–10× smaller than equivalent CSV. The savings come from three layers stacked on top of each other: storing each column separately (so similar values sit next to each other), encoding tricks like dictionary and run-length encoding, and finally a compression codec like Snappy or Zstd. CSV has none of this — every number is text, every separator is a byte, every newline is a byte.
For a typical e-commerce orders table (10M rows, 20 columns), the breakdown looks like ~2.5 GB CSV → ~250 MB Parquet (Snappy). Even after gzipping the CSV, you're still 2–3× larger than Parquet, and you lose random access.
02
Speed: why analytics queries are 10–100× faster
Suppose you have a 50-column table and you want SELECT user_id, amount FROM orders WHERE country = 'US'.
- CSV:the engine has to read every byte of the file, parse every row, then filter. There's no way to read just two columns.
- Parquet: the engine reads the footer (a few KB), checks the column statistics to skip row groups where
countryobviously can't be'US', then reads only the three relevant column chunks. Often 5–10% of the file.
On large datasets this is the difference between a query running in seconds vs. minutes — and the difference between a $1 query and a $20 query on cloud warehouses.
03
Schema and types
CSV has no schema. Every value is a string, and every reader has to guess: is this column an integer or a string? What about that date — is it ISO 8601 or US format? Does "NA" mean null or the country Namibia?
Parquet stores the schema in the footer. INT64 stays INT64. Dates are real DATE/TIMESTAMP types. Decimals are exact. Lists, maps, and structs are first-class. You don't lose information passing Parquet between Spark, Pandas, and DuckDB.
04
When to use CSV anyway
CSV isn't obsolete — it's the right choice when:
- A human needs to open the file in Excel or Google Sheets.
- You're emailing data to a non-technical colleague.
- The dataset is tiny (a few hundred rows), so size doesn't matter.
- You need to
cat,grep, or eyeball the file in a terminal. - The receiving system explicitly requires CSV (some BI tools, government uploads, legacy ETL).
For everything else — analytics, machine learning, data lakes, long-term storage — Parquet is the better default.
05
Going between the two
Converting is easy in either direction:
- Parquet to CSV: free online converter — drop a .parquet file, download .csv. Or in Python:
pd.read_parquet("f.parquet").to_csv("f.csv"). - CSV to Parquet: in Python,
pd.read_csv("f.csv").to_parquet("f.parquet"); in DuckDB,COPY (SELECT * FROM 'f.csv') TO 'f.parquet' (FORMAT 'parquet').