What is a Parquet file?

Apache Parquet is an open-source columnar binary file format optimised for analytical workloads. This guide explains what makes it different from CSV or JSON, how it stores data, and when you should reach for it.

TL;DR

A .parquetfile is a binary, column-oriented data file that compresses tabular data 5–10× smaller than CSV and lets analytics engines read only the columns they need. It's the de-facto standard for data lakes (S3, GCS, Azure Blob), Hugging Face datasets, dbt models, Spark, DuckDB, and Pandas.

01 — Format

Columnar instead of row-based

CSV, JSON, and traditional databases store data row by row: every row contains every column, one after another. Parquet flips this — values from the same column are stored together. A 100-column, 10-million-row table is stored as 100 contiguous column chunks rather than 10 million rows.

Why does this matter? Two big reasons:

  • Compression. Values in the same column tend to be similar (same type, similar magnitudes, repeated strings), so they compress much better than mixed row data. Parquet typically achieves 5–10× compression vs. CSV with codecs like Snappy, Zstd, or Gzip.
  • Selective reads. If your query only touches 3 of 100 columns, the engine reads only those 3 chunks instead of the whole file. On large datasets this is the difference between a query taking 200ms and 30 seconds.

02 — Internals

What's actually inside a .parquet file

A Parquet file is laid out in three logical parts:

PartWhat it contains
Row groupsHorizontal slices of the table (e.g. 1M rows each), each containing column chunks for that slice.
Column chunksCompressed, encoded values for one column inside one row group. The unit of I/O.
Footer (metadata)Schema, row counts, min/max statistics per column chunk, and offsets to every chunk. Read first by every reader.

Because the footer holds min/max stats per chunk, query engines can skip entire chunks without reading them — this is called predicate pushdown, and it's why a filtered query on a 50 GB Parquet dataset can complete in seconds.

03 — Types

Schema and data types

Unlike CSV, Parquet has a real schema. Every column has a name, a physical type (INT32, INT64, FLOAT, DOUBLE, BYTE_ARRAY, etc.) and a logical type (DATE, TIMESTAMP, DECIMAL, STRING, ENUM, JSON, UUID, and more). Nested structures — lists, maps, structs — are first-class. Schema is stored once in the footer, not on every row.

04 — Compression

Compression and encoding

Parquet stacks two techniques. First, encoding: dictionary encoding for repeated strings, run-length encoding for repeated values, bit-packing for small integers, delta encoding for sorted columns. Then, on top of that, a compression codec — usually Snappy (fast), Zstd (best ratio), or Gzip. The combination is what gets you 5–10× smaller files than equivalent CSV.

05 — Ecosystem

Where you'll see Parquet

  • Data lakes: S3, Google Cloud Storage, Azure Blob — Parquet is the default format for Iceberg, Delta Lake, and Hudi.
  • Analytics engines: DuckDB, Apache Spark, Trino, Presto, Athena, BigQuery (external tables), ClickHouse.
  • Python: Pandas (pd.read_parquet), PyArrow, Polars, Dask.
  • Machine learning: Hugging Face datasets ship as Parquet shards; PyTorch and TensorFlow data loaders read it directly.
  • dbt and modern data stack: intermediate models often persist as Parquet for downstream BI tools.

06 — When to use it

Should you use Parquet for your data?

Yes, if:

  • You have more than ~10k rows and the data is mostly numeric or has repeating strings.
  • Your queries are analytical (aggregations, filters on a few columns) rather than transactional.
  • You need to share or store data efficiently — Parquet files are typically a fraction of the CSV size.
  • You want a real schema with proper types, including timestamps, decimals, and nested structures.

Probably no, if:

  • You need humans to open the file in Excel or paste it into a Slack message — use CSV for that, or convert with the Parquet to CSV converter.
  • The dataset is tiny (under a few hundred rows) — the format's overhead outweighs its benefits.
  • Your workload is row-by-row inserts/updates — Parquet is immutable and write-once.

07 — Try it

Open a Parquet file right now

The fastest way to see what's inside a .parquet file is the Parqui online viewer — drop the file into the browser and it parses entirely on your machine, no upload. For more options, see How to open a Parquet file.