You probably use pandas by default. It’s battle‑tested and ubiquitous—but many workloads now push past what eager, row‑oriented, single‑threaded pipelines handle gracefully. Modern columnar stacks based on Apache Arrow bring faster I/O, better memory locality, and zero‑copy interop across tools. In Python, two practical paths emerge:
- pandas 2.x, which adds Arrow‑backed dtypes and cleaner memory semantics (Copy‑on‑Write) while keeping the familiar API.
- Polars, a Rust‑powered DataFrame engine with an expression/lazy optimizer, parallel execution, and first‑class Arrow memory.
This post sets the mental model you’ll use throughout: what Arrow actually buys you, when Polars wins, when pandas 2 suffices, and how to structure code so you can move data between engines with near‑zero overhead.
The model: Arrow in the middle
Apache Arrow is a standardized, columnar in‑memory layout. It’s not “a file format” (though there are Arrow IPC/Feather and Parquet formats built on the same ideas). The win is simple: many tools can operate over the same buffers in place, so you skip serialization and redundant copies.
Implications you can rely on:
- Zero‑copy handoff is possible in many common cases (fixed‑width types, strings, timestamps) when you convert pandas↔Arrow or Polars↔Arrow with the right flags.
- Columnar layout improves CPU cache behavior for vectorized ops and enables predicate/column pruning in readers.
- Interop becomes an architectural choice rather than a data engineering project.
Polars vs pandas 2.x at a glance (practitioner’s view)
- Polars uses an expression engine with lazy evaluation by default for
scan_*
sources. It builds a plan, applies optimization (projection/predicate pushdown, join/agg reordering, caching), then executes in parallel. - pandas 2.x remains eager and largely single‑threaded at the Python level, but adds Arrow‑backed dtypes (
dtype_backend="pyarrow"
) and Copy‑on‑Write to cut hidden copies. I/O via PyArrow and Parquet/IPC gets faster and more memory‑efficient. - Both speak Arrow fluently, so round‑tripping between them can avoid copies with the right APIs.
A tiny taste of the Polars planner
Under the hood, Polars will:
- Push
project
down to the reader so only referenced columns are read. - Push
filter
into the scan so row groups/row groups’ statistics can skip chunks. - Parallelize scan, filter, and groupby across cores.
pandas 2.x with Arrow dtypes (getting the foundation right)
If you’re staying with pandas, flip your defaults toward Arrow dtypes for better memory use and faster I/O. Start with Parquet/IPC, which are Arrow‑native.
# pandas 2.x
import pandas as pd
# Prefer Arrow-backed dtypes on ingest
df = pd.read_parquet("events.parquet")
df = df.convert_dtypes(dtype_backend="pyarrow")
# Confirm Arrow extension dtypes where available (strings, timestamps, ints with NA, etc.)
print(df.dtypes)
# Round-trip to Arrow without copying when possible
tbl = df.to_arrow(types_mapper=None, zero_copy_only=True)
# ... do work in an Arrow-enabled system ...
# Back to pandas, still trying to avoid copies
df2 = tbl.to_pandas(types_mapper=None, use_threads=True, self_destruct=True)
Notes you can count on:
convert_dtypes(dtype_backend="pyarrow")
migrates eligible columns to Arrow‑backed extension dtypes. Coverage is wide for fixed‑width types and strings, improving steadily in recent pandas releases.DataFrame.to_arrow(zero_copy_only=True)
andTable.to_pandas(self_destruct=True)
can avoid copies when memory layout and null semantics line up. When they don’t, APIs will raise (or copy)—treat that as a signal to adjust dtypes.
The Polars way: expressions, lazy, and streaming
Polars lets you describe transformations with expressions; the engine fuses and parallelizes them. Use scan_*
for lazy pipelines, and turn on streaming for large, one‑pass flows.
import polars as pl
# Lazy: build a plan without loading all bytes up front
q = (
pl.scan_parquet("events.parquet")
.filter(pl.col("ts") >= pl.datetime(2024, 1, 1))
.select(["user_id", "kind", "ts"])
.group_by("user_id")
.agg([
pl.len().alias("events"),
pl.col("kind").n_unique().alias("kinds"),
])
.sort("events", descending=True)
.limit(10)
)
# Execute; for very large sources, enable streaming execution
out = q.collect(streaming=True)
print(out)
# Interop: zero-copy to Arrow, and then to pandas when needed
tbl = out.to_arrow()
df = tbl.to_pandas(self_destruct=True)
Practical guidance:
- Prefer
scan_parquet
/scan_csv
over eagerread_*
when the result is consumed once—this gives the optimizer room to push filters/projections into the reader. - Add
.collect(streaming=True)
for large, forward‑only workloads; memory stays bounded. - Use expressions (
pl.col
,when().then().otherwise()
) instead of Python UDFs; they’re vectorized and fusable.
What Arrow actually changes in practice
- Faster reads/writes: Parquet/IPC + Arrow kernels avoid Python loops and convert fewer times.
- Lower memory: Arrow strings and dictionary‑encoded categoricals can materially reduce heap use compared to
object
dtype. - Clean interop: jump between engines by exchanging Arrow Tables instead of CSVs or pickles.
A tiny interop round‑trip you can keep in your toolbox
import polars as pl
import pandas as pd
pl_df = pl.DataFrame({
"user_id": [1, 1, 2, 3],
"kind": ["view", "click", "view", "buy"],
})
# Polars -> Arrow (zero-copy)
tbl = pl_df.to_arrow()
# Arrow -> pandas (try to avoid copies)
pd_df = tbl.to_pandas(self_destruct=True)
# pandas -> Arrow again
tbl2 = pd_df.to_arrow(zero_copy_only=True)
# Arrow -> Polars
pl_df2 = pl.from_arrow(tbl2)
If any step needs a copy, you’ll see it in profiles—use it as a nudge to normalize dtypes (e.g., switch pandas columns to Arrow‑backed dtypes).
Takeaways you can apply today
- Treat Arrow as your shared, columnar contract; push I/O through Parquet/IPC and keep memory on the Arrow path.
- In pandas 2.x projects, adopt Arrow dtypes (
dtype_backend="pyarrow"
) and Parquet/IPC readers first—it’s the highest‑leverage change with minimal refactor. - In pipelines with heavy joins/groupbys or large scans, prefer Polars lazy queries; you’ll get pushdown, fusion, and parallel execution without changing the data model.
- Keep interop explicit: convert via Arrow Tables; measure copies with tools like
memory_profiler
andpy-spy
.
In the rest of the post, we’ll deepen this with benchmarks that reflect real workloads, idiomatic Polars expression patterns (joins, window functions), and a precise interop checklist—so you can mix engines intentionally rather than by accident.
Joins, aggregations, and windows that scale
Real pipelines are dominated by joins and group aggregations. Here are patterns that stay on the vectorized path and keep memory bounded.
Joins
# Polars: lazy join with projection/predicate pushdown and parallel execution
import polars as pl
orders = pl.scan_parquet("orders.parquet").select(["order_id", "user_id", "amount", "ts"])
users = pl.scan_parquet("users.parquet").select(["user_id", "segment"])
top = (
orders
.filter(pl.col("ts") >= pl.datetime(2024, 1, 1))
.join(users, on="user_id", how="left")
.group_by(["segment"])
.agg([
pl.len().alias("orders"),
pl.col("amount").sum().alias("gmv"),
])
.sort("gmv", descending=True)
.limit(5)
.collect(streaming=True)
)
# pandas 2.x: index-aware merge + grouped aggregations
import pandas as pd
orders = pd.read_parquet("orders.parquet", columns=["order_id", "user_id", "amount", "ts"])\
.convert_dtypes(dtype_backend="pyarrow")
users = pd.read_parquet("users.parquet", columns=["user_id", "segment"])\
.convert_dtypes(dtype_backend="pyarrow")
orders = orders[orders["ts"] >= pd.Timestamp("2024-01-01")]
df = orders.merge(users, on="user_id", how="left")
out = (
df.groupby("segment", observed=True, sort=False)
.agg(orders=("order_id", "count"), gmv=("amount", "sum"))
.sort_values("gmv", ascending=False)
.head(5)
.reset_index()
)
Guidance:
- In pandas, prefer
groupby(..., observed=True)
when using categoricals; it avoids emitting empty groups. - Prune columns at read time (
columns=[...]
) and avoidapply
in hot joins/aggregations.
Window functions
# Polars: per-user cumulative and rolling windows
import polars as pl
q = (
pl.scan_parquet("events.parquet")
.select([
"user_id", "ts", "amount",
pl.col("amount").cum_sum().over("user_id").alias("cum_amount"),
pl.col("amount").rolling_mean(window_size=7).over("user_id").alias("roll7"),
])
)
df = q.collect(streaming=True)
# pandas: per-user cumulative and time-based rolling window
import pandas as pd
df = pd.read_parquet("events.parquet", columns=["user_id", "ts", "amount"])\
.convert_dtypes(dtype_backend="pyarrow")
df = df.sort_values(["user_id", "ts"]) # required for rolling-by-time stability
df["cum_amount"] = df.groupby("user_id", sort=False)["amount"].cumsum()
# Time-based rolling (7 days) per user; use transform to preserve shape
roll = (
df.set_index("ts")
.groupby("user_id", sort=False)["amount"]
.rolling("7D").mean()
.reset_index(level=0, drop=True)
)
df["roll7"] = roll.values
I/O pushdown you should actually use
Reading less is the best optimization.
- Polars lazy scans push
select
columns andfilter
predicates into Parquet readers automatically. - In pandas, get similar effects by using Arrow Dataset APIs for filters, then materialize to pandas only the needed columns/rows.
# PyArrow Dataset -> filtered/table -> pandas
import pyarrow.dataset as ds
dataset = ds.dataset("/data/orders", format="parquet")
table = dataset.to_table(
columns=["order_id", "user_id", "amount", "ts"],
filter=ds.field("amount") > 0,
)
df = table.to_pandas(self_destruct=True)
Compare to Polars, where the same result comes from a single lazy plan:
import polars as pl
df = (
pl.scan_parquet("/data/orders")
.select(["order_id", "user_id", "amount", "ts"])
.filter(pl.col("amount") > 0)
.collect(streaming=True)
)
A tiny, fair benchmark harness
Keep micro-benches honest and small. Measure both throughput and peak memory; report environment (Python, library versions, CPU).
# bench/polars_pandas/join_groupby.py
import os, time
import numpy as np
import pandas as pd
import polars as pl
N = int(os.getenv("N", 2_000_000))
K = 100_000
def build_dataframes():
rng = np.random.default_rng(0)
user = rng.integers(0, K, size=N)
amount = rng.random(N).astype("float64")
ts = pd.to_datetime(rng.integers(1_700_000_000, 1_701_000_000, size=N), unit="s")
pdf_orders = pd.DataFrame({"user_id": user, "amount": amount, "ts": ts})
pdf_users = pd.DataFrame({"user_id": np.arange(K), "segment": rng.integers(0, 10, size=K)})
pl_orders = pl.from_pandas(pdf_orders)
pl_users = pl.from_pandas(pdf_users)
return (pdf_orders, pdf_users, pl_orders, pl_users)
def bench(fn, warmups=1, runs=5):
for _ in range(warmups): fn()
t = []
for _ in range(runs):
s = time.perf_counter(); fn(); t.append(time.perf_counter()-s)
return min(t)
def pandas_join_groupby(pdf_orders, pdf_users):
df = pdf_orders[pdf_orders["ts"] >= pd.Timestamp("2024-01-01")]
df = df.merge(pdf_users, on="user_id", how="left")
out = df.groupby("segment", observed=True, sort=False).agg(orders=("user_id", "count"), gmv=("amount", "sum"))
return out
def polars_join_groupby(pl_orders, pl_users):
out = (
pl_orders.lazy()
.filter(pl.col("ts") >= pl.datetime(2024, 1, 1))
.join(pl_users.lazy(), on="user_id", how="left")
.group_by("segment")
.agg([pl.len().alias("orders"), pl.col("amount").sum().alias("gmv")])
.collect(streaming=True)
)
return out
if __name__ == "__main__":
pdf_orders, pdf_users, pl_orders, pl_users = build_dataframes()
t_pd = bench(lambda: pandas_join_groupby(pdf_orders, pdf_users))
t_pl = bench(lambda: polars_join_groupby(pl_orders, pl_users))
print({"pandas_s": round(t_pd, 3), "polars_s": round(t_pl, 3)})
What to watch:
- For pandas, avoid
apply
/Python loops; keep operations in grouped reductions and vectorized transforms. - For Polars, stick to expressions; Python UDFs break fusion and parallelism.
- Confirm Arrow dtypes in pandas before timing (
convert_dtypes(dtype_backend="pyarrow")
).
Interop checklist (working set)
- Strings: prefer Arrow strings over Python
object
in pandas; in Polars, strings are already Arrow‑backed. - Categoricals: Arrow dictionary encoding <-> pandas
Categorical
; validate codes/order if round‑tripping. - Timestamps/time zones: ensure consistent units (ns) and TZ; Polars and pandas both support tz‑aware types but conversions may materialize.
- Decimals: verify precision/scale support in both engines if you rely on exactness.
- Decimals: verify precision/scale support in both engines if you rely on exactness.
Time‑series patterns that pay off
Time windows and sessionization dominate analytics. Prefer built‑in time windows over manual loops.
Rolling/dynamic windows
# Polars: time-based tumbling window per key
import polars as pl
q = (
pl.scan_parquet("/data/events")
.group_by_dynamic(
index_column="ts",
every="1h", # window stride
period="1h", # window length
by="user_id",
closed="left",
)
.agg([
pl.len().alias("events"),
pl.col("amount").sum().alias("gmv"),
])
)
hourly = q.collect(streaming=True)
# pandas: resample after grouping per key
import pandas as pd
df = pd.read_parquet("/data/events", columns=["user_id", "ts", "amount"])\
.convert_dtypes(dtype_backend="pyarrow")
df = df.set_index("ts")
agg = (
df.groupby("user_id", sort=False)
.resample("1H").agg(events=("amount", "size"), gmv=("amount", "sum"))
.reset_index()
)
As‑of joins (event alignment)
# Polars: join the latest price before each trade timestamp
trades = pl.scan_parquet("/data/trades").select(["ts", "symbol", "qty"])
prices = pl.scan_parquet("/data/prices").select(["ts", "symbol", "px"])
enriched = (
trades.join_asof(
prices,
on="ts",
by="symbol",
strategy="backward",
allow_parallel=True,
)
.collect(streaming=True)
)
# pandas: merge_asof with by-key alignment
import pandas as pd
trades = pd.read_parquet("/data/trades", columns=["ts", "symbol", "qty"])\
.sort_values(["symbol", "ts"]).convert_dtypes(dtype_backend="pyarrow")
prices = pd.read_parquet("/data/prices", columns=["ts", "symbol", "px"])\
.sort_values(["symbol", "ts"]).convert_dtypes(dtype_backend="pyarrow")
enriched = pd.merge_asof(
trades, prices, on="ts", by="symbol", direction="backward"
)
Partitioned datasets and pruning
Design layouts that let engines skip work.
- Organize Parquet by frequently‑filtered keys (date, tenant, region) using Hive‑style
key=value/
paths. - Use dataset APIs to filter partitions before scanning.
# PyArrow Dataset with partition pruning
import pyarrow.dataset as ds
dataset = ds.dataset("/lake/events", format="parquet", partitioning="hive")
tbl = dataset.to_table(
columns=["user_id", "kind", "ts"],
filter=(ds.field("dt") >= ds.scalar("2024-01-01")) & (ds.field("region") == ds.scalar("us")),
)
df = tbl.to_pandas(self_destruct=True)
# Polars hive partition pruning
import polars as pl
df = (
pl.scan_parquet("/lake/events/dt=*/region=us/*.parquet", hive_partitioning=True)
.filter(pl.col("dt") >= pl.lit("2024-01-01"))
.select(["user_id", "kind", "ts"]).collect(streaming=True)
)
Dtype and memory alignment (cheat sheet)
- Strings: prefer Arrow strings in pandas (
convert_dtypes(dtype_backend="pyarrow")
), avoidobject
. - Keys with low cardinality: use categoricals—pandas
astype("category")
or Polarspl.col(...).cast(pl.Categorical)
; reduces memory and speeds up joins/groupbys. - Timestamps: normalize to UTC with ns precision; be consistent across engines.
- Numeric precision: validate
decimal128
needs; otherwise stick to float64/int64 for vectorized kernels.
# pandas: normalize dtypes and enable CoW
import pandas as pd
pd.options.mode.copy_on_write = True
df = pd.read_parquet("/data/orders")
df = df.convert_dtypes(dtype_backend="pyarrow")
df["segment"] = df["segment"].astype("category")
df["ts"] = pd.to_datetime(df["ts"], utc=True)
# Polars: global string cache for stable categoricals across scans
import polars as pl
pl.Config.set_global_string_cache(True)
df = pl.read_parquet("/data/orders")
df = df.with_columns(pl.col("segment").cast(pl.Categorical))
Resource controls and streaming outputs
Bound resources to avoid noisy‑neighbor issues and to keep pipelines predictable.
- Threads: set
POLARS_MAX_THREADS
(env) to cap Polars’ concurrency in shared hosts. - I/O: prefer Zstandard compression for Parquet when CPU budget allows; keep row groups reasonably sized (e.g., 128–512MB) for scan efficiency.
- Streaming: for large, one‑pass transforms, keep memory bounded via lazy + streaming collection.
export POLARS_MAX_THREADS=8
# Polars: transform then write compact Parquet
out = (
pl.scan_parquet("/lake/events")
.filter(pl.col("amount") > 0)
.select(["user_id", "amount", "ts"])
.collect(streaming=True)
)
out.write_parquet("/lake/out/events_2024.parquet", compression="zstd", statistics=True)
# pandas: Arrow-accelerated Parquet write
df = pd.read_parquet("/lake/events", columns=["user_id", "amount", "ts"])\
.convert_dtypes(dtype_backend="pyarrow")
df.to_parquet("/lake/out/events_2024.parquet", compression="zstd", engine="pyarrow")
Quality gates and observability for dataframes
- Row‑count invariants at stages (pre/post join, post filter).
- Hash‑based reconciliation for key columns (cheap checksums).
- Log dtypes and null ratios; alert when distributions shift.
# Simple invariant helpers
import pandas as pd
import numpy as np
def assert_rows(df: pd.DataFrame, min_rows: int) -> None:
n = len(df)
if n < min_rows:
raise AssertionError(f"row guard: {n} < {min_rows}")
def checksum(series: pd.Series) -> int:
return int(pd.util.hash_pandas_object(series, index=False).sum())
def profile(df: pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame({
"col": df.columns,
"dtype": [str(t) for t in df.dtypes],
"null_ratio": [float(df[c].isna().mean()) for c in df.columns],
})
Choosing wisely (a practical decision guide)
Rules of thumb:
- If you need planner optimizations and parallelism without changing your data model, use Polars lazy pipelines.
- If your stack leans on pandas’ ecosystem, upgrade to pandas 2.x, adopt Arrow dtypes, and keep I/O Arrow‑native.
- When in doubt, structure boundaries around Arrow Tables so you can switch engines later.
A small migration playbook
Create thin helpers so call sites don’t care which engine you run today.
# dfio.py — unified I/O helpers
from __future__ import annotations
from typing import Iterable, Optional
import pyarrow.dataset as ds
import pyarrow as pa
import pandas as pd
import polars as pl
def scan_dataset(path: str, columns: Optional[Iterable[str]] = None, filter: Optional[ds.Expression] = None) -> pa.Table:
dataset = ds.dataset(path, format="parquet")
return dataset.to_table(columns=columns, filter=filter)
def to_pandas(tbl: pa.Table) -> pd.DataFrame:
return tbl.to_pandas(self_destruct=True)
def to_polars(tbl: pa.Table) -> pl.DataFrame:
return pl.from_arrow(tbl)
def write_parquet(tbl: pa.Table, dest: str, compression: str = "zstd") -> None:
pa.parquet.write_table(tbl, dest, compression=compression)
Normalize dtypes at the edges:
# pandas dtype normalization
import pandas as pd
def normalize(df: pd.DataFrame) -> pd.DataFrame:
df = df.convert_dtypes(dtype_backend="pyarrow")
if "segment" in df.columns:
df["segment"] = df["segment"].astype("category")
if "ts" in df.columns:
df["ts"] = pd.to_datetime(df["ts"], utc=True)
return df
Polars execution hygiene:
import os, polars as pl
os.environ.setdefault("POLARS_MAX_THREADS", "8")
pl.Config.set_global_string_cache(True)
Reproducible benchmarks and budgets
Keep methodology boring and fair:
- Pin versions; print them in output (
pandas.__version__
,polars.__version__
,pyarrow.__version__
). - Report CPU model, core count, and Python version.
- Warm once, report min/median over several runs; measure peak RSS.
# Minimal RSS sampler (Linux/macOS)
import os, resource
def max_rss_mb() -> float:
r = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
# macOS returns bytes, Linux kilobytes depending on platform; normalize
return r / (1024 * 1024) if os.uname().sysname == "Darwin" else r / 1024
Troubleshooting guide (what to try first)
- Copies on interop: ensure pandas uses Arrow dtypes; call
to_arrow(zero_copy_only=True)
and handle the exception to discover problematic columns. - Slow scans: verify predicate/column pushdown. In Polars, print the optimized plan (
.explain(optimize=True)
). With Arrow Datasets, pushfilter
andcolumns
intoto_table
. - Memory spikes: enable Polars streaming; in pandas, chunk I/O or switch heavy transforms to Polars and return via Arrow.
- Joins sluggish: categorize low‑cardinality keys; sort‑merge joins often benefit from pre‑sorting and aligned partitioning.
Closing thoughts
Columnar memory and Arrow interop let you separate “how data looks” from “which engine executes.” Whether you stay in pandas 2.x with Arrow dtypes or move hot paths to Polars, the same habits win: push projections/filters into I/O, stick to vectorized expressions, bound resources, and measure both throughput and memory. Keep Arrow at the seams so swapping engines becomes a configuration change—not a rewrite.
References
- Polars User Guide — Lazy, optimizer, streaming: pola.rs/book
- Polars API Reference: docs.pola.rs
- Polars environment variables (threads, parallelism): docs.pola.rs/user-guide/optimizations/parallel
- Polars categorical and string cache: docs.pola.rs/user-guide/expressions/categorical
- Polars group_by_dynamic (time windows): docs.pola.rs/user-guide/expressions/group_by
- Polars join_asof: docs.pola.rs/user-guide/expressions/joining/#asof-joins
- pandas docs — Arrow‑backed dtypes and dtype_backend: pandas.pydata.org
- pandas Copy‑on‑Write: pandas.pydata.org/user_guide/copy_on_write.html
- pandas I/O (Parquet/Feather via PyArrow): pandas.pydata.org/user_guide/io.html
- pandas merge_asof: pandas.pydata.org/reference/api/pandas.merge_asof.html
- pandas groupby and window guides: pandas.pydata.org/user_guide/groupby.html, pandas.pydata.org/user_guide/window.html
- Apache Arrow — spec and Python docs: arrow.apache.org
- Arrow Python — Table.to_pandas and zero‑copy options: arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas
- Arrow Python — Dataset API (filters, partition pruning): arrow.apache.org/docs/python/dataset.html
- Arrow IPC/Feather interop: arrow.apache.org/docs/python/feather.html
- Parquet format overview: parquet.apache.org/docs/overview
- DuckDB — Arrow interoperability: duckdb.org/docs/guides/python/arrow
- pandas nullable dtypes overview: pandas.pydata.org/user_guide/integer_na.html
- Datapythonista — pandas 2.0 and Arrow series: datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i