High-performance DataFrames: Polars, pandas 2, and Arrow Interop

Published: July 20, 2020 (5y ago)13 min read

Updated: December 10, 2024 (8mo ago)

You probably use pandas by default. It’s battle‑tested and ubiquitous—but many workloads now push past what eager, row‑oriented, single‑threaded pipelines handle gracefully. Modern columnar stacks based on Apache Arrow bring faster I/O, better memory locality, and zero‑copy interop across tools. In Python, two practical paths emerge:

  • pandas 2.x, which adds Arrow‑backed dtypes and cleaner memory semantics (Copy‑on‑Write) while keeping the familiar API.
  • Polars, a Rust‑powered DataFrame engine with an expression/lazy optimizer, parallel execution, and first‑class Arrow memory.

This post sets the mental model you’ll use throughout: what Arrow actually buys you, when Polars wins, when pandas 2 suffices, and how to structure code so you can move data between engines with near‑zero overhead.

The model: Arrow in the middle

Apache Arrow is a standardized, columnar in‑memory layout. It’s not “a file format” (though there are Arrow IPC/Feather and Parquet formats built on the same ideas). The win is simple: many tools can operate over the same buffers in place, so you skip serialization and redundant copies.

flowchart LR subgraph Python P2[pandas 2.x<br/>Arrow dtypes] PL[Polars] end subgraph Arrow A1[Arrow Arrays<br/>Tables (columnar)] end subgraph Other DK[DuckDB] SP[Spark (pandas UDFs)] PQ[Parquet/IPC] end P2 <--> A1 PL <--> A1 DK <--> A1 SP <--> A1 A1 <--> PQ classDef focus fill:#e8f5e8,stroke:#b6e3b6,color:#000 class A1 focus

Implications you can rely on:

  • Zero‑copy handoff is possible in many common cases (fixed‑width types, strings, timestamps) when you convert pandas↔Arrow or Polars↔Arrow with the right flags.
  • Columnar layout improves CPU cache behavior for vectorized ops and enables predicate/column pruning in readers.
  • Interop becomes an architectural choice rather than a data engineering project.

Polars vs pandas 2.x at a glance (practitioner’s view)

  • Polars uses an expression engine with lazy evaluation by default for scan_* sources. It builds a plan, applies optimization (projection/predicate pushdown, join/agg reordering, caching), then executes in parallel.
  • pandas 2.x remains eager and largely single‑threaded at the Python level, but adds Arrow‑backed dtypes (dtype_backend="pyarrow") and Copy‑on‑Write to cut hidden copies. I/O via PyArrow and Parquet/IPC gets faster and more memory‑efficient.
  • Both speak Arrow fluently, so round‑tripping between them can avoid copies with the right APIs.

A tiny taste of the Polars planner

flowchart TB S[scan_parquet("events.parquet")] --> F[filter col("ts") >= 2024-01-01] F --> P[project cols: user_id, kind, ts] P --> G[groupby user_id | agg count, nunique(kind)] G --> L[limit 10] classDef step fill:#e1f5fe,stroke:#90caf9,color:#000 class S,F,P,G,L step

Under the hood, Polars will:

  • Push project down to the reader so only referenced columns are read.
  • Push filter into the scan so row groups/row groups’ statistics can skip chunks.
  • Parallelize scan, filter, and groupby across cores.

pandas 2.x with Arrow dtypes (getting the foundation right)

If you’re staying with pandas, flip your defaults toward Arrow dtypes for better memory use and faster I/O. Start with Parquet/IPC, which are Arrow‑native.

# pandas 2.x
import pandas as pd
 
# Prefer Arrow-backed dtypes on ingest
df = pd.read_parquet("events.parquet")
df = df.convert_dtypes(dtype_backend="pyarrow")
 
# Confirm Arrow extension dtypes where available (strings, timestamps, ints with NA, etc.)
print(df.dtypes)
 
# Round-trip to Arrow without copying when possible
tbl = df.to_arrow(types_mapper=None, zero_copy_only=True)
# ... do work in an Arrow-enabled system ...
 
# Back to pandas, still trying to avoid copies
df2 = tbl.to_pandas(types_mapper=None, use_threads=True, self_destruct=True)

Notes you can count on:

  • convert_dtypes(dtype_backend="pyarrow") migrates eligible columns to Arrow‑backed extension dtypes. Coverage is wide for fixed‑width types and strings, improving steadily in recent pandas releases.
  • DataFrame.to_arrow(zero_copy_only=True) and Table.to_pandas(self_destruct=True) can avoid copies when memory layout and null semantics line up. When they don’t, APIs will raise (or copy)—treat that as a signal to adjust dtypes.

The Polars way: expressions, lazy, and streaming

Polars lets you describe transformations with expressions; the engine fuses and parallelizes them. Use scan_* for lazy pipelines, and turn on streaming for large, one‑pass flows.

import polars as pl
 
# Lazy: build a plan without loading all bytes up front
q = (
  pl.scan_parquet("events.parquet")
    .filter(pl.col("ts") >= pl.datetime(2024, 1, 1))
    .select(["user_id", "kind", "ts"])
    .group_by("user_id")
    .agg([
      pl.len().alias("events"),
      pl.col("kind").n_unique().alias("kinds"),
    ])
    .sort("events", descending=True)
    .limit(10)
)
 
# Execute; for very large sources, enable streaming execution
out = q.collect(streaming=True)
print(out)
 
# Interop: zero-copy to Arrow, and then to pandas when needed
tbl = out.to_arrow()
df = tbl.to_pandas(self_destruct=True)

Practical guidance:

  • Prefer scan_parquet/scan_csv over eager read_* when the result is consumed once—this gives the optimizer room to push filters/projections into the reader.
  • Add .collect(streaming=True) for large, forward‑only workloads; memory stays bounded.
  • Use expressions (pl.col, when().then().otherwise()) instead of Python UDFs; they’re vectorized and fusable.

What Arrow actually changes in practice

  • Faster reads/writes: Parquet/IPC + Arrow kernels avoid Python loops and convert fewer times.
  • Lower memory: Arrow strings and dictionary‑encoded categoricals can materially reduce heap use compared to object dtype.
  • Clean interop: jump between engines by exchanging Arrow Tables instead of CSVs or pickles.

A tiny interop round‑trip you can keep in your toolbox

import polars as pl
import pandas as pd
 
pl_df = pl.DataFrame({
  "user_id": [1, 1, 2, 3],
  "kind": ["view", "click", "view", "buy"],
})
 
# Polars -> Arrow (zero-copy)
tbl = pl_df.to_arrow()
 
# Arrow -> pandas (try to avoid copies)
pd_df = tbl.to_pandas(self_destruct=True)
 
# pandas -> Arrow again
tbl2 = pd_df.to_arrow(zero_copy_only=True)
 
# Arrow -> Polars
pl_df2 = pl.from_arrow(tbl2)

If any step needs a copy, you’ll see it in profiles—use it as a nudge to normalize dtypes (e.g., switch pandas columns to Arrow‑backed dtypes).


Takeaways you can apply today

  • Treat Arrow as your shared, columnar contract; push I/O through Parquet/IPC and keep memory on the Arrow path.
  • In pandas 2.x projects, adopt Arrow dtypes (dtype_backend="pyarrow") and Parquet/IPC readers first—it’s the highest‑leverage change with minimal refactor.
  • In pipelines with heavy joins/groupbys or large scans, prefer Polars lazy queries; you’ll get pushdown, fusion, and parallel execution without changing the data model.
  • Keep interop explicit: convert via Arrow Tables; measure copies with tools like memory_profiler and py-spy.

In the rest of the post, we’ll deepen this with benchmarks that reflect real workloads, idiomatic Polars expression patterns (joins, window functions), and a precise interop checklist—so you can mix engines intentionally rather than by accident.

Joins, aggregations, and windows that scale

Real pipelines are dominated by joins and group aggregations. Here are patterns that stay on the vectorized path and keep memory bounded.

Joins

# Polars: lazy join with projection/predicate pushdown and parallel execution
import polars as pl
 
orders = pl.scan_parquet("orders.parquet").select(["order_id", "user_id", "amount", "ts"]) 
users  = pl.scan_parquet("users.parquet").select(["user_id", "segment"])
 
top = (
  orders
    .filter(pl.col("ts") >= pl.datetime(2024, 1, 1))
    .join(users, on="user_id", how="left")
    .group_by(["segment"]) 
    .agg([
      pl.len().alias("orders"),
      pl.col("amount").sum().alias("gmv"),
    ])
    .sort("gmv", descending=True)
    .limit(5)
    .collect(streaming=True)
)
# pandas 2.x: index-aware merge + grouped aggregations
import pandas as pd
 
orders = pd.read_parquet("orders.parquet", columns=["order_id", "user_id", "amount", "ts"])\
           .convert_dtypes(dtype_backend="pyarrow")
users  = pd.read_parquet("users.parquet", columns=["user_id", "segment"])\
           .convert_dtypes(dtype_backend="pyarrow")
 
orders = orders[orders["ts"] >= pd.Timestamp("2024-01-01")]
df = orders.merge(users, on="user_id", how="left")
 
out = (
  df.groupby("segment", observed=True, sort=False)
    .agg(orders=("order_id", "count"), gmv=("amount", "sum"))
    .sort_values("gmv", ascending=False)
    .head(5)
    .reset_index()
)

Guidance:

  • In pandas, prefer groupby(..., observed=True) when using categoricals; it avoids emitting empty groups.
  • Prune columns at read time (columns=[...]) and avoid apply in hot joins/aggregations.

Window functions

# Polars: per-user cumulative and rolling windows
import polars as pl
 
q = (
  pl.scan_parquet("events.parquet")
    .select([
      "user_id", "ts", "amount",
      pl.col("amount").cum_sum().over("user_id").alias("cum_amount"),
      pl.col("amount").rolling_mean(window_size=7).over("user_id").alias("roll7"),
    ])
)
df = q.collect(streaming=True)
# pandas: per-user cumulative and time-based rolling window
import pandas as pd
 
df = pd.read_parquet("events.parquet", columns=["user_id", "ts", "amount"])\
       .convert_dtypes(dtype_backend="pyarrow")
df = df.sort_values(["user_id", "ts"])  # required for rolling-by-time stability
 
df["cum_amount"] = df.groupby("user_id", sort=False)["amount"].cumsum()
 
# Time-based rolling (7 days) per user; use transform to preserve shape
roll = (
  df.set_index("ts")
    .groupby("user_id", sort=False)["amount"]
    .rolling("7D").mean()
    .reset_index(level=0, drop=True)
)
df["roll7"] = roll.values
flowchart LR A[Scan] --> B[Filter] B --> C[Join] C --> D[GroupBy/Agg] D --> E[Window] E --> F[Output] classDef step fill:#fff3e0,stroke:#ffcc80,color:#000 class A,B,C,D,E,F step

I/O pushdown you should actually use

Reading less is the best optimization.

  • Polars lazy scans push select columns and filter predicates into Parquet readers automatically.
  • In pandas, get similar effects by using Arrow Dataset APIs for filters, then materialize to pandas only the needed columns/rows.
# PyArrow Dataset -> filtered/table -> pandas
import pyarrow.dataset as ds
 
dataset = ds.dataset("/data/orders", format="parquet")
table = dataset.to_table(
  columns=["order_id", "user_id", "amount", "ts"],
  filter=ds.field("amount") > 0,
)
 
df = table.to_pandas(self_destruct=True)

Compare to Polars, where the same result comes from a single lazy plan:

import polars as pl
df = (
  pl.scan_parquet("/data/orders")
    .select(["order_id", "user_id", "amount", "ts"])
    .filter(pl.col("amount") > 0)
    .collect(streaming=True)
)

A tiny, fair benchmark harness

Keep micro-benches honest and small. Measure both throughput and peak memory; report environment (Python, library versions, CPU).

# bench/polars_pandas/join_groupby.py
import os, time
import numpy as np
import pandas as pd
import polars as pl
 
N = int(os.getenv("N", 2_000_000))
K = 100_000
 
def build_dataframes():
    rng = np.random.default_rng(0)
    user = rng.integers(0, K, size=N)
    amount = rng.random(N).astype("float64")
    ts = pd.to_datetime(rng.integers(1_700_000_000, 1_701_000_000, size=N), unit="s")
 
    pdf_orders = pd.DataFrame({"user_id": user, "amount": amount, "ts": ts})
    pdf_users = pd.DataFrame({"user_id": np.arange(K), "segment": rng.integers(0, 10, size=K)})
 
    pl_orders = pl.from_pandas(pdf_orders)
    pl_users = pl.from_pandas(pdf_users)
    return (pdf_orders, pdf_users, pl_orders, pl_users)
 
def bench(fn, warmups=1, runs=5):
    for _ in range(warmups): fn()
    t = []
    for _ in range(runs):
        s = time.perf_counter(); fn(); t.append(time.perf_counter()-s)
    return min(t)
 
def pandas_join_groupby(pdf_orders, pdf_users):
    df = pdf_orders[pdf_orders["ts"] >= pd.Timestamp("2024-01-01")]
    df = df.merge(pdf_users, on="user_id", how="left")
    out = df.groupby("segment", observed=True, sort=False).agg(orders=("user_id", "count"), gmv=("amount", "sum"))
    return out
 
def polars_join_groupby(pl_orders, pl_users):
    out = (
        pl_orders.lazy()
        .filter(pl.col("ts") >= pl.datetime(2024, 1, 1))
        .join(pl_users.lazy(), on="user_id", how="left")
        .group_by("segment")
        .agg([pl.len().alias("orders"), pl.col("amount").sum().alias("gmv")])
        .collect(streaming=True)
    )
    return out
 
if __name__ == "__main__":
    pdf_orders, pdf_users, pl_orders, pl_users = build_dataframes()
    t_pd = bench(lambda: pandas_join_groupby(pdf_orders, pdf_users))
    t_pl = bench(lambda: polars_join_groupby(pl_orders, pl_users))
    print({"pandas_s": round(t_pd, 3), "polars_s": round(t_pl, 3)})
flowchart TB D[Generate synthetic data] --> H[Warm up] H --> M[Run N times] M --> R[Report min/median] R --> C[Compare CPU & memory] classDef step fill:#e8f5e8,stroke:#81c784,color:#000 class D,H,M,R,C step

What to watch:

  • For pandas, avoid apply/Python loops; keep operations in grouped reductions and vectorized transforms.
  • For Polars, stick to expressions; Python UDFs break fusion and parallelism.
  • Confirm Arrow dtypes in pandas before timing (convert_dtypes(dtype_backend="pyarrow")).

Interop checklist (working set)

  • Strings: prefer Arrow strings over Python object in pandas; in Polars, strings are already Arrow‑backed.
  • Categoricals: Arrow dictionary encoding <-> pandas Categorical; validate codes/order if round‑tripping.
  • Timestamps/time zones: ensure consistent units (ns) and TZ; Polars and pandas both support tz‑aware types but conversions may materialize.
  • Decimals: verify precision/scale support in both engines if you rely on exactness.
  • Decimals: verify precision/scale support in both engines if you rely on exactness.

Time‑series patterns that pay off

Time windows and sessionization dominate analytics. Prefer built‑in time windows over manual loops.

Rolling/dynamic windows

# Polars: time-based tumbling window per key
import polars as pl
 
q = (
  pl.scan_parquet("/data/events")
    .group_by_dynamic(
      index_column="ts",
      every="1h",  # window stride
      period="1h",  # window length
      by="user_id",
      closed="left",
    )
    .agg([
      pl.len().alias("events"),
      pl.col("amount").sum().alias("gmv"),
    ])
)
hourly = q.collect(streaming=True)
# pandas: resample after grouping per key
import pandas as pd
 
df = pd.read_parquet("/data/events", columns=["user_id", "ts", "amount"])\
       .convert_dtypes(dtype_backend="pyarrow")
 
df = df.set_index("ts")
agg = (
  df.groupby("user_id", sort=False)
    .resample("1H").agg(events=("amount", "size"), gmv=("amount", "sum"))
    .reset_index()
)

As‑of joins (event alignment)

# Polars: join the latest price before each trade timestamp
trades = pl.scan_parquet("/data/trades").select(["ts", "symbol", "qty"]) 
prices = pl.scan_parquet("/data/prices").select(["ts", "symbol", "px"]) 
 
enriched = (
  trades.join_asof(
    prices,
    on="ts",
    by="symbol",
    strategy="backward",
    allow_parallel=True,
  )
  .collect(streaming=True)
)
# pandas: merge_asof with by-key alignment
import pandas as pd
 
trades = pd.read_parquet("/data/trades", columns=["ts", "symbol", "qty"])\
           .sort_values(["symbol", "ts"]).convert_dtypes(dtype_backend="pyarrow")
prices = pd.read_parquet("/data/prices", columns=["ts", "symbol", "px"])\
           .sort_values(["symbol", "ts"]).convert_dtypes(dtype_backend="pyarrow")
 
enriched = pd.merge_asof(
  trades, prices, on="ts", by="symbol", direction="backward"
)

Partitioned datasets and pruning

Design layouts that let engines skip work.

  • Organize Parquet by frequently‑filtered keys (date, tenant, region) using Hive‑style key=value/ paths.
  • Use dataset APIs to filter partitions before scanning.
# PyArrow Dataset with partition pruning
import pyarrow.dataset as ds
 
dataset = ds.dataset("/lake/events", format="parquet", partitioning="hive")
tbl = dataset.to_table(
  columns=["user_id", "kind", "ts"],
  filter=(ds.field("dt") >= ds.scalar("2024-01-01")) & (ds.field("region") == ds.scalar("us")),
)
df = tbl.to_pandas(self_destruct=True)
# Polars hive partition pruning
import polars as pl
df = (
  pl.scan_parquet("/lake/events/dt=*/region=us/*.parquet", hive_partitioning=True)
    .filter(pl.col("dt") >= pl.lit("2024-01-01"))
    .select(["user_id", "kind", "ts"]).collect(streaming=True)
)
flowchart LR P[Partitioned Parquet\n(dt=YYYY-MM-DD/region=...)] --> F[Partition filter] F --> C[Column projection] C --> E[Engine executes] E --> O[Result] classDef step fill:#e1f5fe,stroke:#90caf9,color:#000 class P,F,C,E,O step

Dtype and memory alignment (cheat sheet)

  • Strings: prefer Arrow strings in pandas (convert_dtypes(dtype_backend="pyarrow")), avoid object.
  • Keys with low cardinality: use categoricals—pandas astype("category") or Polars pl.col(...).cast(pl.Categorical); reduces memory and speeds up joins/groupbys.
  • Timestamps: normalize to UTC with ns precision; be consistent across engines.
  • Numeric precision: validate decimal128 needs; otherwise stick to float64/int64 for vectorized kernels.
# pandas: normalize dtypes and enable CoW
import pandas as pd
pd.options.mode.copy_on_write = True
 
df = pd.read_parquet("/data/orders")
df = df.convert_dtypes(dtype_backend="pyarrow")
df["segment"] = df["segment"].astype("category")
df["ts"] = pd.to_datetime(df["ts"], utc=True)
# Polars: global string cache for stable categoricals across scans
import polars as pl
pl.Config.set_global_string_cache(True)
 
df = pl.read_parquet("/data/orders")
df = df.with_columns(pl.col("segment").cast(pl.Categorical))

Resource controls and streaming outputs

Bound resources to avoid noisy‑neighbor issues and to keep pipelines predictable.

  • Threads: set POLARS_MAX_THREADS (env) to cap Polars’ concurrency in shared hosts.
  • I/O: prefer Zstandard compression for Parquet when CPU budget allows; keep row groups reasonably sized (e.g., 128–512MB) for scan efficiency.
  • Streaming: for large, one‑pass transforms, keep memory bounded via lazy + streaming collection.
export POLARS_MAX_THREADS=8
# Polars: transform then write compact Parquet
out = (
  pl.scan_parquet("/lake/events")
    .filter(pl.col("amount") > 0)
    .select(["user_id", "amount", "ts"])
    .collect(streaming=True)
)
out.write_parquet("/lake/out/events_2024.parquet", compression="zstd", statistics=True)
# pandas: Arrow-accelerated Parquet write
df = pd.read_parquet("/lake/events", columns=["user_id", "amount", "ts"])\
       .convert_dtypes(dtype_backend="pyarrow")
df.to_parquet("/lake/out/events_2024.parquet", compression="zstd", engine="pyarrow")

Quality gates and observability for dataframes

  • Row‑count invariants at stages (pre/post join, post filter).
  • Hash‑based reconciliation for key columns (cheap checksums).
  • Log dtypes and null ratios; alert when distributions shift.
# Simple invariant helpers
import pandas as pd
import numpy as np
 
def assert_rows(df: pd.DataFrame, min_rows: int) -> None:
    n = len(df)
    if n < min_rows:
        raise AssertionError(f"row guard: {n} < {min_rows}")
 
def checksum(series: pd.Series) -> int:
    return int(pd.util.hash_pandas_object(series, index=False).sum())
 
def profile(df: pd.DataFrame) -> pd.DataFrame:
    return pd.DataFrame({
        "col": df.columns,
        "dtype": [str(t) for t in df.dtypes],
        "null_ratio": [float(df[c].isna().mean()) for c in df.columns],
    })

Choosing wisely (a practical decision guide)

flowchart TD Q[Workload] -->|Large scans/joins, one-pass| PLS[Polars lazy + streaming] Q -->|Mostly pandas ecosystem deps| PD[pandas 2.x + Arrow dtypes] Q -->|Mixed engines (DuckDB/Spark)| AR[Arrow Tables as contract] PD --> IO1[Read Parquet/IPC via Arrow] PLS --> IO2[scan_parquet + pushdown] AR --> INT[Zero-copy interop] classDef node fill:#e8f5e8,stroke:#81c784,color:#000 class PLS,PD,AR,IO1,IO2,INT node

Rules of thumb:

  • If you need planner optimizations and parallelism without changing your data model, use Polars lazy pipelines.
  • If your stack leans on pandas’ ecosystem, upgrade to pandas 2.x, adopt Arrow dtypes, and keep I/O Arrow‑native.
  • When in doubt, structure boundaries around Arrow Tables so you can switch engines later.

A small migration playbook

Create thin helpers so call sites don’t care which engine you run today.

# dfio.py — unified I/O helpers
from __future__ import annotations
from typing import Iterable, Optional
import pyarrow.dataset as ds
import pyarrow as pa
import pandas as pd
import polars as pl
 
def scan_dataset(path: str, columns: Optional[Iterable[str]] = None, filter: Optional[ds.Expression] = None) -> pa.Table:
    dataset = ds.dataset(path, format="parquet")
    return dataset.to_table(columns=columns, filter=filter)
 
def to_pandas(tbl: pa.Table) -> pd.DataFrame:
    return tbl.to_pandas(self_destruct=True)
 
def to_polars(tbl: pa.Table) -> pl.DataFrame:
    return pl.from_arrow(tbl)
 
def write_parquet(tbl: pa.Table, dest: str, compression: str = "zstd") -> None:
    pa.parquet.write_table(tbl, dest, compression=compression)

Normalize dtypes at the edges:

# pandas dtype normalization
import pandas as pd
 
def normalize(df: pd.DataFrame) -> pd.DataFrame:
    df = df.convert_dtypes(dtype_backend="pyarrow")
    if "segment" in df.columns:
        df["segment"] = df["segment"].astype("category")
    if "ts" in df.columns:
        df["ts"] = pd.to_datetime(df["ts"], utc=True)
    return df

Polars execution hygiene:

import os, polars as pl
os.environ.setdefault("POLARS_MAX_THREADS", "8")
pl.Config.set_global_string_cache(True)

Reproducible benchmarks and budgets

Keep methodology boring and fair:

  • Pin versions; print them in output (pandas.__version__, polars.__version__, pyarrow.__version__).
  • Report CPU model, core count, and Python version.
  • Warm once, report min/median over several runs; measure peak RSS.
# Minimal RSS sampler (Linux/macOS)
import os, resource
 
def max_rss_mb() -> float:
    r = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    # macOS returns bytes, Linux kilobytes depending on platform; normalize
    return r / (1024 * 1024) if os.uname().sysname == "Darwin" else r / 1024

Troubleshooting guide (what to try first)

  • Copies on interop: ensure pandas uses Arrow dtypes; call to_arrow(zero_copy_only=True) and handle the exception to discover problematic columns.
  • Slow scans: verify predicate/column pushdown. In Polars, print the optimized plan (.explain(optimize=True)). With Arrow Datasets, push filter and columns into to_table.
  • Memory spikes: enable Polars streaming; in pandas, chunk I/O or switch heavy transforms to Polars and return via Arrow.
  • Joins sluggish: categorize low‑cardinality keys; sort‑merge joins often benefit from pre‑sorting and aligned partitioning.

Closing thoughts

Columnar memory and Arrow interop let you separate “how data looks” from “which engine executes.” Whether you stay in pandas 2.x with Arrow dtypes or move hot paths to Polars, the same habits win: push projections/filters into I/O, stick to vectorized expressions, bound resources, and measure both throughput and memory. Keep Arrow at the seams so swapping engines becomes a configuration change—not a rewrite.


References