Skip to content

Core API Reference

Complete API reference for ParquetFrame core functionality.

Core Classes

ParquetFrame

The main class for working with parquet data.

parquetframe.ParquetFrame = DataFrameProxy module-attribute

Core Functions

Reading Data

Functions for loading data from various sources.

parquetframe.read(file, engine=None, **kwargs)

Read a data file with automatic format detection and intelligent engine selection.

This function provides the Phase 2 multi-engine API with automatic selection between pandas, Polars, and Dask based on dataset characteristics.

Parameters:

Name Type Description Default
file str | Path

Path to the data file. Format auto-detected from extension.

required
engine str | None

Force specific engine ("pandas", "polars", or "dask"). If None, automatically selects optimal engine.

None
**kwargs Any

Additional keyword arguments passed to format-specific readers.

{}

Returns:

Name Type Description
DataFrameProxy DataFrameProxy

Unified DataFrame interface with intelligent backend.

Supported Formats
  • CSV (.csv, .tsv)
  • JSON (.json, .jsonl, .ndjson)
  • Parquet (.parquet, .pqt)
  • ORC (.orc)
  • Avro (.avro)

Engine Selection (when engine=None): - pandas: < 100MB (eager, rich ecosystem) - Polars: 100MB - 10GB (lazy, high performance) - Dask: > 10GB (distributed, scalable)

Examples:

>>> import parquetframe as pf
>>> # Automatic engine selection
>>> df = pf.read("sales.csv")
>>> print(f"Using {df.engine_name} engine")
>>>
>>> # Force specific engine
>>> df = pf.read("data.parquet", engine="polars")
>>>
>>> # Configure thresholds globally
>>> pf.set_config(pandas_threshold_mb=50.0)
>>> df = pf.read("medium.csv")  # Uses configured threshold
Migration from Phase 1

Phase 1 code using islazy parameter should migrate to engine parameter:

Before (Phase 1): >>> df = pf.read("data.csv", islazy=True) # Force Dask >>> if df.islazy: >>> result = df.df.compute()

After (Phase 2): >>> df = pf.read("data.csv", engine="dask") # Force Dask >>> if df.engine_name == "dask": >>> result = df.native.compute()

See Also
  • read_csv(): Read CSV files specifically
  • read_parquet(): Read Parquet files specifically
  • read_avro(): Read Avro files specifically
  • parquetframe.legacy: Phase 1 API (deprecated)
Source code in src/parquetframe/__init__.py
def read(
    file: str | Path,
    engine: str | None = None,
    **kwargs: Any,
) -> DataFrameProxy:
    """
    Read a data file with automatic format detection and intelligent engine selection.

    This function provides the Phase 2 multi-engine API with automatic selection
    between pandas, Polars, and Dask based on dataset characteristics.

    Args:
        file: Path to the data file. Format auto-detected from extension.
        engine: Force specific engine ("pandas", "polars", or "dask").
                If None, automatically selects optimal engine.
        **kwargs: Additional keyword arguments passed to format-specific readers.

    Returns:
        DataFrameProxy: Unified DataFrame interface with intelligent backend.

    Supported Formats:
        - CSV (.csv, .tsv)
        - JSON (.json, .jsonl, .ndjson)
        - Parquet (.parquet, .pqt)
        - ORC (.orc)
        - Avro (.avro)

    Engine Selection (when engine=None):
        - pandas: < 100MB (eager, rich ecosystem)
        - Polars: 100MB - 10GB (lazy, high performance)
        - Dask: > 10GB (distributed, scalable)

    Examples:
        >>> import parquetframe as pf
        >>> # Automatic engine selection
        >>> df = pf.read("sales.csv")
        >>> print(f"Using {df.engine_name} engine")
        >>>
        >>> # Force specific engine
        >>> df = pf.read("data.parquet", engine="polars")
        >>>
        >>> # Configure thresholds globally
        >>> pf.set_config(pandas_threshold_mb=50.0)
        >>> df = pf.read("medium.csv")  # Uses configured threshold

    Migration from Phase 1:
        Phase 1 code using `islazy` parameter should migrate to `engine` parameter:

        Before (Phase 1):
            >>> df = pf.read("data.csv", islazy=True)  # Force Dask
            >>> if df.islazy:
            >>>     result = df.df.compute()

        After (Phase 2):
            >>> df = pf.read("data.csv", engine="dask")  # Force Dask
            >>> if df.engine_name == "dask":
            >>>     result = df.native.compute()

    See Also:
        - read_csv(): Read CSV files specifically
        - read_parquet(): Read Parquet files specifically
        - read_avro(): Read Avro files specifically
        - parquetframe.legacy: Phase 1 API (deprecated)
    """
    return _read_v2(file, engine=engine, **kwargs)

Writing Data

Functions for saving data to various formats.

Data Processing

Filtering and Selection

Methods for filtering and selecting data.

Aggregation

Methods for data aggregation and grouping.

Transformation

Methods for data transformation and feature engineering.

Summary

The core API provides comprehensive functionality for data loading, processing, and saving with optimal performance.

Examples

import parquetframe as pf

# Create ParquetFrame instance
df = pf.ParquetFrame()

# Load data
df = pf.read("data.parquet")

# Process data
filtered = df.filter("column > 100")
grouped = df.groupby("category").sum()

# Save results
df.save("output.parquet")

Further Reading