Core API Reference¶

Complete API reference for ParquetFrame core functionality.

Core Classes¶

ParquetFrame¶

The main class for working with parquet data.

`parquetframe.ParquetFrame = DataFrameProxy` `module-attribute` ¶

Core Functions¶

Reading Data¶

Functions for loading data from various sources.

`parquetframe.read(file, engine=None, **kwargs)` ¶

Read a data file with automatic format detection and intelligent engine selection.

This function provides the Phase 2 multi-engine API with automatic selection between pandas, Polars, and Dask based on dataset characteristics.

Parameters:

Name	Type	Description	Default
`file`	`str \| Path`	Path to the data file. Format auto-detected from extension.	required
`engine`	`str \| None`	Force specific engine ("pandas", "polars", or "dask"). If None, automatically selects optimal engine.	`None`
`**kwargs`	`Any`	Additional keyword arguments passed to format-specific readers.	`{}`

Returns:

Name	Type	Description
`DataFrameProxy`	`DataFrameProxy`	Unified DataFrame interface with intelligent backend.

Supported Formats

CSV (.csv, .tsv)
JSON (.json, .jsonl, .ndjson)
Parquet (.parquet, .pqt)
ORC (.orc)
Avro (.avro)

Engine Selection (when engine=None): - pandas: < 100MB (eager, rich ecosystem) - Polars: 100MB - 10GB (lazy, high performance) - Dask: > 10GB (distributed, scalable)

Examples:

>>> import parquetframe as pf
>>> # Automatic engine selection
>>> df = pf.read("sales.csv")
>>> print(f"Using {df.engine_name} engine")
>>>
>>> # Force specific engine
>>> df = pf.read("data.parquet", engine="polars")
>>>
>>> # Configure thresholds globally
>>> pf.set_config(pandas_threshold_mb=50.0)
>>> df = pf.read("medium.csv")  # Uses configured threshold

Migration from Phase 1

Phase 1 code using islazy parameter should migrate to engine parameter:

Before (Phase 1): >>> df = pf.read("data.csv", islazy=True) # Force Dask >>> if df.islazy: >>> result = df.df.compute()

After (Phase 2): >>> df = pf.read("data.csv", engine="dask") # Force Dask >>> if df.engine_name == "dask": >>> result = df.native.compute()

See Also

read_csv(): Read CSV files specifically
read_parquet(): Read Parquet files specifically
read_avro(): Read Avro files specifically
parquetframe.legacy: Phase 1 API (deprecated)

Source code in src/parquetframe/__init__.py

def read(
    file: str | Path,
    engine: str | None = None,
    **kwargs: Any,
) -> DataFrameProxy:
    """
    Read a data file with automatic format detection and intelligent engine selection.

    This function provides the Phase 2 multi-engine API with automatic selection
    between pandas, Polars, and Dask based on dataset characteristics.

    Args:
        file: Path to the data file. Format auto-detected from extension.
        engine: Force specific engine ("pandas", "polars", or "dask").
                If None, automatically selects optimal engine.
        **kwargs: Additional keyword arguments passed to format-specific readers.

    Returns:
        DataFrameProxy: Unified DataFrame interface with intelligent backend.

    Supported Formats:
        - CSV (.csv, .tsv)
        - JSON (.json, .jsonl, .ndjson)
        - Parquet (.parquet, .pqt)
        - ORC (.orc)
        - Avro (.avro)

    Engine Selection (when engine=None):
        - pandas: < 100MB (eager, rich ecosystem)
        - Polars: 100MB - 10GB (lazy, high performance)
        - Dask: > 10GB (distributed, scalable)

    Examples:
        >>> import parquetframe as pf
        >>> # Automatic engine selection
        >>> df = pf.read("sales.csv")
        >>> print(f"Using {df.engine_name} engine")
        >>>
        >>> # Force specific engine
        >>> df = pf.read("data.parquet", engine="polars")
        >>>
        >>> # Configure thresholds globally
        >>> pf.set_config(pandas_threshold_mb=50.0)
        >>> df = pf.read("medium.csv")  # Uses configured threshold

    Migration from Phase 1:
        Phase 1 code using `islazy` parameter should migrate to `engine` parameter:

        Before (Phase 1):
            >>> df = pf.read("data.csv", islazy=True)  # Force Dask
            >>> if df.islazy:
            >>>     result = df.df.compute()

        After (Phase 2):
            >>> df = pf.read("data.csv", engine="dask")  # Force Dask
            >>> if df.engine_name == "dask":
            >>>     result = df.native.compute()

    See Also:
        - read_csv(): Read CSV files specifically
        - read_parquet(): Read Parquet files specifically
        - read_avro(): Read Avro files specifically
        - parquetframe.legacy: Phase 1 API (deprecated)
    """
    return _read_v2(file, engine=engine, **kwargs)

Writing Data¶

Functions for saving data to various formats.

Data Processing¶

Filtering and Selection¶

Methods for filtering and selecting data.

Aggregation¶

Methods for data aggregation and grouping.

Transformation¶

Methods for data transformation and feature engineering.

Summary¶

The core API provides comprehensive functionality for data loading, processing, and saving with optimal performance.

Examples¶

import parquetframe as pf

# Create ParquetFrame instance
df = pf.ParquetFrame()

# Load data
df = pf.read("data.parquet")

# Process data
filtered = df.filter("column > 100")
grouped = df.groupby("category").sum()

# Save results
df.save("output.parquet")