Core API Reference¶
Complete API reference for ParquetFrame core functionality.
Core Classes¶
ParquetFrame¶
The main class for working with parquet data.
parquetframe.ParquetFrame = DataFrameProxy
module-attribute
¶
Core Functions¶
Reading Data¶
Functions for loading data from various sources.
parquetframe.read(file, engine=None, **kwargs)
¶
Read a data file with automatic format detection and intelligent engine selection.
This function provides the Phase 2 multi-engine API with automatic selection between pandas, Polars, and Dask based on dataset characteristics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file
|
str | Path
|
Path to the data file. Format auto-detected from extension. |
required |
engine
|
str | None
|
Force specific engine ("pandas", "polars", or "dask"). If None, automatically selects optimal engine. |
None
|
**kwargs
|
Any
|
Additional keyword arguments passed to format-specific readers. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
DataFrameProxy |
DataFrameProxy
|
Unified DataFrame interface with intelligent backend. |
Supported Formats
- CSV (.csv, .tsv)
- JSON (.json, .jsonl, .ndjson)
- Parquet (.parquet, .pqt)
- ORC (.orc)
- Avro (.avro)
Engine Selection (when engine=None): - pandas: < 100MB (eager, rich ecosystem) - Polars: 100MB - 10GB (lazy, high performance) - Dask: > 10GB (distributed, scalable)
Examples:
>>> import parquetframe as pf
>>> # Automatic engine selection
>>> df = pf.read("sales.csv")
>>> print(f"Using {df.engine_name} engine")
>>>
>>> # Force specific engine
>>> df = pf.read("data.parquet", engine="polars")
>>>
>>> # Configure thresholds globally
>>> pf.set_config(pandas_threshold_mb=50.0)
>>> df = pf.read("medium.csv") # Uses configured threshold
Migration from Phase 1
Phase 1 code using islazy parameter should migrate to engine parameter:
Before (Phase 1): >>> df = pf.read("data.csv", islazy=True) # Force Dask >>> if df.islazy: >>> result = df.df.compute()
After (Phase 2): >>> df = pf.read("data.csv", engine="dask") # Force Dask >>> if df.engine_name == "dask": >>> result = df.native.compute()
See Also
- read_csv(): Read CSV files specifically
- read_parquet(): Read Parquet files specifically
- read_avro(): Read Avro files specifically
- parquetframe.legacy: Phase 1 API (deprecated)
Source code in src/parquetframe/__init__.py
Writing Data¶
Functions for saving data to various formats.
Data Processing¶
Filtering and Selection¶
Methods for filtering and selecting data.
Aggregation¶
Methods for data aggregation and grouping.
Transformation¶
Methods for data transformation and feature engineering.
Summary¶
The core API provides comprehensive functionality for data loading, processing, and saving with optimal performance.
Examples¶
import parquetframe as pf
# Create ParquetFrame instance
df = pf.ParquetFrame()
# Load data
df = pf.read("data.parquet")
# Process data
filtered = df.filter("column > 100")
grouped = df.groupby("category").sum()
# Save results
df.save("output.parquet")