Skip to content

ParquetFrame

ParquetFrame Logo
PyPI Version Python Support License
Tests Coverage

A universal wrapper for working with dataframes in Python, seamlessly switching between pandas and Dask based on file size or manual control.

✨ Features

🚀 Intelligent Backend Selection: Memory-aware automatic switching between pandas and Dask based on file size, system resources, and file characteristics

📁 Smart File Handling: Reads parquet files without requiring file extensions (.parquet, .pqt)

🔄 Seamless Switching: Convert between pandas and Dask with simple methods

Full API Compatibility: All pandas/Dask operations work transparently

🗃️ SQL Support: Execute SQL queries on DataFrames using DuckDB with automatic JOIN capabilities

🧬 BioFrame Integration: Genomic interval operations with parallel Dask implementations

🖥️ Powerful CLI: Command-line interface for data exploration, SQL queries, and batch processing

📝 Script Generation: Automatic Python script generation from CLI sessions

Performance Optimization: Built-in benchmarking tools and intelligent threshold detection

📋 YAML Workflows: Define complex data processing pipelines in YAML with declarative syntax

🎯 Zero Configuration: Works out of the box with sensible defaults

🚀 Quick Start

# Basic installation
pip install parquetframe

# With CLI support (recommended)
pip install parquetframe[cli]
import parquetframe as pqf

# Read a file - automatically chooses pandas or Dask based on size
df = pqf.read("my_data")  # Handles .parquet/.pqt extensions automatically

# All standard DataFrame operations work
result = df.groupby("column").sum()

# Save without worrying about extensions
df.save("output")  # Saves as output.parquet
import parquetframe as pqf

# Custom threshold
df = pqf.read("data", threshold_mb=50)  # Use Dask for files >50MB

# Force backend
df = pqf.read("data", islazy=True)   # Force Dask
df = pqf.read("data", islazy=False)  # Force pandas

# Check current backend
print(df.islazy)  # True for Dask, False for pandas

# Chain operations
result = (pqf.read("input")
          .groupby("category")
          .sum()
          .save("result"))
# Quick file info
pframe info data.parquet

# Data processing
pframe run data.parquet --query "age > 30" --head 10

# Interactive mode
pframe interactive data.parquet

# Performance benchmarking
pframe benchmark --operations "groupby,filter"

🎯 Why ParquetFrame?

The Problem

Working with dataframes in Python often means choosing between:

  • pandas: Fast for small datasets, but runs out of memory on large files
  • Dask: Memory-efficient for large datasets, but slower for small operations
  • Manual switching: Writing boilerplate code to handle different backends

The Solution

ParquetFrame automatically chooses the right backend based on your data size, while providing a consistent API that works with both pandas and Dask. No more manual backend management or code duplication.

📊 Performance Benefits

  • Intelligent optimization: Memory-aware backend selection considering file size, system resources, and file characteristics
  • Built-in benchmarking: Comprehensive performance analysis tools to optimize your workflows
  • Memory efficiency: Never load more data than your system can handle
  • Speed optimization: Fast pandas operations for small datasets, scalable Dask for large ones
  • CLI performance tools: Built-in benchmarking and analysis from the command line
  • Zero overhead: Direct delegation to underlying libraries without performance penalty

🛠️ Key Concepts

Automatic Backend Selection

import parquetframe as pqf

# Small file (< 10MB) → pandas (fast operations)
small_df = pqf.read("small_dataset.parquet")
print(type(small_df._df))  # <class 'pandas.core.frame.DataFrame'>

# Large file (> 10MB) → Dask (memory efficient)
large_df = pqf.read("large_dataset.parquet")
print(type(large_df._df))  # <class 'dask.dataframe.core.DataFrame'>

Manual Control

# Override automatic detection
pandas_df = pqf.read("any_file.parquet", islazy=False)  # Force pandas
dask_df = pqf.read("any_file.parquet", islazy=True)     # Force Dask

# Convert between backends
pandas_df.to_dask()    # Convert to Dask
dask_df.to_pandas()    # Convert to pandas

# Property-based control
df.islazy = True   # Convert to Dask
df.islazy = False  # Convert to pandas

File Extension Handling

# All of these work the same way:
df1 = pqf.read("data.parquet")  # Explicit extension
df2 = pqf.read("data.pqt")      # Alternative extension
df3 = pqf.read("data")          # Auto-detect extension

# Save with automatic extension
df.save("output")         # Saves as "output.parquet"
df.save("output.pqt")     # Saves as "output.pqt"

📋 Requirements

  • Python 3.9+
  • pandas >= 2.0.0
  • dask[dataframe] >= 2023.1.0
  • pyarrow >= 10.0.0

📚 Documentation

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.


Ready to simplify your dataframe workflows? Check out the Quick Start Guide to get started!