ParquetFrame¶
High-performance DataFrame library with Rust acceleration, intelligent multi-engine support, and AI-powered data exploration.
🚀 v2.0.0 Now Available: Rust backend delivers 10-50x speedup for workflows, graphs, and I/O operations
🏆 Production-Ready: 400+ passing tests, comprehensive CI/CD, and battle-tested in production
🦀 Rust-Accelerated: Optional high-performance backend with automatic fallback to Python
New to ParquetFrame?
Start with the Quick Start Guide or explore Rust Acceleration for maximum performance.
✨ What's New in v2.0.0¶
🦀 Rust Acceleration¶
Workflow Engine (10-15x faster)
- Parallel DAG execution with resource-aware scheduling
- Automatic dependency resolution
- Progress tracking and cancellation support
- Learn more →
Graph Algorithms (15-25x faster)
- CSR/CSC construction for efficient graph storage
- Parallel BFS, DFS traversal
- PageRank, Dijkstra shortest paths, connected components
- Learn more →
I/O Operations (5-10x faster)
- Lightning-fast Parquet metadata reading
- Instant column statistics extraction
- Zero-copy data transfer via Apache Arrow
- Learn more →
Performance Benchmarks¶
| Operation | Python | Rust | Speedup |
|---|---|---|---|
| Workflow (10 steps, parallel) | 850ms | 65ms | 13.1x |
| PageRank (100K nodes) | 2.3s | 95ms | 24.2x |
| BFS (1M nodes) | 1.8s | 105ms | 17.1x |
| Parquet metadata (1GB file) | 180ms | 22ms | 8.2x |
Key Features¶
- Multi-Engine Backend: Seamlessly switch between Pandas, Polars, and Dask.
- Rust Acceleration: Critical paths optimized with Rust (via PyO3) for maximum performance.
- Cloud Integration: Unified API for AWS S3, Google Cloud Storage, and Azure Blob Storage.
- Entity Framework: ORM-like data modeling with GraphAr persistence.
- Distributed Computing: Scale out with Ray or Dask clusters.
- Monitoring: Built-in metrics collection and visualization. 🔐 Zanzibar Permissions: Production-grade ReBAC authorization 📊 Graph Processing: Apache GraphAr with Rust-accelerated algorithms 📋 YAML Workflows: Declarative pipeline orchestration 🤖 AI Integration: Local LLM support for natural language queries
⚡ Automatic Fallback: Works without Rust, just slower
🚀 Quick Start¶
import parquetframe.core as pf2
# Automatic engine selection (pandas/Polars/Dask)
df = pf2.read("data.parquet") # Auto-selects best engine
print(f"Using {df.engine_name} engine")
# All operations work transparently
result = df.groupby("category")["value"].sum()
# Force specific engine
df = pf2.read("data.csv", engine="polars") # Use Polars
from dataclasses import dataclass
from parquetframe.entity import entity, rel
@entity(storage_path="./data/users", primary_key="user_id")
@dataclass
class User:
user_id: str
username: str
email: str
# Automatic CRUD operations
user = User("user_001", "alice", "alice@example.com")
user.save() # Persist to Parquet
# Query
user = User.find("user_001")
all_users = User.find_all()
🎉 What's New in Phase 2?¶
Phase 2 represents a major architectural evolution, transforming ParquetFrame from a pandas/Dask wrapper into a comprehensive data framework.
New Capabilities¶
| Feature | Phase 1 | Phase 2 |
|---|---|---|
| Engines | pandas, Dask | pandas, Polars, Dask |
| Entity Framework | ❌ No | ✅ @entity and @rel decorators |
| Permissions | ❌ No | ✅ Zanzibar ReBAC (4 APIs) |
| Avro Support | ❌ No | ✅ Native fastavro integration |
| Configuration | Basic | ✅ Global config + env vars |
| Performance | Good | ✅ 2-5x faster with Polars |
| Backward Compatible | — | ✅ 100% compatible |
Featured Example: Todo/Kanban Application¶
See the Complete Walkthrough of a production-ready Kanban board system demonstrating:
- ✅ Multi-user collaboration with role-based access
- ✅ Entity Framework with
@entityand@reldecorators - ✅ Zanzibar permissions with inheritance (Board → List → Task)
- ✅ YAML workflows for ETL pipelines
- ✅ Complete source code with 38+ tests
# Entity Framework example from Todo/Kanban
from dataclasses import dataclass
from parquetframe.entity import entity, rel
@entity(storage_path="./data/users", primary_key="user_id")
@dataclass
class User:
user_id: str
username: str
email: str
@rel("Board", foreign_key="owner_id", reverse=True)
def boards(self):
"""Get all boards owned by this user."""
pass
# Automatic CRUD operations
user = User("user_001", "alice", "alice@example.com")
user.save() # Persist to Parquet
boards = user.boards() # Navigate relationships
Migration Path¶
- Phase 1 users: See the Migration Guide for step-by-step instructions
- New users: Start directly with Phase 2
- 100% backward compatible: Phase 1 code continues to work
🎯 Why ParquetFrame?¶
The Problem¶
Working with dataframes in Python often means:
- Choosing a single engine: pandas (fast but memory-limited), Dask (scalable but slower), or Polars (fast but new)
- Manual backend management: Writing conditional code for different data sizes
- No data modeling: Treating everything as raw DataFrames without structure
- Complex permissions: Building authorization systems from scratch
The Solution¶
ParquetFrame provides a unified framework that:
- Automatically selects the best engine (pandas/Polars/Dask) based on data characteristics
- Provides entity framework for declarative data modeling with
@entityand@reldecorators - Includes Zanzibar permissions for production-grade authorization
- Maintains 100% compatibility with Phase 1 while adding powerful new features
📊 Performance Benefits¶
- Intelligent optimization: Memory-aware backend selection considering file size, system resources, and file characteristics
- Built-in benchmarking: Comprehensive performance analysis tools to optimize your workflows
- Memory efficiency: Never load more data than your system can handle
- Speed optimization: Fast pandas operations for small datasets, scalable Dask for large ones
- CLI performance tools: Built-in benchmarking and analysis from the command line
- Zero overhead: Direct delegation to underlying libraries without performance penalty
🛠️ Key Concepts (Phase 1 - Legacy)¶
Phase 1 API Examples
The examples below use the Phase 1 API which is still supported. For Phase 2 features (multi-engine with Polars, Entity Framework, Zanzibar permissions), see the Phase 2 Guide.
Automatic Backend Selection¶
import parquetframe as pqf
# Small file (< 10MB) → pandas (fast operations)
small_df = pqf.read("small_dataset.parquet")
print(type(small_df._df)) # <class 'pandas.core.frame.DataFrame'>
# Large file (> 10MB) → Dask (memory efficient)
large_df = pqf.read("large_dataset.parquet")
print(type(large_df._df)) # <class 'dask.dataframe.core.DataFrame'>
Manual Control¶
# Override automatic detection
pandas_df = pqf.read("any_file.parquet", islazy=False) # Force pandas
dask_df = pqf.read("any_file.parquet", islazy=True) # Force Dask
# Convert between backends
pandas_df.to_dask() # Convert to Dask
dask_df.to_pandas() # Convert to pandas
# Property-based control
df.islazy = True # Convert to Dask
df.islazy = False # Convert to pandas
File Extension Handling¶
# All of these work the same way:
df1 = pqf.read("data.parquet") # Explicit extension
df2 = pqf.read("data.pqt") # Alternative extension
df3 = pqf.read("data") # Auto-detect extension
# Save with automatic extension
df.save("output") # Saves as "output.parquet"
df.save("output.pqt") # Saves as "output.pqt"
📋 Requirements¶
Phase 2 (Recommended)¶
- Python 3.11+
- pandas >= 2.0.0
- dask[dataframe] >= 2023.1.0 (optional)
- polars >= 0.19.0 (optional)
- fastavro >= 1.8.0 (optional, for Avro support)
- pyarrow >= 10.0.0
Phase 1 (Legacy)¶
- Python 3.9+
- pandas >= 2.0.0
- dask[dataframe] >= 2023.1.0
- pyarrow >= 10.0.0
📚 Documentation¶
Phase 2 (Start Here!)¶
- Phase 2 Overview - Complete Phase 2 feature guide
- Todo/Kanban Walkthrough - Full application example
- Migration Guide - Migrate from Phase 1
- Quick Start - Get up and running in minutes
- Installation Guide - Detailed installation instructions
Features & Guides¶
- CLI Guide - Complete command-line interface documentation
- Performance Tips - Optimize your workflows
- Workflow System - YAML workflow orchestration
- Graph Processing - Apache GraphAr support
- Permissions System - Zanzibar ReBAC
- API Reference - Complete API documentation
Legacy Documentation¶
- Phase 1 Usage Guide - Phase 1 API reference
- Phase 1 Backend Selection - pandas/Dask switching
🤝 Contributing¶
We welcome contributions! Please see our Contributing Guide for details.
📄 License¶
This project is licensed under the MIT License - see the LICENSE file for details.
Ready to simplify your dataframe workflows? Check out the Quick Start Guide to get started!