DataFrame Engine Selection Matrix¶
ParquetFrame automatically selects the optimal DataFrame backend (Pandas, Polars, or Dask) based on your data size, operation type, and system resources. This document explains the selection logic and how to customize it.
Quick Reference¶
| Data Size | Default Engine | Rationale |
|---|---|---|
| < 100 MB | Pandas | Fast startup, minimal overhead, excellent for small datasets |
| 100 MB - 10 GB | Polars | Memory-efficient, highly optimized, good for medium datasets |
| > 10 GB | Dask | Distributed processing, handles datasets larger than memory |
[!NOTE] These are default thresholds. ParquetFrame uses a sophisticated scoring system that considers multiple factors, not just size.
How Engine Selection Works¶
ParquetFrame uses a weighted scoring system with 5 factors:
Scoring Breakdown¶
| Factor | Weight | Description |
|---|---|---|
| Size Fit | 40% | How well data size matches engine's optimal range |
| Memory Efficiency | 25% | Estimated memory usage vs. available system memory |
| Performance Score | 20% | Engine's general performance characteristics |
| Lazy Evaluation | 10% | Match between user preference and engine capability |
| Operation Type | 5% | Reserved for operation-specific optimizations |
Size-Based Scoring¶
Each engine has an optimal size range:
# Pandas
optimal_size_range = (0, 100 * 1024 * 1024) # 0-100 MB
# Polars
optimal_size_range = (10 * 1024 * 1024, 10 * 1024 * 1024 * 1024) # 10 MB - 10 GB
# Dask
optimal_size_range = (100 * 1024 * 1024, float('inf')) # 100 MB+
Scoring logic: - Perfect fit (data within range): Score = 1.0 - Too small (oversized engine): Score = max(0.3, data_size/min_size) - Too large (undersized engine): Score = max(0.1, max_size/data_size)
Memory Efficiency Scoring¶
Considers available system memory and each engine's memory footprint:
memory_efficiency:
- Pandas: 1.0 (baseline)
- Polars: 1.5 (50% more efficient)
- Dask: 2.0 (2x more efficient with chunking)
Formula:
estimated_usage = data_size / memory_efficiency
memory_ratio = estimated_usage / (available_memory * 0.8) # 80% safety margin
if memory_ratio <= 0.5:
score = 1.0 # Excellent fit
elif memory_ratio <= 1.0:
score = 1.0 - (memory_ratio - 0.5) * 2 # Linear decrease
else:
score = max(0.1, 1.0 / memory_ratio) # Heavy penalty for overflow
Configuration¶
Environment Variables¶
Override default thresholds:
export PARQUETFRAME_PANDAS_THRESHOLD_MB=200
export PARQUETFRAME_POLARS_THRESHOLD_MB=20000
export PARQUETFRAME_ENGINE=polars # Force specific engine
Python API¶
from parquetframe import set_config
# Adjust thresholds
set_config(
pandas_threshold_mb=200,
polars_threshold_mb=20_000
)
# Force an engine globally
set_config(default_engine="polars")
Per-Operation Override¶
from parquetframe import read
from parquetframe.config import config_context
# Temporarily use Dask for a large file
with config_context(default_engine="dask"):
df = read("huge_file.parquet")
Engine Characteristics¶
Pandas¶
Best for: Small to medium datasets (< 100 MB), exploratory analysis, prototyping
Pros: - Fast startup time - Mature ecosystem - Rich functionality - Universal compatibility
Cons: - Single-threaded by default - Entire dataset in memory - Performance degrades with large data
Optimal Size Range: 0 - 100 MB
Polars¶
Best for: Medium to large datasets (10 MB - 10 GB), production workloads
Pros: - Highly optimized (Rust core) - Memory-efficient - Parallel by default - Lazy evaluation support
Cons: - Newer ecosystem - Fewer integrations than Pandas - Learning curve for lazy API
Optimal Size Range: 10 MB - 10 GB Memory Efficiency: 1.5x better than Pandas
Dask¶
Best for: Very large datasets (> 10 GB), distributed computing, datasets larger than memory
Pros: - Handles out-of-core data - Distributed processing - Familiar Pandas-like API - Scales to clusters
Cons: - Higher overhead for small data - More complex setup - Lazy evaluation learning curve
Optimal Size Range: 100 MB+ Memory Efficiency: 2x better than Pandas (chunking)
Advanced: Direct Engine Selection¶
For advanced users who want full control:
from parquetframe.core import EngineRegistry
registry = EngineRegistry()
# Get available engines
engines = registry.list_engines()
print(engines) # ['pandas', 'polars', 'dask']
# Select engine manually
engine = registry.get_engine("polars")
df = engine.read_parquet("data.parquet")
Troubleshooting¶
"No available engines found"¶
Cause: None of the DataFrame libraries are installed.
Solution:
pip install pandas # Install at least one engine
# or
pip install polars
# or
pip install dask[dataframe]
Engine selection seems wrong¶
Enable debug logging:
import logging
logging.basicConfig(level=logging.DEBUG)
logging.getLogger('parquetframe.core.heuristics').setLevel(logging.DEBUG)
You'll see scoring output like:
DEBUG:parquetframe.core.heuristics:Engine pandas score: 0.723
DEBUG:parquetframe.core.heuristics:Engine polars score: 0.891
DEBUG:parquetframe.core.heuristics:Selected engine: polars (score: 0.891)
Force disable automatic selection¶
Performance Tips¶
- Right-size your thresholds for your typical workload
- Install
psutilfor accurate memory-based selection: - Use Polars for production datasets in the 1-10 GB range
- Reserve Dask for truly large datasets or distributed computing
- Pandas is fine for small data and quick scripts
Rust Acceleration Layer¶
ParquetFrame automatically uses Rust fast-paths when available, providing 10-50x speedups for specific operations:
Rust Fast-Path Operations¶
| Operation | Speedup | Auto-enabled |
|---|---|---|
| Parquet metadata reading | 5-10x | ✓ |
| Graph algorithms (BFS, PageRank) | 15-25x | ✓ |
| Workflow execution (parallel DAG) | 10-15x | ✓ |
| CSV reading (with schema inference) | 3-5x | ✓ |
Decision Flow¶
Operation Request
│
├─ Is Rust extension available?
│ │
│ ├─ YES → Check if operation supported by Rust
│ │ │
│ │ ├─ Supported → Use Rust fast-path
│ │ └─ Not supported → Fall back to Python/engine
│ │
│ └─ NO → Use Python/selected engine
│
└─ Result returned
Check Rust Availability¶
from parquetframe import _rustic
# Check if Rust backend is compiled and available
if hasattr(_rustic, "rust_available"):
rust_available = _rustic.rust_available()
print(f"Rust acceleration: {rust_available}")
# Check specific feature availability
if hasattr(_rustic, "io_fastpaths_available"):
print(f"I/O fast-paths: {_rustic.io_fastpaths_available()}")
Installation with Rust Support¶
Rust acceleration is automatically included in wheel distributions:
# Install from PyPI (includes pre-compiled Rust extension)
pip install parquetframe
# For development, build locally with Rust
pip install maturin
maturin develop --release
References¶
- Implementation:
src/parquetframe/core/heuristics.py - Configuration:
src/parquetframe/config.py - Engine base:
src/parquetframe/core/base.py - Rust Acceleration: Rust Overview