CLI Commands Reference¶
This page provides comprehensive documentation for all pframe CLI commands and their options.
pframe info¶
Display detailed information about parquet files without loading them into memory.
Usage¶
Description¶
The info command provides:
- File metadata: Size, path, recommended backend
- Schema information: Column names, types, nullability
- Parquet metadata: Row groups, total rows/columns
- Storage details: Compression, encoding information
Examples¶
Sample Output¶
File Information: data.parquet
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Property ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ File Size │ 2,345,678 bytes (2.24 MB) │
│ Recommended Backend │ pandas (eager) │
└─────────────────────┴─────────────────────────────────────┘
Parquet Schema:
┏━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┓
┃ Column ┃ Type ┃ Nullable ┃
┡━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━┩
│ user_id │ int64 │ No │
│ name │ string │ Yes │
│ email │ string │ Yes │
└──────────┴────────┴──────────┘
pframe run¶
Process parquet files with filtering, transformations, and analysis operations.
Usage¶
Core Options¶
| Option | Short | Type | Description |
|---|---|---|---|
--query |
-q |
TEXT | Filter data with pandas/Dask query expressions |
--columns |
-c |
TEXT | Select specific columns (comma-separated) |
--output |
-o |
PATH | Save results to output file |
--save-script |
-S |
PATH | Generate Python script of operations |
Display Options¶
| Option | Short | Type | Description |
|---|---|---|---|
--head |
-h |
INT | Show first N rows |
--tail |
-t |
INT | Show last N rows |
--sample |
-s |
INT | Show N random rows |
--describe |
FLAG | Statistical description | |
--info |
FLAG | Data types and info |
Backend Options¶
| Option | Type | Description |
|---|---|---|
--threshold |
FLOAT | File size threshold in MB (default: 10) |
--force-pandas |
FLAG | Force pandas backend |
--force-dask |
FLAG | Force Dask backend |
Examples¶
Basic Data Exploration¶
# Quick preview
pframe run data.parquet
# Show first 10 rows
pframe run data.parquet --head 10
# Statistical summary
pframe run data.parquet --describe
Filtering and Selection¶
# Filter rows
pframe run data.parquet --query "age > 25 and status == 'active'"
# Select columns
pframe run data.parquet --columns "name,email,age"
# Combine filtering and selection
pframe run data.parquet \
--query "department == 'Engineering'" \
--columns "name,salary,hire_date" \
--head 20
Data Processing Pipeline¶
# Complete processing with output
pframe run sales_data.parquet \
--query "region in ['North', 'South'] and revenue > 10000" \
--columns "customer_id,product,revenue,date" \
--output "high_value_sales.parquet" \
--save-script "sales_analysis.py"
Backend Control¶
# Force pandas for small operations
pframe run data.parquet --force-pandas --describe
# Force Dask for memory efficiency
pframe run large_data.parquet --force-dask --sample 1000
# Custom threshold
pframe run data.parquet --threshold 50 --info
pframe interactive¶
Start an interactive Python REPL with ParquetFrame integration.
Usage¶
Options¶
| Option | Type | Description |
|---|---|---|
--threshold |
FLOAT | File size threshold in MB (default: 10) |
Description¶
Interactive mode provides:
- Full Python REPL with ParquetFrame pre-loaded
- Session history with persistent readline support
- Rich output formatting for data exploration
- Script generation from session commands
- Pre-loaded variables:
pf,pd,dd,console
Examples¶
Start Interactive Session¶
# Empty session
pframe interactive
# Load file automatically
pframe interactive data.parquet
# Custom threshold
pframe interactive large_data.parquet --threshold 50
Interactive Session Example¶
# In the interactive session
>>> pf.info()
>>> pf.head(10)
>>> filtered = pf.query("age > 30")
>>> result = filtered.groupby("department").size()
>>> result.save("department_counts.parquet", save_script="analysis.py")
>>> exit()
Available Variables¶
| Variable | Description |
|---|---|
pf |
Your ParquetFrame instance |
pd |
pandas module |
dd |
dask.dataframe module |
console |
rich Console for pretty printing |
Session Features¶
- Tab completion for all ParquetFrame methods
- Command history saved between sessions
- Rich formatting for DataFrames and tables
- Error handling with helpful messages
- Script generation from session commands
pframe workflow¶
Execute, validate, or visualize YAML workflow definitions.
Usage¶
Key Options¶
| Option | Short | Type | Description |
|---|---|---|---|
--variables |
-V |
TEXT | Set workflow variables as key=value pairs |
--validate |
-v |
FLAG | Validate workflow without executing |
--visualize |
CHOICE | Generate visualization: graphviz, networkx, mermaid |
|
--viz-output |
PATH | Output path for visualization file | |
--quiet |
-q |
FLAG | Run in quiet mode |
--list-steps |
FLAG | List all available step types | |
--create-example |
PATH | Create example workflow file |
Examples¶
# Execute workflow
pframe workflow pipeline.yml
# Execute with variables
pframe workflow pipeline.yml --variables "region=US,min_age=21"
# Validate before running
pframe workflow pipeline.yml --validate
# Generate DAG visualization
pframe workflow pipeline.yml --visualize graphviz --viz-output dag.svg
# Create example workflow
pframe workflow --create-example example.yml
Features¶
- Multi-step Pipelines: Chain read, filter, transform, save operations
- DAG Visualization: Generate workflow dependency graphs
- Execution History: Automatic tracking of all workflow runs
- Performance Monitoring: Memory usage and timing metrics
- Variable Substitution: Dynamic workflow configuration
pframe workflow-history¶
View and manage workflow execution history with detailed analytics.
Usage¶
Key Options¶
| Option | Short | Type | Description |
|---|---|---|---|
--workflow-name |
-w |
TEXT | Filter by specific workflow name |
--status |
-s |
CHOICE | Filter by status: completed, failed, running |
--limit |
-l |
INT | Limit number of records (default: 10) |
--details |
-d |
FLAG | Show detailed execution information |
--stats |
FLAG | Show aggregate statistics | |
--cleanup |
INT | Clean up files older than N days |
Examples¶
# View recent executions
pframe workflow-history
# Filter by workflow name
pframe workflow-history --workflow-name "Data Pipeline"
# Show detailed information
pframe workflow-history --details --limit 5
# View aggregate statistics
pframe workflow-history --stats
# Clean up old history
pframe workflow-history --cleanup 30
Features¶
- Execution Tracking: Complete history of workflow runs
- Performance Analytics: Success rates, duration trends
- Rich Filtering: By name, status, time period
- Detailed Views: Step-by-step execution breakdown
- History Management: Cleanup and maintenance tools
Global Options¶
These options are available for all commands:
| Option | Description |
|---|---|
--version |
Show version and exit |
--help |
Show help message |
Error Handling¶
The CLI provides helpful error messages for common issues:
- File not found: Clear message with suggested fixes
- Invalid queries: Pandas/Dask error context
- Backend issues: Automatic fallback suggestions
- Permission errors: System-specific guidance
Performance Notes¶
- File size detection happens before backend selection
- Memory usage is optimized based on chosen backend
- Progress indicators for long-running operations
- Interrupt handling with Ctrl+C support