Workflow System¶
ParquetFrame's workflow system provides powerful orchestration capabilities for complex data processing pipelines. Build, execute, monitor, and visualize multi-step data transformations with enterprise-grade features.
✨ Key Features¶
- 📄 YAML-based Workflow Definitions - Define complex pipelines in readable YAML format
- 🔄 Step Dependencies - Automatic dependency resolution and execution ordering
- 📊 Execution History - Track all workflow runs with detailed metrics
- 📈 Performance Analytics - Monitor execution times, memory usage, and success rates
- 🎯 DAG Visualization - Generate visual representations of workflow dependencies
- 🛠️ Rich CLI Tools - Comprehensive command-line interface for management
- ⚡ Backend Optimization - Automatic pandas/Dask selection based on data size
🚀 Quick Start¶
1. Create a Workflow¶
name: "Customer Data Pipeline"
description: "Process customer data and generate insights"
steps:
- name: "load_customers"
type: "read"
input: "raw_customers.parquet"
output: "customers"
- name: "filter_active"
type: "filter"
input: "customers"
query: "status == 'active' and last_login > '2024-01-01'"
output: "active_customers"
- name: "add_segments"
type: "transform"
input: "active_customers"
function: "add_customer_segments"
output: "segmented_customers"
- name: "save_results"
type: "save"
input: "segmented_customers"
output: "customer_segments.parquet"
2. Execute the Workflow¶
# Run the workflow
pframe workflow data_pipeline.yml
# Run with variables
pframe workflow data_pipeline.yml --variables "min_revenue=1000,region=US"
# Validate before running
pframe workflow data_pipeline.yml --validate
# Generate visualization
pframe workflow data_pipeline.yml --visualize graphviz --viz-output pipeline.svg
3. Monitor Execution History¶
# View recent executions
pframe workflow-history
# Filter by workflow
pframe workflow-history --workflow-name "Customer Data Pipeline"
# Show detailed information
pframe workflow-history --details
# View statistics
pframe workflow-history --stats
🏗️ Architecture Overview¶
graph TD
A[YAML Workflow] --> B[WorkflowEngine]
B --> C[Step Execution]
C --> D[WorkflowHistoryManager]
C --> E[Performance Tracking]
D --> F[.hist Files]
E --> G[Memory & Timing Metrics]
H[WorkflowVisualizer] --> I[Graphviz]
H --> J[NetworkX]
H --> K[Mermaid]
L[CLI Commands] --> M[workflow]
L --> N[workflow-history]
style B fill:#e1f5fe
style D fill:#f3e5f5
style H fill:#e8f5e8
📖 Documentation Sections¶
Section | Description |
---|---|
Step Types | Complete reference for all available workflow step types |
YAML Syntax | Detailed workflow definition format and options |
CLI Commands | Command-line interface reference for workflow management |
History & Analytics | Execution tracking, metrics, and performance analysis |
DAG Visualization | Generate and customize workflow visualizations |
Advanced Features | Variables, conditionals, and complex dependency patterns |
Examples | Real-world workflow examples and use cases |
🎯 Common Use Cases¶
Data Processing Pipeline¶
Perfect for ETL/ELT workflows with multiple transformation steps, filtering, and aggregation operations.
Machine Learning Workflows¶
Orchestrate data preprocessing, feature engineering, model training, and evaluation steps.
Business Intelligence¶
Automate report generation, metric calculations, and dashboard data preparation.
Data Quality Monitoring¶
Create repeatable data validation and quality assessment workflows.
🚀 Getting Started¶
Ready to build your first workflow? Start with our Step Types Reference to understand available operations, then explore YAML Syntax for workflow definitions.
For hands-on examples, check out our Examples Gallery featuring real-world use cases and patterns.