ML Pipeline Integration¶

Integrate ParquetFrame into machine learning workflows and pipelines.

ML Pipeline Integration¶

ParquetFrame seamlessly integrates with machine learning frameworks and pipeline tools.

Data Preparation¶

Optimize data preparation for ML workflows: - Feature Engineering: Create and transform features - Data Splitting: Train/validation/test splits - Data Validation: Ensure data quality and consistency - Schema Evolution: Handle changing data schemas

Framework Integration¶

Work with popular ML frameworks: - Scikit-learn: Traditional ML algorithms - XGBoost/LightGBM: Gradient boosting frameworks - TensorFlow/PyTorch: Deep learning frameworks - MLflow: ML lifecycle management

Pipeline Orchestration¶

Integrate with pipeline tools: - Apache Airflow: Workflow orchestration - Kubeflow: ML workflows on Kubernetes - MLOps platforms: End-to-end ML platforms

Performance Considerations¶

Optimize for ML workloads: - Efficient data loading: Minimize data loading overhead - Memory management: Handle large datasets efficiently - Parallel processing: Scale feature engineering and model training

Summary¶

ParquetFrame provides efficient data handling capabilities that integrate naturally with ML workflows.

Examples¶

import parquetframe as pf
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load and prepare ML data
df = pf.read("ml_dataset.parquet")

# Feature engineering
df['feature_ratio'] = df['feature_a'] / df['feature_b']
df['feature_log'] = pf.log(df['feature_c'])

# Prepare features and target
X = df[['feature_a', 'feature_b', 'feature_ratio', 'feature_log']]
y = df['target']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.4f}")

# Save results
results = pf.DataFrame({
    'actual': y_test,
    'predicted': predictions
})
results.save("ml_results.parquet")

ML Pipeline Integration¶