SQL Support¶
Execute powerful SQL queries directly on your DataFrames using ParquetFrame's integrated DuckDB engine. Combine the flexibility of SQL with the performance of pandas and Dask.
Overview¶
ParquetFrame provides comprehensive SQL support through DuckDB integration, enabling you to:
- Execute SQL queries on pandas and Dask DataFrames
- Join multiple DataFrames with automatic registration
- Use advanced SQL features like window functions and CTEs
- Optimize query performance with explain plans
- Seamlessly switch between DataFrame and SQL workflows
Key Features¶
- DuckDB Engine: Fast, analytical SQL database engine
- Automatic DataFrame Registration: DataFrames become SQL tables instantly
- Multi-DataFrame JOINs: Query across multiple datasets effortlessly
- Backend Agnostic: Works with both pandas and Dask DataFrames
- Query Validation: Built-in SQL syntax validation and safety checks
- Performance Analysis: Query explain plans for optimization
Quick Start¶
Installation¶
# Install ParquetFrame with SQL support
pip install parquetframe[sql]
# Or install everything
pip install parquetframe[all]
Basic SQL Query¶
import parquetframe as pf
# Load data
customers = pf.read("customers.parquet")
# Execute SQL query
result = customers.sql("""
SELECT customer_id, name, age
FROM df
WHERE age > 30
ORDER BY age DESC
LIMIT 10
""")
print(result)
Core SQL Operations¶
Basic Queries¶
import parquetframe as pf
# Sample data
sales = pf.read("sales_data.parquet")
# SELECT with filtering
result = sales.sql("""
SELECT product_name, sales_amount, order_date
FROM df
WHERE sales_amount > 1000
AND order_date >= '2024-01-01'
""")
# Aggregation queries
summary = sales.sql("""
SELECT
product_category,
COUNT(*) as order_count,
SUM(sales_amount) as total_revenue,
AVG(sales_amount) as avg_order_value
FROM df
GROUP BY product_category
ORDER BY total_revenue DESC
""")
# Window functions
trending = sales.sql("""
SELECT
product_name,
order_date,
sales_amount,
SUM(sales_amount) OVER (
PARTITION BY product_name
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as rolling_7day_sales
FROM df
ORDER BY product_name, order_date
""")
Multi-DataFrame JOINs¶
# Load multiple datasets
customers = pf.read("customers.parquet")
orders = pf.read("orders.parquet")
products = pf.read("products.parquet")
# Complex JOIN query
analysis = customers.sql("""
SELECT
c.customer_name,
c.customer_segment,
COUNT(o.order_id) as total_orders,
SUM(o.order_amount) as total_spent,
AVG(o.order_amount) as avg_order_value,
COUNT(DISTINCT p.product_category) as categories_purchased
FROM df c
LEFT JOIN orders o ON c.customer_id = o.customer_id
LEFT JOIN products p ON o.product_id = p.product_id
WHERE o.order_date >= '2024-01-01'
GROUP BY c.customer_name, c.customer_segment
HAVING total_orders > 5
ORDER BY total_spent DESC
""", orders=orders, products=products)
print(f"Customer analysis: {len(analysis)} customers")
print(analysis.head())
Advanced SQL Features¶
Common Table Expressions (CTEs)¶
result = sales.sql("""
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) as month,
product_category,
SUM(sales_amount) as monthly_revenue
FROM df
GROUP BY month, product_category
),
category_trends AS (
SELECT
*,
LAG(monthly_revenue) OVER (
PARTITION BY product_category
ORDER BY month
) as prev_month_revenue
FROM monthly_sales
)
SELECT
month,
product_category,
monthly_revenue,
monthly_revenue - prev_month_revenue as revenue_change,
CASE
WHEN prev_month_revenue > 0 THEN
(monthly_revenue - prev_month_revenue) / prev_month_revenue * 100
ELSE NULL
END as growth_rate
FROM category_trends
WHERE prev_month_revenue IS NOT NULL
ORDER BY month, growth_rate DESC
""")
Subqueries and Analytics¶
# Find top customers in each region
top_customers = customers.sql("""
SELECT
region,
customer_name,
total_spent,
customer_rank
FROM (
SELECT
region,
customer_name,
total_spent,
ROW_NUMBER() OVER (
PARTITION BY region
ORDER BY total_spent DESC
) as customer_rank
FROM df
WHERE total_spent > 0
) ranked_customers
WHERE customer_rank <= 3
ORDER BY region, customer_rank
""")
CLI Integration¶
ParquetFrame provides powerful CLI tools for SQL operations:
Basic SQL Command¶
# Execute SQL query from command line
pframe sql "SELECT COUNT(*) FROM df" --file data.parquet
# Save results
pframe sql "SELECT * FROM df WHERE age > 30" \
--file customers.parquet \
--output filtered_customers.parquet
# JOIN multiple files
pframe sql "SELECT c.name, SUM(o.amount) as total FROM df c JOIN orders o ON c.id = o.customer_id GROUP BY c.name" \
--file customers.parquet \
--join "orders=orders.parquet" \
--output customer_totals.parquet
Interactive SQL Mode¶
# Start interactive SQL session
pframe sql --interactive --file sales_data.parquet
# In the interactive session:
SQL> SELECT product_category, AVG(sales_amount) as avg_sales
FROM df
GROUP BY product_category
ORDER BY avg_sales DESC;
# Results displayed automatically with rich formatting
# Tables show with colors and proper alignment
Query Validation and Explanation¶
# Validate SQL syntax before execution
pframe sql "SELECT * FROM df WHERE invalid_syntax" \
--file data.parquet \
--validate
# Show execution plan
pframe sql "SELECT * FROM df WHERE complex_condition" \
--file data.parquet \
--explain
Performance and Optimization¶
Query Performance Analysis¶
from parquetframe.sql import explain_query
# Get execution plan
plan = explain_query(
main_df=sales._df,
query="""
SELECT product_category, SUM(sales_amount) as total
FROM df
WHERE order_date >= '2024-01-01'
GROUP BY product_category
"""
)
print("Execution Plan:")
print(plan)
Performance Best Practices¶
# 1. Filter early to reduce data size
efficient_query = sales.sql("""
SELECT customer_id, SUM(amount) as total
FROM df
WHERE order_date >= '2024-01-01' -- Filter first
GROUP BY customer_id
HAVING total > 1000 -- Filter after aggregation
""")
# 2. Use column selection to minimize I/O
selective_query = sales.sql("""
SELECT customer_id, order_date, amount -- Only needed columns
FROM df
WHERE order_date BETWEEN '2024-01-01' AND '2024-03-31'
""")
# 3. Leverage indexes for JOINs (DuckDB optimizes automatically)
join_query = customers.sql("""
SELECT c.name, o.total_amount
FROM df c
JOIN orders o ON c.customer_id = o.customer_id -- Indexed JOIN
WHERE o.order_date >= '2024-01-01'
""", orders=orders)
Integration with Dask¶
SQL operations work seamlessly with Dask DataFrames:
# Large dataset with Dask
large_sales = pf.read("huge_sales_data.parquet", islazy=True)
# DuckDB will compute Dask DataFrame automatically
with warnings.catch_warnings():
warnings.simplefilter("ignore") # Suppress Dask computation warning
result = large_sales.sql("""
SELECT
product_category,
DATE_TRUNC('month', order_date) as month,
SUM(sales_amount) as monthly_sales
FROM df
WHERE order_date >= '2023-01-01'
GROUP BY product_category, month
ORDER BY month, monthly_sales DESC
""")
print(f"Result type: {type(result._df)}")
print(f"Rows: {len(result)}")
Error Handling and Safety¶
ParquetFrame includes built-in safety features:
from parquetframe.sql import validate_sql_query
# Query validation
query = "SELECT * FROM df WHERE age > 30"
if validate_sql_query(query):
print("Query is valid")
result = df.sql(query)
else:
print("Invalid query syntax")
# Automatic warnings for potentially dangerous operations
try:
# This will warn but not fail (operates on in-memory data)
result = df.sql("DROP TABLE df")
except Exception as e:
print(f"Query failed: {e}")
# Error handling with detailed messages
try:
result = customers.sql("""
SELECT invalid_column
FROM df
WHERE nonexistent_table.id = 1
""")
except ValueError as e:
print(f"SQL Error: {e}")
# Provides helpful context for debugging
Advanced Use Cases¶
Time Series Analysis¶
time_series = sales.sql("""
SELECT
DATE_TRUNC('day', order_date) as date,
SUM(sales_amount) as daily_sales,
AVG(SUM(sales_amount)) OVER (
ORDER BY DATE_TRUNC('day', order_date)
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) as moving_avg_7d,
SUM(sales_amount) - LAG(SUM(sales_amount)) OVER (
ORDER BY DATE_TRUNC('day', order_date)
) as daily_change
FROM df
WHERE order_date >= '2024-01-01'
GROUP BY DATE_TRUNC('day', order_date)
ORDER BY date
""")
Cohort Analysis¶
cohort_analysis = customers.sql("""
WITH customer_cohorts AS (
SELECT
customer_id,
DATE_TRUNC('month', first_order_date) as cohort_month,
DATE_TRUNC('month', order_date) as order_month
FROM df
),
cohort_sizes AS (
SELECT
cohort_month,
COUNT(DISTINCT customer_id) as cohort_size
FROM customer_cohorts
GROUP BY cohort_month
)
SELECT
c.cohort_month,
c.order_month,
COUNT(DISTINCT c.customer_id) as active_customers,
s.cohort_size,
COUNT(DISTINCT c.customer_id)::FLOAT / s.cohort_size as retention_rate,
EXTRACT(MONTH FROM AGE(c.order_month, c.cohort_month)) as period_number
FROM customer_cohorts c
JOIN cohort_sizes s ON c.cohort_month = s.cohort_month
GROUP BY c.cohort_month, c.order_month, s.cohort_size
ORDER BY c.cohort_month, c.order_month
""", orders=orders)
Statistical Analysis¶
stats_analysis = sales.sql("""
SELECT
product_category,
COUNT(*) as sample_size,
AVG(sales_amount) as mean_sales,
STDDEV(sales_amount) as std_dev,
MIN(sales_amount) as min_sales,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY sales_amount) as q1,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sales_amount) as median,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY sales_amount) as q3,
MAX(sales_amount) as max_sales,
(PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY sales_amount) -
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY sales_amount)) as iqr
FROM df
GROUP BY product_category
ORDER BY mean_sales DESC
""")
Migration from Other Tools¶
From pandas¶
# Before (pandas)
import pandas as pd
df = pd.read_parquet("data.parquet")
result = df.groupby('category').agg({
'amount': ['sum', 'mean', 'count']
}).round(2)
# After (ParquetFrame SQL)
import parquetframe as pf
df = pf.read("data.parquet")
result = df.sql("""
SELECT
category,
ROUND(SUM(amount), 2) as total_amount,
ROUND(AVG(amount), 2) as avg_amount,
COUNT(*) as count_records
FROM df
GROUP BY category
ORDER BY total_amount DESC
""")
From Spark SQL¶
# Spark SQL patterns work directly in ParquetFrame
result = df.sql("""
SELECT
customer_id,
product_category,
SUM(amount) as total_spent,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY SUM(amount) DESC
) as category_rank
FROM df
GROUP BY customer_id, product_category
""")
Troubleshooting¶
Common Issues¶
-
"No such table: df"
-
"Column not found"
-
"Memory error with large datasets"
-
"Slow JOIN performance"
Best Practices¶
-
Use Descriptive Table Aliases
-
Filter Early and Often
-
Use Parameterized Queries for Safety
-
Test Queries on Small Samples
Related Documentation¶
- DuckDB Integration - Advanced DuckDB features and optimization
- Database Connections - Connect to external databases
- Backend Selection - Choose between pandas and Dask
- CLI Guide - Command-line SQL operations
- Performance Optimization - General performance tips