AI-Powered Data Exploration¶
π€ NEW: ParquetFrame now includes AI-powered natural language querying and interactive data exploration!
Overview¶
ParquetFrame has evolved into a comprehensive data exploration platform that supports: - Parquet Data Lakes: Recursive file discovery and unified querying - Database Integration: SQLAlchemy-based multi-database support - AI-Powered Queries: Natural language to SQL conversion with local LLM - Interactive CLI: Rich REPL interface with session management
Prerequisites¶
Install Ollama¶
ParquetFrame uses Ollama for local LLM inference:
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Windows
# Download from https://ollama.ai/download
Start Ollama and Pull Models¶
# Start Ollama service
ollama serve
# In another terminal, pull a model
ollama pull llama3.2
# or for better SQL generation
ollama pull codellama
Install ParquetFrame with AI Support¶
Interactive Mode¶
Parquet Data Lakes¶
Explore directories of parquet files with AI assistance:
Example session:
π Welcome to ParquetFrame Interactive Mode
Data source: /Users/me/sales_data/
Type: parquet
AI enabled: Yes
β
Connected! Found 1 table(s)
pframe:parquetπ€> \help
π Help
Data Exploration:
\list, \l, \tables List all available tables
\describe <table> Show detailed table schema
Querying:
<SQL query> Execute SQL query directly
\ai <question> Ask question in natural language π€
Session Management:
\history Show query history
\save-session <file> Save current session
\load-session <file> Load saved session
pframe:parquetπ€> \list
π Available Tables
ββββββββββββββ¬ββββββββββ
β Table Name β Type β
ββββββββββββββΌββββββββββ€
β sales_data β Virtual β
ββββββββββββββ΄ββββββββββ
pframe:parquetπ€> \describe sales_data
π Table Schema: sales_data
ββββββββββββββββ¬ββββββββββββ¬ββββββββββ
β Column β Type β Nullableβ
ββββββββββββββββΌββββββββββββΌββββββββββ€
β date β DATE β β β
β product_id β INTEGER β β β
β customer_id β INTEGER β β β
β amount β DOUBLE β β β
β quantity β INTEGER β β β
ββββββββββββββββ΄ββββββββββββ΄ββββββββββ
π Files: 12
π Source: /Users/me/sales_data
pframe:parquetπ€> \ai what were the total sales last month?
π€ Processing: what were the total sales last month?
π Generated Query:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SELECT SUM(amount) as total_sales β
β FROM sales_data β
β WHERE date >= date_trunc('month', current_date - inte β
β val '1 month') β
β AND date < date_trunc('month', current_date) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Execute this query? [Y/n]: y
βββββββββββββββ
β total_sales β
βββββββββββββββ€
β 125432.78 β
βββββββββββββββ
π 1 rows in 45.2ms
pframe:parquetπ€> SELECT product_id, SUM(amount) as revenue FROM sales_data GROUP BY product_id ORDER BY revenue DESC LIMIT 5;
ββββββββββββββ¬βββββββββββ
β product_id β revenue β
ββββββββββββββΌβββββββββββ€
β 101 β 25431.20 β
β 205 β 18965.44 β
β 156 β 15678.91 β
β 332 β 12447.33 β
β 189 β 11982.17 β
ββββββββββββββ΄βββββββββββ
π 5 rows in 23.1ms
pframe:parquetπ€> \history
π Query History
βββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββ¬ββββββββββββ
β # β Query β Status β Time (ms) β
βββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββΌββββββββββββ€
β 1 β SELECT SUM(amount) as total_sales FROM sales_data WHERE date... β β
β 45.2 β
β 2 β SELECT product_id, SUM(amount) as revenue FROM sales_data GR... β β
β 23.1 β
βββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββ΄ββββββββββββ
pframe:parquetπ€> \save-session monthly_analysis
πΎ Session saved to: /Users/me/.parquetframe/sessions/monthly_analysis.pkl
pframe:parquetπ€> \quit
π Goodbye!
Database Connections¶
Connect to any SQL database:
# SQLite
pframe interactive --db-uri "sqlite:///northwind.db"
# PostgreSQL
pframe interactive --db-uri "postgresql://user:password@localhost:5432/ecommerce"
# MySQL
pframe interactive --db-uri "mysql+pymysql://user:password@localhost/orders"
Example database session:
pframe:databaseπ€> \list
π Available Tables
βββββββββββββ¬βββββββββββ
β Table Nameβ Type β
βββββββββββββΌβββββββββββ€
β customers β Database β
β orders β Database β
β products β Database β
β suppliers β Database β
βββββββββββββ΄βββββββββββ
pframe:databaseπ€> \ai which customers have made the most orders?
π€ Processing: which customers have made the most orders?
π Generated Query:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SELECT c.customer_name, COUNT(o.order_id) as order_countβ
β FROM customers c β
β JOIN orders o ON c.customer_id = o.customer_id β
β GROUP BY c.customer_id, c.customer_name β
β ORDER BY order_count DESC β
β LIMIT 10 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Execute this query? [Y/n]: y
ββββββββββββββββββββ¬ββββββββββββββ
β customer_name β order_count β
ββββββββββββββββββββΌββββββββββββββ€
β ACME Corporation β 45 β
β Global Industriesβ 38 β
β Tech Solutions β 31 β
β ... β ... β
ββββββββββββββββββββ΄ββββββββββββββ
π 10 rows in 67.3ms
Programming Interface¶
DataContext API¶
Use the DataContext system programmatically:
import asyncio
from parquetframe.datacontext import DataContextFactory
async def explore_data():
# Connect to parquet data lake
context = DataContextFactory.create_from_path("./sales_data/")
await context.initialize()
# Get schema information
schema = context.get_schema_as_text()
print("Schema for LLM:", schema)
# Execute queries
result = await context.execute("SELECT COUNT(*) as total FROM sales_data")
print("Total records:", result.iloc[0]['total'])
# List available tables
tables = context.get_table_names()
print("Available tables:", tables)
# Get detailed table info
table_info = context.get_table_schema("sales_data")
print("Columns:", [col['name'] for col in table_info['columns']])
context.close()
asyncio.run(explore_data())
LLM Agent API¶
Use AI capabilities in your code:
import asyncio
from parquetframe.datacontext import DataContextFactory
from parquetframe.ai import LLMAgent
async def ai_analysis():
# Setup context and agent
context = DataContextFactory.create_from_path("./ecommerce_data/")
agent = LLMAgent(
model_name="llama3.2",
max_retries=2,
use_multi_step=True, # For complex schemas
temperature=0.1 # More deterministic
)
await context.initialize()
# Natural language queries
queries = [
"What are the top 5 products by revenue?",
"Show me monthly sales trends",
"Which customers haven't ordered in the last 6 months?",
"What's the average order value by region?"
]
for question in queries:
print(f"\nQuestion: {question}")
result = await agent.generate_query(question, context)
if result.success:
print(f"Generated SQL: {result.query}")
print(f"Rows returned: {len(result.result)}")
print(f"Execution time: {result.execution_time_ms:.2f}ms")
print(f"Attempts: {result.attempts}")
else:
print(f"Failed: {result.error}")
context.close()
asyncio.run(ai_analysis())
Custom Examples and Training¶
Improve AI performance with domain-specific examples:
from parquetframe.ai import LLMAgent
agent = LLMAgent()
# Add custom examples for better performance
agent.add_custom_example(
question="show me high value customers",
sql="SELECT customer_id, total_spent FROM customers WHERE total_spent > 10000 ORDER BY total_spent DESC"
)
agent.add_custom_example(
question="monthly recurring revenue",
sql="SELECT DATE_TRUNC('month', subscription_date) as month, SUM(monthly_fee) as mrr FROM subscriptions GROUP BY month ORDER BY month"
)
# Use different models for different purposes
code_agent = LLMAgent(model_name="codellama") # Better for SQL
general_agent = LLMAgent(model_name="llama3.2") # General purpose
Advanced Features¶
Multi-Step Reasoning¶
For databases with many tables, enable multi-step reasoning:
agent = LLMAgent(use_multi_step=True)
# The agent will:
# 1. First select relevant tables for the question
# 2. Then generate SQL using only those tables
# This reduces context size and improves accuracy
Self-Correction¶
The LLM agent automatically attempts to fix failed queries:
agent = LLMAgent(max_retries=3) # Try up to 3 corrections
# If a query fails:
# 1. Error message is sent back to the LLM
# 2. LLM generates a corrected query
# 3. Process repeats up to max_retries times
Session Management¶
Save and restore interactive sessions:
# Save current session
pframe:parquetπ€> \save-session customer_analysis
# Later, in a new session
pframe:parquetπ€> \load-session customer_analysis
π Loaded session: 15 queries in history
pframe:parquetπ€> \history
# Shows all previous queries
Configuration¶
Model Selection¶
List and choose from available models:
agent = LLMAgent()
available = agent.get_available_models()
print("Available models:", available)
# Switch models
agent.set_model("codellama")
Performance Tuning¶
# Faster, less creative
agent = LLMAgent(temperature=0.0)
# More creative, potentially less accurate
agent = LLMAgent(temperature=0.3)
# Disable multi-step for simple schemas
agent = LLMAgent(use_multi_step=False)
# More aggressive error correction
agent = LLMAgent(max_retries=5)
Best Practices¶
1. Model Selection¶
- codellama: Best for SQL generation and complex queries
- llama3.2: Good general-purpose model
- llama2: Fallback for older systems
2. Query Optimization¶
- Use specific column names in questions: "show customer names and emails" vs "show customer data"
- Include time ranges: "sales last month" vs "recent sales"
- Be specific about sorting: "top 10 by revenue" vs "best products"
3. Schema Design¶
- Use descriptive table and column names
- Include comments in CREATE TABLE statements when possible
- Keep related data in the same table when feasible
4. Error Handling¶
result = await agent.generate_query(question, context)
if result.failed:
if "column" in result.error.lower():
print("Hint: Check column names with \\describe table_name")
elif "table" in result.error.lower():
print("Hint: Check table names with \\list")
Troubleshooting¶
Common Issues¶
-
"AI functionality not available"
-
"No tables found"
- Check file permissions
- Verify parquet files are valid
-
Use
\list
to see discovered tables -
"Database connection failed"
- Verify connection string format
- Check credentials and network access
-
Test connection with a simple client first
-
"Query execution failed"
- Use
\describe table_name
to verify schema - Check generated SQL for syntax errors
- Enable debug logging for more details
Debug Mode¶
import logging
logging.basicConfig(level=logging.DEBUG)
# Now you'll see detailed logs of:
# - Schema discovery process
# - LLM prompts and responses
# - Query execution details
# - Error messages and retries
Examples¶
E-commerce Analysis¶
pframe:parquetπ€> \ai what's our monthly recurring revenue trend?
pframe:parquetπ€> \ai which products have the highest return rate?
pframe:parquetπ€> \ai show me customer cohort analysis for Q4
Log Analysis¶
pframe:parquetπ€> \ai what are the top 10 most visited pages?
pframe:parquetπ€> \ai show me error rate by hour of day
pframe:parquetπ€> \ai which user agents are most common?
Financial Data¶
pframe:databaseπ€> \ai what's the portfolio performance this quarter?
pframe:databaseπ€> \ai show me the most volatile stocks
pframe:databaseπ€> \ai calculate risk-adjusted returns by sector
Ready to explore your data with AI? Start with pframe interactive --help
! π