Graph Engine¶
ParquetFrame's graph engine provides high-performance graph data processing capabilities with support for the Apache GraphAr columnar format. The engine enables efficient storage, loading, and analysis of large-scale graph data using familiar pandas/Dask APIs.
Overview¶
The graph engine implements a three-layer architecture:
- GraphFrame: High-level graph interface with vertex/edge access and algorithms
- VertexSet/EdgeSet: Typed collections with schema validation and property access
- Adjacency Structures: Optimized CSR/CSC representations for graph traversal
Key Features¶
- Apache GraphAr Support: Full compliance with GraphAr columnar format specification
- Intelligent Backend Selection: Automatic pandas/Dask switching based on graph size
- Schema Validation: Type checking and property validation for graph data
- Adjacency Optimization: Memory-efficient CSR/CSC structures for fast neighbor lookups
- CLI Integration: Command-line tools for graph inspection and analysis
Quick Start¶
Basic Graph Loading¶
import parquetframe as pf
# Load GraphAr format graph
graph = pf.read_graph("social_network/")
print(f"Loaded graph: {graph.num_vertices} vertices, {graph.num_edges} edges")
# Access vertex and edge data
users = graph.vertices
friendships = graph.edges
# Query graph data
active_users = users.data.query("status == 'active'")
recent_connections = friendships.data.query("timestamp > '2024-01-01'")
Graph Inspection with CLI¶
# Inspect graph properties
pf graph info social_network/
# With detailed statistics
pf graph info social_network/ --detailed --format json
# Select backend
pf graph info large_graph/ --backend dask
Working with Adjacency Structures¶
from parquetframe.graph.adjacency import CSRAdjacency
# Create adjacency structure from edges
csr = CSRAdjacency.from_edge_set(graph.edges)
# Fast neighbor lookups
neighbors = csr.neighbors(user_id=123)
degree = csr.degree(user_id=123)
# Check edge existence
has_connection = csr.has_edge(source=123, target=456)
Graph Data Structure¶
GraphFrame¶
The central graph interface providing:
- Unified Access: Single entry point for vertices, edges, and metadata
- Property Inspection: Schema-aware property access and validation
- Backend Flexibility: Seamless pandas/Dask switching
- Algorithm Integration: Foundation for graph algorithms and analytics
# Graph properties
print(f"Directed: {graph.is_directed}")
print(f"Vertex properties: {graph.vertex_properties}")
print(f"Edge properties: {graph.edge_properties}")
# Backend management
if graph.vertices.islazy:
print("Using Dask backend for large graph")
else:
print("Using pandas backend for efficient processing")
VertexSet and EdgeSet¶
Strongly-typed collections with:
- Schema Validation: Automatic type checking and property validation
- Flexible Loading: Support for multiple parquet files and directories
- Property Access: Named property columns with metadata
- Query Interface: Standard DataFrame operations
# Vertex operations
person_vertices = VertexSet.from_parquet("vertices/person/")
print(f"Loaded {len(person_vertices)} people")
adults = person_vertices.data.query("age >= 18")
# Edge operations
friendship_edges = EdgeSet.from_parquet("edges/friendship/")
strong_ties = friendship_edges.data.query("weight > 0.8")
GraphAr Format Support¶
Full compliance with Apache GraphAr specification:
Directory Structure¶
social_network/
├── _metadata.yaml # Graph metadata (name, directed, etc.)
├── _schema.yaml # Property schemas and types
├── vertices/ # Vertex data by type
│ ├── person/
│ │ └── part0.parquet
│ └── organization/
│ └── part0.parquet
└── edges/ # Edge data by type
├── friendship/
│ └── part0.parquet
└── employment/
└── part0.parquet
Metadata Format¶
# _metadata.yaml
name: "social_network"
version: "1.0"
directed: true
description: "Social network graph with people and organizations"
Schema Format¶
# _schema.yaml
version: "1.0"
vertices:
person:
properties:
id: {type: "int64", primary: true}
name: {type: "string"}
age: {type: "int32"}
edges:
friendship:
properties:
src: {type: "int64", source: true}
dst: {type: "int64", target: true}
weight: {type: "float64"}
Performance and Optimization¶
Backend Selection¶
The graph engine automatically selects the optimal backend:
# Automatic selection based on file size
small_graph = pf.read_graph("small_network/") # Uses pandas
large_graph = pf.read_graph("web_graph/") # Uses Dask
# Manual backend control
fast_graph = pf.read_graph("data/", islazy=False) # Force pandas
scalable_graph = pf.read_graph("data/", islazy=True) # Force Dask
# Custom thresholds
graph = pf.read_graph("data/", threshold_mb=50) # Dask if >50MB
Memory-Efficient Adjacency¶
CSR/CSC structures provide optimal memory usage and query performance:
from parquetframe.graph.adjacency import CSRAdjacency, CSCAdjacency
# Out-degree optimized (CSR)
csr = CSRAdjacency.from_edge_set(edges)
neighbors = csr.neighbors(vertex_id) # O(degree) lookup
out_degree = csr.degree(vertex_id) # O(1) lookup
# In-degree optimized (CSC)
csc = CSCAdjacency.from_edge_set(edges)
predecessors = csc.predecessors(vertex_id) # O(in_degree) lookup
in_degree = csc.degree(vertex_id) # O(1) lookup
Advanced Features¶
Schema Validation¶
Comprehensive type checking and validation:
# Enable validation (default)
graph = pf.read_graph("data/", validate_schema=True)
# Disable for performance
graph = pf.read_graph("large_data/", validate_schema=False)
# Custom property validation
vertices = VertexSet.from_parquet(
"vertices/user/",
properties={"age": "int32", "name": "string"},
schema=custom_schema
)
Error Handling¶
Robust error handling for invalid GraphAr data:
try:
graph = pf.read_graph("invalid_data/")
except GraphArError as e:
print(f"GraphAr format error: {e}")
except GraphArValidationError as e:
print(f"Schema validation failed: {e}")
except FileNotFoundError as e:
print(f"Directory not found: {e}")
Examples and Use Cases¶
Social Network Analysis¶
# Load social network
social_graph = pf.read_graph("social_network/")
# Find influential users (high degree)
csr = CSRAdjacency.from_edge_set(social_graph.edges)
degrees = [csr.degree(i) for i in range(social_graph.num_vertices)]
influential_users = social_graph.vertices.data[
social_graph.vertices.data.index.isin(
pd.Series(degrees).nlargest(10).index
)
]
# Analyze connection patterns
friendship_weights = social_graph.edges.data['weight']
print(f"Average friendship strength: {friendship_weights.mean():.2f}")
Knowledge Graph Processing¶
# Load knowledge graph
knowledge_graph = pf.read_graph("knowledge_base/")
# Query entity relationships
entities = knowledge_graph.vertices.data
relations = knowledge_graph.edges.data
# Find entities with specific properties
organizations = entities.query("type == 'organization'")
strong_relations = relations.query("confidence > 0.9")
Web Graph Analysis¶
# Load web graph (uses Dask automatically for large data)
web_graph = pf.read_graph("web_crawl/", threshold_mb=10)
# Analyze link structure
if web_graph.vertices.islazy:
print("Processing large web graph with Dask")
page_counts = web_graph.vertices.data.groupby('domain').size().compute()
else:
page_counts = web_graph.vertices.data.groupby('domain').size()
print(f"Top domains: {page_counts.nlargest(5)}")
API Reference¶
- GraphFrame API - Main graph interface
- VertexSet/EdgeSet API - Typed graph collections
- Adjacency API - CSR/CSC adjacency structures
- GraphAr Reader API - Apache GraphAr format support
- CLI Reference - Command-line tools