Graph Processing¶

ParquetFrame provides comprehensive graph processing capabilities with native support for the GraphAr format, high-performance algorithms, and seamless integration with popular graph libraries. Process million-node graphs efficiently with Rust-accelerated algorithms delivering 15-25x speedups.

Overview¶

The graph processing system offers:

📊 GraphAr Format Support - Industry-standard columnar graph storage
🚀 Rust-Accelerated Algorithms - BFS (17x), PageRank (24x), shortest paths
🔄 Bidirectional Conversion - NetworkX, igraph, pandas DataFrames
💾 Efficient Storage - Compressed Parquet-based graph representation
📈 Scalable Operations - Handle million-node graphs with ease
🎯 Rich Algorithm Library - Traversal, centrality, community detection

Quick Start¶

Creating a Graph¶

import parquetframe as pf
import pandas as pd

# From edge list DataFrame
edges = pd.DataFrame({
    'src': [0, 1, 2, 0],
    'dst': [1, 2, 3, 3],
    'weight': [1.0, 2.0, 1.5, 3.0]
})

graph = pf.GraphFrame.from_edges(edges, src='src', dst='dst')

# From GraphAr directory
graph = pf.GraphFrame.from_graphar("social_network/")

# From NetworkX
import networkx as nx
nx_graph = nx.karate_club_graph()
graph = pf.GraphFrame.from_networkx(nx_graph)

Running Algorithms¶

# PageRank (24x faster with Rust)
scores = graph.pagerank(alpha=0.85, max_iter=100)
print(f"Top node: {scores.idxmax()} with score {scores.max():.4f}")

# BFS traversal (17x faster with Rust)
distances = graph.bfs(source=0)
print(f"Reachable nodes: {(distances >= 0).sum()}")

# Shortest paths
paths = graph.shortest_paths(source=0, method='dijkstra')

# Connected components
components = graph.connected_components()
print(f"Found {components.nunique()} components")

Key Features¶

GraphAr Format¶

Industry-standard columnar graph storage format:

Efficient Storage: Parquet-based compression
Fast Loading: Selective column reading
Metadata Support: Rich graph schema information
Interoperability: Compatible with Apache Arrow ecosystem

Learn more →

Rust Acceleration¶

High-performance graph algorithms:

Algorithm	Speedup	Scale
PageRank	24x	100K+ nodes
BFS	17x	1M+ nodes
Shortest Paths	18x	500K+ edges
Connected Components	27x	1M+ edges

Learn more →

Algorithm Library¶

Traversal Algorithms - Breadth-First Search (BFS) - Depth-First Search (DFS) - Single-source shortest paths (Dijkstra, Bellman-Ford)

Centrality Measures - PageRank - Betweenness centrality - Closeness centrality - Degree centrality

Community Detection - Connected components - Strongly connected components - Label propagation

Graph Properties - Degree distribution - Clustering coefficient - Graph diameter

Pages in This Section¶

Page	Description
Tutorial	Step-by-step guide to graph processing
GraphAr Format	GraphAr specification and usage
GraphAr Migration	Migrating to GraphAr from other formats
CLI Usage	Command-line tools for graph operations

Common Use Cases¶

Analyze relationships and influence in social networks:

# Load social graph
graph = pf.GraphFrame.from_graphar("social_network/")

# Find influential users
influence = graph.pagerank()
top_influencers = influence.nlargest(10)

# Community detection
communities = graph.connected_components()

Knowledge Graph Processing¶

Query and analyze knowledge graphs:

# Load knowledge graph with properties
graph = pf.GraphFrame.from_graphar("knowledge_base/")

# Find related concepts
related = graph.bfs(source="Python", max_depth=2)

# Rank entities by importance
rankings = graph.pagerank(personalization={"AI": 1.0})

Dependency Analysis¶

Analyze software dependencies and build graphs:

# Create dependency graph
edges = pd.DataFrame({
    'package': ['A', 'A', 'B', 'C'],
    'depends_on': ['B', 'C', 'D', 'D']
})

graph = pf.GraphFrame.from_edges(edges, src='package', dst='depends_on')

# Find transitive dependencies
all_deps = graph.bfs(source='A')

# Detect circular dependencies
cycles = graph.find_cycles()

Performance Guidelines¶

When to Use Rust Acceleration¶

✅ Recommended for: - Graphs with 10K+ nodes - Iterative algorithms (PageRank, label propagation) - Repeated traversals - Production workloads

⚠️ May not benefit: - Very small graphs (<1000 nodes) - One-time operations - Development/prototyping

Memory Optimization¶

# Use efficient data types
edges = pd.DataFrame({
    'src': pd.array([...], dtype='int32'),  # Not int64
    'dst': pd.array([...], dtype='int32'),
    'weight': pd.array([...], dtype='float32')  # Not float64
})

# Load only required columns
graph = pf.GraphFrame.from_graphar(
    "large_graph/",
    vertex_columns=['id', 'label'],
    edge_columns=['src', 'dst', 'weight']
)

Integration with Other Libraries¶

NetworkX¶

# ParquetFrame → NetworkX
graph = pf.GraphFrame.from_graphar("data/")
nx_graph = graph.to_networkx()

# NetworkX → ParquetFrame
graph = pf.GraphFrame.from_networkx(nx_graph)

igraph¶

# ParquetFrame → igraph
import igraph as ig
edges = graph.edges
ig_graph = ig.Graph.TupleList(
    edges=zip(edges['src'], edges['dst']),
    directed=True
)

PyTorch Geometric¶

# For GNN workflows
import torch
from torch_geometric.data import Data

# Convert to PyG format
edge_index = torch.tensor(
    [graph.edges['src'].values, graph.edges['dst'].values],
    dtype=torch.long
)

data = Data(edge_index=edge_index, num_nodes=graph.num_vertices)

Rust Acceleration - High-performance graph algorithms
Core Features - DataFrame operations for graph data
Analytics & Statistics - Graph metrics and analysis
YAML Workflows - Graph processing pipelines

Advanced Topics¶

Custom Algorithms¶

Implement custom graph algorithms:

def custom_traversal(graph, start_node, condition):
    """Custom BFS with filtering."""
    visited = set()
    queue = [start_node]

    while queue:
        node = queue.pop(0)
        if node in visited:
            continue

        visited.add(node)

        # Get neighbors
        neighbors = graph.neighbors(node)

        # Apply custom condition
        for neighbor in neighbors:
            if condition(graph, neighbor):
                queue.append(neighbor)

    return visited

Distributed Processing¶

For very large graphs (billions of edges):

# Partition graph by vertex ranges
for partition_id in range(num_partitions):
    partition = graph.get_partition(partition_id)

    # Process partition
    local_scores = partition.pagerank()

    # Aggregate results
    global_scores.update(local_scores)

Next Steps¶

New to graphs? Start with the Tutorial
Using GraphAr? See GraphAr Format Guide
Need performance? Check Rust Acceleration
Real examples? Browse Examples Gallery

API Reference¶

For detailed API documentation, see:

parquetframe.GraphFrame - Main graph class
parquetframe.graph.algorithms - Algorithm implementations
parquetframe.graph.graphar - GraphAr format support
parquetframe.graph_rust - Rust-accelerated functions

Error Handling¶

Common issues when working with graphs:

Schema Mismatch: Ensure your Parquet files match _schema.yaml.
Missing Metadata: _metadata.yaml is required in the graph directory.
Memory Errors: Use islazy=True (Dask) for graphs larger than memory.