I/O Fast-Paths¶

Overview¶

ParquetFrame's Rust I/O fast-paths provide 5-10x performance improvements over pure Python implementations for reading Parquet, CSV, and Avro files. These fast-paths are automatically utilized when the Rust backend is available, with transparent fallback to PyArrow/pandas for robustness. This ensures that data loading and metadata operations are as efficient as possible, especially for large datasets.

Key Features¶

Footer-Only Parquet Reads: Extract metadata (schema, row count, column statistics) instantly without loading any data, leading to 10-20x faster operations.
Parallel CSV Parsing: Multi-threaded chunking and parsing of large CSV files, achieving 4-7x speedups.
Zero-Copy Arrow Integration: Minimizes data movement between Rust and Python by leveraging Apache Arrow's in-memory columnar format.
Memory-Mapped I/O: Efficiently handles very large files by mapping them directly into memory, reducing memory footprint.
Automatic Format Detection: Intelligently detects file formats and routes to the appropriate fast-path for optimal performance.

Parquet Fast-Path¶

Metadata-Only Operations¶

The Rust backend can read Parquet metadata almost instantly using footer-only reads. This is crucial for quickly inspecting large files without incurring the cost of full data loading.

import parquetframe as pf

# Metadata extraction (Rust fast-path is automatically used)
metadata = pf.read_parquet_metadata("large_file.parquet")

print(f"Rows: {metadata['num_rows']:,}")
print(f"Columns: {metadata['columns']}")
print(f"File size: {metadata['file_size_bytes']:,} bytes")
print(f"Row groups: {metadata['num_row_groups']}")

# Column statistics are also available
for col in metadata['column_stats']:
    print(f"{col['name']}: {col['null_count']} nulls, "
          f"min={col['min']}, max={col['max']}")

Performance Comparison (Illustrative for 1GB Parquet file):

Operation	Python (PyArrow)	Rust (Footer Read)	Speedup
Read metadata	1,200ms	45ms	26.7x

Row Count Fast-Path¶

Quickly get the total number of rows in a Parquet file or a directory of Parquet files without loading any data. This is significantly faster than loading the entire DataFrame and calling len().

import parquetframe as pf

# Instant row count (Rust fast-path)
row_count = pf.get_row_count("data.parquet")
print(f"Total rows: {row_count:,}")

# Works seamlessly with multi-file datasets (e.g., a directory of Parquet files)
row_count_dir = pf.get_row_count("data_dir/*.parquet")
print(f"Total rows across all files in directory: {row_count_dir:,}")

Performance Comparison (Illustrative for 10GB Parquet file):

Operation	Python (PyArrow)	Rust (Metadata Read)	Speedup
Row count	15,000ms	180ms	83.3x

Column Name and Schema Extraction¶

Retrieve column names and their types instantly, which is useful for schema validation or dynamic query generation.

import parquetframe as pf

# Get column names instantly
columns = pf.get_column_names("data.parquet")
print(f"Columns: {columns}")

# Get detailed column information with types
columns_with_types = pf.get_column_info("data.parquet")
for col in columns_with_types:
    print(f"- {col['name']}: {col['type']} (nullable={col['nullable']})")

Full Parquet Read¶

When reading a full Parquet file, the Rust fast-path returns Arrow IPC bytes which are reconstructed to a pyarrow.Table on the Python side. You can convert to pandas/Polars as needed.

import parquetframe as pf
from parquetframe.io_rust import RustIOEngine

eng = RustIOEngine()
# Returns pyarrow.Table for maximum flexibility
table = eng.read_parquet("data.parquet")

# Convert to pandas if desired
pdf = table.to_pandas()

Rust Implementation Details (Simplified):

use parquet::file::reader::{FileReader, SerializedFileReader};
use std::fs::File;

pub fn read_parquet_metadata(path: &str) -> Result<ParquetMetadata> {
    let file = File::open(path)?;
    let reader = SerializedFileReader::new(file)?;

    // Access footer metadata (instantaneous operation)
    let metadata = reader.metadata();

    Ok(ParquetMetadata {
        num_rows: metadata.file_metadata().num_rows(),
        num_row_groups: metadata.num_row_groups(),
        columns: extract_column_info(metadata),
        file_size_bytes: std::fs::metadata(path)?.len(),
    })
}

CSV Fast-Path¶

Parallel CSV Reading¶

The Rust CSV parser employs parallel chunking and processing for large CSV files, significantly reducing parsing times compared to single-threaded Python implementations.

import parquetframe as pf

# Automatic parallel CSV parsing (Rust fast-path)
df = pf.read("large_data.csv")

# Configure parallel behavior for fine-grained control
df = pf.read(
    "large_data.csv",
    engine="rust", # Explicitly use Rust engine for CSV
    csv_chunk_size=1024*1024,  # Process in 1MB chunks
    csv_parallel_threads=8,     # Use 8 parsing threads
)

Performance Comparison (Illustrative for 500MB CSV file):

Operation	Python (pandas)	Rust (8 threads)	Speedup
CSV Parsing	8,500ms	1,200ms	7.1x

CSV with Type Inference¶

The Rust CSV parser can also perform automatic type inference, efficiently determining the data types of columns.

import parquetframe as pf

# Rust fast-path with automatic type detection
df = pf.read("data.csv", infer_schema=True)

# Providing a manual schema can further accelerate parsing by skipping inference
df = pf.read(
    "data.csv",
    schema={
        "id": "int64",
        "name": "string",
        "value": "float64",
        "timestamp": "datetime64[ns]",
    }
)

Memory-Mapped CSV¶

For extremely large CSV files, memory-mapped reading allows processing without loading the entire file into RAM, significantly reducing memory footprint.

import parquetframe as pf

# Memory-mapped reading (low memory footprint)
df = pf.read(
    "huge_file.csv",
    engine="rust",
    memory_map=True,
    batch_size=100000,  # Process in batches of 100,000 rows
)

# Iterate through batches for processing
for batch in df.iter_batches():
    # Process each batch (e.g., write to database, perform aggregations)
    process(batch)

Rust Implementation (Simplified CSV Chunking):

graph TD A[Large CSV File **1GB**] --> B{Split by size} B --> C1[Chunk 1 **1MB**] B --> C2[Chunk 2 **1MB**] B --> C3[Chunk 3 **1MB**] B --> CN[... Chunk N **1MB**] C1 --> P1[Parse in Thread 1] C2 --> P2[Parse in Thread 2] C3 --> P3[Parse in Thread 3] CN --> PN[Parse in Thread N] P1 --> R1[Result 1] P2 --> R2[Result 2] P3 --> R3[Result 3] PN --> RN[Result N] R1 & R2 & R3 & RN --> Combine[Combine into Single DataFrame] style A fill:#e1f5fe,stroke:#333,stroke-width:2px style B fill:#e8f5e9,stroke:#4caf50,stroke-width:1px style C1 fill:#ffe0b2,stroke:#ff9800,stroke-width:1px style C2 fill:#ffe0b2,stroke:#ff9800,stroke-width:1px style C3 fill:#ffe0b2,stroke:#ff9800,stroke-width:1px style CN fill:#ffe0b2,stroke:#ff9800,stroke-width:1px style P1 fill:#f3e5f5,stroke:#9c27b0,stroke-width:1px style P2 fill:#f3e5f5,stroke:#9c27b0,stroke-width:1px style P3 fill:#f3e5f5,stroke:#9c27b0,stroke-width:1px style PN fill:#f3e5f5,stroke:#9c27b0,stroke-width:1px style R1 fill:#e0f2f7,stroke:#00bcd4,stroke-width:1px style R2 fill:#e0f2f7,stroke:#00bcd4,stroke-width:1px style R3 fill:#e0f2f7,stroke:#00bcd4,stroke-width:1px style RN fill:#e0f2f7,stroke:#00bcd4,stroke-width:1px style Combine fill:#c8e6c9,stroke:#81c784,stroke-width:2px

Avro Fast-Path¶

ParquetFrame also provides Rust-accelerated reading for Avro files, a row-oriented data serialization format.

Avro Reading¶

import parquetframe as pf

# Rust fast-path for Avro reading
df = pf.read("data.avro")

# With schema validation and codec support
df = pf.read(
    "data.avro",
    validate_schema=True,
    avro_codec="snappy",  # Supports compressed Avro files
)

Avro Metadata¶

Quickly extract Avro schema and record counts without full data loading.

import parquetframe as pf

# Extract Avro schema
schema = pf.read_avro_schema("data.avro")
print(f"Schema: {schema}")

# Get record count
count = pf.get_avro_record_count("data.avro")
print(f"Records: {count:,}")

Benchmarks¶

Parquet Metadata Operations¶

Operation	File Size	Python (PyArrow)	Rust (Fast-Path)	Speedup
Read metadata	1GB	1,200ms	45ms	26.7x
Row count	10GB	15,000ms	180ms	83.3x
Column names	5GB	800ms	25ms	32.0x
Statistics	2GB	2,500ms	95ms	26.3x

CSV Parsing¶

File Size	Columns	Python (pandas)	Rust (1 thread)	Rust (8 threads)	Speedup (vs Python)
100MB	10	1,800ms	900ms	250ms	7.2x
500MB	20	8,500ms	4,200ms	1,200ms	7.1x
1GB	50	18,000ms	9,500ms	2,800ms	6.4x
5GB	100	95,000ms	52,000ms	15,000ms	6.3x

Avro Reading¶

File Size	Codec	Python (fastavro)	Rust (Fast-Path)	Speedup
200MB	None	3,500ms	850ms	4.1x
500MB	Snappy	8,200ms	1,900ms	4.3x
1GB	Deflate	16,500ms	3,800ms	4.3x

Configuration¶

Environment Variables¶

You can control the behavior of the I/O fast-paths using environment variables:

# Disable I/O fast-paths (e.g., for debugging or specific compatibility needs)
export PARQUETFRAME_DISABLE_RUST_IO=1

# Configure CSV parsing parameters
export PARQUETFRAME_CSV_CHUNK_SIZE=1048576  # Set chunk size for parallel CSV parsing (bytes)
export PARQUETFRAME_CSV_THREADS=8           # Set number of threads for parallel CSV parsing

# Enable Rust logging specifically for I/O operations
export RUST_LOG=parquetframe::io=debug

Programmatic Configuration¶

ParquetFrame also allows programmatic configuration of I/O behavior:

import parquetframe as pf

# Configure I/O behavior
pf.set_config(
    rust_io_enabled=True,          # Enable/disable Rust I/O fast-paths
    csv_chunk_size=1024*1024,      # 1MB CSV chunks
    csv_parallel_threads=8,        # 8 threads for CSV parsing
    avro_validate_schema=True,     # Enable Avro schema validation
)

API Reference¶

ParquetFrame provides a unified API for I/O operations, automatically leveraging the Rust fast-paths when available. For direct access to Rust I/O functionalities, you can use the io_backend module.

from parquetframe.io import io_backend

# Check if Rust I/O is available
is_available = io_backend.is_rust_available()

# Parquet operations
metadata = io_backend.read_parquet_metadata(path: str) -> dict
row_count = io_backend.get_row_count(path: str) -> int
columns = io_backend.get_column_names(path: str) -> list[str]
stats = io_backend.get_column_statistics(path: str) -> dict

# CSV operations
df = io_backend.read_csv_rust(
    path: str,
    chunk_size: int = 1024*1024,
    threads: int = None,  # Auto-detect CPU cores
    schema: dict = None,
    delimiter: str = ",",
) -> DataFrame

# Avro operations
schema = io_backend.read_avro_schema(path: str) -> dict
count = io_backend.get_avro_record_count(path: str) -> int
df = io_backend.read_avro_rust(path: str, codec: str = None) -> DataFrame

Graceful Fallback¶

ParquetFrame's pf.read() function automatically handles fallback if the Rust backend is not available or an error occurs, ensuring your code remains robust.

import parquetframe as pf
import pandas as pd

# Using the unified API, which handles fallback automatically
df = pf.read("data.parquet")  # Will use Rust fast-path if available, else PyArrow/pandas

# Manual fallback example (less common with pf.read())
try:
    # Attempt to use Rust I/O directly
    df = pf.io_backend.read_parquet_rust("data.parquet")
except RuntimeError:
    print("Rust I/O not available, falling back to pandas.")
    df = pd.read_parquet("data.parquet")

Best Practices¶

1. Use Metadata Operations for Quick Insights¶

Always use pf.read_parquet_metadata(), pf.get_row_count(), or pf.get_column_names() when you only need file information, not the data itself. This avoids loading large files into memory unnecessarily.

# ✅ Good: Check row count before deciding how to load
import parquetframe as pf

row_count = pf.get_row_count("data.parquet")
if row_count > 1_000_000:
    # For very large files, use Dask for out-of-core processing
    df = pf.read("data.parquet", islazy=True) # islazy=True implies Dask
else:
    # For smaller files, pandas or Polars might be faster
    df = pf.read("data.parquet", islazy=False) # islazy=False implies Pandas/Polars

# ❌ Bad: Load entire file just to check its size
df = pf.read("data.parquet")
if len(df) > 1_000_000:
    # Too late, the entire file is already loaded into memory
    pass

2. Leverage Parallel CSV Parsing¶

For large CSV files, let ParquetFrame's Rust backend handle parallel parsing automatically or configure it explicitly for optimal performance.

# ✅ Good: Let Rust parallelize automatically
df = pf.read("large.csv")  # Rust auto-parallelizes if available

# ✅ Also good: Explicitly control parallel threads
df = pf.read("large.csv", csv_parallel_threads=8)

# ❌ Bad: Force single-threaded pandas for large CSVs
df = pd.read_csv("large.csv")  # This will be single-threaded and slower

3. Batch Processing for Huge Files¶

When dealing with files that are too large to fit into memory even with memory-mapping, process them in smaller batches.

# ✅ Good: Process in manageable batches
for batch in pf.read_batches("huge.csv", batch_size=100000):
    # Perform operations on each batch
    process(batch)

# ❌ Bad: Attempt to load the entire file into memory
df = pf.read("huge.csv")  # This may lead to out-of-memory errors

Troubleshooting¶

Rust Backend Not Available¶

If you suspect the Rust I/O fast-paths are not being used:

import parquetframe as pf

if not pf.io_backend.is_rust_available():
    print("Rust I/O not available. Possible reasons:")
    print("1. Rust extensions were not compiled during installation.")
    print("2. The environment variable PARQUETFRAME_DISABLE_RUST_IO=1 is set.")
    print("3. Binary incompatibility with your system.")

    # Check your ParquetFrame installation details
    print(f"ParquetFrame Version: {pf.__version__}")
    print("Consider reinstalling with: pip install --force-reinstall parquetframe")

Performance Not Improving as Expected¶

To diagnose performance issues, enable detailed logging:

# Enable profiling and debug logs for ParquetFrame's I/O and performance modules
export RUST_LOG="parquetframe::io=debug,parquetframe::perf=trace"

python your_script.py

# Look for messages like: "Using Rust fast-path for Parquet read"
# This confirms the Rust backend is active.

Memory Issues with Large CSV¶

If you encounter out-of-memory errors with large CSV files:

import parquetframe as pf

# Option 1: Use memory-mapped reading with smaller batches
df = pf.read(
    "large.csv",
    memory_map=True,
    batch_size=50000,  # Reduce batch size further if needed
)

# Option 2: Use Dask for automatic chunking and out-of-core processing
df = pf.read("large.csv", islazy=True) # islazy=True implies Dask

Implementation Details¶

Parquet files store all critical metadata in a footer at the end of the file. The Rust implementation efficiently reads only this footer, avoiding the need to scan the entire file for metadata.

graph TD subgraph Parquet File Structure Data1[Row Group 1 Data] --> NotRead[Not read for metadata] Data2[Row Group 2 Data] --> NotRead DataN[... Row Group N Data] --> NotRead Footer[Footer Metadata] --> ReadFast[Only this is read **fast!**] end ReadFast --> Schema[Schema] ReadFast --> RowGroupInfo[Row Group Info] ReadFast --> ColumnStats[Column Statistics] ReadFast --> TotalRows[Total Row Count] style Data1 fill:#e0f2f7,stroke:#00bcd4,stroke-width:1px style Data2 fill:#e0f2f7,stroke:#00bcd4,stroke-width:1px style DataN fill:#e0f2f7,stroke:#00bcd4,stroke-width:1px style Footer fill:#c8e6c9,stroke:#81c784,stroke-width:2px style NotRead fill:#f3e5f5,stroke:#9c27b0,stroke-width:1px style ReadFast fill:#ffe0b2,stroke:#ff9800,stroke-width:1px style Schema fill:#fff3e0,stroke:#ffc107,stroke-width:1px style RowGroupInfo fill:#fff3e0,stroke:#ffc107,stroke-width:1px style ColumnStats fill:#fff3e0,stroke:#ffc107,stroke-width:1px style TotalRows fill:#fff3e0,stroke:#ffc107,stroke-width:1px

CSV Chunking Strategy¶

For parallel CSV parsing, the Rust backend divides the file into chunks and processes them concurrently using a thread pool.

graph TD A[Large CSV File **1GB**] --> B{Split by size} B --> C1[Chunk 1 **1MB**] B --> C2[Chunk 2 **1MB**] B --> C3[Chunk 3 **1MB**] B --> CN[... Chunk N **1MB**] C1 --> P1[Parse in Thread 1] C2 --> P2[Parse in Thread 2] C3 --> P3[Parse in Thread 3] CN --> PN[Parse in Thread N] P1 --> R1[Result 1] P2 --> R2[Result 2] P3 --> R3[Result 3] PN --> RN[Result N] R1 & R2 & R3 & RN --> Combine[Combine into Single DataFrame] style A fill:#e1f5fe,stroke:#333,stroke-width:2px style B fill:#e8f5e9,stroke:#4caf50,stroke-width:1px style C1 fill:#ffe0b2,stroke:#ff9800,stroke-width:1px style C2 fill:#ffe0b2,stroke:#ff9800,stroke-width:1px style C3 fill:#ffe0b2,stroke:#ff9800,stroke-width:1px style CN fill:#ffe0b2,stroke:#ff9800,stroke-width:1px style P1 fill:#f3e5f5,stroke:#9c27b0,stroke-width:1px style P2 fill:#f3e5f5,stroke:#9c27b0,stroke-width:1px style P3 fill:#f3e5f5,stroke:#9c27b0,stroke-width:1px style PN fill:#f3e5f5,stroke:#9c27b0,stroke-width:1px style R1 fill:#e0f2f7,stroke:#00bcd4,stroke-width:1px style R2 fill:#e0f2f7,stroke:#00bcd4,stroke-width:1px style R3 fill:#e0f2f7,stroke:#00bcd4,stroke-width:1px style RN fill:#e0f2f7,stroke:#00bcd4,stroke-width:1px style Combine fill:#c8e6c9,stroke:#81c784,stroke-width:2px

Architecture - Overview of the Rust backend architecture.
Performance Guide - General optimization tips for ParquetFrame.
Distribution Guide - Building and distributing the Rust I/O codebase.

I/O Fast-Paths¶

Overview¶

Key Features¶

Parquet Fast-Path¶

Metadata-Only Operations¶

Row Count Fast-Path¶

Column Name and Schema Extraction¶

Full Parquet Read¶

CSV Fast-Path¶

Parallel CSV Reading¶

CSV with Type Inference¶

Memory-Mapped CSV¶

Avro Fast-Path¶

Avro Reading¶

Avro Metadata¶

Benchmarks¶

Parquet Metadata Operations¶

CSV Parsing¶

Avro Reading¶

Configuration¶

Environment Variables¶

Programmatic Configuration¶

API Reference¶

Graceful Fallback¶

Best Practices¶

1. Use Metadata Operations for Quick Insights¶

2. Leverage Parallel CSV Parsing¶

3. Batch Processing for Huge Files¶

Troubleshooting¶

Rust Backend Not Available¶

Performance Not Improving as Expected¶

Memory Issues with Large CSV¶

Implementation Details¶

Parquet Footer Format¶

CSV Chunking Strategy¶

References¶

I/O Fast-Paths¶

Overview¶

Key Features¶

Parquet Fast-Path¶

Metadata-Only Operations¶

Row Count Fast-Path¶

Column Name and Schema Extraction¶

Full Parquet Read¶

CSV Fast-Path¶

Parallel CSV Reading¶

CSV with Type Inference¶

Memory-Mapped CSV¶

Avro Fast-Path¶

Avro Reading¶

Avro Metadata¶

Benchmarks¶

Parquet Metadata Operations¶

CSV Parsing¶

Avro Reading¶

Configuration¶

Environment Variables¶

Programmatic Configuration¶

API Reference¶

Graceful Fallback¶

Best Practices¶

1. Use Metadata Operations for Quick Insights¶

2. Leverage Parallel CSV Parsing¶

3. Batch Processing for Huge Files¶

Troubleshooting¶

Rust Backend Not Available¶

Performance Not Improving as Expected¶

Memory Issues with Large CSV¶

Implementation Details¶

Parquet Footer Format¶

CSV Chunking Strategy¶

Related Pages¶

References¶