Parquet Format¶
Apache Parquet is the recommended format for production data in sigc.
Why Parquet?¶
| Feature | CSV | Parquet |
|---|---|---|
| Read Speed | Slow | 5-10x faster |
| File Size | Large | 50-90% smaller |
| Type Safety | No | Yes |
| Column Pruning | No | Yes |
| Predicate Pushdown | No | Yes |
Basic Usage¶
Types are automatically inferred from Parquet schema.
Column Selection¶
Only load needed columns:
Text Only
data:
source = "market_data.parquet"
format = parquet
columns:
date: Date
ticker: Symbol
close: Numeric as prices
volume: Numeric
# Other columns in file are ignored
Parquet Options¶
Row Groups¶
Text Only
data:
source = "large_file.parquet"
format = parquet
options:
row_group_size = 100000 # Rows per group
Compression¶
sigc reads all standard compressions:
- Snappy (default, fast)
- Gzip (smaller files)
- LZ4 (fastest)
- Zstd (best compression ratio)
Memory Mapping¶
Text Only
data:
source = "huge_file.parquet"
format = parquet
options:
memory_map = true # Memory-map instead of loading
Creating Parquet Files¶
From CSV with sigc¶
From Python¶
Python
import pandas as pd
# Read CSV
df = pd.read_csv('prices.csv', parse_dates=['date'])
# Write Parquet
df.to_parquet('prices.parquet', index=False)
From Polars¶
Python
import polars as pl
# Read and write
df = pl.read_csv('prices.csv')
df.write_parquet('prices.parquet')
Schema Requirements¶
Required Columns¶
Your Parquet file should have:
Example Schema¶
Text Only
message schema {
required int32 date (DATE);
required binary ticker (STRING);
required double close;
required double volume;
optional double high;
optional double low;
optional double open;
}
Partitioned Data¶
Hive-Style Partitioning¶
Text Only
data/
├── year=2023/
│ ├── month=01/
│ │ └── data.parquet
│ └── month=02/
│ └── data.parquet
└── year=2024/
└── month=01/
└── data.parquet
Date Filtering with Partitions¶
Text Only
data:
source = "data/"
format = parquet
options:
partitioned = true
start_date = "2023-01-01" # Only load needed partitions
end_date = "2024-12-31"
Multiple Parquet Files¶
Directory¶
Explicit List¶
Column Aliasing¶
Text Only
data:
source = "data.parquet"
format = parquet
columns:
px_last: Numeric as prices
px_volume: Numeric as volume
Type Mapping¶
| Parquet Type | sigc Type |
|---|---|
| INT32 (DATE) | Date |
| INT64 (TIMESTAMP) | Date |
| STRING | Symbol, String |
| FLOAT, DOUBLE | Numeric |
| BOOLEAN | Boolean |
Performance Tips¶
1. Column Pruning¶
Text Only
// Only load required columns
data:
source = "large_file.parquet"
format = parquet
columns:
date: Date
ticker: Symbol
close: Numeric as prices # Only loads 'close', not all columns
2. Predicate Pushdown¶
Text Only
data:
source = "data.parquet"
format = parquet
options:
start_date = "2020-01-01" # Filter pushed to reader
symbols = ["AAPL", "MSFT", "GOOGL"]
3. Row Group Statistics¶
Parquet stores min/max per row group, enabling fast filtering.
4. Memory Mapping for Large Files¶
Text Only
data:
source = "100gb_file.parquet"
format = parquet
options:
memory_map = true # Don't load entire file
Compression Comparison¶
| Compression | Read Speed | Write Speed | Ratio |
|---|---|---|---|
| None | Fastest | Fastest | 1.0x |
| Snappy | Fast | Fast | 2-3x |
| LZ4 | Fast | Fast | 2-3x |
| Gzip | Medium | Slow | 4-6x |
| Zstd | Medium | Medium | 5-8x |
Recommendation: Use Snappy for most cases (good balance).
Example: Converting a Large Dataset¶
Python
import polars as pl
# Read CSV in chunks
df = pl.scan_csv('huge_prices.csv')
# Write as partitioned Parquet
df.sink_parquet(
'prices/',
compression='snappy',
row_group_size=100_000,
partition_by=['year', 'month']
)
Troubleshooting¶
Schema Mismatch¶
Ensure all Parquet files have identical schemas when loading multiple files.
Memory Issues¶
Solutions:
- Use memory mapping:
memory_map = true - Select only needed columns
- Use date filtering
- Partition large files
Missing Columns¶
Check the actual schema:
Bash
# Using Python
python -c "import pyarrow.parquet as pq; print(pq.read_schema('file.parquet'))"
# Using sigc
sigc inspect file.parquet
Best Practices¶
1. Use Snappy Compression¶
2. Choose Appropriate Row Group Size¶
- Small files (< 1M rows): 50K-100K rows
- Large files: 100K-500K rows
3. Partition by Date for Time-Series¶
4. Include Metadata¶
Python
# Add custom metadata
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table.from_pandas(df)
metadata = {
b'source': b'Bloomberg',
b'created': b'2024-01-15',
b'frequency': b'daily'
}
table = table.replace_schema_metadata(metadata)
pq.write_table(table, 'data.parquet')
Next Steps¶
- S3 Storage - Cloud storage integration
- Data Quality - Validating your data
- CSV Format - When CSV is appropriate