Data Loading¶
sigc supports multiple data sources and formats for loading market data.
Supported Formats¶
| Format | Extension | Best For |
|---|---|---|
| CSV | .csv |
Small datasets, debugging |
| Parquet | .parquet |
Large datasets, production |
| S3 | s3:// |
Cloud storage |
| PostgreSQL | postgresql:// |
Database integration |
Quick Examples¶
CSV¶
Text Only
data:
source = "prices.csv"
format = csv
columns:
date: Date
ticker: Symbol
close: Numeric as prices
volume: Numeric
Parquet¶
S3¶
PostgreSQL¶
Text Only
data:
source = "postgresql://localhost/marketdb"
query = "SELECT date, ticker, close, volume FROM daily_prices"
Data Section Structure¶
Text Only
data:
source = "..." # Required: file path, URL, or connection string
format = csv | parquet # Required for files
columns: # Column definitions (required for CSV)
column_name: Type [as alias]
options: # Optional: format-specific settings
...
Column Types¶
| Type | Description | Example |
|---|---|---|
Date |
Date column (index) | 2024-01-15 |
Symbol |
Asset identifier | AAPL, MSFT |
Numeric |
Price, volume, etc. | 150.25 |
String |
Text data | "Technology" |
Column Aliasing¶
Rename columns for cleaner code:
Text Only
data:
source = "raw_data.csv"
format = csv
columns:
trade_date: Date
symbol: Symbol
adj_close: Numeric as prices # Use 'prices' in signals
shares_traded: Numeric as volume # Use 'volume' in signals
Multiple Data Sources¶
Combine data from multiple files:
Text Only
data prices:
source = "prices.parquet"
format = parquet
data fundamentals:
source = "fundamentals.parquet"
format = parquet
signal combined:
momentum = zscore(ret(prices.close, 60))
value = zscore(fundamentals.book_to_market)
emit 0.5 * momentum + 0.5 * value
Data Flow¶
Text Only
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ CSV/Parquet │───▶│ Parser │───▶│ Panel │
│ File │ │ (Polars) │ │ (dates×sym) │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────────────┐
│ Type Validation │
│ - Column types │
│ - Date parsing │
│ - Symbol mapping │
└─────────────────────┘
Best Practices¶
1. Use Parquet for Production¶
Text Only
// Parquet is 5-10x faster than CSV for large datasets
data:
source = "prices.parquet"
format = parquet
2. Define Types Explicitly¶
Text Only
// Good: explicit types
columns:
date: Date
ticker: Symbol
close: Numeric as prices
// Bad: relying on inference
3. Handle Missing Data¶
Text Only
signal robust:
// Fill missing prices before computation
clean = fill_nan(prices, 0)
emit zscore(ret(clean, 20))
4. Use Consistent Naming¶
Text Only
// Alias to standard names
columns:
px_last: Numeric as prices
px_volume: Numeric as volume
px_high: Numeric as high
px_low: Numeric as low
Data Quality¶
See Data Quality for:
- Missing data detection
- Outlier handling
- Corporate action adjustments
- Survivorship bias
Next Steps¶
- CSV Format - CSV file loading
- Parquet Format - Parquet file loading
- Data Quality - Data validation
- Corporate Actions - Handling splits and dividends