CSV Format¶
Loading data from CSV files.
Basic Usage¶
Text Only
data:
source = "prices.csv"
format = csv
columns:
date: Date
ticker: Symbol
close: Numeric as prices
volume: Numeric
Expected File Format¶
Long Format (Recommended)¶
One row per date/symbol combination:
Text Only
date,ticker,close,volume
2024-01-02,AAPL,185.64,45678900
2024-01-02,MSFT,374.51,23456789
2024-01-02,GOOGL,140.21,12345678
2024-01-03,AAPL,184.25,43210987
2024-01-03,MSFT,372.89,21098765
2024-01-03,GOOGL,139.75,11234567
Wide Format¶
One row per date, symbols as columns:
Load with pivot option:
Column Definitions¶
Required Columns¶
Data Columns¶
Text Only
columns:
close: Numeric as prices # Closing price
volume: Numeric # Trading volume
high: Numeric # Daily high
low: Numeric # Daily low
open: Numeric # Opening price
Optional Metadata¶
Text Only
columns:
sector: String # Sector classification
industry: String # Industry classification
market_cap: Numeric # Market capitalization
Column Aliasing¶
Rename columns for consistency:
Text Only
data:
source = "bloomberg_data.csv"
format = csv
columns:
PX_LAST: Numeric as prices
PX_VOLUME: Numeric as volume
PX_HIGH: Numeric as high
PX_LOW: Numeric as low
Use in signals:
Text Only
signal example:
// Use aliased name 'prices' instead of 'PX_LAST'
emit zscore(ret(prices, 20))
CSV Options¶
Date Parsing¶
Text Only
data:
source = "data.csv"
format = csv
columns:
date: Date
...
options:
date_format = "%Y-%m-%d" # ISO format (default)
# date_format = "%m/%d/%Y" # US format
# date_format = "%d/%m/%Y" # European format
Delimiter¶
Text Only
data:
source = "data.tsv"
format = csv
options:
delimiter = "\t" # Tab-separated
# delimiter = ";" # Semicolon
Header¶
Text Only
data:
source = "no_header.csv"
format = csv
options:
has_header = false
columns:
col1: Date
col2: Symbol
col3: Numeric as prices
Encoding¶
Text Only
data:
source = "data.csv"
format = csv
options:
encoding = "utf-8" # Default
# encoding = "latin1"
Skip Rows¶
Text Only
data:
source = "data_with_notes.csv"
format = csv
options:
skip_rows = 3 # Skip first 3 rows
Type Inference¶
sigc can infer types, but explicit is better:
Text Only
// Explicit (recommended)
columns:
date: Date
ticker: Symbol
close: Numeric as prices
volume: Numeric
// Inferred (less reliable)
data:
source = "data.csv"
format = csv
options:
infer_types = true
Multiple CSV Files¶
Single Directory¶
Text Only
data:
source = "data/*.csv"
format = csv
columns:
date: Date
ticker: Symbol
close: Numeric as prices
Explicit List¶
Text Only
data:
source = ["prices_2023.csv", "prices_2024.csv"]
format = csv
columns:
date: Date
ticker: Symbol
close: Numeric as prices
Handling Missing Data¶
In CSV¶
Text Only
date,ticker,close,volume
2024-01-02,AAPL,185.64,45678900
2024-01-02,MSFT,,23456789 # Missing price
2024-01-03,AAPL,184.25, # Missing volume
In sigc¶
Text Only
signal handle_missing:
// Option 1: Fill with value
clean_prices = fill_nan(prices, 0)
// Option 2: Forward fill
filled = where(is_nan(prices), lag(prices, 1), prices)
// Option 3: Exclude
valid = not(is_nan(prices))
filtered = where(valid, prices, 0)
emit zscore(ret(clean_prices, 20))
Performance Considerations¶
File Size Guidelines¶
| Size | Rows | Recommendation |
|---|---|---|
| Small | < 100K | CSV is fine |
| Medium | 100K - 1M | Consider Parquet |
| Large | > 1M | Use Parquet |
Optimization Tips¶
Text Only
// 1. Use specific columns (don't load unused data)
columns:
date: Date
ticker: Symbol
close: Numeric as prices # Only load what you need
// 2. Convert to Parquet for repeated use
// Run once: sigc convert data.csv data.parquet
// 3. Use date filtering
options:
start_date = "2020-01-01"
end_date = "2024-12-31"
Common Issues¶
Date Parsing Errors¶
Fix: Specify date format:
Missing Column¶
Fix: Check column names or add alias:
Encoding Issues¶
Fix: Specify encoding:
Example Files¶
OHLCV Data¶
Text Only
date,ticker,open,high,low,close,volume
2024-01-02,AAPL,185.00,186.50,184.00,185.64,45678900
2024-01-02,MSFT,373.00,375.00,372.00,374.51,23456789
Text Only
data:
source = "ohlcv.csv"
format = csv
columns:
date: Date
ticker: Symbol
open: Numeric
high: Numeric
low: Numeric
close: Numeric as prices
volume: Numeric
With Fundamentals¶
Text Only
date,ticker,close,pe_ratio,book_value,market_cap
2024-01-02,AAPL,185.64,28.5,4.25,2850000000000
2024-01-02,MSFT,374.51,35.2,24.50,2780000000000
Text Only
data:
source = "with_fundamentals.csv"
format = csv
columns:
date: Date
ticker: Symbol
close: Numeric as prices
pe_ratio: Numeric
book_value: Numeric
market_cap: Numeric
Next Steps¶
- Parquet Format - Faster loading for large datasets
- Data Quality - Validating your data
- Signal Section - Using loaded data