Data Quality¶
Ensuring your data is clean and reliable for backtesting.
Common Data Issues¶
Missing Data¶
- Gaps: Missing trading days
- NaN values: Missing prices or volumes
- Incomplete symbols: Assets that appear/disappear
Data Errors¶
- Spikes: Erroneous price jumps
- Zeros: Invalid zero prices
- Stale data: Unchanged values when they should change
Structural Issues¶
- Survivorship bias: Only successful companies in dataset
- Look-ahead bias: Using future information
- Corporate actions: Unadjusted splits and dividends
Detecting Missing Data¶
Check for NaN¶
Text Only
signal check_missing:
// Count missing values
has_price = where(is_nan(prices), 0, 1)
has_volume = where(is_nan(volume), 0, 1)
// Quality score per asset
quality = (has_price + has_volume) / 2
emit quality
Missing Data Report¶
Output:
Text Only
Missing Data Report:
===================
Column | Missing | Pct Missing
-----------+---------+-----------
close | 145 | 0.02%
volume | 892 | 0.12%
high | 145 | 0.02%
low | 145 | 0.02%
Assets with >5% missing: DELISTED1, DELISTED2
Handling Missing Data¶
Fill with Zero¶
Forward Fill¶
Text Only
signal forward_fill:
// Use previous value
filled = where(is_nan(prices), lag(prices, 1), prices)
emit filled
Exclude Missing¶
Text Only
signal exclude_missing:
valid = not(is_nan(prices))
signal = where(valid, zscore(ret(prices, 20)), 0)
emit signal
Interpolation¶
Text Only
signal interpolate:
// Linear interpolation
prev = lag(prices, 1)
next = lead(prices, 1)
interpolated = where(is_nan(prices), (prev + next) / 2, prices)
emit interpolated
Detecting Outliers¶
Statistical Detection¶
Text Only
signal outlier_detection:
returns = ret(prices, 1)
z = zscore(returns)
// Flag returns > 5 standard deviations
outlier = abs(z) > 5
emit where(outlier, 1, 0)
Price Spike Detection¶
Text Only
signal spike_detection:
returns = ret(prices, 1)
// Large single-day moves
spike = abs(returns) > 0.5 // > 50% move
emit where(spike, 1, 0)
Volume Anomalies¶
Text Only
signal volume_anomaly:
avg_volume = rolling_mean(volume, 20)
ratio = volume / avg_volume
// Volume > 10x average
anomaly = ratio > 10
emit where(anomaly, 1, 0)
Handling Outliers¶
Winsorization¶
Text Only
signal winsorized:
returns = ret(prices, 20)
z = zscore(returns)
// Clip at 1st and 99th percentile
cleaned = winsor(z, p=0.01)
emit cleaned
Clipping¶
Text Only
signal clipped:
returns = ret(prices, 20)
z = zscore(returns)
// Hard clip at ±3 standard deviations
bounded = clip(z, -3, 3)
emit bounded
Exclusion¶
Text Only
signal exclude_outliers:
returns = ret(prices, 20)
z = zscore(returns)
// Exclude extreme values
valid = abs(z) < 5
signal = where(valid, z, 0)
emit signal
Survivorship Bias¶
The Problem¶
Backtests that only include currently active securities overstate returns because failed companies are excluded.
Solution 1: Point-in-Time Data¶
Use a data provider that includes delisted securities:
Text Only
data:
source = "point_in_time_prices.parquet"
format = parquet
options:
include_delisted = true
Solution 2: Dead Stock Handling¶
Text Only
signal handle_delisting:
// Detect when a stock stops trading
was_trading = lag(prices, 1) > 0
not_trading = prices == 0 or is_nan(prices)
delisted = was_trading and not_trading
// Zero out signal for delisted stocks
raw_signal = zscore(ret(prices, 60))
signal = where(delisted, 0, raw_signal)
emit signal
Look-Ahead Bias¶
The Problem¶
Using information that wouldn't have been available at the time of the trade.
Common Sources¶
- Same-day data: Using close price to make decisions before market close
- Reported vs actual dates: Using data before it was actually reported
- Revisions: Using revised data instead of original releases
Solution: Lag Your Data¶
Text Only
signal no_lookahead:
// Use previous day's close for today's signal
yesterday_price = lag(prices, 1)
signal = zscore(ret(yesterday_price, 20))
emit signal
Fundamental Data¶
Text Only
signal fundamental_signal:
// Earnings reported with delay
// Use 1-quarter lag to be safe
lagged_earnings = lag(earnings, 63) // ~3 months
signal = zscore(lagged_earnings / prices)
emit signal
Data Validation Checklist¶
Before Loading¶
- Check file format (CSV headers, Parquet schema)
- Verify date range is complete
- Confirm all expected symbols present
- Check for duplicate rows
After Loading¶
- Count missing values per column
- Verify no future dates
- Check price ranges are reasonable
- Validate volume is non-negative
Before Backtesting¶
- Confirm no look-ahead bias
- Verify corporate actions adjusted
- Check for survivorship bias
- Test on subset of data first
Quality Metrics¶
Completeness Score¶
Text Only
signal completeness:
// What percentage of data is present?
has_data = where(is_nan(prices), 0, 1)
score = rolling_mean(has_data, 252) // Over 1 year
emit score
Price Quality¶
Text Only
signal price_quality:
// No zero or negative prices
valid_price = prices > 0
// No extreme daily returns
ret_1d = ret(prices, 1)
reasonable_return = abs(ret_1d) < 0.5
quality = where(valid_price and reasonable_return, 1, 0)
emit quality
Volume Quality¶
Text Only
signal volume_quality:
// No zero volume on trading days
has_volume = volume > 0
// No extreme volume spikes
avg_vol = rolling_mean(volume, 20)
reasonable_vol = volume < 50 * avg_vol
quality = where(has_volume and reasonable_vol, 1, 0)
emit quality
Automated Quality Checks¶
sigc CLI¶
Bash
# Run all quality checks
sigc validate data.parquet
# Check specific issues
sigc validate data.parquet --check missing
sigc validate data.parquet --check outliers
sigc validate data.parquet --check duplicates
In Strategy¶
Text Only
data:
source = "prices.parquet"
format = parquet
options:
validate = true # Enable validation
max_missing_pct = 0.05 # Fail if >5% missing
max_outlier_pct = 0.01 # Fail if >1% outliers
Best Practices¶
1. Always Validate New Data¶
2. Use Quality Thresholds¶
Text Only
signal with_quality_filter:
// Only trade assets with good data
quality = calculate_quality(prices, volume)
raw_signal = zscore(ret(prices, 60))
signal = where(quality > 0.95, raw_signal, 0)
emit signal
3. Document Data Sources¶
Text Only
// Data: Bloomberg adjusted prices
// Period: 2010-01-01 to 2024-12-31
// Universe: S&P 500 constituents (point-in-time)
// Corporate actions: Adjusted for splits and dividends
data:
source = "sp500_prices.parquet"
4. Version Your Data¶
Text Only
data/
├── v1/
│ └── prices.parquet
├── v2/
│ └── prices.parquet # Fixed outliers
└── v3/
└── prices.parquet # Added delisted stocks
5. Keep Raw and Cleaned Data¶
Text Only
// Load raw data
data raw:
source = "prices_raw.parquet"
// Apply cleaning
signal clean:
cleaned = winsor(zscore(raw.prices), p=0.01)
emit cleaned
Next Steps¶
- Corporate Actions - Handling splits and dividends
- CSV Format - Loading CSV files
- Backtesting - Running backtests