Monitoring¶
Prometheus metrics and observability for sigc.
Overview¶
sigc exposes comprehensive metrics for:
- Strategy performance
- System health
- Trading activity
- Data pipeline status
Prometheus Setup¶
Enable Metrics¶
Scrape Configuration¶
Add to your Prometheus config:
YAML
# prometheus.yml
scrape_configs:
- job_name: 'sigc'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 15s
Available Metrics¶
Performance Metrics¶
Text Only
# Portfolio value
sigc_portfolio_value{strategy="momentum"} 1050000
# Daily P&L
sigc_daily_pnl_pct{strategy="momentum"} 0.012
# Drawdown
sigc_drawdown_pct{strategy="momentum"} 0.032
# Sharpe ratio (rolling 30d)
sigc_sharpe_ratio_30d{strategy="momentum"} 1.25
# Returns
sigc_return_total{strategy="momentum"} 0.085
sigc_return_daily{strategy="momentum"} 0.0012
Position Metrics¶
Text Only
# Position count
sigc_positions_long_count{strategy="momentum"} 42
sigc_positions_short_count{strategy="momentum"} 38
# Exposure
sigc_exposure_gross_pct{strategy="momentum"} 1.85
sigc_exposure_net_pct{strategy="momentum"} 0.05
# Concentration
sigc_position_max_pct{strategy="momentum",ticker="NVDA"} 0.042
sigc_sector_max_pct{strategy="momentum",sector="Technology"} 0.22
Trading Metrics¶
Text Only
# Order counts
sigc_orders_total{strategy="momentum",status="filled"} 1523
sigc_orders_total{strategy="momentum",status="rejected"} 12
sigc_orders_total{strategy="momentum",status="cancelled"} 45
# Turnover
sigc_turnover_daily_pct{strategy="momentum"} 0.18
sigc_turnover_monthly_pct{strategy="momentum"} 0.85
# Transaction costs
sigc_transaction_costs_bps{strategy="momentum"} 8.5
System Metrics¶
Text Only
# Daemon status
sigc_daemon_uptime_seconds 302400
sigc_daemon_restarts_total 2
# Memory
sigc_memory_used_bytes 1073741824
sigc_memory_limit_bytes 4294967296
# CPU
sigc_cpu_usage_pct 15.2
# Computation
sigc_computation_duration_seconds{strategy="momentum"} 2.34
sigc_computation_last_success_timestamp{strategy="momentum"} 1705320000
Data Metrics¶
Text Only
# Data freshness
sigc_data_age_seconds{source="prices"} 900
sigc_data_rows_total{source="prices"} 2500000
# Cache
sigc_cache_hits_total 15234
sigc_cache_misses_total 523
sigc_cache_size_bytes 2147483648
Alert Metrics¶
Text Only
# Alert counts
sigc_alerts_total{severity="critical"} 0
sigc_alerts_total{severity="high"} 2
sigc_alerts_total{severity="warning"} 15
# Circuit breakers
sigc_circuit_breaker_triggered_total{breaker="daily_loss"} 1
sigc_circuit_breaker_status{breaker="daily_loss"} 0 # 0=ok, 1=tripped
Grafana Dashboards¶
Import Dashboard¶
sigc provides pre-built Grafana dashboards:
Bash
# Export dashboard JSON
sigc monitoring export-dashboard > sigc-dashboard.json
# Import to Grafana
curl -X POST \
-H "Content-Type: application/json" \
-d @sigc-dashboard.json \
http://grafana:3000/api/dashboards/db
Dashboard Panels¶
Overview Panel: - Portfolio value over time - Daily P&L - Current drawdown - Key metrics
Performance Panel: - Cumulative returns - Rolling Sharpe - Drawdown chart - Monthly returns heatmap
Positions Panel: - Long/short counts - Gross/net exposure - Sector breakdown - Top positions
Trading Panel: - Orders by status - Turnover trend - Transaction costs - Fill rates
System Panel: - Memory/CPU usage - Computation times - Data freshness - Error rates
Health Checks¶
Endpoints¶
Bash
# Liveness (is process running?)
curl http://localhost:8080/live
# Returns 200 OK or 503
# Readiness (is service ready?)
curl http://localhost:8080/ready
# Returns 200 OK or 503
# Health (detailed status)
curl http://localhost:8080/health
Health Response¶
JSON
{
"status": "healthy",
"timestamp": "2024-01-15T14:30:00Z",
"checks": {
"data_freshness": {
"status": "healthy",
"message": "Data is 15 minutes old",
"last_check": "2024-01-15T14:29:55Z"
},
"broker_connection": {
"status": "healthy",
"message": "Connected to Alpaca",
"latency_ms": 45
},
"memory": {
"status": "healthy",
"used_pct": 62,
"limit_mb": 4096
},
"disk": {
"status": "healthy",
"used_pct": 45,
"free_gb": 120
}
},
"uptime_seconds": 302400
}
Custom Health Checks¶
YAML
monitoring:
health:
checks:
- name: market_hours
type: time_window
start: "09:30"
end: "16:00"
timezone: America/New_York
- name: data_complete
type: custom
script: /opt/sigc/scripts/check_data.sh
timeout_seconds: 30
Logging¶
Structured Logging¶
Log Format¶
JSON
{
"timestamp": "2024-01-15T14:30:00.123Z",
"level": "info",
"component": "scheduler",
"strategy": "momentum",
"message": "Signal computation completed",
"duration_ms": 2340,
"positions": 80,
"turnover_pct": 0.18
}
Log Levels¶
| Level | Description |
|---|---|
debug |
Detailed debugging info |
info |
Normal operations |
warn |
Warning conditions |
error |
Error conditions |
Log Aggregation¶
Send logs to external systems:
YAML
monitoring:
logging:
sinks:
- type: elasticsearch
url: http://elasticsearch:9200
index: sigc-logs
- type: cloudwatch
region: us-east-1
log_group: /sigc/production
Alertmanager Integration¶
Configure Alertmanager¶
YAML
# alertmanager.yml
route:
group_by: ['alertname', 'strategy']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'pagerduty'
receivers:
- name: 'default'
slack_configs:
- channel: '#trading-alerts'
- name: 'pagerduty'
pagerduty_configs:
- service_key: ${PAGERDUTY_KEY}
Alert Rules¶
YAML
# sigc.rules.yml
groups:
- name: sigc
rules:
- alert: SigcHighDrawdown
expr: sigc_drawdown_pct > 0.10
for: 5m
labels:
severity: high
annotations:
summary: "High drawdown: {{ $value | humanizePercentage }}"
- alert: SigcDaemonDown
expr: up{job="sigc"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "sigc daemon is down"
- alert: SigcDataStale
expr: sigc_data_age_seconds > 7200
for: 5m
labels:
severity: high
annotations:
summary: "Data is {{ $value | humanizeDuration }} old"
Tracing¶
OpenTelemetry Integration¶
YAML
monitoring:
tracing:
enabled: true
exporter: jaeger
endpoint: http://jaeger:14268/api/traces
sample_rate: 0.1
Trace Spans¶
Text Only
[signal_computation] 2.34s
├─[load_data] 0.85s
│ ├─[fetch_prices] 0.72s
│ └─[fetch_fundamentals] 0.12s
├─[compute_signals] 1.12s
│ ├─[momentum] 0.45s
│ └─[value] 0.67s
└─[generate_weights] 0.37s
Best Practices¶
1. Set Up Dashboards Early¶
Create dashboards before going live.
2. Alert on Symptoms, Not Causes¶
YAML
# Good: Alert on outcome
- alert: SigcHighDrawdown
expr: sigc_drawdown_pct > 0.10
# Less useful: Alert on potential cause
- alert: SigcHighMemory
expr: sigc_memory_used_pct > 80
3. Use Recording Rules¶
Pre-compute expensive queries:
YAML
groups:
- name: sigc_recording
rules:
- record: sigc:sharpe_ratio_30d
expr: |
(avg_over_time(sigc_return_daily[30d]) * 252) /
(stddev_over_time(sigc_return_daily[30d]) * sqrt(252))
4. Retain Data Appropriately¶
5. Test Monitoring¶
Verify all metrics are being collected.
Next Steps¶
- Alerting - Notification setup
- Safety Systems - Circuit breakers
- Daemon Mode - Running as service