Monitoring¶

Prometheus metrics and observability for sigc.

Overview¶

sigc exposes comprehensive metrics for:

Strategy performance
System health
Trading activity
Data pipeline status

Prometheus Setup¶

Enable Metrics¶

YAML

monitoring:
  prometheus:
    enabled: true
    port: 9090
    path: /metrics

Scrape Configuration¶

Add to your Prometheus config:

YAML

# prometheus.yml
scrape_configs:
  - job_name: 'sigc'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s

Available Metrics¶

Performance Metrics¶

Text Only

# Portfolio value
sigc_portfolio_value{strategy="momentum"} 1050000

# Daily P&L
sigc_daily_pnl_pct{strategy="momentum"} 0.012

# Drawdown
sigc_drawdown_pct{strategy="momentum"} 0.032

# Sharpe ratio (rolling 30d)
sigc_sharpe_ratio_30d{strategy="momentum"} 1.25

# Returns
sigc_return_total{strategy="momentum"} 0.085
sigc_return_daily{strategy="momentum"} 0.0012

Position Metrics¶

Text Only

# Position count
sigc_positions_long_count{strategy="momentum"} 42
sigc_positions_short_count{strategy="momentum"} 38

# Exposure
sigc_exposure_gross_pct{strategy="momentum"} 1.85
sigc_exposure_net_pct{strategy="momentum"} 0.05

# Concentration
sigc_position_max_pct{strategy="momentum",ticker="NVDA"} 0.042
sigc_sector_max_pct{strategy="momentum",sector="Technology"} 0.22

Trading Metrics¶

Text Only

# Order counts
sigc_orders_total{strategy="momentum",status="filled"} 1523
sigc_orders_total{strategy="momentum",status="rejected"} 12
sigc_orders_total{strategy="momentum",status="cancelled"} 45

# Turnover
sigc_turnover_daily_pct{strategy="momentum"} 0.18
sigc_turnover_monthly_pct{strategy="momentum"} 0.85

# Transaction costs
sigc_transaction_costs_bps{strategy="momentum"} 8.5

System Metrics¶

Text Only

# Daemon status
sigc_daemon_uptime_seconds 302400
sigc_daemon_restarts_total 2

# Memory
sigc_memory_used_bytes 1073741824
sigc_memory_limit_bytes 4294967296

# CPU
sigc_cpu_usage_pct 15.2

# Computation
sigc_computation_duration_seconds{strategy="momentum"} 2.34
sigc_computation_last_success_timestamp{strategy="momentum"} 1705320000

Data Metrics¶

Text Only

# Data freshness
sigc_data_age_seconds{source="prices"} 900
sigc_data_rows_total{source="prices"} 2500000

# Cache
sigc_cache_hits_total 15234
sigc_cache_misses_total 523
sigc_cache_size_bytes 2147483648

Alert Metrics¶

Text Only

# Alert counts
sigc_alerts_total{severity="critical"} 0
sigc_alerts_total{severity="high"} 2
sigc_alerts_total{severity="warning"} 15

# Circuit breakers
sigc_circuit_breaker_triggered_total{breaker="daily_loss"} 1
sigc_circuit_breaker_status{breaker="daily_loss"} 0  # 0=ok, 1=tripped

Grafana Dashboards¶

Import Dashboard¶

sigc provides pre-built Grafana dashboards:

Bash

# Export dashboard JSON
sigc monitoring export-dashboard > sigc-dashboard.json

# Import to Grafana
curl -X POST \
  -H "Content-Type: application/json" \
  -d @sigc-dashboard.json \
  http://grafana:3000/api/dashboards/db

Dashboard Panels¶

Overview Panel: - Portfolio value over time - Daily P&L - Current drawdown - Key metrics

Performance Panel: - Cumulative returns - Rolling Sharpe - Drawdown chart - Monthly returns heatmap

Positions Panel: - Long/short counts - Gross/net exposure - Sector breakdown - Top positions

Trading Panel: - Orders by status - Turnover trend - Transaction costs - Fill rates

System Panel: - Memory/CPU usage - Computation times - Data freshness - Error rates

Health Checks¶

Endpoints¶

YAML

monitoring:
  health:
    enabled: true
    port: 8080

Bash

# Liveness (is process running?)
curl http://localhost:8080/live
# Returns 200 OK or 503

# Readiness (is service ready?)
curl http://localhost:8080/ready
# Returns 200 OK or 503

# Health (detailed status)
curl http://localhost:8080/health

Health Response¶

JSON

{
  "status": "healthy",
  "timestamp": "2024-01-15T14:30:00Z",
  "checks": {
    "data_freshness": {
      "status": "healthy",
      "message": "Data is 15 minutes old",
      "last_check": "2024-01-15T14:29:55Z"
    },
    "broker_connection": {
      "status": "healthy",
      "message": "Connected to Alpaca",
      "latency_ms": 45
    },
    "memory": {
      "status": "healthy",
      "used_pct": 62,
      "limit_mb": 4096
    },
    "disk": {
      "status": "healthy",
      "used_pct": 45,
      "free_gb": 120
    }
  },
  "uptime_seconds": 302400
}

Custom Health Checks¶

YAML

monitoring:
  health:
    checks:
      - name: market_hours
        type: time_window
        start: "09:30"
        end: "16:00"
        timezone: America/New_York

      - name: data_complete
        type: custom
        script: /opt/sigc/scripts/check_data.sh
        timeout_seconds: 30

Logging¶

Structured Logging¶

YAML

monitoring:
  logging:
    format: json
    level: info
    file: /var/log/sigc/sigc.log

Log Format¶

JSON

{
  "timestamp": "2024-01-15T14:30:00.123Z",
  "level": "info",
  "component": "scheduler",
  "strategy": "momentum",
  "message": "Signal computation completed",
  "duration_ms": 2340,
  "positions": 80,
  "turnover_pct": 0.18
}

Log Levels¶

Level	Description
`debug`	Detailed debugging info
`info`	Normal operations
`warn`	Warning conditions
`error`	Error conditions

Log Aggregation¶

Send logs to external systems:

YAML

monitoring:
  logging:
    sinks:
      - type: elasticsearch
        url: http://elasticsearch:9200
        index: sigc-logs

      - type: cloudwatch
        region: us-east-1
        log_group: /sigc/production

Alertmanager Integration¶

Configure Alertmanager¶

YAML

# alertmanager.yml
route:
  group_by: ['alertname', 'strategy']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'

receivers:
  - name: 'default'
    slack_configs:
      - channel: '#trading-alerts'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: ${PAGERDUTY_KEY}

Alert Rules¶

YAML

# sigc.rules.yml
groups:
  - name: sigc
    rules:
      - alert: SigcHighDrawdown
        expr: sigc_drawdown_pct > 0.10
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "High drawdown: {{ $value | humanizePercentage }}"

      - alert: SigcDaemonDown
        expr: up{job="sigc"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "sigc daemon is down"

      - alert: SigcDataStale
        expr: sigc_data_age_seconds > 7200
        for: 5m
        labels:
          severity: high
        annotations:
          summary: "Data is {{ $value | humanizeDuration }} old"

Tracing¶

OpenTelemetry Integration¶

YAML

monitoring:
  tracing:
    enabled: true
    exporter: jaeger
    endpoint: http://jaeger:14268/api/traces
    sample_rate: 0.1

Trace Spans¶

Text Only

[signal_computation] 2.34s
├─[load_data] 0.85s
│  ├─[fetch_prices] 0.72s
│  └─[fetch_fundamentals] 0.12s
├─[compute_signals] 1.12s
│  ├─[momentum] 0.45s
│  └─[value] 0.67s
└─[generate_weights] 0.37s

Best Practices¶

1. Set Up Dashboards Early¶

Create dashboards before going live.

2. Alert on Symptoms, Not Causes¶

YAML

# Good: Alert on outcome
- alert: SigcHighDrawdown
  expr: sigc_drawdown_pct > 0.10

# Less useful: Alert on potential cause
- alert: SigcHighMemory
  expr: sigc_memory_used_pct > 80

3. Use Recording Rules¶

Pre-compute expensive queries:

YAML

groups:
  - name: sigc_recording
    rules:
      - record: sigc:sharpe_ratio_30d
        expr: |
          (avg_over_time(sigc_return_daily[30d]) * 252) /
          (stddev_over_time(sigc_return_daily[30d]) * sqrt(252))

4. Retain Data Appropriately¶

YAML

# prometheus.yml
storage:
  tsdb:
    retention.time: 90d  # Keep 90 days

5. Test Monitoring¶

Bash

sigc monitoring test

Verify all metrics are being collected.

Next Steps¶

Alerting - Notification setup
Safety Systems - Circuit breakers
Daemon Mode - Running as service