Metrics & Telemetry Reference¶

Complete reference for CheckStream observability.

Prometheus Metrics¶

All metrics use the checkstream_ prefix by default.

Request Metrics¶

# Total requests processed
checkstream_requests_total{tenant="default",status="success|blocked|error"} counter

# Request latency histogram
checkstream_request_latency_ms{tenant="default",quantile="0.5|0.95|0.99"} histogram

# Active connections
checkstream_connections_active{tenant="default"} gauge

# Bytes transferred
checkstream_bytes_total{tenant="default",direction="in|out"} counter

Phase Metrics¶

# Phase latency
checkstream_phase_latency_ms{phase="ingress|midstream|egress",quantile="0.5|0.95|0.99"} histogram

# Phase decisions
checkstream_phase_decisions_total{phase="ingress|midstream|egress",decision="allow|block|redact"} counter

# Tokens processed (midstream)
checkstream_tokens_processed_total{tenant="default"} counter
checkstream_tokens_redacted_total{tenant="default"} counter

Classifier Metrics¶

# Classifier calls
checkstream_classifier_calls_total{classifier="toxicity",result="positive|negative"} counter

# Classifier latency
checkstream_classifier_latency_ms{classifier="toxicity",tier="A|B|C",quantile="0.95"} histogram

# Classifier errors
checkstream_classifier_errors_total{classifier="toxicity",error="timeout|model_error"} counter

# Classifier scores histogram
checkstream_classifier_scores{classifier="toxicity",bucket="0.1|0.2|...|0.9|1.0"} histogram

Policy Metrics¶

# Policy triggers
checkstream_policy_triggers_total{policy="default",rule="block_toxicity",action="stop|redact|log"} counter

# Policy evaluation time
checkstream_policy_eval_latency_ms{policy="default",quantile="0.95"} histogram

# Shadow mode triggers (testing)
checkstream_shadow_triggers_total{policy="default",rule="test_rule"} counter

Model Metrics¶

# Model memory usage
checkstream_model_memory_bytes{model="toxicity",device="cpu|cuda"} gauge

# Model inference time
checkstream_model_inference_ms{model="toxicity",quantile="0.95"} histogram

# Model cache hits/misses
checkstream_model_cache_hits_total{model="toxicity"} counter
checkstream_model_cache_misses_total{model="toxicity"} counter

Backend Metrics¶

# Backend request latency
checkstream_backend_latency_ms{backend="openai",quantile="0.95"} histogram

# Backend errors
checkstream_backend_errors_total{backend="openai",error="timeout|5xx|connection"} counter

# Backend retry attempts
checkstream_backend_retries_total{backend="openai"} counter

System Metrics¶

# Process metrics
checkstream_process_cpu_seconds_total counter
checkstream_process_resident_memory_bytes gauge
checkstream_process_open_fds gauge

# Runtime metrics
checkstream_runtime_threads gauge
checkstream_runtime_tasks_active gauge

Metric Labels¶

Common Labels¶

Label	Description	Values
`tenant`	Tenant identifier	String
`phase`	Pipeline phase	`ingress`, `midstream`, `egress`
`classifier`	Classifier name	String
`tier`	Classifier tier	`A`, `B`, `C`
`status`	Request status	`success`, `blocked`, `error`
`action`	Policy action	`stop`, `redact`, `log`, `inject`

Quantile Labels¶

Histogram metrics include quantile buckets:

Quantile	Meaning
`0.5`	Median (50th percentile)
`0.9`	90th percentile
`0.95`	95th percentile
`0.99`	99th percentile

Configuration¶

Enable Metrics¶

telemetry:
  metrics:
    enabled: true
    port: 9090
    path: "/metrics"
    prefix: "checkstream"

Custom Labels¶

telemetry:
  metrics:
    custom_labels:
      environment: "production"
      region: "us-east-1"

Histogram Buckets¶

telemetry:
  metrics:
    latency_buckets: [1, 2, 5, 10, 25, 50, 100, 250, 500, 1000]
    score_buckets: [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

Grafana Dashboards¶

Overview Dashboard¶

Key panels: - Requests per second (by status) - P95 latency by phase - Policy triggers over time - Error rate

Classifier Dashboard¶

Key panels: - Classifier latency by tier - Score distribution - Positive rate over time - Cache hit ratio

Compliance Dashboard¶

Key panels: - Blocks by regulation - Audit records created - Shadow mode triggers - Policy coverage

Alerting Rules¶

High Latency¶

# Prometheus alerting rule
groups:
  - name: checkstream
    rules:
      - alert: HighIngressLatency
        expr: histogram_quantile(0.95, checkstream_phase_latency_ms{phase="ingress"}) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High ingress latency detected"

High Error Rate¶

- alert: HighErrorRate
  expr: |
    sum(rate(checkstream_requests_total{status="error"}[5m])) /
    sum(rate(checkstream_requests_total[5m])) > 0.01
  for: 5m
  labels:
    severity: critical

Classifier Failure¶

- alert: ClassifierFailure
  expr: rate(checkstream_classifier_errors_total[5m]) > 0
  for: 1m
  labels:
    severity: critical

Backend Unavailable¶

- alert: BackendUnavailable
  expr: rate(checkstream_backend_errors_total{error="connection"}[1m]) > 5
  for: 1m
  labels:
    severity: critical

Structured Logging¶

Log Format¶

telemetry:
  logging:
    level: info
    format: json

Log Fields¶

{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "INFO",
  "message": "Request processed",
  "request_id": "req-abc123",
  "tenant": "default",
  "phase": "ingress",
  "latency_ms": 3.5,
  "decision": "allow",
  "classifiers": {
    "toxicity": 0.12,
    "prompt_injection": 0.05
  }
}

Log Levels¶

Level	Description	Use Case
`error`	Errors requiring attention	Failures, exceptions
`warn`	Warning conditions	Policy triggers, slow queries
`info`	Normal operations	Requests, decisions
`debug`	Detailed debugging	Classifier scores, timing
`trace`	Very detailed tracing	Token-level processing

Audit Trail¶

Configuration¶

telemetry:
  audit:
    enabled: true
    path: "./audit"
    rotation: daily
    retention_days: 90
    hash_chain: true
    include:
      - request_id
      - timestamp
      - tenant
      - input_hash
      - output_hash
      - classifiers
      - actions
      - regulations

Audit Record Format¶

{
  "id": "audit-2024011510300001",
  "timestamp": "2024-01-15T10:30:00.123Z",
  "previous_hash": "sha256:abc123...",
  "hash": "sha256:def456...",
  "request_id": "req-abc123",
  "tenant": "acme-corp",
  "input_hash": "sha256:input...",
  "output_hash": "sha256:output...",
  "classifiers": [
    {"name": "toxicity", "score": 0.82, "tier": "B"}
  ],
  "actions": [
    {"type": "redact", "rule": "redact_toxic", "target": "midstream"}
  ],
  "regulations": ["internal-policy"],
  "latency_ms": 5.2
}

Hash Chain Verification¶

curl http://localhost:8080/audit/verify?start=2024-01-01&end=2024-01-31

{
  "status": "valid",
  "records_verified": 50000,
  "chain_intact": true,
  "gaps": [],
  "first_record": "2024-01-01T00:00:01Z",
  "last_record": "2024-01-31T23:59:59Z"
}

OpenTelemetry Integration¶

Configuration¶

telemetry:
  opentelemetry:
    enabled: true
    endpoint: "http://otel-collector:4317"
    service_name: "checkstream"
    traces:
      enabled: true
      sampling_rate: 0.1
    metrics:
      enabled: true
      interval_seconds: 60

Trace Spans¶

checkstream.request
├── checkstream.ingress
│   ├── checkstream.classifier.toxicity
│   └── checkstream.classifier.prompt_injection
├── checkstream.backend.request
├── checkstream.midstream
│   └── checkstream.classifier.toxicity (per chunk)
└── checkstream.egress
    └── checkstream.audit

Health Metrics¶

Endpoint¶

curl http://localhost:8080/health/ready

Response¶

{
  "status": "ready",
  "uptime_seconds": 86400,
  "checks": {
    "classifiers": {
      "status": "healthy",
      "loaded": 5,
      "failed": 0
    },
    "policies": {
      "status": "healthy",
      "loaded": 3,
      "rules": 25
    },
    "backend": {
      "status": "healthy",
      "latency_ms": 150
    },
    "audit": {
      "status": "healthy",
      "records_today": 12500
    }
  }
}

Best Practices¶

Set appropriate retention - Balance storage vs compliance needs
Use sampling - Reduce trace volume in high-traffic systems
Alert on P95, not P50 - Catch tail latency issues
Monitor classifier accuracy - Track false positive rates
Hash chain verification - Regular integrity checks
Dashboard for each audience - Ops, compliance, developers

Next Steps¶

Deployment Guide - Production monitoring setup
API Reference - Query metrics and audit data
Compliance Guide - Regulatory audit requirements