Troubleshooting¶
This guide covers common issues encountered when operating MPL, organized by symptom. Each problem includes diagnosis steps, solutions, and prevention measures.
Diagnostic Commands¶
Before diving into specific issues, use these commands to assess the current state of your MPL deployment:
# Check proxy health
curl http://localhost:9443/health
# Check metrics endpoint
curl http://localhost:9100/metrics | grep mpl_
# List registered schemas
mpl schemas list
# Check proxy logs (increase verbosity)
RUST_LOG=debug mpl proxy http://mcp-server:8080
# Test upstream connectivity
curl http://mcp-server:8080/health
Proxy Won't Start¶
Port Conflict¶
Diagnosis:
Solution:
# Option 1: Stop the conflicting process
kill <PID>
# Option 2: Use a different port
mpl proxy http://mcp-server:8080 --listen 0.0.0.0:9444
Prevention: Use a dedicated port range for MPL services in your infrastructure. Document port assignments:
| Port | Service |
|---|---|
| 9443 | MPL proxy |
| 9100 | Prometheus metrics |
| 9080 | Built-in dashboard |
Missing or Invalid Configuration¶
Symptoms
Diagnosis:
# Verify config file exists
ls -la mpl-config.yaml
# Validate YAML syntax
python3 -c "import yaml; yaml.safe_load(open('mpl-config.yaml'))"
# Check required fields
grep -E "^(upstream|listen):" mpl-config.yaml
Solution:
Ensure your configuration has all required fields:
# Minimum viable mpl-config.yaml
upstream: "http://mcp-server:8080" # Required
listen: "0.0.0.0:9443" # Required
mode: learning # Optional, defaults to transparent
Prevention: Use the --config flag explicitly and keep a validated template in version control.
Upstream Unreachable¶
Symptoms
Diagnosis:
# Test upstream connectivity
curl -v http://mcp-server:8080/health
# Check DNS resolution
nslookup mcp-server
# or
dig mcp-server
# Check network path
traceroute mcp-server
Solution:
# Verify the upstream URL is correct
mpl proxy http://correct-host:correct-port
# If using Docker, ensure services are on the same network
docker network ls
docker network inspect <network-name>
Prevention: Add the upstream health check to your deployment pipeline. The proxy will start in degraded mode if the upstream is temporarily unavailable, and reconnect when it becomes healthy.
Schema Validation Errors¶
Unknown SType¶
Symptoms
Diagnosis:
# List registered SType mappings
mpl schemas list
# Check if the tool is mapped in config
grep "custom.tool.name" mpl-config.yaml
# Check metrics for unknown SType rate
curl -s http://localhost:9100/metrics | grep mpl_unknown_stype_total
Solution:
Add the missing mapping to your configuration:
Or enable learning mode to auto-detect:
Prevention: Maintain a comprehensive tool-to-SType mapping. Run in learning mode periodically to detect new tools.
Schema Mismatch¶
Symptoms
Diagnosis:
# View the current schema
mpl schemas show org.calendar.Event.v1 --format json
# Compare with actual payload (from logs)
RUST_LOG=debug mpl proxy http://mcp-server:8080
# Look for "payload" field in log entries
# Validate a specific payload against a schema
echo '{"title": "Test", "amount": "100"}' | mpl validate --stype com.acme.finance.Transaction.v3
Solution:
-
If the schema is wrong (too strict or outdated):
-
If the payload is wrong (client sending bad data):
- Fix the client to send conforming payloads
- Switch the SType to
warnmode while fixing:enforcement.overrides[].mode: warn
Prevention: Run schema validation in CI/CD pipelines. Keep schemas in version control and review changes.
additionalProperties Violations¶
Diagnosis:
This occurs when a payload contains fields not declared in the schema. All MPL schemas require additionalProperties: false.
# See what fields the schema expects
mpl schemas show org.calendar.Event.v1 --format json | jq '.properties | keys'
Solution:
-
If the extra field is legitimate, add it to the schema as an optional property:
{ "properties": { "existingField": { "type": "string" }, "newField": { "type": "string", "description": "Newly added optional field" } } }Version Impact
Adding a new optional field is a minor version change. Update
metadata.jsonaccordingly. -
If the extra field is unwanted, fix the client to stop sending it.
QoM Breaches¶
Threshold Too Strict¶
Symptoms
Diagnosis:
# Check current profile thresholds
mpl profiles show qom-strict-argcheck
# Check QoM score distribution
curl -s http://localhost:9100/metrics | grep 'mpl_qom_score{' | grep instruction_compliance
# Review recent QoM reports in logs
RUST_LOG=debug mpl proxy http://mcp-server:8080
Solution:
-
Relax the threshold for the specific metric:
-
Switch to a less strict profile:
-
Use per-SType profiles:
Prevention: Start with qom-basic and progressively tighten. Monitor score distributions before raising thresholds.
Assertion Failures¶
Symptoms
Diagnosis:
# View assertions for the SType
cat registry/stypes/org/calendar/Event/v1/assertions.cel
# Test an assertion manually
mpl validate --stype org.calendar.Event.v1 --payload '{"title":"Test","start":"2025-01-15T10:00:00Z","end":"2025-01-15T09:00:00Z"}'
Solution:
- If the assertion is correct, fix the payload (the data violates a business rule)
-
If the assertion is too strict, update the CEL expression:
Missing Context¶
Symptoms
Diagnosis:
These metrics require additional context that may not be available in all deployments:
context_grounding: Requires reference context to compare againstprovenance_completeness: Requires provenance chain (A2A mode)
Solution:
If you are not using features that provide this context, switch to a profile that does not require them:
# Use a profile that only measures available metrics
profile: "qom-basic" # Only schema_fidelity and instruction_compliance
Or configure the QoM engine to skip unavailable metrics:
Connection Issues¶
Timeout Configuration¶
Symptoms
Diagnosis:
# Check current timeout settings
grep timeout mpl-config.yaml
# Test upstream response time directly
time curl http://mcp-server:8080/slow-endpoint
Solution:
Adjust timeout settings in configuration:
# mpl-config.yaml
timeouts:
connect: 5s # Time to establish connection to upstream
request: 60s # Total time for request/response cycle
idle: 300s # Keep-alive idle timeout
handshake: 10s # AI-ALPN negotiation timeout
For specific slow tools, consider per-route timeouts:
timeouts:
request: 30s # Default
overrides:
- tool: "data.analysis.run"
request: 300s # 5 minutes for long-running analysis
WebSocket vs HTTP¶
Symptoms
Diagnosis:
Solution:
Configure the proxy transport mode to match your MCP server:
Transport Detection
The proxy attempts to auto-detect the transport mode. If auto-detection fails, set it explicitly in the configuration.
DNS Resolution¶
Diagnosis:
# Test DNS resolution
nslookup mcp-server
dig mcp-server
# Check /etc/resolv.conf
cat /etc/resolv.conf
# Try with IP address directly
curl http://10.0.1.5:8080/health
Solution:
# Use IP address instead of hostname
mpl proxy http://10.0.1.5:8080
# Or add to /etc/hosts
echo "10.0.1.5 mcp-server" >> /etc/hosts
# For Docker: ensure services are on the same network
docker network connect mpl-network mcp-server
Performance Issues¶
High Latency¶
Symptoms
Diagnosis:
# Check latency breakdown
curl -s http://localhost:9100/metrics | grep mpl_proxy_latency_seconds
# Check schema cache hit rate
curl -s http://localhost:9100/metrics | grep mpl_cache
# Check if registry is remote
grep registry mpl-config.yaml
# Profile with debug logging
RUST_LOG=mpl_proxy::timing=debug mpl proxy http://mcp-server:8080
Solution:
| Cause | Fix |
|---|---|
| Remote registry | Switch to local file registry: registry: "file://./registry" |
| Schema cache miss | Pre-warm cache: mpl schemas preload |
| Large schemas | Simplify deeply nested schemas; split into sub-schemas |
| QoM evaluation slow | Use qom-basic profile (fewer metrics to compute) |
| CEL assertions complex | Simplify assertion expressions; reduce assertion count |
# mpl-config.yaml - performance tuning
registry: "file://./registry" # Local, not remote
cache:
max_schemas: 1000 # Increase cache size
ttl: 3600s # Cache for 1 hour
qom:
timeout: 50ms # Cap QoM evaluation time
High Memory Usage¶
Symptoms
Diagnosis:
# Check memory usage
ps aux | grep mpl
# or for containers:
docker stats mpl-proxy
# Check registry size
du -sh registry/
find registry/ -name "*.json" | wc -l
# Check number of cached schemas
curl -s http://localhost:9100/metrics | grep mpl_cache_size
Solution:
# mpl-config.yaml - memory tuning
cache:
max_schemas: 100 # Limit cached schemas (default: unlimited)
eviction: lru # Evict least-recently-used
learning:
max_samples_per_tool: 1000 # Limit learning buffer
flush_interval: 60s # Write to disk more frequently
For container deployments, set appropriate resource limits:
SDK Errors¶
Connection Refused¶
Symptoms
Diagnosis:
# Check if proxy is running
curl http://localhost:9443/health
# Check if port is open
nc -zv localhost 9443
# Check proxy process
ps aux | grep mpl
Solution:
-
Ensure the proxy is running:
-
Verify the SDK is pointing at the correct address:
-
For containerized deployments, use the service name:
Negotiation Failures¶
Symptoms
Diagnosis:
# Check what the server supports
curl http://localhost:9443/health | jq '.capabilities'
# Check proxy logs for handshake details
RUST_LOG=mpl_proxy::handshake=debug mpl proxy http://mcp-server:8080
Solution:
Ensure the STypes and profiles you request are registered:
# Request only STypes that are registered
session = await client.negotiate(
stypes=["org.calendar.Event.v1"], # Must be in registry
profile="qom-basic" # Must be a known profile
)
Check the registry:
SDK Timeout¶
Diagnosis:
# Check if proxy is responding
time curl http://localhost:9443/health
# Check upstream latency
time curl http://mcp-server:8080/health
Solution:
Configure SDK timeouts:
# Python SDK
client = Client(
"http://localhost:9443",
timeout=60.0, # Increase from default 30s
connect_timeout=5.0
)
// TypeScript SDK
const client = new MplClient('http://localhost:9443', {
timeout: 60000, // 60 seconds
connectTimeout: 5000, // 5 seconds
});
Log Analysis¶
Understanding Structured Logs¶
MPL outputs structured JSON logs. Key fields to look for when diagnosing issues:
# Filter for errors only
RUST_LOG=error mpl proxy http://mcp-server:8080
# Filter for a specific module
RUST_LOG=mpl_proxy::validation=debug mpl proxy http://mcp-server:8080
# Combine levels
RUST_LOG=info,mpl_proxy::validation=debug,mpl_proxy::qom=debug mpl proxy http://mcp-server:8080
Key Log Fields¶
| Field | When to Check | What It Tells You |
|---|---|---|
request_id |
Tracing a specific request | Correlate across log lines |
stype |
Validation or QoM issues | Which SType is affected |
errors |
Validation failures | Exact schema violations |
qom_scores |
QoM breaches | Which metrics failed |
latency_ms |
Performance issues | Where time is spent |
upstream_status |
Upstream errors | Server response code |
sem_hash |
Audit trail | Content fingerprint |
provenance |
Multi-hop issues | Agent chain |
Log Aggregation¶
For production, pipe structured logs to your log aggregation system:
# JSON output to stdout (default)
mpl proxy http://mcp-server:8080 2>&1 | jq .
# Pipe to a file for later analysis
mpl proxy http://mcp-server:8080 2>> /var/log/mpl/proxy.jsonl
# Filter specific issues from logs
cat /var/log/mpl/proxy.jsonl | jq 'select(.level == "ERROR")'
cat /var/log/mpl/proxy.jsonl | jq 'select(.stype == "org.calendar.Event.v1")'
cat /var/log/mpl/proxy.jsonl | jq 'select(.latency_ms > 50)'
Quick Reference¶
| Problem | First Check | Quick Fix |
|---|---|---|
| Proxy won't start | lsof -i :9443 |
Use --listen 0.0.0.0:9444 |
| Validation errors | mpl schemas show <stype> |
Switch to mode: warn |
| QoM breaches | Check profile thresholds | Switch to qom-basic |
| High latency | Registry location | Use file://./registry |
| SDK connection | curl localhost:9443/health |
Verify proxy is running |
| Unknown SType | mpl schemas list |
Add mapping or enable --learn |
| Memory growth | du -sh registry/ |
Set cache.max_schemas |
| Timeout | time curl upstream:8080 |
Increase timeouts.request |
Getting Help¶
If you cannot resolve an issue with this guide:
- Check metrics:
curl http://localhost:9100/metrics | grep mpl_for quantitative state - Enable debug logs:
RUST_LOG=debugfor detailed request tracing - Check the dashboard:
http://localhost:9080for visual overview - Reproduce with minimal config: Strip down to the simplest configuration that exhibits the issue
- Report: Include the proxy version (
mpl --version), configuration (redact secrets), relevant logs, and metrics
Next Steps¶
- Monitoring & Metrics -- Set up proactive alerting
- Existing Infrastructure -- Migration-specific troubleshooting
- Concepts: Integration Modes -- Understanding deployment models