Model Loading Guide¶
Load and manage ML models in CheckStream using Candle and HuggingFace Hub.
Overview¶
CheckStream uses the Candle ML framework for inference, supporting:
- HuggingFace Hub integration
- SafeTensors and PyTorch formats
- CPU, CUDA, and Metal acceleration
- INT8/INT4 quantization
- Lazy loading with caching
Loading from HuggingFace¶
Basic Configuration¶
With Specific Revision¶
classifiers:
toxicity:
model:
source: huggingface
repo: "unitary/toxic-bert"
revision: "v1.0.0" # Tag
# revision: "abc123def" # Commit hash
Private Repositories¶
classifiers:
custom_model:
model:
source: huggingface
repo: "company/private-model"
token_env: "HF_TOKEN" # Environment variable with token
Loading Local Models¶
SafeTensors Format (Recommended)¶
classifiers:
local_classifier:
tier: B
type: ml
model:
source: local
path: "./models/my-classifier"
weights: "model.safetensors"
config: "config.json"
PyTorch Format¶
classifiers:
pytorch_model:
model:
source: local
path: "./models/pytorch-model"
weights: "pytorch_model.bin"
config: "config.json"
format: pytorch
Required Files¶
models/my-classifier/
├── config.json # Model architecture config
├── model.safetensors # Weights (or pytorch_model.bin)
├── tokenizer_config.json # Tokenizer config
├── vocab.txt # Vocabulary (BERT-style)
└── special_tokens_map.json
Device Selection¶
Automatic Selection¶
Explicit Device¶
classifiers:
# CPU only
pattern_classifier:
device: cpu
# NVIDIA GPU
heavy_model:
device: cuda
device_id: 0 # Specific GPU
# Apple Silicon
mac_model:
device: metal
Memory Management¶
Quantization¶
Reduce model size and improve inference speed.
INT8 Quantization¶
classifiers:
toxicity:
model:
repo: "unitary/toxic-bert"
quantization: int8
# ~4x smaller, ~2x faster
INT4 Quantization¶
classifiers:
toxicity:
model:
repo: "unitary/toxic-bert"
quantization: int4
# ~8x smaller, ~3x faster, some accuracy loss
Quantization Comparison¶
| Quantization | Size | Speed | Accuracy |
|---|---|---|---|
| None (FP32) | 100% | 1x | 100% |
| FP16 | 50% | ~1.5x | ~99.9% |
| INT8 | 25% | ~2x | ~99% |
| INT4 | 12.5% | ~3x | ~95-98% |
Caching¶
Model Cache¶
Models are cached in ~/.cache/huggingface/hub/:
~/.cache/huggingface/hub/
├── models--unitary--toxic-bert/
│ ├── snapshots/
│ │ └── abc123def/
│ │ ├── model.safetensors
│ │ └── config.json
│ └── refs/
│ └── main
Custom Cache Location¶
Or via environment:
Inference Cache¶
Cache classification results:
Tokenizer Configuration¶
Auto-Load Tokenizer¶
classifiers:
toxicity:
model:
repo: "unitary/toxic-bert"
# Tokenizer automatically loaded from same repo
Custom Tokenizer¶
classifiers:
custom:
model:
source: local
path: "./models/custom"
tokenizer:
source: huggingface
repo: "bert-base-uncased"
Tokenizer Options¶
classifiers:
toxicity:
tokenizer:
max_length: 512
truncation: true
padding: max_length
add_special_tokens: true
return_attention_mask: true
Model Warmup¶
Pre-load models at startup:
Or trigger manually:
Multiple Model Instances¶
Model Registry¶
model_registry:
toxicity_v1:
repo: "unitary/toxic-bert"
revision: "v1.0"
toxicity_v2:
repo: "unitary/toxic-bert"
revision: "v2.0"
classifiers:
toxicity_stable:
model_ref: toxicity_v1
toxicity_canary:
model_ref: toxicity_v2
mode: shadow
A/B Testing¶
classifiers:
toxicity:
ab_test:
enabled: true
variants:
- model_ref: toxicity_v1
weight: 90
- model_ref: toxicity_v2
weight: 10
Supported Model Architectures¶
BERT-based¶
# DistilBERT, BERT, RoBERTa, etc.
classifiers:
sentiment:
model:
repo: "distilbert-base-uncased-finetuned-sst-2-english"
architecture: bert_sequence_classification
DeBERTa¶
classifiers:
prompt_injection:
model:
repo: "protectai/deberta-v3-base-prompt-injection"
architecture: deberta_sequence_classification
Supported Architectures¶
| Architecture | Description |
|---|---|
bert_sequence_classification |
BERT for text classification |
bert_token_classification |
BERT for NER/token tasks |
deberta_sequence_classification |
DeBERTa classifier |
distilbert_sequence_classification |
DistilBERT classifier |
Troubleshooting¶
Model Not Loading¶
# Check model files
ls -la ~/.cache/huggingface/hub/models--unitary--toxic-bert/
# Verify config
cat ~/.cache/huggingface/hub/models--unitary--toxic-bert/snapshots/*/config.json
Out of Memory¶
classifiers:
large_model:
model:
quantization: int8 # Reduce memory
max_length: 256 # Shorter sequences
batch_size: 1 # Smaller batches
Slow Inference¶
# Use GPU
classifiers:
slow_model:
device: cuda
# Or quantize
classifiers:
slow_model:
model:
quantization: int8
Cache Issues¶
# Clear model cache
rm -rf ~/.cache/huggingface/hub/models--unitary--toxic-bert/
# Force re-download
curl -X POST http://localhost:8080/admin/reload-models
Best Practices¶
- Use SafeTensors - Faster and safer than PyTorch format
- Enable quantization - INT8 for production, FP32 for accuracy testing
- Warmup at startup - Avoid cold start latency
- Use caching - Both model and inference caching
- Monitor memory - Track
checkstream_model_memory_bytesmetric - Test locally first - Verify models before deployment
Next Steps¶
- Classifier Configuration - Full classifier options
- Pipeline Configuration - Use models in pipelines
- Classifier System - Understanding tiers