Frequently Asked Questions¶
General¶
What is EmbedCache?¶
EmbedCache is a high-performance Rust library and REST API service for generating text embeddings with intelligent caching.
What embedding models are supported?¶
EmbedCache supports 22+ models through FastEmbed, including: - AllMiniLM series - BGE series (Small, Base, Large) - Nomic Embed series - Multilingual E5 series - Paraphrase series - MxbaiEmbed series
See Embedding Models for the complete list.
Is EmbedCache free to use?¶
Yes, EmbedCache is open source under the GPL-3.0 license. The embedding models are also free and run locally.
Do I need an API key?¶
No, for basic usage. Embedding generation happens locally using FastEmbed.
You only need API keys if you want to use LLM-based chunking with OpenAI or Anthropic.
Usage¶
How do I choose the right model?¶
| Need | Recommended Model |
|---|---|
| Fast, general purpose | AllMiniLML6V2 |
| Better quality | BGESmallENV15 |
| Highest quality | BGELargeENV15 |
| Multiple languages | MultilingualE5Base |
| Low memory | AllMiniLML6V2Q |
What's the difference between /v1/embed and /v1/process?¶
/v1/embed- Generates embeddings for text you provide/v1/process- Fetches URL content, chunks it, and generates embeddings
Are results cached?¶
Only /v1/process results are cached. /v1/embed does not cache because the same text should produce the same embeddings.
What chunk size should I use?¶
Start with 256-512 words. Smaller chunks provide more precise retrieval, larger chunks provide more context.
LLM Chunking¶
Do I need an LLM for EmbedCache?¶
No, EmbedCache works without LLM. Word-based chunking is always available.
LLM chunking is optional and provides higher quality semantic chunking.
Which LLM provider should I use?¶
| Context | Recommended |
|---|---|
| Development | Ollama (free, local) |
| Production (budget) | Ollama |
| Production (quality) | OpenAI or Anthropic |
Why does LLM chunking fall back to word chunking?¶
This happens when: - LLM server is unavailable - Request times out - Response can't be parsed
Check logs for specific errors.
Performance¶
How fast is EmbedCache?¶
Typical response times: - Embed (cached model): 10-100ms per text - Process (cache hit): < 10ms - Process (cache miss): 1-5 seconds (depends on URL and content size)
How much memory does it use?¶
Memory usage depends on loaded models: - Small models (~384 dim): ~200MB each - Base models (~768 dim): ~400MB each - Large models (~1024 dim): ~800MB each
Can I run multiple instances?¶
Yes, for read scaling. However, each instance has its own cache unless you share the SQLite database (with proper locking).
Deployment¶
How do I deploy to production?¶
- Use a reverse proxy (Nginx, Caddy) for TLS
- Configure authentication in the proxy
- Set up monitoring
- Use persistent storage for the cache database
See Deployment for detailed guides.
Does EmbedCache support horizontal scaling?¶
Limited. The SQLite cache doesn't support distributed access. For horizontal scaling: - Use a shared cache layer (Redis) - Implement your own caching - Use sticky sessions
What are the system requirements?¶
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 2GB | 4GB+ |
| CPU | 2 cores | 4+ cores |
| Disk | 500MB | SSD recommended |
Troubleshooting¶
Why is the first request slow?¶
The embedding model is loaded on first use. Subsequent requests are faster.
Why am I getting "Unsupported embedding model"?¶
The model isn't in your ENABLED_MODELS configuration. Add it to your .env file.
How do I reset the cache?¶
Development¶
Can I add custom chunking strategies?¶
Yes! Implement the ContentChunker trait. See Custom Chunkers.
Can I use different embedding providers?¶
Yes! Implement the Embedder trait. See Custom Embedders.
How do I contribute?¶
- Fork the repository
- Create a feature branch
- Submit a pull request
See the Contributing Guidelines.