Frequently Asked Questions¶

General¶

What is EmbedCache?¶

EmbedCache is a high-performance Rust library and REST API service for generating text embeddings with intelligent caching.

What embedding models are supported?¶

EmbedCache supports 22+ models through FastEmbed, including: - AllMiniLM series - BGE series (Small, Base, Large) - Nomic Embed series - Multilingual E5 series - Paraphrase series - MxbaiEmbed series

See Embedding Models for the complete list.

Is EmbedCache free to use?¶

Yes, EmbedCache is open source under the GPL-3.0 license. The embedding models are also free and run locally.

Do I need an API key?¶

No, for basic usage. Embedding generation happens locally using FastEmbed.

You only need API keys if you want to use LLM-based chunking with OpenAI or Anthropic.

Usage¶

How do I choose the right model?¶

Need	Recommended Model
Fast, general purpose	`AllMiniLML6V2`
Better quality	`BGESmallENV15`
Highest quality	`BGELargeENV15`
Multiple languages	`MultilingualE5Base`
Low memory	`AllMiniLML6V2Q`

What's the difference between /v1/embed and /v1/process?¶

/v1/embed - Generates embeddings for text you provide
/v1/process - Fetches URL content, chunks it, and generates embeddings

Are results cached?¶

Only /v1/process results are cached. /v1/embed does not cache because the same text should produce the same embeddings.

What chunk size should I use?¶

Start with 256-512 words. Smaller chunks provide more precise retrieval, larger chunks provide more context.

LLM Chunking¶

Do I need an LLM for EmbedCache?¶

No, EmbedCache works without LLM. Word-based chunking is always available.

LLM chunking is optional and provides higher quality semantic chunking.

Which LLM provider should I use?¶

Context	Recommended
Development	Ollama (free, local)
Production (budget)	Ollama
Production (quality)	OpenAI or Anthropic

Why does LLM chunking fall back to word chunking?¶

This happens when: - LLM server is unavailable - Request times out - Response can't be parsed

Check logs for specific errors.

Performance¶

How fast is EmbedCache?¶

Typical response times: - Embed (cached model): 10-100ms per text - Process (cache hit): < 10ms - Process (cache miss): 1-5 seconds (depends on URL and content size)

How much memory does it use?¶

Memory usage depends on loaded models: - Small models (~384 dim): ~200MB each - Base models (~768 dim): ~400MB each - Large models (~1024 dim): ~800MB each

Can I run multiple instances?¶

Yes, for read scaling. However, each instance has its own cache unless you share the SQLite database (with proper locking).

Deployment¶

How do I deploy to production?¶

Use a reverse proxy (Nginx, Caddy) for TLS
Configure authentication in the proxy
Set up monitoring
Use persistent storage for the cache database

See Deployment for detailed guides.

Does EmbedCache support horizontal scaling?¶

Limited. The SQLite cache doesn't support distributed access. For horizontal scaling: - Use a shared cache layer (Redis) - Implement your own caching - Use sticky sessions

What are the system requirements?¶

Component	Minimum	Recommended
RAM	2GB	4GB+
CPU	2 cores	4+ cores
Disk	500MB	SSD recommended

Troubleshooting¶

Why is the first request slow?¶

The embedding model is loaded on first use. Subsequent requests are faster.

Why am I getting "Unsupported embedding model"?¶

The model isn't in your ENABLED_MODELS configuration. Add it to your .env file.

How do I reset the cache?¶

# Stop the service, then:
rm cache.db

# Or:
sqlite3 cache.db "DELETE FROM cache;"

Development¶

Can I add custom chunking strategies?¶

Yes! Implement the ContentChunker trait. See Custom Chunkers.

Can I use different embedding providers?¶

Yes! Implement the Embedder trait. See Custom Embedders.

How do I contribute?¶

Fork the repository
Create a feature branch
Submit a pull request

See the Contributing Guidelines.