Model Selection¶

Choosing the right models is crucial for effective prompt engineering with Blogus.

Target vs Judge Models¶

Blogus distinguishes between two types of models:

Target Models¶

Target models execute prompts - they generate the actual responses in production.

Use for:

Testing prompt performance
Generating actual responses
Cost-sensitive production environments
Comparing model outputs

Judge Models¶

Judge models analyze and evaluate prompts. These are typically more capable models that provide insightful feedback.

Use for:

Prompt analysis and evaluation
Fragment analysis
Test case generation
Goal inference

Available Models¶

OpenAI Models¶

Model	Recommended For	Description
`gpt-4o`	Both	Latest multimodal model, excellent for analysis
`gpt-4-turbo`	Both	High-performance, good balance of capability and cost
`gpt-3.5-turbo`	Target	Cost-effective for simple tasks

Anthropic Models¶

Model	Recommended For	Description
`claude-3-opus-20240229`	Judge	Most capable, ideal for detailed analysis
`claude-3-sonnet-20240229`	Both	Good balance of speed and capability
`claude-3-haiku-20240307`	Target	Fast and cost-effective

Groq Models¶

Model	Recommended For	Description
`groq/llama3-70b-8192`	Both	High-quality responses with speed
`groq/mixtral-8x7b-32768`	Target	Cost-effective with good performance
`groq/gemma-7b-it`	Target	Lightweight tasks

Selection Strategy¶

For Analysis (Judge Models)¶

Prioritize capability over cost - Analysis quality matters
Claude 3 Opus - Best for detailed, nuanced analysis
GPT-4o - Excellent alternative with broad capabilities
Llama 3 70B - Good open-source option

Python

from blogus.core import analyze_prompt, JudgeLLMModel

# Use a capable judge for analysis
analysis = analyze_prompt(prompt, JudgeLLMModel.CLAUDE_3_OPUS)

For Execution (Target Models)¶

Balance cost and performance - Consider your budget
Match task complexity:
- Simple tasks: GPT-3.5 Turbo, Mixtral, Haiku
- Complex tasks: GPT-4o, Claude 3 Sonnet, Llama 3 70B
Consider speed requirements - Haiku and Groq for fast responses

Python

from blogus.core import execute_prompt, TargetLLMModel

# Use cost-effective model for execution
response = execute_prompt(prompt, TargetLLMModel.GROQ_MIXTRAL_8X7B)

Recommended Combinations¶

Cost-Effective Analysis¶

Python

# Powerful judge, economical execution
analysis = analyze_prompt(prompt, JudgeLLMModel.CLAUDE_3_OPUS)
response = execute_prompt(prompt, TargetLLMModel.GROQ_MIXTRAL_8X7B)

High-Performance¶

Python

# Best quality for both
analysis = analyze_prompt(prompt, JudgeLLMModel.CLAUDE_3_OPUS)
response = execute_prompt(prompt, TargetLLMModel.GPT_4)

Speed-Focused¶

Python

# Fast analysis and execution
analysis = analyze_prompt(prompt, JudgeLLMModel.CLAUDE_3_SONNET)
response = execute_prompt(prompt, TargetLLMModel.GROQ_LLAMA3_70B)

Balanced¶

Python

# Good all-around performance
analysis = analyze_prompt(prompt, JudgeLLMModel.GPT_4)
response = execute_prompt(prompt, TargetLLMModel.CLAUDE_3_SONNET)

Performance Comparison¶

Latency (Fastest to Slowest)¶

Groq Models - Dedicated hardware, very fast
Claude 3 Haiku - Optimized for speed
GPT-3.5 Turbo - Generally fast
Claude 3 Sonnet - Moderate
GPT-4o - Variable based on load
Claude 3 Opus - Slower but thorough

Cost (Lowest to Highest)¶

Groq Gemma 7B - Least expensive
Groq Mixtral 8x7B - Low cost
GPT-3.5 Turbo - Moderate
Claude 3 Haiku - Moderate
Claude 3 Sonnet - Higher
GPT-4o / GPT-4 Turbo - High
Claude 3 Opus - Highest

Context Length¶

Model	Context Window
Claude 3 (all)	200K tokens
GPT-4o	128K tokens
GPT-4 Turbo	128K tokens
Mixtral 8x7B	32K tokens
GPT-3.5 Turbo	16K tokens
Llama 3 70B	8K tokens
Gemma 7B	8K tokens

Best Practices¶

1. Match Model to Task¶

Python

# Detailed analysis: use capable judge
analysis = analyze_prompt(prompt, JudgeLLMModel.CLAUDE_3_OPUS)

# Simple execution: use cost-effective target
response = execute_prompt(prompt, TargetLLMModel.GROQ_MIXTRAL_8X7B)

2. Test Across Models¶

Python

models_to_test = [
    TargetLLMModel.GPT_3_5_TURBO,
    TargetLLMModel.GROQ_MIXTRAL_8X7B,
    TargetLLMModel.CLAUDE_3_HAIKU
]

for model in models_to_test:
    response = execute_prompt(prompt, model)
    print(f"{model.value}: {response[:100]}...")

3. Use Powerful Models for Analysis¶

Python

# Even with a simple target model, use powerful judge
response = execute_prompt(prompt, TargetLLMModel.GROQ_GEMMA_7B)
analysis = analyze_prompt(prompt, JudgeLLMModel.GPT_4)  # More capable

4. Consider Context Requirements¶

Python

# For long prompts, choose models with sufficient context
if len(prompt) > 10000:
    target_model = TargetLLMModel.CLAUDE_3_SONNET  # 200K context
else:
    target_model = TargetLLMModel.GPT_3_5_TURBO  # 16K context

Environment Variables¶

Set API keys for the models you want to use:

Bash

# OpenAI
export OPENAI_API_KEY="sk-..."

# Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."

# Groq
export GROQ_API_KEY="gsk_..."

CLI Model Selection¶

Analyze Command¶

Bash

blogus analyze "Your prompt" --judge-model claude-3-opus-20240229

Execute Command¶

Bash

blogus exec my-prompt --model gpt-4o

Available CLI Model Values¶

Text Only

gpt-4o
gpt-4-turbo
gpt-3.5-turbo
claude-3-opus-20240229
claude-3-sonnet-20240229
claude-3-haiku-20240307
groq/llama3-70b-8192
groq/mixtral-8x7b-32768
groq/gemma-7b-it