Pipeline Stages¶

Direktor uses a 6-stage pipeline to transform text into video. Each stage produces intermediate outputs that can be reviewed and customized.

Overview¶

┌─────────────────────────────────────────────────────────────────┐
│                         INPUT TEXT                               │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  Stage 1: Content Optimization & Script Generation              │
│  Input: Raw text                                                 │
│  Output: podcast_script.txt                                      │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  Stage 2: Audio Generation                                       │
│  Input: Podcast script                                           │
│  Output: audio.mp3                                               │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  Stage 3: Transcript Generation                                  │
│  Input: Audio file                                               │
│  Output: transcript.json                                         │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  Stage 4: Image Prompt Generation                                │
│  Input: Transcript                                               │
│  Output: image_prompts.json                                      │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  Stage 5: Image Generation                                       │
│  Input: Image prompts                                            │
│  Output: images/*.webp                                           │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────┐
│  Stage 6: Video Creation                                         │
│  Input: Audio + Images + Prompts                                 │
│  Output: output.mp4                                              │
└─────────────────────────────────────────────────────────────────┘

Stage 1: Content Optimization & Script Generation¶

Purpose: Transform raw text into an engaging podcast script.

Process:

Content is first optimized using NLP techniques for clarity and engagement
Text is split into chunks based on token limits
GPT generates a conversational podcast script for each chunk
Scripts are combined into a single coherent narrative

Output: podcast_script.txt

Technologies: OpenAI GPT models

Stage 2: Audio Generation¶

Purpose: Convert the podcast script into natural-sounding speech.

Process:

Script is split into sentences
Sentences are grouped into chunks (max 150 characters)
BARK model generates audio for each chunk
FFmpeg concatenates chunks into a single audio file

Output: audio.mp3

Technologies: Replicate BARK model, FFmpeg

Stage 3: Transcript Generation¶

Purpose: Create a timestamped transcript from the audio.

Process:

Audio is converted to WAV format (16kHz)
WAV file is uploaded to cloud storage
Distil-Whisper transcribes with timestamps
Transcript is saved with chunk-level timestamps

Output: transcript.json

Technologies: Replicate Distil-Whisper, FFmpeg, Cloudflare R2

Stage 4: Image Prompt Generation¶

Purpose: Generate visual descriptions for each segment.

Process:

Transcript chunks are aggregated into ~30-second segments
GPT analyzes each segment's content
Vivid image prompts are generated for each segment
Prompts are saved with timestamps

Output: image_prompts.json

Technologies: OpenAI GPT models

Stage 5: Image Generation¶

Purpose: Create visual images from the prompts.

Process:

Each prompt is sent to the FLUX model
Images are generated in 16:9 aspect ratio
Images are saved in WebP format

Output: images/image_0.webp, images/image_1.webp, etc.

Technologies: Replicate FLUX model

Stage 6: Video Creation¶

Purpose: Combine all elements into the final video.

Process:

WebP images are converted to PNG
Images are scaled to 1920x1080
Video is created with image timing based on prompts
Audio is combined with video
Optional keyword overlays are added

Output: output.mp4

Technologies: FFmpeg, Pillow

Resuming from a Stage¶

Each stage checks for existing outputs before processing:

# Stage 1 already complete? Skip to Stage 2
direktor input.txt --stage 2

This allows you to:

Review intermediate outputs
Manually edit scripts or prompts
Retry failed stages without reprocessing

Cost Considerations¶

Stage	API Calls	Relative Cost
1	GPT (text)	Low
2	Replicate (audio)	Medium
3	Replicate (transcription)	Low
4	GPT (text)	Low
5	Replicate (images)	High
6	None (local)	Free

Cost Optimization

Run stages 1-4 first to review prompts before generating images.