The Problem
Video editing is one of the most time-intensive parts of content production. A 30-minute documentary-style YouTube video can take 20–40 hours to edit: transcription, B-roll selection, subtitle sync, colour grade, audio normalisation, chapter markers.
For high-volume channels—or creators who want to focus on ideas and speaking, not post-production—this is a bottleneck. The pipeline eliminates it.
What Was Built
The pipeline takes a single voiceover audio file (and optionally a text script) and produces a fully edited YouTube video in documentary style: B-roll, synced subtitles, transitions, chapter markers, and loudness-normalised audio.
The output is production-ready. Target benchmarks: Johnny Harris, Veritasium, ColdFusion.
Dual-Stack Architecture
The system is split into two runtimes that communicate via an EDL JSON (Edit Decision List) pivot format:
Python stack — intelligence layer:
- WhisperX for word-level transcription (faster than Whisper, better timestamps)
- LLM analysis of transcript → scene segmentation, B-roll keywords, subtitle groupings
- Asset sourcing from Pexels and Pixabay with caching and deduplication
- EDL JSON v3 generation: one record per segment with timing, B-roll asset, subtitle text, transition type
Remotion stack (React/TypeScript) — compositing layer:
- Reads the EDL JSON and renders the video frame-by-frame
- Word-synced subtitle animation (Hormozi-style: large, high-contrast, bold)
- B-roll composited over the voiceover with Ken Burns motion on stills
@remotion/transitionsfor cut types (hard cut, cross-dissolve, wipe)- Light leak overlays and motion graphics for visual rhythm
FFmpeg handles the final pass: audio ducking (B-roll music under VO), loudness normalisation to -14 LUFS, and mux.
The EDL JSON Format
The EDL JSON is the architectural decision that makes the dual-stack approach work. It is the contract between intelligence and compositing.
Each entry contains: segment_id, start_ms, end_ms, vo_text, subtitle_chunks, broll_asset, broll_type (video/image), transition_in, transition_out, motion_preset.
This format is human-readable and editable. A creator can open the EDL JSON, change a B-roll selection or subtitle grouping, and re-render without re-running the AI analysis. Render-only changes are fast; intelligence re-runs are only needed when the analysis itself needs to change.
Professional Video Rules Baked In
The LLM is prompted with specific editorial rules:
- Shot length: 3–7 seconds per B-roll clip (faster cuts feel more energetic)
- Cut ratio: 90% hard cuts, 10% transitions (over-using transitions is a beginner mistake)
- Subtitle grouping: 3–5 words per card (readability at speed)
- Ken Burns: applied to stills only, with scale and pan direction derived from image content keywords
- Audio ducking: B-roll music (if provided) ducks to -18 LUFS under the VO, then returns
These rules produce a consistent output style without requiring per-video prompt tuning.
Outputs
Each run produces:
final.mp4— the complete rendered videoedl.json— the full edit decision list (inspect, edit, re-render)chapters.txt— YouTube chapter timestamps derived from scene segmentationassets/— sourced and cached B-roll assets with metadata
Scope
- ✓WhisperX transcription + LLM analysis → EDL JSON v3; optional text script for accuracy.
- ✓Asset sourcing (Pexels/Pixabay) with caching, dedup, and per-segment manifests.
- ✓Remotion: compositing, word-synced subtitles, motion graphics, @remotion/transitions, light leaks.
- ✓FFmpeg: ducking, loudness normalization, final mux. Outputs: final.mp4, edl.json, chapters.txt.
- ✓Professional rules: 3–7s per shot, 90% hard cuts, Hormozi-style subtitles, Ken Burns on B-roll.
Waqas Raza
AI-Native Full-Stack Engineer. Top Rated on Upwork · $180K+ earned · 93% job success. I build production AI agents, LLM systems, Web3 platforms, and full-stack applications.
Hire me on Upwork