A Real-Time Multimodal Research & Documentary Engine
Three services, one seamless experience
SequentialAgent orchestrating document to documentary
Document Analysis & Parallel Research
process_documentrun_async(ctx)ADK Constraint: google_search tool cannot be combined with other tools in the same agent. Each researcher is a search-only Agent with its own output_key. The Aggregator reads all {research_0} … {research_N} state keys via ADK template substitution.
3-Round Search Protocol: Core verification → visual references → secondary corroboration. Sources graded by trust tier: .edu / .gov / jstor Wikipedia Blogs / news
Script Generation & Hallucination Firewall
Each segment becomes a SegmentScript:
LLM-judge classifies every sentence:
Safety: only overwrites script if validated segment count matches original (prevents data loss)
Per-Segment Streaming: Scene 0 runs through Phases III–XI first (fast path). Only after Scene 0 emits segment_update(complete) do Scenes 1–N start in parallel with asyncio.Semaphore(2) to respect API rate limits.
Interleaved Multimodal Generation — Gemini TEXT+IMAGE
These three phases use Gemini's native interleaved TEXT+IMAGE output (response_modalities=["TEXT","IMAGE"]) — a single model call produces both narration text and illustrations simultaneously. This is the core Creative Storytellers innovation.
One Gemini call per scene:
Decomposes narration into 3–4 dramatic beats:
Assigns generation path per beat:
Constraint: if ≥3 beats, must include at least 1 of each type
Visual Research Micro-Pipeline & Image/Video Generation
VisualDetailManifest with enriched promptsclient.aio.models.generate_content callsasyncio.gathergoogle_search grounding for URL discoverynegative_prompt: electricity, plastic, modern objectsveo2_scene descriptiongenerate_videos(output_gcs_uri=...)client.operations.get is sync-only → run_in_executorvideoUrl + SSE eventWebSocket Relay, Interruption Handling & Live Illustration
Interruption Flow: User speaks → VAD detects speech → Gemini sets interrupted=true → relay forwards to browser → audio queue flushed → Ken Burns paused → waveform + caption mode. On turn_complete, documentary resumes at next beat.
16kHz PCM, 16-bit mono
1024-byte chunks
24kHz PCM, 16-bit mono
Buffer queue playback
Gemini calls function
mid-narration (fire & forget)
SSE Streaming, Firestore Schema & Frontend State
Drip buffer (150ms): Prevents UI overload from parallel agent bursts. Events queued in pendingRef, released at intervals.
Pipeline phases narrated as expedition journal entries
5-state visual machine: queued → searching → done
Audio-reactive pan/zoom driven by AnalyserNode
Canvas waveform, interruption, resumption
Zustand 5 signals-based subscriptions ensure components re-render only when their exact slice changes. Combined with the SSE drip buffer and Vite code-splitting (<200KB gzipped), the frontend maintains 60fps even during peak parallel agent activity.