Architecture Deep Dive

AI Historian

A Real-Time Multimodal Research & Documentary Engine

11
Pipeline Phases
6
Gemini Models
3
Cloud Run Services
4
Output Modalities
11
Pipeline Phases
Input
Any historical document, any language
Output
Self-generating cinematic documentary + live voice historian
Slide 1 of 8

System Overview

Three services, one seamless experience

Frontend
Browser
React 19 · TypeScript · Vite 6 · Tailwind v4
PDF Viewer Research Panel Documentary Player Voice Button
REST + SSE
WebSocket
Service 1
historian-api
FastAPI · Python 3.12 · 2Gi / 2 CPU
HTTP gateway, session lifecycle,
signed URL generation, semantic retrieval
Service 2
agent-orchestrator
ADK + FastAPI · Python 3.12 · 4Gi / 4 CPU
11-phase agent pipeline, SSE streaming,
parallel research, visual generation
Service 3
live-relay
Node.js 20 · 1Gi / 1 CPU
WebSocket proxy to Gemini Live API,
bidirectional audio, interruption handling
Firestore
Sessions, Segments, Chunks, Geo
Cloud Storage
PDFs, Images, Videos
Document AI
Multilingual OCR
Vertex AI
Imagen 3, Veo 2
Gemini Live API
2.5 Flash Native Audio
Slide 2 of 8

The 11-Phase Pipeline

SequentialAgent orchestrating document to documentary

Global Phases (run once)
I
Translation & Scan
document_analyzer — OCR, chunking, semantic curation
gemini-2.0-flash
II
Field Research
scene_research — ParallelAgent with N google_search sub-agents
gemini-2.0-flash
Per-Segment Streaming (Scene 0 first, then 1–N in parallel)
III
Synthesis
script_orch — generates SegmentScript with narration + visual descriptions
gemini-2.5-flash
IV
Creative Direction
narrative_director — TEXT+IMAGE interleaved storyboard
gemini-2.5-flash-image
V
Interleaved Composition
beat_illustration — pre-generates narration beats with images
gemini-2.5-flash-image
VI
Visual Interleave
visual_interleave — assigns illustration / cinematic / video per beat
gemini-2.0-flash
VII
Fact Validation
fact_validator — hallucination firewall, cross-references research
gemini-2.0-flash
VIII
Geographic Mapping
geo_location_agent — geocoding with Google Maps grounding
gemini-2.0-flash
IX
Visual Storyboard
narrative_visual_planner — plans unique visual territory per scene
gemini-2.0-pro
X
Visual Composition
visual_research_orch — 6-stage visual detail micro-pipeline
gemini-2.0-flash
XI
Generation
visual_director_orch — Imagen 3 frames + Veo 2 video clips
imagen-3 + veo-2
Slide 3 of 8

Phase I & II

Document Analysis & Parallel Research

Phase I — Translation & Scan
Document AI OCR
Any language, any script
Semantic Chunker
Page breaks, headings, topics
Parallel Summarizer
Semaphore(10) · Flash
Narrative Curator
gemini-2.0-pro Agent
Document AI OCR
  • Processes any format: PDF, images, scanned manuscripts
  • Supports Latin, Arabic, Cyrillic, CJK, Ottoman Turkish, dead scripts
  • Extracts layout structure (headings, paragraphs, tables)
  • Async client via process_document
Narrative Curator
  • ADK Agent running gemini-2.0-pro via run_async(ctx)
  • Produces 4–8 SceneBriefs: title, era, location, cinematic hook
  • Generates Visual Bible: Imagen 3 style guide for entire documentary
  • Writes DocumentMap: structural overview for downstream agents
Phase II — Field Research
SceneBriefs
from Phase I
Inject Context
per-scene state keys
ParallelAgent
researcher_0
google_search
researcher_1
google_search
researcher_2
google_search
researcher_N
google_search
Aggregator
merges all research

ADK Constraint: google_search tool cannot be combined with other tools in the same agent. Each researcher is a search-only Agent with its own output_key. The Aggregator reads all {research_0}{research_N} state keys via ADK template substitution.

3-Round Search Protocol: Core verification → visual references → secondary corroboration. Sources graded by trust tier: .edu / .gov / jstor Wikipedia Blogs / news

Slide 4 of 8

Phase III & VII

Script Generation & Hallucination Firewall

Scene Briefs
Phase I output
+
Aggregated Research
Phase II output
Script Agent
gemini-2.5-flash
Fact Validator
hallucination firewall
Firestore
segments collection
Script Generation (Phase III)

Each segment becomes a SegmentScript:

{ "id": "segment_0", "title": "The Fall of Constantinople", "narration_script": "On the morning of May 29...", "visual_descriptions": ["Wide: siege walls...", ...], "veo2_scene": "Cannon smoke rising...", "mood": "tension_crescendo", "sources": ["jstor.org/...", ...] }
Fact Validation (Phase VII)

LLM-judge classifies every sentence:

  • Supported — keep exact, backed by research
  • Unsupported Specific — remove, write bridging sentence
  • Unsupported Plausible — soften with hedging language
  • Non-Factual — keep exact (atmospheric, rhetorical)

Safety: only overwrites script if validated segment count matches original (prevents data loss)

Per-Segment Streaming: Scene 0 runs through Phases III–XI first (fast path). Only after Scene 0 emits segment_update(complete) do Scenes 1–N start in parallel with asyncio.Semaphore(2) to respect API rate limits.

Slide 5 of 8

Phases IV, V & VI

Interleaved Multimodal Generation — Gemini TEXT+IMAGE

These three phases use Gemini's native interleaved TEXT+IMAGE output (response_modalities=["TEXT","IMAGE"]) — a single model call produces both narration text and illustrations simultaneously. This is the core Creative Storytellers innovation.

🎨
IV: Creative Direction

One Gemini call per scene:

  • Input: narration + visual bible + scene brief
  • Output: creative direction text + storyboard image
  • Image uploaded to GCS as storyboard PNG
  • Guides all downstream visual decisions
gemini-2.5-flash-image
V: Beat Illustration

Decomposes narration into 3–4 dramatic beats:

  • Each beat: composition hint + TEXT+IMAGE call
  • Beat 0 generated first (fast path)
  • Beats 1–N concurrent via asyncio.gather
  • Beat images are the primary visual path for the player
gemini-2.5-flash-image
VI: Visual Interleave

Assigns generation path per beat:

illustration — keep Phase V image
cinematic — Imagen 3 photorealistic
video — Veo 2 short clip

Constraint: if ≥3 beats, must include at least 1 of each type

Script
Phase III
Narrative Director
storyboard
Beat Illustration
3–4 beats × image
Visual Interleave
assign types
Phase XI
generation
Slide 6 of 8

Phases X & XI

Visual Research Micro-Pipeline & Image/Video Generation

Phase X — 6-Stage Visual Research
Stage 0
Query Generation
Stage 1
Google Search
Stage 2
Source Evaluation
Stage 3
Content Fetch
Stage 4
LLM Evaluation
Stage 5
Merge & Output
🔎
Visual Research Strategy
  • Fast path (Scene 0): 3 sources, early exit at 2 accepted
  • Deep path (Scene 1–N): 10 sources, full evaluation
  • Content fetch: webpage scrape (httpx + BeautifulSoup), Wikipedia REST API, PDF via Document AI, image via Gemini multimodal
  • Output: VisualDetailManifest with enriched prompts
📷
No ADK Sub-Agents
  • All direct client.aio.models.generate_content calls
  • Bypasses ADK overhead for speed
  • Concurrent via asyncio.gather
  • Uses google_search grounding for URL discovery
  • Evaluates source authority, detail density, relevance
Phase XI — Imagen 3 + Veo 2 Generation
🌄
Imagen 3 (4 frames/segment)
  • Frame modifiers: wide establishing, medium human, close-up detail, dramatic atmosphere
  • Prompt priority: enriched manifest > script visual_descriptions > generic fallback
  • Period-specific negative_prompt: electricity, plastic, modern objects
  • 4 frames generated concurrently via asyncio.gather
  • ~5s per frame → uploaded to GCS immediately
🎥
Veo 2 (optional video)
  • Triggered for segments with veo2_scene description
  • Fire-and-forget: generate_videos(output_gcs_uri=...)
  • Long-running operation — polled every 20s (max 10 min)
  • client.operations.get is sync-only → run_in_executor
  • On completion: updates Firestore videoUrl + SSE event
Slide 7 of 8

Live Voice Layer

WebSocket Relay, Interruption Handling & Live Illustration

Browser
AudioWorkletNode
PCM 16kHz, 16-bit, mono
1024-byte chunks → base64
AudioBufferSourceNode
PCM 24kHz playback queue
AnalyserNode → Ken Burns
caption display
Word-by-word blur reveal
interrupted
Stop audio queue + Ken Burns
Switch to listening mode
PCM
audio
text
live-relay (Node.js)
Fetch documentary context
from Firestore (15m cache)
Build system instruction:
persona + all segments
+ sources + visual bible
generate_illustration
Fire-and-forget (5–7s)
FunctionResponse < 10ms
resumption token
Saved to Firestore
Valid 2 hours
WS
WS
Gemini Live API
BidiGenerateContent
gemini-2.5-flash-native-audio
VAD detection →
serverContent.interrupted
modelTurn
Interleaved text + audio parts
sessionResumptionUpdate
goAway handling

Interruption Flow: User speaks → VAD detects speech → Gemini sets interrupted=true → relay forwards to browser → audio queue flushed → Ken Burns paused → waveform + caption mode. On turn_complete, documentary resumes at next beat.

🎤
Audio In

16kHz PCM, 16-bit mono
1024-byte chunks

🔊
Audio Out

24kHz PCM, 16-bit mono
Buffer queue playback

🎨
Live Illustration

Gemini calls function
mid-narration (fire & forget)

Slide 8 of 8

Real-Time Data Flow

SSE Streaming, Firestore Schema & Frontend State

SSE Event Types
pipeline_phase { phase, label, message } agent_status { agentId, status, query, facts[] } segment_update { segmentId, status, title, imageUrls[] } stats_update { sourcesFound, factsVerified } narration_beat { segmentId, beatIndex, imageUrl } geo_update { segmentId, center, events, routes } agent_source_eval { agentId, url, accepted, reason }

Drip buffer (150ms): Prevents UI overload from parallel agent bursts. Events queued in pendingRef, released at intervals.

🗃
Firestore Schema
/sessions/{id} status, gcsPath, language, visualBible /sessions/{id}/chunks/{chunkId} raw_text, summary, scene_ids[] /sessions/{id}/segments/{segId} title, narration_script, mood imageUrls[4], videoUrl?, sources[] /sessions/{id}/geos/{segId} center, zoom, events[], routes[] /sessions/{id}/liveSession resumptionToken, persona, voiceState
Frontend State Architecture
SSE Stream
EventSource
useSSE Hook
150ms drip buffer
sessionStore
researchStore
playerStore
voiceStore
📄

Expedition Log

Pipeline phases narrated as expedition journal entries

🔍

Agent Cards

5-state visual machine: queued → searching → done

🎬

Ken Burns Player

Audio-reactive pan/zoom driven by AnalyserNode

🎤

Voice Button

Canvas waveform, interruption, resumption

Zustand 5 signals-based subscriptions ensure components re-render only when their exact slice changes. Combined with the SSE drip buffer and Vite code-splitting (<200KB gzipped), the frontend maintains 60fps even during peak parallel agent activity.