Architecture Deep Dive

AI Historian

A Real-Time Multimodal Research & Documentary Engine

11

Pipeline Phases

6

Gemini Models

3

Cloud Run Services

4

Output Modalities

11

Pipeline Phases

Input

Any historical document, any language

Output

Self-generating cinematic documentary + live voice historian

Slide 1 of 8

System Overview

Three services, one seamless experience

Frontend

Browser

React 19 · TypeScript · Vite 6 · Tailwind v4

PDF Viewer Research Panel Documentary Player Voice Button

REST + SSE

WebSocket

Service 1

historian-api

FastAPI · Python 3.12 · 2Gi / 2 CPU

HTTP gateway, session lifecycle,
signed URL generation, semantic retrieval

Service 2

agent-orchestrator

ADK + FastAPI · Python 3.12 · 4Gi / 4 CPU

11-phase agent pipeline, SSE streaming,
parallel research, visual generation

Service 3

live-relay

Node.js 20 · 1Gi / 1 CPU

WebSocket proxy to Gemini Live API,
bidirectional audio, interruption handling

Firestore

Sessions, Segments, Chunks, Geo

Cloud Storage

PDFs, Images, Videos

Document AI

Multilingual OCR

Vertex AI

Imagen 3, Veo 2

Gemini Live API

2.5 Flash Native Audio

Slide 2 of 8

The 11-Phase Pipeline

SequentialAgent orchestrating document to documentary

Global Phases (run once)

I

Translation & Scan

document_analyzer — OCR, chunking, semantic curation

gemini-2.0-flash

II

Field Research

scene_research — ParallelAgent with N google_search sub-agents

gemini-2.0-flash

Per-Segment Streaming (Scene 0 first, then 1–N in parallel)

III

Synthesis

script_orch — generates SegmentScript with narration + visual descriptions

gemini-2.5-flash

IV

Creative Direction

narrative_director — TEXT+IMAGE interleaved storyboard

gemini-2.5-flash-image

V

Interleaved Composition

beat_illustration — pre-generates narration beats with images

gemini-2.5-flash-image

VI

Visual Interleave

visual_interleave — assigns illustration / cinematic / video per beat

gemini-2.0-flash

VII

Fact Validation

fact_validator — hallucination firewall, cross-references research

gemini-2.0-flash

VIII

Geographic Mapping

geo_location_agent — geocoding with Google Maps grounding

gemini-2.0-flash

IX

Visual Storyboard

narrative_visual_planner — plans unique visual territory per scene

gemini-2.0-pro

X

Visual Composition

visual_research_orch — 6-stage visual detail micro-pipeline

gemini-2.0-flash

XI

Generation

visual_director_orch — Imagen 3 frames + Veo 2 video clips

imagen-3 + veo-2

Slide 3 of 8

Phase I & II

Document Analysis & Parallel Research

Phase I — Translation & Scan

Document AI OCR

Any language, any script

Semantic Chunker

Page breaks, headings, topics

Parallel Summarizer

Semaphore(10) · Flash

Narrative Curator

gemini-2.0-pro Agent

▦

Document AI OCR

Processes any format: PDF, images, scanned manuscripts
Supports Latin, Arabic, Cyrillic, CJK, Ottoman Turkish, dead scripts
Extracts layout structure (headings, paragraphs, tables)
Async client via process_document

★

Narrative Curator

ADK Agent running gemini-2.0-pro via run_async(ctx)
Produces 4–8 SceneBriefs: title, era, location, cinematic hook
Generates Visual Bible: Imagen 3 style guide for entire documentary
Writes DocumentMap: structural overview for downstream agents

Phase II — Field Research

SceneBriefs

from Phase I

Inject Context

per-scene state keys

ParallelAgent

researcher_0

google_search

researcher_1

google_search

researcher_2

google_search

researcher_N

google_search

Aggregator

merges all research

ADK Constraint: google_search tool cannot be combined with other tools in the same agent. Each researcher is a search-only Agent with its own output_key. The Aggregator reads all {research_0} … {research_N} state keys via ADK template substitution.

3-Round Search Protocol: Core verification → visual references → secondary corroboration. Sources graded by trust tier: .edu / .gov / jstor Wikipedia Blogs / news

Slide 4 of 8

Phase III & VII

Script Generation & Hallucination Firewall

Scene Briefs

Phase I output

+

Aggregated Research

Phase II output

Script Agent

gemini-2.5-flash

Fact Validator

hallucination firewall

Firestore

segments collection

✎

Script Generation (Phase III)

Each segment becomes a SegmentScript:

{
  "id":                  "segment_0",
  "title":               "The Fall of Constantinople",
  "narration_script":    "On the morning of May 29...",
  "visual_descriptions": ["Wide: siege walls...", ...],
  "veo2_scene":          "Cannon smoke rising...",
  "mood":                "tension_crescendo",
  "sources":             ["jstor.org/...", ...]
}
          

⚠

Fact Validation (Phase VII)

LLM-judge classifies every sentence:

Supported — keep exact, backed by research
Unsupported Specific — remove, write bridging sentence
Unsupported Plausible — soften with hedging language
Non-Factual — keep exact (atmospheric, rhetorical)

Safety: only overwrites script if validated segment count matches original (prevents data loss)

Per-Segment Streaming: Scene 0 runs through Phases III–XI first (fast path). Only after Scene 0 emits segment_update(complete) do Scenes 1–N start in parallel with asyncio.Semaphore(2) to respect API rate limits.

Slide 5 of 8

Phases IV, V & VI

Interleaved Multimodal Generation — Gemini TEXT+IMAGE

These three phases use Gemini's native interleaved TEXT+IMAGE output (response_modalities=["TEXT","IMAGE"]) — a single model call produces both narration text and illustrations simultaneously. This is the core Creative Storytellers innovation.

🎨

IV: Creative Direction

One Gemini call per scene:

Input: narration + visual bible + scene brief
Output: creative direction text + storyboard image
Image uploaded to GCS as storyboard PNG
Guides all downstream visual decisions

gemini-2.5-flash-image

♫

V: Beat Illustration

Decomposes narration into 3–4 dramatic beats:

Each beat: composition hint + TEXT+IMAGE call
Beat 0 generated first (fast path)
Beats 1–N concurrent via asyncio.gather
Beat images are the primary visual path for the player

gemini-2.5-flash-image

●

VI: Visual Interleave

Assigns generation path per beat:

illustration — keep Phase V image

cinematic — Imagen 3 photorealistic

video — Veo 2 short clip

Constraint: if ≥3 beats, must include at least 1 of each type

Script

Phase III

Narrative Director

storyboard

Beat Illustration

3–4 beats × image

Visual Interleave

assign types

Phase XI

generation

Slide 6 of 8

Phases X & XI

Visual Research Micro-Pipeline & Image/Video Generation

Phase X — 6-Stage Visual Research

Stage 0

Query Generation

Stage 1

Google Search

Stage 2

Source Evaluation

Stage 3

Content Fetch

Stage 4

LLM Evaluation

Stage 5

Merge & Output

🔎

Visual Research Strategy

Fast path (Scene 0): 3 sources, early exit at 2 accepted
Deep path (Scene 1–N): 10 sources, full evaluation
Content fetch: webpage scrape (httpx + BeautifulSoup), Wikipedia REST API, PDF via Document AI, image via Gemini multimodal
Output: VisualDetailManifest with enriched prompts

📷

No ADK Sub-Agents

All direct client.aio.models.generate_content calls
Bypasses ADK overhead for speed
Concurrent via asyncio.gather
Uses google_search grounding for URL discovery
Evaluates source authority, detail density, relevance

Phase XI — Imagen 3 + Veo 2 Generation

🌄

Imagen 3 (4 frames/segment)

Frame modifiers: wide establishing, medium human, close-up detail, dramatic atmosphere
Prompt priority: enriched manifest > script visual_descriptions > generic fallback
Period-specific negative_prompt: electricity, plastic, modern objects
4 frames generated concurrently via asyncio.gather
~5s per frame → uploaded to GCS immediately

🎥

Veo 2 (optional video)

Triggered for segments with veo2_scene description
Fire-and-forget: generate_videos(output_gcs_uri=...)
Long-running operation — polled every 20s (max 10 min)
client.operations.get is sync-only → run_in_executor
On completion: updates Firestore videoUrl + SSE event

Slide 7 of 8

Live Voice Layer

WebSocket Relay, Interruption Handling & Live Illustration

Browser

AudioWorkletNode
PCM 16kHz, 16-bit, mono
1024-byte chunks → base64

AudioBufferSourceNode
PCM 24kHz playback queue
AnalyserNode → Ken Burns

caption display
Word-by-word blur reveal

interrupted
Stop audio queue + Ken Burns
Switch to listening mode

PCM

audio

text

live-relay (Node.js)

Fetch documentary context
from Firestore (15m cache)

Build system instruction:
persona + all segments
+ sources + visual bible

generate_illustration
Fire-and-forget (5–7s)
FunctionResponse < 10ms

resumption token
Saved to Firestore
Valid 2 hours

WS

Gemini Live API

BidiGenerateContent
gemini-2.5-flash-native-audio

VAD detection →
serverContent.interrupted

modelTurn
Interleaved text + audio parts

sessionResumptionUpdate
goAway handling

Interruption Flow: User speaks → VAD detects speech → Gemini sets interrupted=true → relay forwards to browser → audio queue flushed → Ken Burns paused → waveform + caption mode. On turn_complete, documentary resumes at next beat.

🎤

Audio In

16kHz PCM, 16-bit mono
1024-byte chunks

🔊

Audio Out

24kHz PCM, 16-bit mono
Buffer queue playback

🎨

Live Illustration

Gemini calls function
mid-narration (fire & forget)

Slide 8 of 8

Real-Time Data Flow

SSE Streaming, Firestore Schema & Frontend State

⚡

SSE Event Types

pipeline_phase   { phase, label, message }
agent_status     { agentId, status, query, facts[] }
segment_update   { segmentId, status, title, imageUrls[] }
stats_update     { sourcesFound, factsVerified }
narration_beat   { segmentId, beatIndex, imageUrl }
geo_update       { segmentId, center, events, routes }
agent_source_eval { agentId, url, accepted, reason }
          

Drip buffer (150ms): Prevents UI overload from parallel agent bursts. Events queued in pendingRef, released at intervals.

🗃

Firestore Schema

/sessions/{id}
  status, gcsPath, language, visualBible

/sessions/{id}/chunks/{chunkId}
  raw_text, summary, scene_ids[]

/sessions/{id}/segments/{segId}
  title, narration_script, mood
  imageUrls[4], videoUrl?, sources[]

/sessions/{id}/geos/{segId}
  center, zoom, events[], routes[]

/sessions/{id}/liveSession
  resumptionToken, persona, voiceState
          

Frontend State Architecture

SSE Stream

EventSource

useSSE Hook

150ms drip buffer

sessionStore

researchStore

playerStore

voiceStore

📄

Expedition Log

Pipeline phases narrated as expedition journal entries

🔍

Agent Cards

5-state visual machine: queued → searching → done

🎬

Ken Burns Player

Audio-reactive pan/zoom driven by AnalyserNode

🎤

Voice Button

Canvas waveform, interruption, resumption

Zustand 5 signals-based subscriptions ensure components re-render only when their exact slice changes. Combined with the SSE drip buffer and Vite code-splitting (<200KB gzipped), the frontend maintains 60fps even during peak parallel agent activity.