Hackathon Showcase Finalist

Capturia

YouTube Video

Project Description

Capturia — generative UI for live video

Capturia composes broadcast-grade overlays on a live webcam feed in real time, driven entirely by an AI agent. There is no chat sidebar. The user’s video fills the screen, and the agent’s “replies” are spatial UI components — lower-thirds, metrics panels, sparkline charts, big counters, ticker bands, keyword chips, donut rings, letterbox bars — that materialize, animate, mutate, and move on the live feed. The agent never speaks prose. Every utterance is either a tool call that changes the on-screen state or silence. The interface IS the agent’s output.

Beyond text-based chat

Most generative-UI demos are chat-augmented forms or sidecar copilots: a panel suggests, the user clicks, the form fills. Capturia inverts that. The user talks naturally — “my name is Andres, founder of Pixalia”, “our Q4 revenue is 1.8 million up 24 percent”, “bump revenue to 2.1”, “move the chart to the top right”, “add a big counter for twelve thousand viewers” — and watches an animated broadcast-grade overlay appear, update, slide between anchor positions, flash green on a positive delta, or burst a halo when a counter crosses 10K. There is no transcript pane to read; the value lives entirely in the rendered, animated screen.

Originality

Three things make this distinct from a typical CopilotKit/AG-UI showcase:

Spatial generative UI. The catalog isn’t a list of forms — every overlay has anchor positions (top-left, bottom-center, full-bottom, etc.) and the agent reasons about layout, density, and presentation order. A move_overlay tool triggers a FLIP-style transform between previous and new bounding boxes; a LowerThird defaults to bottom-left; a cinematic Letterbox slides black bars from the screen edges.
Incremental tools over regeneration. The agent has dedicated incremental actions — bump_metric, append_chart_data, move_overlay — and is prompted to prefer them over modify_overlay. So “bump revenue to 1.8M” count-up tweens the row, flashes it green, and appends a new point to the inline sparkline — instead of redrawing the panel. The animation system is hand-tuned (digit-rolls, ring ripples, milestone halos, gradient hue-cycle progress bars, sparkle dots on radial rings ≥85%, ambient floating particles when voice is live).
Voice-first. Web Speech API streams transcripts into the AG-UI session as [VOICE]-prefixed messages. A 3-rule prompt distinguishes explicit commands (“add a lower third”) from implicit cues (“my name is Andres” → infer LowerThird) from pure filler (“you know what I mean” → silence). Same agent for typed and spoken input.

Technical execution

Stack: Next.js 16 (App Router, Turbopack) · React 19.2 · TypeScript · Tailwind v4 · CopilotKit 1.57.1 · @copilotkit/runtime/v2 · @copilotkit/a2ui-renderer · @ai-sdk/google · Gemini 3.1 Flash-Lite · Web Speech API · MediaRecorder + getDisplayMedia for capture · deployed on Vercel.

CopilotKit + AG-UI as the agent runtime. The backend (app/api/copilotkit/[[...slug]]/route.ts) runs a BuiltInAgent from @copilotkit/runtime/v2 in single-route mode, driving Gemini 2.5 Flash-Lite via the Vercel AI SDK with maxSteps: 1 and temperature: 0 for sub-second TTFT. The frontend (app/page.tsx) registers six tools the agent can call — add_overlay, modify_overlay, remove_overlay, move_overlay, append_chart_data, bump_metric — via useCopilotAction, and shares the current overlay list back to the agent through useCopilotReadable (AG-UI shared state), so the agent always knows what’s on screen and can target updates by id. useCopilotChat().appendMessage pipes voice transcripts into the same session as [VOICE] user messages.

A2UI catalog as schema-of-truth. lib/catalog.ts defines all 12 components as Zod schemas. lib/a2ui-catalog.tsx invokes the real createCatalog from @copilotkit/a2ui-renderer, registering each Zod definition with its React adapter renderer. The catalog object is exposed on window for live inspection and forms the foundation for the upcoming <A2UIRenderer> surface mode (where the agent will push entire UI surfaces, not just per-component tool calls).

Defensive runtime layer. Agent JSON is untrusted at runtime. A shared normalizeProps(type, props) helper applied to both add_overlay and modify_overlay coerces malformed metrics (objects → {label, value, delta} with stringified fields), keywords ([{text:"x"}] → string[]), chart data ([{value:n}] → number[]), steps, items, and currentStep. Each component also guards its own iteration so partial agent props can’t crash the tree.

How the interface responds to agent output

Single-tool-call latency budget for “add a metrics panel — revenue 1.2M, users 18K, churn 2.1%”:

Web Speech onresult fires (interim + final transcript).
Transcript is appended to the AG-UI session as a [VOICE] add a metrics panel... user message.
Gemini 2.5 Flash-Lite emits an add_overlay tool call (~150 ms TTFT).
CopilotKit dispatches the call to useCopilotAction("add_overlay").
setOverlays(prev => [...prev, { id, type, position, props }]) mutates React state.
OverlayLayer reconciles, applying a 60 ms-staggered entrance per new item.
Each overlay’s hand-authored CSS keyframe plays — overlay-enter, overlay-enter-scale, letterbox-enter-top, digit-roll, border-breathe — with bespoke easing for broadcast feel.

Subsequent updates use the same loop but trigger different visual responses: bump_metric does not unmount the panel — it count-up tweens the targeted row, flashes it green or red based on direction, and appends a new value to the inline sparkline history; append_chart_data morphs the polyline via useNumberArrayTween; move_overlay triggers a FLIP-style position transition; BigCounter rolls each digit independently and bursts a milestone halo on 10K/100K/1M crossings; KeywordHighlight rotates a palette across chips when color="auto". Recording is one click — getDisplayMedia + getUserMedia muxed into a single VP9+Opus webm download.

Generative-UI thesis

Capturia is a deliberate test of how far generative UI can go past chat: a 12-component spatial catalog, voice as the primary input, incremental tools that respect existing on-screen state, and an animation system that makes the agent’s output feel cinematic rather than utilitarian. CopilotKit + AG-UI handle the transport and shared state; A2UI’s createCatalog is the typed contract between agent and renderers. The result is a working live-broadcast composer where the agent is invisible and the screen is the conversation.