Abstract
A single set of weights now covers replacement, removal, style transfer, and reference-driven insertion, but only when the request is already complete. Aurora turns that if into when.
Recent video editing models converge on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit.
We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model agent with a unified video diffusion transformer. The VLM agent maps a raw user request to a structured edit plan aligned with the transformer's conditioning channels, resolving textual and visual underspecification before generation. We introduce AgentEdit-Bench to evaluate this framework along those two axes, and we show that the agent transfers to compatible frozen video editing models.
Edits, end to end
Drag the aurora line to reveal either the source or the Aurora result; flip to scrub to step through both clips frame-by-frame.
Capability matrix
Hover any tile to bring up the prompt; click to open a side-by-side comparison: source, Kiwi-Edit and UniVideo under the raw request, and Aurora after the Aurora agent resolves the missing conditions.
Benchmarks
Transfer across models
The Aurora agent stays frozen and model-agnostic: it rewrites the user's request, invokes tools when needed, and emits a conditioning tuple that different unified video editing models can consume. The gain comes from the front-end paradigm, not from pairing it with just one downstream model.
Per-task breakdown
Cite
Aurora and AgentEdit-Bench are described in the same paper. If you use either, please cite the work.