Agentic · Unified · Video Editing

Aurora
Unified Video Editing with
a Tool-Using Agent

Yongsheng Yu¹
Ziyun Zeng¹
Zhiyuan Xiao¹
Zhenghong Zhou¹
Hang Hua²
Wei Xiong³
Jiebo Luo¹

¹University of Rochester
²MIT-IBM Watson AI Lab
³NVIDIA

Paper arXiv Code Demo reel How it works

Abstract

Closing the gap between
a user's request and the model's conditions.

A single set of weights now covers replacement, removal, style transfer, and reference-driven insertion, but only when the request is already complete. Aurora turns that if into when.

Recent video editing models converge on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit.

We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model agent with a unified video diffusion transformer. The VLM agent maps a raw user request to a structured edit plan aligned with the transformer's conditioning channels, resolving textual and visual underspecification before generation. We introduce AgentEdit-Bench to evaluate this framework along those two axes, and we show that the agent transfers to compatible frozen video editing models.

Edits, end to end

From raw instruction to finished frame.

Drag the aurora line to reveal either the source or the Aurora result; flip to scrub to step through both clips frame-by-frame.

Capability matrix

Every benchmark case, grouped by what the model had to edit.

Hover any tile to bring up the prompt; click to open a side-by-side comparison: source, Kiwi-Edit and UniVideo under the raw request, and Aurora after the Aurora agent resolves the missing conditions.

Benchmarks

Three benchmarks,
one transferable lift.

Transfer across models

One agentic front-end,
multiple unified video editing models.

The Aurora agent stays frozen and model-agnostic: it rewrites the user's request, invokes tools when needed, and emits a conditioning tuple that different unified video editing models can consume. The gain comes from the front-end paradigm, not from pairing it with just one downstream model.

Kiwi-Edit raw request → w/ Aurora agent

UniVideo raw request → w/ Aurora agent

Per-task breakdown

Where the Aurora agent helps the most.

Aurora w/o agent Aurora w/ agent

Cite

BibTeX.

Aurora and AgentEdit-Bench are described in the same paper. If you use either, please cite the work.

Aurora Unified Video Editing witha Tool-Using Agent

Aurora
Unified Video Editing with
a Tool-Using Agent