Agentic · Unified · Video Editing

Aurora
Unified Video Editing with
a Tool-Using Agent

  • Yongsheng Yu1
  • Ziyun Zeng1
  • Zhiyuan Xiao1
  • Zhenghong Zhou1
  • Hang Hua2
  • Wei Xiong3
  • Jiebo Luo1
  • 1University of Rochester
  • 2MIT-IBM Watson AI Lab
  • 3NVIDIA

Abstract

Closing the gap between
a user's request and the model's conditions.

A single set of weights now covers replacement, removal, style transfer, and reference-driven insertion, but only when the request is already complete. Aurora turns that if into when.

Recent video editing models converge on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit.

We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model agent with a unified video diffusion transformer. The VLM agent maps a raw user request to a structured edit plan aligned with the transformer's conditioning channels, resolving textual and visual underspecification before generation. We introduce AgentEdit-Bench to evaluate this framework along those two axes, and we show that the agent transfers to compatible frozen video editing models.

User Agent plan Tools Conditioning Video DiT Output
Scroll 14 s

Edits, end to end

From raw instruction to finished frame.

Drag the aurora line to reveal either the source or the Aurora result; flip to scrub to step through both clips frame-by-frame.

Benchmarks

Three benchmarks,
one transferable lift.

Transfer across models

One agentic front-end,
multiple unified video editing models.

The Aurora agent stays frozen and model-agnostic: it rewrites the user's request, invokes tools when needed, and emits a conditioning tuple that different unified video editing models can consume. The gain comes from the front-end paradigm, not from pairing it with just one downstream model.

Kiwi-Edit raw request → w/ Aurora agent
UniVideo raw request → w/ Aurora agent

Per-task breakdown

Where the Aurora agent helps the most.

Aurora w/o agent Aurora w/ agent

Cite

BibTeX.

Aurora and AgentEdit-Bench are described in the same paper. If you use either, please cite the work.