EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers

John Flynn¹, Wolfgang Paier¹, Dimitar Dinev¹, Sam Nhut Nguyen¹, Hayk Poghosyan¹, Manuel Toribio¹, Sandipan Banerjee², Guy Gafni¹

¹Pipio AI, ²Amazon

Paper arXiv Video Code Try it on Fal.ai Try it on Pipio

made with ❤️ and ☕ at

EditYourself is a diffusion-based video editing model for talking heads, enabling transcript-driven lip-syncing, insertion, removal and retiming of speech while preserving identity and visual fidelity.

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Image-to-Video

Video-to-Video

Abstract

Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require motion preservation, temporal coherence, speaker identity and lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven Video-to-Video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition and removal of spoken content. By adapting a pre-trained Image-to-Video (I2V) diffusion model to support audio-conditioned V2V inference, EditYourself enables precise lip synchronization while maintaining visual fidelity to the original video. This work represents a foundational step toward transforming generative video models into practical tools for professional video post-production.

Video

Method Overview

Our proposed pipeline. A global audio projection layer and an audio cross-attention layers are added to the network's architecture. For V2V lip syncing, we noise tokens corresponding to the mouth area and task the model with spatio-temporally inpainting them. Purple and orange flame symbols correspond to LoRA and full training.

Video Editing

EditYourself supports transcript-based video editing, including the addition of new footage and seamless removal of unwanted segments.

Addition

Insertion of new content at arbitrary temporal locations, seamlessly adhering to surrounding boundary frames (when present) while lip-syncing to modified audio.

Removal

Deletion of existing content while smoothing the resulting temporal discontinuity to avoid visible jump cuts.

Rerender

Selective re-rendering of video content over specified spatial and temporal regions, conditioned on updated audio and text prompt (e.g., correcting an awkward facial expression or regenerating a hand gesture).

Video-to-Video

Qualitative comparisons between EditYourself and leading open-source and proprietary solutions.

Open Source

InfiniteTalk, LatentSync, MuseTalk.

Closed Source

Creatify Lip Sync, PixVerse, SyncLabs React-1, SyncLabs lipsync-2-pro, VEED Fabric.

Image-to-Video

Qualitative comparisons between EditYourself and leading open-source and proprietary solutions.

Open Source

Hallo3, InfiniteTalk, Sonic, StableAvatar.

Closed Source

Aurora, VEED Fabric, Kling AI, OmniHuman 1.5.

Long Video Inference

EditYourself is capable of generating minutes-long videos without noticeable identity drift. This section also provide a comparison with open source and proprietary solutions: MultiTalk AI Avatar, Aurora, HunyuanAvatar, Kling AI Avatar v2 Pro, OmniHuman 1.5, StableAvatar.

About Us

Pipio is a human-first, AI powered video editor that keeps creators in control. While others replace, Pipio enhances. Seamlessly adjust your performance without ever having to re-record your footage.

EditYourself represents our latest advancement in audio-driven video generation and editing, bringing state-of-the-art diffusion transformer architectures to practical video production workflows.

Visit Pipio API Docs

BibTeX

@article{park2021nerfies,
  author    = {Park, Keunhong and Sinha, Utkarsh and Barron, Jonathan T. and Bouaziz, Sofien and Goldman, Dan B and Seitz, Steven M. and Martin-Brualla, Ricardo},
  title     = {Nerfies: Deformable Neural Radiance Fields},
  journal   = {ICCV},
  year      = {2021},
}

EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers

EditYourself is a diffusion-based video editing model for talking heads, enabling transcript-driven lip-syncing, insertion, removal and retiming of speech while preserving identity and visual fidelity.

Table of Contents

Abstract

Video

Method Overview

Video Editing

Addition

Removal

Rerender

Video-to-Video

Open Source

Closed Source

Image-to-Video

Open Source

Closed Source

Long Video Inference

About Us

BibTeX