We would like depict steps to reproduce outcomes we achieved and described in paper "Deepfake for the Good: Generating Avatars through Face-Swapping with Implicit Deepfake Generation". In the aforementioned work we show a novel approach to get 3D deepfake representation of a person using a single 2D photo and a set of images of a base face avatar. Below there is an illustrative movie of such attempt's outcome for Ms. Céline Dion.
Result video:
celine.mp4
Now it's possible to use our solution to create 4D deepfake representation. As for 3D version we use single 2D photo and video of a base face avatar. Below we present capabilities of our solution.
Original video:
original_video.mp4
Face to swap with:
Output video:
output_video.mp4
It's also possible to change facial expressions
Paper | After ImplicitDeepfake | Changed expression |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
- Download dataset from link
This dataset consists of CelebA picture of Ms. Céline Dion (file: famous.jpg) and a directory with train, validation and test pictures of a base face avatar. With every subdirectory there is associated a .json file, containing camera positions, from which specific photos were taken. We hereby stress, that the face avatar we use in this example comes from this link. We are thankful for this piece of work.
- Convert every photo from the dataset to a 2D deepfake
Use a 2D deepfake of your choice to convert all the pictures from the dataset directory to their deepfake versions, using file famous.jpg as a target photo. For the experiments we conducted in the paper, we used GHOST deepfake (see citations).
- Pick a 3D rendering model to be rewarded with a 3D deepfake representation of the target person from step 2
Both NeRF and Gaussian Splatting solutions (see citations) work fine, our pipeline does not demand any specific model though. The result from the short illustrative video comes from Gaussian Splatting model.
We created a notebook that covers steps 2 and 3 from above, assuming the use of Gaussian Splatting as the 3D rendering technique. Its content is based on similar notebooks from the repos of the matter. In case of any doubts, feel free to ask us. The notebook needs no further requirements when being run on Google Colab.
- Download the dataset (as noted here).
This dataset consists of directory with train, validation and test pictures of a base face avatar. With every subdirectory there is associated a .json file, containing camera positions, from which specific photos were taken.
-
Download photo famous.jpg
-
Convert every photo from the dataset to a 2D deepfake
Use a 2D deepfake of your choice to convert all the pictures from the dataset directory to their deepfake versions, using file famous.jpg as a target photo. For the experiments we conducted in the paper, we used GHOST deepfake (see citations).
-
Use 4D Facial avatars to get 4D avatar facial reconstruction.
Attention this model requires at least 80GB RAM!
To reproduce the results of this experiment, you will need to install the following tools:
-
Render Images from Blender
Render a series of images from your 3D model in Blender. These images will serve as the input for the diffusion process. We recommend performing 360 degrees video render to ensure consistency while applying diffusion. An example open source 3D model -
Apply Diffusion using Stable Diffusion Use Stable Diffusion to apply transformations defined by a prompt to the rendered frames. The process of applying Stable Diffusion that ensures the highest consistency of the result across among different angles along with optimal parameters is described here and requires the use of EbSynth.
-
Convert Rendered Images to a 3D Model using Gaussian Splatting
Feed the transformed images into the Gaussian Splatting process to generate the final 3D model.
After completing the experiment, the following results were observed:
Used prompts:
- positive - "Photo of a bronze bust of a woman, detailed and lifelike, in the style of Auguste Rodin, polished bronze, classical sculpture aesthetics, 32k uhd, timeless and elegant, intricate details, full head and shoulders, museum quality, realistic texture, warm bronze tones, photorealistic, black background."
- negative - "Deformed, disfigured, ugly."
Used prompts:
- positive - "Photo of a head with realistic facial features, hair color changed to vibrant red, smooth and lifelike skin texture, sharp and expressive eyes, natural human proportions, high-definition detail, consistent appearance from all angles (front, side, back view), cinematic composition, trending on ArtStation."
- negative - "Unrealistic colors, distorted proportions, blurred details, heavy shadows, lack of detail."
Used prompts:
- positive - "An elf, with pointed ears, ethereal and elegant features, detailed and lifelike, in the style of Alan Lee, smooth and flawless skin, sharp and expressive eyes, long and flowing hair, otherworldly and mystical appearance, 32k uhd, high-definition detail, wearing simple yet stylish elven attire, black background, cinematic lighting, photorealistic, studio portrait."
- negative - "Distorted, disfigured, ugly, human features, unrealistic proportions, poor lighting, low detail."
-
Install the software mentioned in the Tools section (we provide an example dataset in the link, so you don’t need Blender).
-
Copy all scripts from this repository into the
Automatic1111
folder. -
Run Stable Diffusion with the
--api
parameter. -
Open the folder in CMD Terminal (PowerShell may not work properly).
-
Activate the virtual environment:
.\venv\Scripts\activate.bat
- Run the script:
python loop.py /path/to/main_folder "My Prompt" --start 0 --end 3 [optional params]
-
Make sure the main folder contains:
- video render of the object
transforms.json
file
-
The code will generate folders named
iter1
,iter2
, etc., containing gaussian splatting model and generated frames. -
Video results (360-degree renders) will appear in the main folder as
iter1.mp4
,iter2.mp4
, etc. -
During generation, the program will pause and wait for you to manually use EbSynth to propagate changes from diffusion to all frames.
Results from the second experiment:
Latent Diffusion Models were introduced by Rombach et al. in
“High-Resolution Image Synthesis with Latent Diffusion Models” (arXiv : 2112.10752, 20 Dec 2021).
The authors propose to embed the diffusion process inside a pretrained auto-encoder so that all noisy forward / denoising reverse steps run in a much lower-dimensional latent space rather than in pixel space. This design slashes memory and compute while preserving visual fidelity, and it underpins Stable Diffusion.
Pipeline overview
-
Encode – an image
$x$ is compressed by an encoder$E$ to a latent tensor$z = E(x)$ . -
Diffuse & Denoise in Latent Space – the DDPM/SDE process operates on
$z$ , training a U-Net to predict noise in the latent domain. -
Decode – after the reverse diffusion yields
$\hat{z}$ , a decoder$D$ reconstructs the final high-resolution image$\hat{x}=D(\hat{z})$ .
Because
Before introducing ControlNet, it is useful to recall that modern diffusion models can already be steered by natural-language prompts.
The mechanism was formalised in the OpenAI paper “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models” (Nichol et al., 2022). The authors demonstrate that if you pass a prompt through a text encoder—initially CLIP ViT-L/14—and concatenate the resulting embedding to the U-Net’s latent or cross-attention layers, the denoising process learns to minimise noise while satisfying the text condition.
Two guidance strategies proved especially effective:
Strategy | Idea |
---|---|
CLIP Guidance | During sampling, use the CLIP image encoder to rank intermediate images by semantic similarity to the prompt and nudge the diffusion trajectory towards higher-ranked samples. |
Classifier-Free Guidance (CFG) | Train the model with and without a prompt (empty string) and, at inference, interpolate between the two predictions to trade diversity for fidelity. |
This text-conditioning recipe underpins Stable Diffusion:
- v1.x checkpoints inherit the pre-trained OpenAI CLIP encoder used in GLIDE and DALL·E 2.
- v2.x replaces it with OpenCLIP—a from-scratch replication trained on the LAION-2B dataset—which improves prompt adherence and removes the need for proprietary weights.
In short, a prompt such as
"Photo of a bronze bust, polished, museum lighting"
is converted into a CLIP embedding that conditions every denoising step, yielding images that match the described content even before any additional control mechanisms (e.g., ControlNet) are applied.
ControlNet adds an extra, trainable branch to a frozen text-guided Latent Diffusion Model so that generation can be steered by pixel-aligned inputs such as edge maps, depth, pose, or segmentation. A duplicate of the U-Net encoder–decoder receives the condition map (c); its layers are connected to the frozen backbone through zero-initialised 1 × 1 convolutions, which output zeros at the start of training and therefore leave the base model’s behaviour untouched. During fine-tuning, these “ZeroConvs” gradually learn a residual that injects just enough spatial information to satisfy (c), allowing robust training even on datasets as small as 50 k pairs and preventing catastrophic drift. Official checkpoints cover Canny edges, depth, OpenPose skeletons, normal maps, and more, and the ControlNet 1.1 release adds “guess-mode” and cached feature variants for ~45 % faster inference.
Role in this repo: we use ControlNet (edge or depth) to lock geometry across chosen 360° renders which are then being transformed by Stable Diffusion into new images according to the prompt. It takes place before EbSynth propagates style and Gaussian Splatting rebuilds the 3D model, ensuring both structural fidelity and temporal coherence.
EbSynth is a patch-based, example-guided algorithm that propagates an artist-painted key frame across the remaining frames of a video while preserving both local texture details and global temporal coherence. The method was first presented as “Stylizing Video by Example” at SIGGRAPH 2019 and builds on earlier work such as StyLit (SIGGRAPH 2016) and the PatchMatch family of nearest-neighbour algorithms.
- Key-frame stylisation – The user paints or edits one (or more) reference frames with any 2D tool.
- Guidance map computation – Dense correspondences between the key frame(s) and each target frame are estimated (usually with optical flow).
- PatchMatch transfer – For every patch in the target frame, EbSynth finds the best-matching patch in the key frame and copies its pixels; a confidence map weights the blending.
- Edge-aware blending & refinement – Overlaps are resolved with guided filtering; an optional temporal pass enforces consistency over successive frames.
Because the algorithm works in image space, it inherits the exact style of the artist painting—including brush strokes and high-frequency detail—something that neural style-transfer often washes out. The process is fast (real-time or faster per frame) and runs entirely on the GPU.
Step | Recommendation |
---|---|
Key-frame count | 9 – 16 well-chosen views usually suffice for a 360 ° turntable; add more only when topology changes drastically. |
Resolution | Keep the rendered frames and painted key frames at the same native resolution to avoid resampling artefacts. |
Integration with A1111 | The community extension CiaraStrawberry/TemporalKit can automate the call from Stable Diffusion to EbSynth if you prefer a single click workflow. |
After Stable Diffusion + ControlNet generates high-quality but per-key-frame stylised renders, EbSynth sweeps through the sequence and harmonises consistency of transformed frames across time. This step is crucial before we pass the imagery to Gaussian Splatting, because temporal consistency directly improves the quality of the reconstructed 3D point-cloud.
Another 3D model we used was expertly created by Author here.
We would like to express our gratitude to the authors of Gaussian Splatting and NeRF model, along with the pytorch representation of the latter. We used Gaussian Splatting and NeRF to achieve 3D rendering results.
@misc{mildenhall2020nerf,
title={NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis},
author={Ben Mildenhall and Pratul P. Srinivasan and Matthew Tancik and Jonathan T. Barron and Ravi Ramamoorthi and Ren Ng},
year={2020},
eprint={2003.08934},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
@misc{lin2020nerfpytorch,
title={NeRF-pytorch},
author={Yen-Chen, Lin},
publisher = {GitHub},
journal = {GitHub repository},
howpublished={\url{https://github.com/yenchenlin/nerf-pytorch/}},
year={2020}
}
@Article{kerbl3Dgaussians,
author = {Kerbl, Bernhard and Kopanas, Georgios and Leimk{\"u}hler, Thomas and Drettakis, George},
title = {3D Gaussian Splatting for Real-Time Radiance Field Rendering},
journal = {ACM Transactions on Graphics},
number = {4},
volume = {42},
month = {July},
year = {2023},
url = {https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/}
}
Big thanks to the authors of the 4D Facial Avatars model. We used it to obtain 4D rendering results.
@InProceedings{Gafni_2021_CVPR,
author = {Gafni, Guy and Thies, Justus and Zollh{\"o}fer, Michael and Nie{\ss}ner, Matthias},
title = {Dynamic Neural Radiance Fields for Monocular 4D Facial Avatar Reconstruction},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2021},
pages = {8649-8658}
}
Last but not least, we hereby cite the 2D deepfake GHOST work that was used in our original pipeline.
@article{9851423,
author={Groshev, Alexander and Maltseva, Anastasia and Chesakov, Daniil and Kuznetsov, Andrey and Dimitrov, Denis},
journal={IEEE Access},
title={GHOST—A New Face Swap Approach for Image and Video Domains},
year={2022},
volume={10},
number={},
pages={83452-83462},
doi={10.1109/ACCESS.2022.3196668}
}
Thanks also to the authors of Stable Diffusion, ControlNet, and EbSynth for their incredible tools that made this work possible.
@misc{rombach2022highresolutionimagesynthesislatent,
title={High-Resolution Image Synthesis with Latent Diffusion Models},
author={Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer},
year={2022},
eprint={2112.10752},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2112.10752},
}
@misc{zhang2023addingconditionalcontroltexttoimage,
title={Adding Conditional Control to Text-to-Image Diffusion Models},
author={Lvmin Zhang and Anyi Rao and Maneesh Agrawala},
year={2023},
eprint={2302.05543},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2302.05543},
}
@misc{Jamriska2018,
author = {Jamriska, Ondrej},
title = {Ebsynth: Fast Example-based Image Synthesis and Style Transfer},
year = {2018},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/jamriska/ebsynth}},
}