Generative adversarial networks (GANs) are commonly used to create deep fakes, which are manipulated videos or images that appear real. Deep fakes and generative machine learning will most likely become an increasing threat to our society as they can be used to spread disinformation and cause harm, such as manipulating public opinion, impersonating individuals, or committing fraud. Let’s make sure we understand how they work as that is the best way to fight it. We cover:
How can we have celebrities say whatever we want
How can to learn to dance like a pro in 5 minutes: Everybody Dance Now!
Learn more about Deep Fakes
How can we have celebrities say whatever we want
Deep Fakes caught the public's attention a few years ago and are typically powered by Generative Adversarial Networks (GANs). GANs might be the best example of what makes Deep Learning special compared to traditional Machine Learning. It was not about predicting or optimizing, it was about generating new data that can fool humans! Diffusion models or Generative language models have taken the mediatic pedestal these days but Deep Fakes are still trending hot! By the way if you need a refresher on what are GANs and how they work:
When it comes to Deep Fakes, there are 2 major paradigms:
Replacement, where, for example, you move your face onto another body.
Reenactment, where, for example, someone's expression drives another person's face expression.
ReenactGAN is an excellent example of a process where anyone can control someone else's expression. Here is a YouTube video showing how it looks:
Here is the process for training such a model:
We have videos of multiple people and extract their facial features (boundaries) using basic computer vision techniques.
We use one encoder to encode everybody's face in the latent space and people specific decoders to regenerate those faces. The latent space is constrained to be as similar to boundary data such that the encoder learns to generate boundary data from images and the decoders learn to generate realistic images from boundary data. The encoder-decoders are trained end-to-end as a generator in an adversarial manner with a discriminator (Pix2Pix type GAN: “Image-to-Image Translation with Conditional Adversarial Networks“). The discriminator's job is to tell apart real images from regenerated images while the generator's job is to fool the discriminator by generating images as real as possible.
Because someone's facial features are different from another's, we need to learn how to transform one person's boundary data into another. Again, they train a boundary transformer as a generator in an adversarial manner with a discriminator. The discriminator's job is to tell apart real boundary data from transformed boundary data while the generator's job is to generate boundary data as close as possible to the actual person's boundary data. Because there are no pairs of data to be used in that case, they used a CycleGAN (“Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks“) to learn to project from one domain to another and back to the original domain.
Keep reading with a 7-day free trial
Subscribe to The AiEdge Newsletter to keep reading this post and get 7 days of free access to the full post archives.