Skip to main content

Use Case: DeepFake Origin Detection

· 8 min read
Yury Zhauniarovich

Image by [Matt Groh](https://www.media.mit.edu/projects/detect-fakes/overview/) from [MIT Media Lab](https://www.media.mit.edu/)

In the last several years, we constantly hear that we are close to be living in a post-truth era. Every TV set speaks about fake news, confirmatione bias and more recently about deepfakes videos processed with deep learning techniques where the face of an original person is substituted with the fake one.

If several years ago we were just laughing at the first deepfake attempts, now the technology has matured, and we have to find a way to live in this new reality.

Introduction

We, human beings, have already learnt that we should not trust to text and image information sources (because they can be quite easily modified), however we can still accredit videos. But we should not anymore because of the deepfake technology development. If the first deepfake videos born in research labs several years ago could be easily detected even by non-technical person, the more recent examples look so natural that even a professional cannot catch that the video has been manipulated.

Indeed, the technology develops very rapidly. Every year, hundreds of scientific papers are published on this topic achieving more and more realistic image in the experiments. Moreover, the adoption of this research also blows my mind. Such projects as DeepFaceLab, actively supported by community and incorporating the most recent state-of-the-art techniques, now allow anyone to create a deepfake video clip almost on a commodity hardware: you just need a good dataset of images of a faked person and a powerful GPU.

Not surprisingly, the detection of deepfakes became a paramount task. For instance, the Kaggle's competition to detect facial or voice manipulation hold 1 year ago offered the highest prize on the platform of 1 million US dollars. Interestingly, none of the 2114 teams was able to reach 70% of accuracy on an unseen validation set. This shows that the problem of deepfakes detection is acute and not yet solved.

In Vertx, we do not try address the problem of deepfakes detection. However, we have found out that our algorithm allows us to attack an accompanying issue, namely deepfake source detection. I.e., if the original video clip has been already indexed by our algorithm, our system will be able to find it given the provided deepfake sample. This is possible because our algorithm relies on movements detection in order to detect copies. Indeed, the deepfake algorithms substitute faces, but the movements in videos remain the same.

In order to show how our system can be used to find the origins of deepfake videos I have downloaded several popular deepfake clips and analyzed them using our platform. I chose the most popular videos of the users mentioned at the DeepFaceLab Github page.

note

So as some of the deepfake video clips considered here are assembled from short pieces of original movies, we reduced the minimum match time from default 15 seconds to 5 (you can do this by using command line client parameters).

DeepFake Examples

Joe Biden in World War Z

The first video I have chosen to analyze is where Joe Biden substitutes a zombie in World War Z. It seems to be quite simple for the analysis, because from the first glance, it consists of a one continuous piece of movie processed using the DeepFaceLab technology. Below, you can watch this video sample:

As a result of the analysis, our algorithm produced the following json document:

Vertx Search Results
[
{
"matches": [
{
"metadata": {
"album": null,
"artist": null,
"bucket": null,
"cover_url": null,
"imdb_id": 816711,
"label": null,
"title": "World War Z",
"type": "movie",
"uid": "6144699547833740691",
"year": 2013
},
"segments": [
{
"duration": 22.4375,
"que_offset": 22.6875,
"ref_offset": 6382.3125
},
{
"duration": 6.5,
"que_offset": 48.625,
"ref_offset": 6407.9375
},
{
"duration": 7.0,
"que_offset": 65.125,
"ref_offset": 6450.0
},
{
"duration": 16.6875,
"que_offset": 72.0,
"ref_offset": 6479.6875
},
{
"duration": 10.375,
"que_offset": 89.4375,
"ref_offset": 6506.0625
},
{
"duration": 13.8125,
"que_offset": 114.875,
"ref_offset": 6550.125
},
{
"duration": 5.6875,
"que_offset": 128.0,
"ref_offset": 6586.4375
},
{
"duration": 5.6875,
"que_offset": 145.75,
"ref_offset": 6642.8125
}
],
"uid": 6144699547833740691
},
{
"metadata": {
"album": null,
"artist": null,
"bucket": null,
"cover_url": null,
"imdb_id": 58700,
"label": null,
"title": "The Last Man on Earth",
"type": "movie",
"uid": "5890475500996554810",
"year": 1964
},
"segments": [
{
"duration": 7.625,
"que_offset": 3.0,
"ref_offset": 2171.375
}
],
"uid": 5890475500996554810
}
],
"media_type": "audio",
"reason": null,
"source_path": "WWZ.mkv",
"source_uid": "15247207028104530174",
"status": "succeeded"
},
{
"matches": [
{
"metadata": {
"album": null,
"artist": null,
"bucket": null,
"cover_url": null,
"imdb_id": 816711,
"label": null,
"title": "World War Z",
"type": "movie",
"uid": "6144699547833740691",
"year": 2013
},
"segments": [
{
"duration": 35.3125,
"que_offset": 22.9375,
"ref_offset": 6382.625
},
{
"duration": 9.375,
"que_offset": 62.8125,
"ref_offset": 6447.8125
},
{
"duration": 17.0,
"que_offset": 71.4375,
"ref_offset": 6479.3125
},
{
"duration": 11.625,
"que_offset": 88.4375,
"ref_offset": 6505.3125
},
{
"duration": 12.75,
"que_offset": 115.375,
"ref_offset": 6550.4375
},
{
"duration": 6.0,
"que_offset": 128.5,
"ref_offset": 6587.0625
},
{
"duration": 6.625,
"que_offset": 133.6875,
"ref_offset": 6597.9375
}
],
"uid": 6144699547833740691
}
],
"media_type": "video",
"reason": null,
"source_path": "WWZ.mkv",
"source_uid": "15247207028104530174",
"status": "succeeded"
}
]

Unfortunately, this document is not easy for analysis for human beings. To facilitate the analysis, we use the timeline representation of the data. We present the data from the document as a timeline graph, where the x axes represents the time in the query sample (in our case, this is a deepfake video clip), while on the y axes we list the titles of all the matches (we call them reference videos) we have managed to find. A bar represents the match. Thus, analysing this graph it is possible to see what parts of the query sample has been found in the indexed data. So as our algorithm analyzes video and audio separately, for each query video we produce a pair of such timeline graphs: one for video and one for audio modalities. If you put your mouse pointer over a bar representing a match, you should see the details of the match. In particular, the hover data shows the start of the match in query (deepfake clip) and reference (source/original) videos, and the duration of the match.

The graph below shows the results of the video track analysis. As you can see, our algorithm was managed to detect that almost the whole deepfake is taken from the "World War Z" movie. However, from the results of the analysis you can see that the deepfake clip is not continuous - it consists of several pieces perfectly assembled together, so that you think that this is one undivided part of the movie.

The fallback content to display on prerendering

The next chart shows the matches our algorithm managed to find analyzing the audio modality. As you can see, audio is taken from two different sources: a short part in the beginning is from the "The Last Man on Earth" movie, while the rest is taken from the "World War Z".

The fallback content to display on prerendering

If you consider both video and audio charts you will notice that our system cannot detect originals in the period from 100th to 115th second. It seems that this part is where our algorithm produces false negative: it cannot recognize the original. Maybe, there are several small pieces of originals glued together, or movements on the picture are negligible.

John Travolta is Forest Gump

Even though I watched the following deepfake video of John Travolta playing Forest Gump many times, to my point of view, Tom Hanks fits this role better. However, the deepfake is of very high quality, and it indeed shows how John Travolta could look like at this position:

Although, this video is assembled from several short pieces from the original movie, our algorithm is still able to find the original. The following chart for the video modality confirms this statement:

The fallback content to display on prerendering

As you can see, the algorithm finds a lot of pieces from the original movie. However, it also fails in some cases: in the ranges of 40-65, 73-82, 89-102 and 137-145 seconds. I am not 100% sure, but it seems that the algorithm fails during these periods because there are several short pieces (less than 5 seconds) glued together.

Interestingly, at the 156-166 seconds the algorithm was able to detect the original movie, even though the picture is blurred, and there are advertisement of the channel on the foreground.

For the audio modality, we see a similar graph:

The fallback content to display on prerendering

You can still see the gaps at 73-82 and 89-102 seconds. This makes me think the assumption that there are several pieces glued together is correct. However, for the 137-145 seconds period the algorithm was managed to detect the original audio. I can conclude that this is a false negative of our video matching algorithm. The last gap between 13th and 40th seconds exists because the original soundtrack is dimmed by the text read by an annotator.

Tom Cruise in Iron Man

In the end, let's consider a deepfake example where our algorithm does not perform well. Here is the video where the deepfake technology has been used to show Tom Cruise playing Iron Man:

This video is difficult for our algorithm to analyze. Indeed, the deepfake video consists of very small pieces glued together. Moreover, there are additional effects added (like side-by-side pictures) that make our algorithm to fail. Still, it manages to find the original movie, but only a small part of it:

The fallback content to display on prerendering

The audio modality brought even more surprising results: there are 4 matches taken from 3 different movies. However, if you carefully listen to the soundtrack you will hear that there is a background music playing all the time. Moreover, Robert Downey JR's voice is substituted by the voice of Tom Cruise. To my point of view, that is the reason why the algorithm fails.

The fallback content to display on prerendering

Conclusion

As we have shown in this article, our algorithm can be useful in finding the sources of deepfake videos. Our platform can be a good accompanying solution for companies developing deepfake detection algorithms allowing them not only to detect the fact that a video is manipulated but also to find the origin of the deepfake videos.