Artificial Intelligence

The latest impressive deepfake videos from Veo3 (Google). How was it possible to make them?

Videos generated with Google's new artificial intelligence model are spreading on social networks and are extremely realistic, how do they come about?

29 May 2025

4' min read

In 2018, a video went viral in which former President Barack Obama called Donald Trump a 'complete idiot'. In reality, it was not Obama speaking, but a condensed version created by actor Jordan Peele and BuzzFeed technicians.

Video deepfake di Obama del 2018

In that case, the intent was pedagogical: to show the power and risks of deepfakes, videos altered by artificial intelligence to the point of being indistinguishable from the real thing.

Today, just a few years later, that same technology has made great strides. Deepfakes are not only more widespread, but they reach a level of realism that fools even the most attentive observers.

In recent days, social media have been flooded with videos generated with Veo 3, the new artificial intelligence model presented by Google during the annual Google I/O conference in Mountain View.

The films produced with this tool are incredibly realistic.

Like this one, titled "The Sailor and the Sea" made by providing this prompt: "A medium field frames an old sailor, whose blue woollen sailor's hat casts a shadow over his eyes, while a thick grey beard hides his chin. He holds his pipe in one hand, pointing with it to the rough grey sea beyond the ship's railing. 'This ocean is a force, a wild and untamed power. And it arouses your awe, with every light that breaks."

Il video del marinaio e il mare creato con Veo3

Or like this video, showing street interviews inspired by the meme, viral on social media, of the 'Hawk Tuah girl'.

Interviste in strada deepfake realizzate con l’IA di Google

The view according to the machines

The heart of this revolution is called computer vision, a branch of artificial intelligence that teaches computers to 'see' and 'understand' video images.

This is explained by Kai-Fu Lee, former president of Google China, and the writer Chen Qiufan in the book 'AI 2041', a ten-story book set in the near future that tells how artificial intelligence will change the world.

Seeing, for an algorithm, is not just about recording images: it means interpreting them, understanding them, and acting accordingly. Computer vision breaks reality down into sequences of pixels, recognises objects, tracks them in time, analyses their movements, gestures, even the relationships implicit in a scene.

These processes require enormous computing power. While it takes us a fraction of a second to see a scene, a neural network has to learn to do this from scratch, trained on millions of images. This is why convolutional neural networks were born, inspired by our visual cortex.

Structured in hierarchical layers, they first analyse lines and colours, then shapes, and finally complex objects. A mechanism that makes it possible to recognise a face in a crowd, distinguish a zebra from a horse, or trace the moving hands of a dancer.

"The researchers," the authors write, "were inspired by the human brain to improve deep learning. Our visual cortex uses many neurons corresponding to the many restricted sub-regions (known as receptive fields) within what our eyes see at any given time. These receptive fields identify basic features, such as shapes, lines, colours or angles. These detectors are connected to the neocortex, the outermost layer of our brain. The neocortex stores information hierarchically, processing the outputs of these receptive fields into a more complex understanding of the scene'.

How to build a deepfake

The same technologies that allow cars to drive themselves or an iPhone to recognise a face are also used to create deepfakes. To make one, a video is divided into thousands of images. On each one, the face, hands, eyes and mouth are identified. Then the face is changed and the mouth is synchronised with a fake audio. The result is a video in which a person appears to say or do something that never actually happened.

Underlying the most sophisticated deepfakes is a technology called the Generative Adversarial Network (Gan), consisting of two competing neural networks: one generates content, the other evaluates it. It is an ongoing challenge: the forger improves to fool the detector, which in turn refines to expose the deception. The process can be repeated millions of times, until the video produced is indistinguishable from a real one.

"In the long run," reads 'Ai 2041', "the main problem is that the Gan has internal mechanisms that 'update' the falsifier network. Suppose you have trained a Gan's falsifier network, and then someone comes along with a new investigator algorithm that recognises your deepfake. At this point, you simply re-train your forger network with the aim of fooling the investigator algorithm. The result is an arms race to see which side trains a better model on a more powerful computer.

Applications

Computer vision has now fully contaminated our reality. It can be found, for example, in Amazon Go supermarkets, where video cameras recognise the products placed in the trolley. In cars that monitor the driver's attention or in airport facial recognition systems. And again, in medicine for analysing X-rays, in moderating content on social media, in the military to distinguish civilians from combatants. Or, as in this video created with Veo 3, for the creation of entire films, without the need to employ flesh-and-blood actors, but only by using technology.

Scene di film iperrealistiche realizzate con l’IA di Google

Today, there are systems capable of detecting deepfakes, such as those developed by Facebook and Google. But they are expensive, slow and ineffective on a large scale. The paradox is obvious: the same Gan used to generate fakes can be re-educated to bypass the detectors. It is a digital arms race in which whoever has the most data, the most time and the most computing power wins.