How to protect your business from deepfakes
Is your favourite singer giving you a set of luxury pots and pans as a gift? In an ad, Taylor Swift tells you in her voice that you only have to pay the delivery costs. Of course, the set you thought you were going to use to prepare your MasterChef application while dancing to ‘It’s over now’ never arrived. This fraud cost each person cheated about 10 euros. In other cases, such as that of the Hong Kong company employee, the fraudsters made a bit more: 23 million euros.
Before you may think the employee was naïve, pay attention to the devious plan the criminals carried out: firstly, the employee received an email from his CFO, but because of his distrust of the order he was giving him, he proposed a phone call. Over the course of a week, several video calls took place not only with the CFO, but also with other managers in the company. However, no one was who they appeared to be as the criminals used voice and video deepfake to recreate the executives.
But, there are more: voice deepfakes to contract services in your name that you have never requested, calls from supposedly kidnapped relatives, company executives badmouthing their own company’s product… The creativity of criminals using videos, images or audios created with artificial intelligence seems to know no bounds.
One of the most frequent scams of this type is the so-called ‘CEO scam’ and it does not need to be done by video call as in the case of the Hong Kong company mentioned above. Simply, via telephone, an employee receives a call from his or her boss ordering him or her to carry out a transaction. Another modality is that of an investor or any regular customer of a bank who calls his branch and orders certain transactions or transfers to be carried out. Because of the pre-existing relationship, they are executed. Both operations have in common a bank, a huge amount of money and a deepfake, in these cases a deepvoice.
Types of deepfake
According to ,Incibe, the Spanish Cybersecurity Institute, deepfake is a technique that allows the face of one person to be superimposed on another person’s face in a video, adding their voice and gestures to make them look like those of the person being impersonated. The origin of the name comes from Deep Learning, a field of Artificial Intelligence that uses artificial neural networks to mimic the learning process of the human brain. There are two types of deepfakes depending on the type of multimedia content they generate:
- Deepvoice: fragments of the victim’s voice are spliced together, and the victim’s voice is replicated to say another message. This is how you get the voice of a CEO asking for a transfer or a family member asking for help with a ransom.
- Deepface: fragments of multimedia content featuring the victim are spliced together and the victim’s face and gestures are impersonated. This is how videos were created, such as the ad where Lola Flores was brought back to life, but also the video call with the fake board of directors that made a haul of 23 million euros, the deepface in which a CEO talks badly about his company’s product or a famous person in a compromising situation.
There are 4 types of deepface:
- FaceSwapping: one source face replaces another in the final image.
- Facial expression changes: replacement of an expression in the target video with another expression obtained from another video, to achieve gestures or expressions that can, for example, match lip movements and expressions to those needed to include a different speech than the real one in the video. It can also be applied to images, i.e. creating a video from an image.
- Synthetic generation: identities created from scratch using a generative adversarial network or diffusion model.
- Morphing: combining similar-looking faces to produce an identity that contains the characteristics of the sources.
That’s not happening to me
When news of deepfake scams are reported, it is easy to think that these people fell for the scam because they did not pay attention to the right signals, that they were not cautious enough. However, the human eye is not infallible, let alone the ear.
According to a study by the University of Texas, at a glance we are only able to detect 50\% of fake images when it comes to AI-generated photographs. In the case of deepvoice, in this other study that tested more than 500 people, the result was that one in four times a person was unable to identify a deepvoice. Even when half of the group received training beforehand, the result only improved by 3\%.
Detection solutions needed
As the creativity of cybercriminals devises different ways to make fraudulent use of technological advances, the need for technologies to detect these crimes is growing. Detecting a deepfake-generated video or image with the naked eye is not easy as we have seen. In the case of deepvoice, it is even more difficult. Researchers at University College London found that participants in their study mentioned the same characteristics about the voices they heard, regardless of whether they were fake voices or not; they deduced that this was due to subjectivity. Therefore, to prevent frauds using deepfakes, it is imperative to continue developing technological tools capable of detecting them.
The main approach to combating deepfakes relies mainly on artificial intelligence technology itself. However, the constant evolution of deepfake generation techniques means that current detection tools quickly become obsolete. This requires solutions that incorporate continuous improvement and updating mechanisms in order to effectively combat the problem.
Gradiant and Councilbox solution
In GICTEL, co-funded by GAIN, and in collaboration between Gradiant and Councilbox, we have developed a hybrid solution for detecting video/image and voice deepfakes, which shows great generalisation capabilities thanks to novel data selection, curation and augmentation techniques applied in the training phase, as well as the performance of a multimodal analysis that fuses the predictions of the detection systems on audio, image, and video. The AI models we develop at GICTEL can focus on details that go unnoticed by the human eye and ear. From the training phase, they can learn these peculiarities of the data to be able to correctly classify whether a video, image or audio is a real or fake case.
The case we presented earlier, the Hong Kong company executives’ video call fraud, could have been avoided if Councilbox’s products had been used. These solutions include video conferencing as a fundamental part of decision-making processes in boards, assemblies, and other corporate meetings, as well as for carrying out procedures with public bodies and private companies. In GICTEL we merge visual and auditory detection to have a double verification of the veracity of the data, a peculiarity that distinguishes this system.

