Products

DeepL Voice-to-Voice and Translator, voice is the new heart of linguistic AI

by Gianni Rusconi

22 April 2026

4' min read

Translated by AI

Versione italiana

4' min read

Translated by AI

Versione italiana

More than significant acceleration: this is how various experts define the path taken by machine translation systems in recent years, an acceleration driven (obviously) by the evolution of artificial intelligence models and (less predictably) by ever closer integration into mainstream digital work environments.

From consumer services to enterprise platforms, tools from Google, Microsoft and other tech companies have progressively shifted the centre of gravity from simple text translation to more advanced forms of interaction, including voice, context and real-time collaboration.

According to various market analyses, the segment of AI applied to natural language is among the fastest growing, driven by the need of companies to operate on a global scale without linguistic frictions and by the increasingly marked perception towards the idea of a 'translation' that from an ancillary service becomes an infrastructural element of digital processes, with direct impacts on productivity and decision-making speed.

Low latency real-time translation

And it is in this context that the new step forward by DeepL, the German start-up that rose to prominence a few years ago thanks to its translation tool of the same name, fits in. Its latest announcement is Voice-to-Voice, a suite that intervenes in one of the most technologically complex areas of linguistic AI, namely real-time spoken communication, and more precisely the 'end-to-end' process that combines speech recognition (speech-to-text), neural translation and speech synthesis (text-to-speech) in a continuous, low-latency flow.

The central issue on which several solutions have run aground in the past is indeed latency, in view of the fact that - to make a multilingual conversation natural - a translation system must be able to capture speech, transcribe it and return it in vocal form within a few seconds, maintaining semantic coherence and fluency.

Hence the need for models that are optimised not only in terms of accuracy, but also in terms of speed of inference (the time it takes for an already trained AI model to analyse new data and produce a result) and the ability to handle unstructured input, e.g. rapid speech, different accents or background environmental noise.

Integration with collaboration platforms

The DeepL suite tries to meet these requirements through various application modules that reflect specific usage scenarios. Integration with collaboration platforms such as Microsoft Teams, Google Meet and Zoom also allows translation to be inserted directly into the flow of meetings and extends to broader operational contexts, including web and mobile applications to enable immediate interactions even in less structured contexts.

From an architectural point of view, what makes the difference (at least on paper) is the availability of APIs that open up voice translation functionality within enterprise applications such as contact centres and customer service tools. A further distinguishing feature of the new suite, according to those involved, is the management of terminology and the complexity of specialised language.

In fact, DeepL extends its glossaries to the voice component, making it possible to bind the translation to domain-specific lexicons. At a technical level, this upgrade implies the integration of customised dictionaries into the neural translation models and post-processing systems, so as to guarantee consistency and terminological accuracy in real time even in the enterprise environment, where semantic accuracy is an essential element to avoid errors or ambiguities that may have significant operational impacts.

From translation as a service to translation as infrastructure

In addition to the announcement of Voice-to-Voice, DeepL also added that of the evolution of its Translator platform, in a direction that more generally reflects the transformation of translation from a standalone application and service to an infrastructural component integrated into corporate workflows and technology stacks and supported by an architecture that combines next-generation neural translation models with flow orchestration mechanisms and integration via APIs.

The approach, in a nutshell, is that of an AI-first platform that plugs into existing workflows, reducing the need for manual steps and separate tools, and that aims not only to reduce time but also to improve the overall quality of multilingual communications. Translation thus becomes a transversal layer within corporate systems, capable of operating seamlessly and automatically within the tools already in use. Content can be intercepted directly in corporate systems (CRM, collaboration platforms) and translated automatically with dynamic application of linguistic rules, tone and terminology, while evaluation mechanisms make it possible to estimate the reliability of the final result in advance (and not a posteriori, as in traditional models) and intervene with revisions only when necessary.

The logic of continuous and progressive learning is, according to the managers of DeepL, a further strength of the new Translator. The corrections made by users are in fact used to update the models or to refine customisation levels, creating a sort of 'linguistic memory' specific to each organisation (without prejudice to implications to be managed in terms of security and data governance). What seems certain, reading between the lines of DeepL's announcements, is that AI is increasingly moving from individual tools to core processes, becoming an integral part of an enterprise's operational infrastructure.