ChatMinerva, the Italian AI with real-time web access, arrives
It is a multimodal Ai assistant capable of reading texts, interpreting images, analysing documents and surfing the Web in real time, all while conversing in Italian with an unprecedented level of reliability for a model developed entirely in Italia
A system where photos of pages in a foreign language can be uploaded to be translated, and perhaps even summarised, into Italian in real time. Or a model to be asked to analyse scientific articles in detail. Although these are not absolute novelties in the world of artificial intelligence, they become so when we refer to the Italia panorama. The novelty in our country, in this sense, comes from ChatMinerva, freshly presented by the Sapienza NLP research group of La Sapienza University of Rome, led by Professor Roberto Navigli, in collaboration with Babelscape, an academic spin-off founded ten years ago.
It is a multimodal Ai assistant capable of reading texts, interpreting images, analysing documents and surfing the Web in real time, all while conversing in Italian with an unprecedented level of reliability for a model developed entirely in Italia. The project stands out for a feature that, in the current panorama, is far from being taken for granted: transparency and direct control over the entire life cycle of the system, from pre-training to fine-tuning, up to content moderation mechanisms.
From voice to OCR, up to 32 thousand tokens
The technical innovations are several. On the multimodal understanding front, the model is now able to process photographs, scanned pages, reports and scientific articles, combining visual and textual information and performing optical character recognition (OCR) on digitised documents. It is also possible to interact vocally with the system.
On the information access front, ChatMinerva integrates a Web RAG - Retrieval-Augmented Generation - system based on the open search engine DuckDuckGo, which allows the model to draw on up-to-date sources in real time, overcoming the typical limitations of models trained on static data.
Also noteworthy is the extension of the contextual window up to 32,000 tokens, achieved through continuous training: a threshold that allows long documents and articulated conversations to be handled without loss of coherence. Everything is manned by a dedicated security component, which analyses input and output to filter out unwanted, untrusted or sensitive content.

