📊 Full opportunity report: Minerva. The opposite path. on ThorstenMeyerAI.com — validation score, market gap, and execution plan.
TL;DR
Italy’s Minerva LLM, built from scratch with extensive Italian data, outperforms multilingual models but scores near chance on Italian exams. This challenges assumptions about scale and investment in sovereign models.
Italy’s Minerva-3B, a large language model trained entirely from scratch on 2.5 trillion tokens with approximately 50% Italian content, scored only 4.9% on the INVALSI Italian school-exam benchmark, despite its impressive technical development.
The Minerva project, led by Sapienza University of Rome and supported by Italy’s national AI strategy, built a 7-billion-parameter model using the CINECA supercomputer with 128 GPUs. It publicly released weights, training data, and code, making it a notable case of European sovereign AI infrastructure.
While Minerva outperforms comparable multilingual models on Italian benchmarks, its performance on the INVALSI exam indicates a significant gap in understanding complex academic content. Researchers concluded that dataset size and parameters are more crucial than language-specific data volume for complex tasks, highlighting a structural challenge for sovereign models.
This low exam score is a key empirical finding, suggesting that even substantial native-language investment may not suffice at current model scales to achieve deep country-specific knowledge, raising questions about the effectiveness of scale alone in sovereign LLM development.
Minerva.
The opposite
path.
Italy spent years building a European sovereign LLM from scratch. Then Minerva-3B scored 4.9% on the INVALSI Italian school exam.
Where AMÁLIA layered Portuguese specialization onto a multilingual foundation, Minerva trained from scratch on 2.5 trillion tokens with approximately 50% Italian content. Where AMÁLIA’s weights are not yet public, Minerva published weights, training data, and code as truly-open from day one. By every institutional measure, the Italian approach worked. But the empirical results contain a finding the press coverage has been quiet about — and it has implications that extend well beyond Italy.
Same problem. Opposite path.
European sovereign-LLM development has two primary architectural approaches. Italy chose from scratch with substantial native-language foundation. Portugal chose continuation pre-training of a multilingual model. The structural comparison surfaces what each commitment actually requires operationally.
The comparison is not “Italy did it better than Portugal.” Both projects respond to the same structural problem with different architectural strategies under different institutional and economic constraints. Italy’s national-AI investment is structurally larger by an order of magnitude — and Minerva is the visible artifact of that scale.
large language model training kit
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
4.9% on INVALSI. The bitter lesson surfaces.
In June 2024, researchers evaluated Minerva-3B on the Italian school-exam benchmark. The result was unambiguous. This is not a critique of Minerva — it is a critique of the public discourse around what Minerva’s empirical results actually demonstrate.
AI model training GPU hardware
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
350M to 7B. Four parameter scales, one architecture.
The Minerva model family covers four parameter tiers, each with specific training corpora. Each scale level reveals what the from-scratch path actually requires at different operating points.
Italian + English
100B English
~50% English
+ 200B code
AI training dataset management tools
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three answers. Same question.
Minerva, AMÁLIA, and OpenEuroLLM represent the three operational answers to the European sovereign-LLM question. Each makes different architectural and institutional bets. The strategic discourse benefits from treating all three as data points in the same empirical experiment.
AI model evaluation tools
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three standards the movement should adopt.
The structural critique generalizes beyond Minerva. The European sovereign-LLM movement benefits from internalizing these lessons across every subsequent national project. Italy modeled the openness standard; the movement should adopt it as norm.
Minerva is one valid answer to the European sovereign-LLM question. AMÁLIA is another. OpenEuroLLM is potentially a third. The strategic discourse benefits from treating all three as data points in the same empirical experiment rather than as competing national-prestige projects. More analysis like this is needed. Not less.
Implications of Scale and Investment in Sovereign LLMs
The results from Minerva demonstrate that simply increasing native-language data and model size may not guarantee proficiency in complex language tasks. This has profound implications for European countries investing heavily in sovereign AI, emphasizing the need to reconsider scaling strategies and resource allocation.
The findings challenge the narrative that training from scratch with large datasets automatically leads to high-level language understanding, suggesting instead that more nuanced approaches or larger scales are necessary for meaningful country-specific AI capabilities.
European Sovereign LLM Strategies and Challenges
The Italian Minerva project represents a significant effort in European sovereign AI, involving extensive data collection, high-performance computing, and open dissemination of models and data. It contrasts with other approaches like Portugal’s AMÁLIA, which layered specialization onto multilingual foundations.
Despite technical successes, Minerva’s low exam score underscores ongoing debates about the optimal scale and methodology for developing effective national language models. The project’s empirical results add a critical data point to these discussions, highlighting the importance of scale and investment levels.
“Minerva’s results challenge the assumption that native-language data and scale automatically produce country-knowledge depth.”
— Thorsten Meyer, AI researcher
Unclear Impact of Scaling on Country-Specific AI
It remains uncertain whether increasing model size further or adopting different training methodologies will significantly improve Minerva’s performance on complex national tasks. The exact threshold of necessary investment and scale is still unconfirmed, and ongoing research may alter current understanding.
Next Steps in Evaluating Sovereign Language Models
The Minerva team plans to continue iterating on training methodologies, including ongoing experiments in 2025, to determine if larger models or refined approaches can close the performance gap. Further empirical testing on diverse language tasks will inform future investment strategies and model development in Italy and Europe.
Key Questions
Why did Minerva score so low on the Italian exam despite large-scale training?
The low score suggests that dataset size and model parameters alone are insufficient for mastering complex academic content; other factors like training methodology or data quality may be critical.
Does this mean sovereign models are ineffective?
Not necessarily; it indicates that current scaling strategies may need to be complemented with different approaches or larger investments to achieve desired proficiency levels.
What are the implications for European AI policy?
The findings suggest that European countries should reconsider their investment levels and strategies, emphasizing that more scale alone may not suffice for country-specific AI capabilities.
Will increasing model size improve performance?
This remains uncertain; ongoing research aims to determine whether larger models or different training approaches can overcome current limitations.
How does Minerva compare to other European sovereign models?
Minerva’s approach of training from scratch with extensive native-language data contrasts with models like Portugal’s AMÁLIA, which layered specialization onto multilingual foundations. Its performance raises questions about the efficacy of different strategies.
Source: ThorstenMeyerAI.com