The race to make AI as multilingual as Europe |...

Your new post is loading...

Scooped by Charles Tiayon

Today, 3:27 AM

Scoop.it!

From thenextweb.com - Today, 2:00 AM

Charles Tiayon's insight:

Europe wants artificial intelligence to understand all its languages. Can it overcome English dominance to make AI truly multilingual?

"The race to make AI as multilingual as Europe
Can Europe stop AI from becoming English by default?

June 30, 2025 - 6:00 am

Image by: AbsolutVision
The European Union has 24 official languages and dozens more unofficial ones spoken across the continent. If you add in the European countries outside the union, then that brings at least a dozen more into the mix. Add dialects, endangered languages, and languages brought by migrants to Europe, and you end up with hundreds of languages.

One thing many of us in technology could agree on is that the US dominates — and that extends to online languages. There are many reasons for this, mostly due to American institutions, standards bodies, and companies defining how computers, their operating systems, and the software they run work in their nascent days. This is changing, but for the short term at least, it remains the norm. This has also led to the majority of the web being in English. An astounding 50% of websites are in English, despite it being the native tongue of only about 6% of the world’s population, with Spanish, German, and Japanese next, but a long way behind, each only between 5-6% of the web.

As we delve deeper into the new wave of AI-powered applications and services, many are driven by data in large language models (LLMs). As much of the data in these LLMs is scraped (controversially in many cases) from the web, LLMs predominantly understand and respond in English. As we find ourselves at the start of or in the midst of a shift in technological paradigm caused by the rapid growth of AI tools, this is a problem, and we’re bringing that problem into a new age.

Europe already boasts several high-profile AI companies and projects, such as Mistral and Hugging Face. Google DeepMind also originated as a European company. The continent has research projects that develop language models to enhance how AI tools comprehend less commonly spoken languages.

The 💜 of EU tech

The latest rumblings from the EU tech scene, a story from our wise ol' founder Boris, and some questionable AI art. It's free, every week, in your inbox. Sign up now!

Your email address

I agree to TNW storing and processing my personal data to receive the requested newsletter(s). For more information check out TNW's Privacy Policy.*
This article explores some of these initiatives, questions their effectiveness, and asks whether their efforts are worthwhile or if many users default to using English versions of tools. As Europe seeks to build its independence in AI and ML, does the continent have the companies and skills necessary to achieve its goals?

Terminology and technology primer
To make sense of what follows, you don’t need to understand how models are created, trained, or function. But it’s helpful to understand a couple of basics about models and their human language support.

Unless model documentation explicitly mentions it is multilingual or cross-lingual, prompting it or requesting a response in an unsupported language may cause it to translate back and forth or respond in a language it does understand. Both strategies can produce unreliable and inconsistent results — especially in low-resource languages.

While high-resource languages, such as English, benefit from abundant training data. Low-resource languages, such as Gaelic or Galician, have far less, which often leads to inferior performance

The harder concept to explain regarding models is “open,” which is unusual, as software in general has had a fairly clear definition of “open source” for a while. I don’t want to delve too deeply into this topic as the exact definition is still in flux and controversial. The summary is that even when a model might call itself “open” and is referenced as “open,” the meaning of “open” isn’t always the same.

Here are two other useful terms to know:

Training teaches a model to make predictions or decisions based on input data.

Parameters are variables learned during model training that define how the model maps inputs to outputs. In other words, how it understands and responds to your questions. The larger the number of parameters, the more complex the model is.

With that brief explanation done, how are European AI companies and projects working to enhance these processes to improve European language support?

Hugging Face
When someone wants to share code, they typically provide a link to their GitHub repository. When someone wants to share a model, they typically provide a Hugging Face link. Founded in 2016 by French entrepreneurs in New York City, the company is an active participant in creating communities and a strong proponent of open models. In 2024, it started an AI accelerator for European startups and partnered with Meta to develop translation tools based on Meta’s “No Language Left Behind” model. They are also one of the driving forces behind the BLOOM model, a groundbreaking multilingual model that set new standards for international collaboration, openness, and training methodologies.

Hugging Face is a useful tool for getting a rough idea of the language support in models. At the time of writing, Hugging Face lists 1,743,136 models and 298,927 datasets. Look at its leaderboard for monolingual models and datasets, and you see the following ranking for models and datasets that developers tag (add metadata) as supporting European languages at the time of writing:

Language Language code Datasets Models
English English en 27,702 205,459
English eng 1,370 1,070
French fra 1,933 850
Spanish Español es 1,745 10,028
German Deutsch de 1,442 9,714
English eng 1,370 1,070
You can already see some issues here. These aren’t tags set in stone. The community can add values freely. While you can see that they follow them for the most part, there is some duplication.

As you can see, the models are dominated by English. A similar issue applies to the datasets on Hugging Face, which lack non-English data.

What does this mean?

Lucie-Aimée Kaffee, EU Policy Lead at Hugging Face, said that the tags indicate that a model has been trained to understand and process this language or that the dataset contains materials in that language. She added that the confusion between language support often comes during training.“When training a large model, it’s common for other languages to accidentally get caught in training because there were some artefacts of it in that dataset,” she said. “The language a model is tagged with is usually what the developers intended the model to understand.”

As one of the main and busiest destinations for model developers and researchers, Hugging Face not only hosts much of their work, but also lets them create outward-facing communities to tell people how to use them.

Thomas Wolf, co-founder of Hugging Face, described Bloom as “the world’s largest open multilingual language model.” Credit: Shauna Clinton/Web Summit via Sportsfile
Mistral AI
Perhaps the best-known Europe-based AI company is France’s Mistral AI, which unfortunately declined an interview. Its multilingual challenges partly inspired this article. At the FOSDEM developer conference in February 2024, linguistics researcher Julie Hunter asked one of Mistral’s models for a recipe in French — but it responded in English. However, 16 months is an eternity in AI development, and neither the company’s “Le Chat” chat interface nor running its 7B model locally reproduced the same error in recent tests. But interestingly, 7B did produce a spelling error in the opening line: “boueef” — and more may follow.

While Mistral sells several commercial models, tools, and services, its free-to-use models are popular, and I personally tend to use Mistral 7B for running tasks through local models.

Until recently, the company wasn’t explicit about its models having multilingual support, but its announcement of the Magistral model at London Tech Week in June 2025 confirmed support for several European languages.

EuroLLM
EuroLLM was created as a partnership between Portuguese AI platform Unbabel and several European universities to understand and generate text in all official European Union languages. The model also includes non-European languages widely spoken by immigrant communities and major trading partners, such as Hindi, Chinese, and Turkish.

Like some of the other open model projects in this article, its work was partly funded by the EU’s High Performance Computing Joint Undertaking program (EuroHPC JU). Many of them share similar names and aims, making it confusing to separate them all. EuroLLM was one of the first, and as Ricardo Rei, Senior Research Scientist at Unbabel, told me, the team has learned a lot from the projects that have come since.

As Unbabel’s prime business is language translation, and translation is a key task for many multilingual models, the work on EuroLLM made sense to the Portuguese platform. Before EuroLLM, Unbabel had already been refining existing models to make its own and found them all too English-centric.

One of the team’s biggest challenges was finding sufficient training data for low-resource languages. Ultimately, the availability of training material reflects the number of people who speak the language. One of the common data sources used to train European language models is Europarl, which contains transcripts of the European Parliament’s activities translated into all official EU languages. It’s also available as a Hugging Face dataset, thanks to ETH Zürich.

Currently, the project has a 1.7B parameter model and a 9B parameter model, and is working on a 22B parameter model. In all cases, the models can translate, but are also general-purpose, meaning you can chat with them in a similar way to ChatGPT, mixing and matching languages as you do.

OpenLLM Europe
OpenLLM Europe isn’t building anything directly, but it’s fostering a Europe-wide community of LLM projects, specifically medium and low-resource languages. Don’t let the one-page GitHub repository fool you: the Discord server is lively and active.

OpenEuroLLM, Lumi, and Silo
A joint project between several European universities and companies, OpenEuroLLM is one of the newer and larger entrants to the list of projects funded by EuroHPC. This means that it has no public models as of yet, but it involves many of the institutions and individuals behind the Lumi family of models that focus on Scandinavian and Nordic languages. It aims to create a multilingual model, provide more datasets for other models and conform to the EU AI Act.

I spoke with Peter Sarlin of AMD Silo, one of the companies involved in the project and a key figure in Finnish and European AI development, about the plans. He explained that Finland, especially, has several institutes with significant AI research programs, including Lumi, one of the supercomputers part of EuroHPC. Silo, through its SiloGen product, offers open source models to customers, with a strong focus on supporting European languages. Sarlin pointed out that while sovereignty is an important motivation to him and Silo for creating and maintaining models that support European languages, the better reason is expanding the business and helping companies build solutions for small markets such as Estonia.

“Open models are great building blocks, but they aren’t as performant as closed ones, and many businesses in the Nordics and Scandinavia don’t have the resources to build tools based on open models,” he said. “So Silo and our models can step in to fill the gaps.”

Under Sarlin’s leadership, Silo AI built a Nordic LLM family to protect the region’s linguistic diversity. Credit: Silo AI
The Lumi models use a “cross-lingual training” technique in which the model shares its parameters between high-resource and low-resource languages.

All this prior work led to the OpenEuroLLM project, which Sarlin describes as “Europe’s largest open source AI initiative ever, including pretty much all AI developers in Europe apart from Mistral.”

While many efforts are underway and performing well, the training data issue for low-resource languages remains the biggest challenge, especially amid the move towards more nuanced reasoning models. Translations and cross-lingual training are options, but can create responses that sound unnatural to native speakers. As Sarlin said, “We don’t want a model that sounds like an American speaking Finnish.”

OpenLLM France
France is one of the more active countries in AI development, with Mistral and Hugging Face leading the way. From a community perspective, the country also has OpenLLM France. The project (unsurprisingly) focuses on French language models, with several models of different parameters and datasets, which help other projects train and improve their models that support French. The datasets include a mix of political discourse, meeting recordings, theatre shows, and casual conversations. The project also maintains a leaderboard of French models on Hugging Face, one of the few (active) European language model benchmark pages.

Do Europeans care about multilingual AI?
Europe is full of people and projects working on multilingual language models. But do consumers care? Unfortunately, getting language usage rates for proprietary tools such as ChatGPT or Mistral is almost impossible. I created a poll on LinkedIn asking if people use AI tools in their native language, English, or a mixture of both. The results were a 50/50 split between English and a mixture of languages. This could indicate that the number of people using AI tools in a non-English language is higher than you think.

Typically, people use AI tools in English for work and in their own language for personal tasks.

Kaffee, a German and English speaker, said: “I use them mostly in English because I speak English at work and with my partner at home. But then, for personal tasks…, I use German.”

Kaffee mentioned that Hugging Face was working on a soon-to-be-published research project that fully analysed the usage of multilingual models on the platform. She also noted anecdotally that their usage is on the rise.

“Users have a conception that models are now more multilingual. And with the accessibility through large models like Llama, for example, being multilingual, I think that made a big impact on the research world regarding multilingual models and the number of people wanting to now use them in their own language.”

The internet was always supposed to be global and for everyone, but the damning statistic that 50% of sites are in English shows it never really worked out that way. We’re entering a new phase in how we access information and who controls it. Maybe this time, the (AI) revolution will be international.

STORY BY
Chris Chinchilla
Technology writer, podcaster, and video maker by day. Fiction, games, and music by night. chrischinchilla.com"

https://thenextweb.com/news/making-multilingual-ai-in-europe
#metaglossia_mundus

No comment yet.

As AI giants duel, the Global South builds its own brainpower - Features

Researchers across Africa, Asia and the Middle East are building their own language models designed for local tongues, cultural nuance and digital independence

"In a high-stakes artificial intelligence race between the United States and China, an equally transformative movement is taking shape elsewhere. From Cape Town to Bangalore, from Cairo to Riyadh, researchers, engineers and public institutions are building homegrown AI systems, models that speak not just in local languages, but with regional insight and cultural depth.

The dominant narrative in AI, particularly since the early 2020s, has focused on a handful of US-based companies like OpenAI with GPT, Google with Gemini, Meta’s LLaMa, Anthropic’s Claude. They vie to build ever larger and more capable models. Earlier in 2025, China’s DeepSeek, a Hangzhou-based startup, added a new twist by releasing large language models (LLMs) that rival their American counterparts, with a smaller computational demand. But increasingly, researchers across the Global South are challenging the notion that technological leadership in AI is the exclusive domain of these two superpowers.

Instead, scientists and institutions in countries like India, South Africa, Egypt and Saudi Arabia are rethinking the very premise of generative AI. Their focus is not on scaling up, but on scaling right, building models that work for local users, in their languages, and within their social and economic realities.

“How do we make sure that the entire planet benefits from AI?” asks Benjamin Rosman, a professor at the University of the Witwatersrand and a lead developer of InkubaLM, a generative model trained on five African languages. “I want more and more voices to be in the conversation”.

Beyond English, beyond Silicon Valley

Large language models work by training on massive troves of online text. While the latest versions of GPT, Gemini or LLaMa boast multilingual capabilities, the overwhelming presence of English-language material and Western cultural contexts in these datasets skews their outputs. For speakers of Hindi, Arabic, Swahili, Xhosa and countless other languages, that means AI systems may not only stumble over grammar and syntax, they can also miss the point entirely.

“In Indian languages, large models trained on English data just don’t perform well,” says Janki Nawale, a linguist at AI4Bharat, a lab at the Indian Institute of Technology Madras. “There are cultural nuances, dialectal variations, and even non-standard scripts that make translation and understanding difficult.” Nawale’s team builds supervised datasets and evaluation benchmarks for what specialists call “low resource” languages, those that lack robust digital corpora for machine learning.

It’s not just a question of grammar or vocabulary. “The meaning often lies in the implication,” says Vukosi Marivate, a professor of computer science at the University of Pretoria, in South Africa. “In isiXhosa, the words are one thing but what’s being implied is what really matters.” Marivate co-leads Masakhane NLP, a pan-African collective of AI researchers that recently developed AFROBENCH, a rigorous benchmark for evaluating how well large language models perform on 64 African languages across 15 tasks. The results, published in a preprint in March, revealed major gaps in performance between English and nearly all African languages, especially with open-source models.

Similar concerns arise in the Arabic-speaking world. “If English dominates the training process, the answers will be filtered through a Western lens rather than an Arab one,” says Mekki Habib, a robotics professor at the American University in Cairo. A 2024 preprint from the Tunisian AI firm Clusterlab finds that many multilingual models fail to capture Arabic’s syntactic complexity or cultural frames of reference, particularly in dialect-rich contexts.

Governments step in

For many countries in the Global South, the stakes are geopolitical as well as linguistic. Dependence on Western or Chinese AI infrastructure could mean diminished sovereignty over information, technology, and even national narratives. In response, governments are pouring resources into creating their own models.

Saudi Arabia’s national AI authority, SDAIA, has built ‘ALLaM,’ an Arabic-first model based on Meta’s LLaMa-2, enriched with more than 540 billion Arabic tokens. The United Arab Emirates has backed several initiatives, including ‘Jais,’ an open-source Arabic-English model built by MBZUAI in collaboration with US chipmaker Cerebras Systems and the Abu Dhabi firm Inception. Another UAE-backed project, Noor, focuses on educational and Islamic applications.

In Qatar, researchers at Hamad Bin Khalifa University, and the Qatar Computing Research Institute, have developed the Fanar platform and its LLMs Fanar Star and Fanar Prime. Trained on a trillion tokens of Arabic, English, and code, Fanar’s tokenization approach is specifically engineered to reflect Arabic’s rich morphology and syntax.

India has emerged as a major hub for AI localization. In 2024, the government launched BharatGen, a public-private initiative funded with 235 crore (€26 million) initiative aimed at building foundation models attuned to India’s vast linguistic and cultural diversity. The project is led by the Indian Institute of Technology in Bombay and also involves its sister organizations in Hyderabad, Mandi, Kanpur, Indore, and Madras. The programme’s first product, e-vikrAI, can generate product descriptions and pricing suggestions from images in various Indic languages. Startups like Ola-backed Krutrim and CoRover’s BharatGPT have jumped in, while Google’s Indian lab unveiled MuRIL, a language model trained exclusively on Indian languages. The Indian governments’ AI Mission has received more than180 proposals from local researchers and startups to build national-scale AI infrastructure and large language models, and the Bengaluru-based company, AI Sarvam, has been selected to build India’s first ‘sovereign’ LLM, expected to be fluent in various Indian languages.

In Africa, much of the energy comes from the ground up. Masakhane NLP and Deep Learning Indaba, a pan-African academic movement, have created a decentralized research culture across the continent. One notable offshoot, Johannesburg-based Lelapa AI, launched InkubaLM in September 2024. It’s a ‘small language model’ (SLM) focused on five African languages with broad reach: Swahili, Hausa, Yoruba, isiZulu and isiXhosa.

“With only 0.4 billion parameters, it performs comparably to much larger models,” says Rosman. The model’s compact size and efficiency are designed to meet Africa’s infrastructure constraints while serving real-world applications. Another African model is UlizaLlama, a 7-billion parameter model developed by the Kenyan foundation Jacaranda Health, to support new and expectant mothers with AI-driven support in Swahili, Hausa, Yoruba, Xhosa, and Zulu.

India’s research scene is similarly vibrant. The AI4Bharat laboratory at IIT Madras has just released IndicTrans2, that supports translation across all 22 scheduled Indian languages. Sarvam AI, another startup, released its first LLM last year to support 10 major Indian languages. And KissanAI, co-founded by Pratik Desai, develops generative AI tools to deliver agricultural advice to farmers in their native languages.

The data dilemma

Yet building LLMs for underrepresented languages poses enormous challenges. Chief among them is data scarcity. “Even Hindi datasets are tiny compared to English,” says Tapas Kumar Mishra, a professor at the National Institute of Technology, Rourkela in eastern India. “So, training models from scratch is unlikely to match English-based models in performance.”

Rosman agrees. “The big-data paradigm doesn’t work for African languages. We simply don’t have the volume.” His team is pioneering alternative approaches like the Esethu Framework, a protocol for ethically collecting speech datasets from native speakers and redistributing revenue back to further development of AI tools for under-resourced languages. The project’s pilot used read speech from isiXhosa speakers, complete with metadata, to build voice-based applications.

In Arab nations, similar work is underway. Clusterlab’s 101 Billion Arabic Words Dataset is the largest of its kind, meticulously extracted and cleaned from the web to support Arabic-first model training.

The cost of staying local

But for all the innovation, practical obstacles remain. “The return on investment is low,” says KissanAI’s Desai. “The market for regional language models is big, but those with purchasing power still work in English.” And while Western tech companies attract the best minds globally, including many Indian and African scientists, researchers at home often face limited funding, patchy computing infrastructure, and unclear legal frameworks around data and privacy.

“There’s still a lack of sustainable funding, a shortage of specialists, and insufficient integration with educational or public systems,” warns Habib, the Cairo-based professor. “All of this has to change.”

A different vision for AI

Despite the hurdles, what’s emerging is a distinct vision for AI in the Global South – one that favours practical impact over prestige, and community ownership over corporate secrecy.

“There’s more emphasis here on solving real problems for real people,” says Nawale of AI4Bharat. Rather than chasing benchmark scores, researchers are aiming for relevance: tools for farmers, students, and small business owners.

And openness matters. “Some companies claim to be open-source, but they only release the model weights, not the data,” Marivate says. “With InkubaLM, we release both. We want others to build on what we’ve done, to do it better.”

In a global contest often measured in teraflops and tokens, these efforts may seem modest. But for the billions who speak the world’s less-resourced languages, they represent a future in which AI doesn’t just speak to them, but with them."

Sibusiso Biyela, Amr Rageh and Shakoor Rather

20 May 2025

https://www.natureasia.com/en/nmiddleeast/article/10.1038/nmiddleeast.2025.65

#metaglossia_mundus

From www.natureasia.com - May 20, 2025 10:49 PM

Scoop.it!

Reactions (3)

No comment yet.

The Great Britain Sasakawa Foundation

The Great Britain Sasakawa Foundation Translation Prize We are excited to announce a partnership with the Society of Authors for a new prize for translations into English of Japanese-language literature.

"The Great Britain Sasakawa Foundation Translation Prize

We are excited to announce a partnership with the Society of Authors for a new prize for translations into English of Japanese-language literature. The Great Britain Sasakawa Foundation Translation Prize is now closed for submittions, with winners receiving £3,000. More information in the announcement from the Society of Authors below.

The Society of Authors, in conjunction with the Great Britain Sasakawa Foundation, is delighted to announce the launch of the Great Britain Sasakawa Foundation Translation Prize, celebrating translations into English from Japanese.

In 2019, Morgan Giles was awarded the TA First Translation Prize for her translation of Tokyo Ueno Station by Yū Miri, from Japanese; and in 2018, Janet Hong was awarded the same prize for her translation of The Impossible Fairytale by Han Yujoo, from Korean. However, the Great Britain Sasakawa Foundation Translation Prize marks the first Society of Authors prize dedicated solely to translations from an Asian country.

In the true spirit of the Great Britain Sasakawa Foundation, we are thrilled by the opportunity this prize presents, to enhance an appreciation of Japanese culture, and in turn bring a new, English-speaking audience to writers and translators working with Japanese.

The prize will be awarded for a translation into English of a full-length Japanese-language work of literary merit and general interest. The winning translation will make for faultless reading or reading where there is no indication that the original language was not English.

Opening for submissions on 1 January, the prize will become part of the Society of Authors’ stable of translation prizes. Winners of the Great Britain Sasakawa Foundation Translation Prize will receive £3,000, with the runner-up receiving £1,000, and will be presented with the award in-person at our annual Translation Prizes Ceremony on 12th February 2025. The next deadline for submissions will be 31st March 2025.

For full details of how to submit to the prize, please check the prize’s webpage, where there is a form requesting all necessary information.

We are delighted to support this new prize for translation from Japanese, in collaboration with the Society of Authors. The Great Britain Sasakawa Foundation has always sought to promote mutual knowledge and understanding between the UK and Japan, and there can be few better ways towards this than the creative art of translation, with its ability to open windows into other minds and cultures and to enable us to share different views of the world. We owe translators a debt of gratitude for their sometimes under-appreciated labours: we would be vastly impoverished without them.

The Chairman of The Great Britain Sasakawa Foundation, the Earl of St Andrews

A prize for translation from Japanese feels very overdue as we’ve seen the rise in readership of great Japanese literature in the UK. A celebration of the translators working on those books is in order. All our translation prizes celebrate the art of translation, an often overlooked art, and we can’t wait to work with new judges to celebrate Japanese literature in translation. Our thanks go to the Great Britain Sasakawa Foundation for their support of this important prize.

The Society of Authors Head of Fundraising, Grants and Prizes, Robyn Law

The Great Britain Sasakawa Foundation

Lower Ground Floor,

24 Bedford Row,

London WC1R 4TQ

Tel: +44 (0)20 7436 9042

E-mail: gbsf [at] gbsf.org.uk

The Great Britain Sasakawa Foundation

Sasakawa Peace Foundation Building

1-15-16, Toranomon, Minato-ku, Tokyo

105-0001 Japan

Tel: +81 (0)3 6257 1931

E-mail: tokyo [at] gbsf.org.uk

グレイトブリテン・ササカワ財団

東京事務所

〒105-0001

東京都港区虎ノ門1-15-16 笹川平和財団ビル

Like GBSasakawa on Facebook

Follow @GBSasakawa on Twitter

https://www.gbsf.org.uk/other-programmes/translation-prize/

#metaglossia_mundus

From www.gbsf.org.uk - June 30, 2025 8:15 AM

Scoop.it!