India’s many languages pose a challenge to the development of its large language model | Metaglossia: The Translation World | Scoop.it

"...India currently has 22 officially recognised languages and more ... local languages, making it complicated to code an AI model that can process all these languages seamlessly.


Ishan Garg & Kevin Lam
26 May 2025 06:32PM


NEW DELHI: India is building its own large language model it hopes one day may rival OpenAI's chatbot ChatGPT, but the country’s countless languages and dialects have made training it a challenge.


Some languages like Marathi share common roots with others such as Hindi and Gujarati, while others spoken in South India - such as Kannada, Telugu, Tamil and Malayalam - are completely different.


A large language model has to process these multiple languages seamlessly, and coding an AI model capable of understanding most of them, if not all, remains complicated.


TRAINING AI ON LOCAL LANGUAGES


One challenge faced by BharatGen, a consortium funded by India’s government, in training their large language model is a lack of online content in Indian languages.


The consortium said that while roughly half of all the data available on the internet is in English, Indian languages make up barely 1 per cent.


Literary works in many Indian languages have never been digitised, while a raft of cultural and traditional information has been verbally passed down for generations without being stored online.


On a more positive note, experts said that the diversity of languages and data collected from local sources could help create AI models with fewer biases.


Ganesh Ramakrishnan, a professor at the Indian Institute of Technology Bombay, told CNA his work involved reaching out to magazines, data sources, foundations and non-governmental organisations who have been gathering data in their local languages.
...
Experts said platforms like BharatGen need to invest billions of dollars on graphics processing units and data centres to achieve made-in-India generative AI at scale.


The hefty price tag would be a small price to pay to transform India from a major tech service provider to a major tech disruptor, in what could soon be a trillion-dollar market.


“India is all about scale and complexity,” said Shekar Sivasubramanian, head of the LEHS-AI unit at non-profit AI institute Wadhwani AI.


“If it is solved in India, and if it works in India, chances are, it will work in the world. That’s the opportunity"


https://www.channelnewsasia.com/asia/india-ai-language-model-chatbots-bharatgen-5153516
##metaglossia_mundus