CognitiveLab Unveils Ambari, Bilingual Language Models in Kannada-English | Metaglossia: The Translation World | Scoop.it

CognitiveLab has introduced Ambari, an open-source Bilingual Kannada-English Large Language Models (LLMs) series.

Published on January 14, 2024

CognitiveLab Unveils Ambari, Bilingual Language Models in Kannada-English

Its inaugural models, 𝗔đ—șđ—Żđ—źđ—żđ—¶-𝟳𝗕-𝗯𝗼𝘀đ—Č-𝘃𝟬.𝟭 and 𝗔đ—șđ—Żđ—źđ—żđ—¶-𝟳𝗕-đ—œđ—»đ˜€đ˜đ—żđ˜‚đ—°đ˜-𝘃𝟬.𝟭, achieve impressive results on a compact 1 billion-token training dataset, trained across multiple stages.
 

CognitiveLab has introduced Ambari, an open-source Bilingual Kannada-English Large Language Models (LLMs) series. The initiative addresses the challenges posed by the dynamic landscape of LLMs, with a primary focus on bridging the linguistic gap between Kannada and English.

Its inaugural models, 𝗔đ—șđ—Żđ—źđ—żđ—¶-𝟳𝗕-𝗯𝗼𝘀đ—Č-𝘃𝟬.𝟭 and 𝗔đ—șđ—Żđ—źđ—żđ—¶-𝟳𝗕-đ—œđ—»đ˜€đ˜đ—żđ˜‚đ—°đ˜-𝘃𝟬.𝟭, achieve impressive results on a compact 1 billion-token training dataset, trained across multiple stages. You can find the models here.

In the blog post, CognitiveLab shares insights into the purpose behind Ambari and the meticulous approach taken during its development. The project is driven by the need to pioneer language adaptability within LLMs, pushing the boundaries of efficiency by training and fine tuning on a modest 1 billion-token dataset.

Ambari’s training process involves distinct stages, including pre-training, bilingual next token prediction/translation, instruct fine-tuning, and more. Efficient tokenization, a critical component, is achieved through a specialized model using SentencePiece, addressing challenges posed by Kannada text within open-source LLMs.

Continual pre-training with a curated dataset of 500 million tokens is highlighted, showcasing the commitment to open-source knowledge sharing with the availability of fully fine-tuned model weights on Hugging Face.

A pivotal addition to the training strategy is the phase of bilingual next token prediction, inspired by the Hathi series. Challenges in translation and fine-tuning are acknowledged, emphasizing the commitment to refining bilingual capabilities within Ambari.

The blog details supervised fine-tuning with low-rank adaptation, introducing a chat template structure for bilingual instruct fine-tuning. The final phase explores Direct Preference Optimization (DPO) using the Anthropic/hh-rlhf dataset, undergoing evaluation for its impact on performance.

Learnings and observations include occasional hallucinations, translation nuances, and the dilemma of full weight fine-tuning. The future roadmap for Ambari includes the incorporation of Romanized Kannada, refinement of data pipelines, and scaling the training dataset for continuous learning and model enhancement.

Interestingly, this is the second Kannada-based LLM. Recently, Mumbai-based software development company Tensoic released Kannada Llama, also known as Kan-LLaMA [àȕàČšàł-LLama] — a 7B Llama-2 model, LoRA PreTrained and FineTuned on ‘Kannada’ tokens.

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.