CognitiveLab Unveils Ambari, Bilingual Language...

Your new post is loading...

Scooped by Charles Tiayon

January 14, 2024 8:27 PM

Scoop.it!

From analyticsindiamag.com - January 14, 2024 1:46 AM

Charles Tiayon's insight:

"CognitiveLab has introduced Ambari, an open-source Bilingual Kannada-English Large Language Models (LLMs) series.

Published on January 14, 2024

CognitiveLab Unveils Ambari, Bilingual Language Models in Kannada-English

Its inaugural models, 𝗔𝗺𝗯𝗮𝗿𝗶-𝟳𝗕-𝗯𝗮𝘀𝗲-𝘃𝟬.𝟭 and 𝗔𝗺𝗯𝗮𝗿𝗶-𝟳𝗕-𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁-𝘃𝟬.𝟭, achieve impressive results on a compact 1 billion-token training dataset, trained across multiple stages.

CognitiveLab has introduced Ambari, an open-source Bilingual Kannada-English Large Language Models (LLMs) series. The initiative addresses the challenges posed by the dynamic landscape of LLMs, with a primary focus on bridging the linguistic gap between Kannada and English.

In the blog post, CognitiveLab shares insights into the purpose behind Ambari and the meticulous approach taken during its development. The project is driven by the need to pioneer language adaptability within LLMs, pushing the boundaries of efficiency by training and fine tuning on a modest 1 billion-token dataset.

Ambari’s training process involves distinct stages, including pre-training, bilingual next token prediction/translation, instruct fine-tuning, and more. Efficient tokenization, a critical component, is achieved through a specialized model using SentencePiece, addressing challenges posed by Kannada text within open-source LLMs.

Continual pre-training with a curated dataset of 500 million tokens is highlighted, showcasing the commitment to open-source knowledge sharing with the availability of fully fine-tuned model weights on Hugging Face.

A pivotal addition to the training strategy is the phase of bilingual next token prediction, inspired by the Hathi series. Challenges in translation and fine-tuning are acknowledged, emphasizing the commitment to refining bilingual capabilities within Ambari.

The blog details supervised fine-tuning with low-rank adaptation, introducing a chat template structure for bilingual instruct fine-tuning. The final phase explores Direct Preference Optimization (DPO) using the Anthropic/hh-rlhf dataset, undergoing evaluation for its impact on performance.

Learnings and observations include occasional hallucinations, translation nuances, and the dilemma of full weight fine-tuning. The future roadmap for Ambari includes the incorporation of Romanized Kannada, refinement of data pipelines, and scaling the training dataset for continuous learning and model enhancement.

Interestingly, this is the second Kannada-based LLM. Recently, Mumbai-based software development company Tensoic released Kannada Llama, also known as Kan-LLaMA [ಕನ್-LLama] — a 7B Llama-2 model, LoRA PreTrained and FineTuned on ‘Kannada’ tokens.

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence."

#metaglossia_mundus

No comment yet.

Seoul offers medical interpreters for foreign residents, multicultural families - The Korea Times

"...Foreign residents and multicultural families in Korea often face hurdles beyond language barriers when visiting hospitals. From explaining symptoms to navigating the health care system, these challenges can cause delays or disrupt treatment.

Seoul has unveiled a new policy aimed at easing those difficulties.

The Seoul Metropolitan Government announced Friday that it will launch the Medical-Seoul Interpreter Community this month to help foreign residents and multicultural families overcome language barriers in medical settings.

A total of 43 interpreters have been selected, covering 10 languages: Chinese, Russian, Vietnamese, Mongolian, English, Khmer, Japanese, Thai, Hindi and Urdu.

The team includes foreign residents and local citizens who have completed professional medical interpretation training and hold relevant certifications. They will assist in situations requiring specialized support, including treatment for serious illnesses, surgeries, hospitalizations and advanced medical exams.

Language barriers have long been a major obstacle preventing foreign residents here from receiving proper medical care.

A 2020 survey of 1,060 foreigners by the National Human Rights Commission of Korea found that 24.5 percent said they did not fully understand medical explanations without an interpreter. In contrast, more than 90 percent reported clear understanding when interpretation services were available.

According to Statistics Korea, 6 percent of migrants said they had been unable to visit a hospital despite being ill over the past year — with 38.7 percent blaming communication difficulties as the primary reason.

Medical interpretation services must be requested at least three days before the appointment, excluding public holidays. The service is available on weekdays from 9 a.m. to 6 p.m., and each applicant may use it for up to four hours per session, with a maximum of four sessions per year.

The entire process, from application to interpreter assignment, is managed by the Dongbu Foreign Resident Center, a foreigner support center operated by the city government.

A QR code that links to the medical interpretation request form. Courtesy of the Seoul Metropolitan Government

The service will be available starting next Wednesday. Applications can be submitted through banners or QR codes found on global.seoul.go.kr and the mcfamily.or.kr."
By Park Ung
Published May 3, 2025 5:00 am KST
https://www.koreatimes.co.kr/southkorea/society/20250503/seoul-offers-medical-interpreters-for-foreign-residents-multicultural-families
#metaglossia_mundus

From www.koreatimes.co.kr - May 3, 11:47 PM

Scoop.it!

Siddharth Jindal

Siddharth Jindal

En términos generales, ¿qué hace que una lengua esté centrada en la naturaleza?

Cuénteme más sobre tu estadía en Tuvá. ¿Qué le sorprendió de la cultura y del idioma tuvano?

¿Esta visión del mundo centrada en la naturaleza solo se refleja en el vocabulario, o hay otras formas en que el idioma tuvano codifica el conocimiento ambiental?

¿Cómo fue aprender un idioma centrado en la naturaleza como el tuvano? ¿Eso cambió su visión sobre nuestra relación con la naturaleza?

¿Cómo influye esta visión del mundo centrada en la naturaleza en la vida cotidiana de las personas de Tuvá?

¿Qué ha aprendido de otras lenguas indígenas en términos de cómo codifican el conocimiento ambiental?

¿Qué podemos aprender del tipo de conocimiento ambiental arraigado en las lenguas indígenas?

¿Cómo se puede utilizar este conocimiento para ayudar a proteger la biodiversidad?

¿Ve alguna señal de que la ciencia occidental en general esté comenzando a reconocer el conocimiento ambiental que tienen las comunidades indígenas?

De las cerca de 7.000 lenguas identificadas, casi la mitad se considera en peligro de extinción. ¿Qué se puede hacer para preservarlas y para proteger el conocimiento cultural y ambiental que contienen muchas de ellas?

Un bouton proche du texte traduit pour lancer une recherche dans la langue cible

Des améliorations qui remplacent Translate au cœur de l'apprentissage