AI Singapore Collaborates With Google To Enhance Datasets Used To Train LLMS in Southeast Asian Languages

Kyle Chua
Mar 12, 2024
3 min read

AI Singapore (AISG) and Google Research are teaming up to improve inclusivity in Southeast Asian Large Language Models (LLMs).

The two announced today the launch of the research collaboration called Project SEALD (Southeast Asian Languages in One Network Data), which is meant to enhance datasets that can be used to train LLMs in languages spoken across Southeast Asia (SEA).

For those unfamiliar, LLMs are artificial intelligence (AI) models that can understand and generate human language text.

Among the first languages to be fed to the LLM include Indonesian, Thai, Tamil, Filipino and Burmese. These languages are expected to help build a diverse data corpus spoken in the SEA region to support the training models under SEA-LION (Southeast Asian Languages in One Network), an AISG initiative that seeks to develop LLMs pre-trained and instruction-tuned to be more representative of SEA's cultural contexts and linguistic nuances.

As part of Project SEALD, AISG and Google Research Asia Pacific (APAC) also plan to develop translocalisation and translation models, establish best practices for instruction tuning datasets, create tools to enable translocalisation at scale and publish pre-training recipes for SEA languages.

"Google is proud to be partnering with AISG to put Singapore and SEA on the map of AI model development," said Yolyn Ang, Vice President, Knowledge and Information Partnerships, Google Asia Pacific. "By focusing on languages spoken and used in SEA and cultural understanding, Project SEALD will significantly improve the existing corpus and

evaluation benchmarks for these languages."

The collaboration also intends to release the datasets in open-source to advance the progress of the SEA LLM ecosystem and advance applicability of the tech across the region. For example, the datasets can improve communications with under-represented populations of migrant workers in Singapore, better capturing linguistic nuances within these communities to provide the foundation for enhanced engagement by both the Singapore Government and employers.

In addition to that, when the datasets are integrated into one of the generative AI solutions first developed under the AI Trailblazers initiative by the Singapore Government and Google Cloud, they can aid outreach across important domains, such as redressal of worker grievances and extension of assistance schemes.

Project SEALD also plans to engage with the academia, as well as industry and government partners for data collection, curation and quality checks, among others.

Apart from Project SEALD, AISG is working with Google's Cloud division to make its SEA-LION LLMs available on Google Cloud’s Model Garden on Vertex AI, which provides organizations with access to first-party, third-party, and

open models that meet Google Cloud’s strict enterprise safety and quality standards. Vertex AI allows organisations to use enterprise-grade tools to customise these models to address relevant use cases and integrate them into their own applications.

AISG has also initiated collaborations across Singapore and other SEA countries. For example, AISG has

signed Memorandums of Understanding (MOUs) or Letters of Intent (LOIs) with Indonesian, Malaysian, and

Vietnamese entities for the development of datasets and applications for regional LLMs. AISG has also been

working with its partners in Thailand, the Philippines, and Indonesia to build resources on regional language syntax and semantics. Meanwhile, in Singapore, AISG works closely with public sector and R&D stakeholders on safety alignment and multimodality.

"The SEA-LION LLM project has always been about building a community and ecosystem that will continuously work together to enhance the quality of the SEA-LION data corpus and continuously improve SEA-LION’s capabilities. We are happy that Google now stands as a key part of the SEA-LION ecosystem and we look forward to building better datasets through Project SEALD in collaboration with Google for the benefit of the entire community", said Leslie Teo, Senior Director of AI Products at AISG.

AI Singapore and Google Research are teaming up to improve inclusivity in Southeast Asian Large Language Models.
Called Project SEALD, the research collaboration looks to enhance datasets that can be used to train LLMs in languages spoken across Southeast Asia.
The collaboration also intends to release the datasets in open-source to advance the progress of the SEA LLM ecosystem and advance applicability of the tech across the region.