Beijing AI Academy Unveils Multimodal Model

tech360.tv
Oct 22, 2024
2 min read

Beijing Academy of Artificial Intelligence (BAAI) unveils Emu3, a groundbreaking multimodal AI model. Emu3 can interpret text, images, and video simultaneously, showcasing China's technological advancement. BAAI's innovative approach simplifies model training by using a unified AI architecture.

Beijing AI Academy Unveils Groundbreaking Multimodal Model — Credit: Shutterstock

This move positions Chinese firms at the forefront of innovation, bridging the gap with leading US counterparts in the AI sector.

Facing challenges like restricted access to advanced chips and limited capital compared to US companies, Chinese AI startups are striving to match the rapid pace of model development set by industry giants like OpenAI and Google. BAAI, a non-profit organisation, plays a pivotal role in fostering growth within China's AI community.

At a recent event in Beijing, BAAI showcased Emu3, its latest multimodal model. Emu3 utilises a streamlined architectural design to train models in comprehending images and generating video clips. Unlike traditional models that focus on a single data type, multimodal models like Emu3 can process various inputs such as text, video, and audio simultaneously.

Wang Zhongyuan, the head of BAAI, also known as the Zhiyuan Institute, hailed Emu3 as the "largest technological contribution in recent years" from the organisation. Emu3 employs a unified AI architecture that converts text, images, and video clips into tokens, the fundamental units of data processed by AI models.

This innovative approach eliminates the need for separate models to handle different data types, streamlining the training process and enhancing efficiency in developing versatile AI models. BAAI reported that Emu3 surpasses established task-specific models like Stable Diffusion XL in image generation and the multimodal model LLaVA in both understanding and creating images.

Beijing Academy of Artificial Intelligence (BAAI) unveils Emu3, a groundbreaking multimodal AI model.
Emu3 can interpret text, images, and video simultaneously, showcasing China's technological advancement.
BAAI's innovative approach simplifies model training by using a unified AI architecture.

Source: SCMP