Google announced its 1,000 Languages Initiative in November 2022, months before making the headlines for its Bard AI Chat.
This languages project seeks to create a machine-learning model that supports the world's one thousand most spoken languages.
Machine learning is the application of artificial intelligence (AI). It is the implementation of mathematical models of data to help a computer learn without being explicitly instructed.
Google says the initiative aims to increase global inclusion for billions of people, and in the latest update, the initiative is now taking ground.
Translating 1000 Languages with Machine Learning
Google's team worked to develop the Universal Speech Model (USM) for several months. The USM is a set of advanced speech models made up of 2 billion parameters trained on 12 million hours of speech and 28 billion sentences of text, encompassing more than 300 languages.
This model will support the 1000 Languages Initiative by performing automatic speech recognition (ASR) on widely spoken but under-resourced languages such as Ethiopia's Amharic, the Philippines' Cebuano, Assamese from Burma, and Azerbaijani.
However, TechXplore tells us that due to the few people who speak less common languages, Google has decided to temporarily limit the number of languages they are attempting to support to 100. Such rare languages lack datasets for training.
Google has also published a paper on the arXiv pre-print server describing the introduction of its Universal Speech Model (USM).
Challenges with Making the Model
In a recent blog post, Google explains that obtaining enough data to train high-quality models for all these languages is one of the primary challenges in developing this technology.
Google's solution to this issue is self-supervised learning, which enables the model to learn from audio data without requiring manual labeling or pre-transcriptions.
This method is more scalable and can utilize more significant quantities of audio-only data across languages.
A training pipeline for the USM consists of three steps: self-supervised learning on speech audio covering hundreds of languages, optional pre-training on text data to improve quality and language coverage, and fine-tuning on downstream tasks like ASR or automatic speech translation using a small amount of supervised data.
Testing the USM on YouTube
According to Google, the USM has already been tested on YouTube Caption's multilingual speech data, which includes 73 languages and less than 3,000 hours of data per language on average.
Despite limited supervised data, Google claims the model achieved an average word error rate of less than 30% across 73 languages. On 18 languages, the USM outperformed the recently released large model Whisper with a word error rate that was 32.7% lower.
Bring People Closer with Inclusive Language
Google's Universal Speech Model is an advanced speech recognition tool that can recognize one thousand of the world's most widely spoken languages.
Utilizing self-supervised learning and fine-tuning with a small amount of supervised data is a crucial step in addressing the tool's scalability and computational efficiency issues while expanding language coverage and quality.
This technology can potentially increase global inclusion by making communication more accessible and encompassing billions of people.
Stay posted here at Tech Times.