In collaboration with the University of California, Santa Barbara, researchers at the Amazon Web Services Artificial Intelligence Lab have uncovered a substantial prevalence of faulty machine translations across the web, raising concerns about the reliability and quality of content generated through artificial intelligence (AI).
"The low quality of these ... translations indicates they were likely created using machine translation," the authors wrote. "Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web."
Analyzing 6 Billion Sentences Online
According to Tech Xplore, after analyzing over six billion sentences online, the researchers discovered that more than half had undergone translation into two or more languages, with a significant portion exhibiting poor translation quality.
Moreover, the study highlighted a concerning trend: as these translations underwent further iterations - up to eight or nine languages - the quality deteriorated markedly.
In their report titled "A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism," the researchers expressed apprehension about using multilingual large language models on both monolingual and bilingual data scraped from the web.
The study revealed that texts are not only being translated by AI but also being created by AI. The AI-generated translations were particularly highest in lower-resource languages, such as Wolof and Xhosa, which are African languages.
The researchers found that highly multi-way parallel translations are significantly lower quality than two-way parallel translations, which means regions under-represented on the web, such as African countries and other nations with more obscure languages, will face more significant challenges in establishing reliable AI large language models.
They must heavily rely on tainted translations flooding the market due to the lack of native resources to draw upon.
Mehak Dhaliwal, a former applied science intern at Amazon Web Services, noted that colleagues working with machine training in low-resource languages observed a pervasive presence of machine-generated content in their native languages on the internet. Dhaliwal cautioned users to be aware that machines might generate the content encountered on the web.
Read Also : OpenAI Bans Dean.Bot's Developer, a ChatGPT-Powered Presidential Candidate Impersonator Bot
Bias in Selecting Content for AI Training
The researchers also identified bias in selecting content for AI training, with machine-generated, multi-way parallel translations dominating the total translated content in lower-resource languages.
According to the researchers, this content, often more straightforward and lower in quality, is speculated to be produced for ad revenue generation, contributing to the potential spread of inaccurate information.
The study's findings underscore the challenges posed by machine-generated translations, highlighting concerns about the accuracy, fluency, and reliability of content generated through AI systems.
While the prevalence of machine-generated content continues to grow, it becomes crucial to address the associated issues to ensure the integrity of information accessible on the web. The study's findings were published in arXiv.
Related Article : Revolutionizing Learning: Loughborough University Introduces Holographic Technology in the Classroom