MIT Unveils New Algorithm Capable of Learning Language Just by Watching Videos

This new algorithm could potentially decode animal communication.

The Massachusetts Institute of Technology (MIT) has introduced an innovative algorithm that can learn language solely by watching videos.

Mark Hamilton, a PhD student in electrical engineering and computer science, is leading this project alongside his colleagues at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL). They aim to utilize machines to decode animal communication, beginning with human language acquisition.

Art Kids Watching TV
Ubaidillah from Pixabay

Inspired by Penguins: Creating DenseAV

The inspiration for this novel algorithm came from an unexpected source: the film "March of the Penguins." In one scene, a penguin falls and emits a groan as it tries to get up. Hamilton observed that this groan seemed to imply a word, sparking the idea that audio and video could be used together to teach language to an algorithm.

This idea led to the creation of DenseAV, a model designed to learn language by predicting visual content from audio and vice versa. For instance, hearing the phrase, "bake the cake at 350" would prompt the model to expect visuals of a cake or an oven.

But to make audio-video matching across millions of videos possible, DenseAV must learn the context of what people are discussing. After training DenseAV on this matching task, the research team examined which pixels the model focused on when processing sounds.

When the word "dog" was mentioned, the algorithm searched for dog images in the video stream, indicating its understanding of the word's meaning. Similarly, when it heard a dog barking, it looked for dogs in the video.

The team was curious whether DenseAV could differentiate between the word "dog" and the sound of a dog barking. By giving DenseAV a dual-brain approach, they discovered that one side naturally focused on language, like the word "dog," while the other concentrated on sounds, like barking.

The team faced a challenging task in learning a language without text input as they aimed to rediscover the essence of language from scratch without using pre-trained language models. This method is inspired by how children learn language by observing and listening to their environment.

DenseAV's Potential Applications

One potential application of this technology is learning from the vast amount of video content uploaded to the internet daily. Hamilton and his team aim to create systems that can learn from instructional videos and other online content.

Another intriguing application is understanding new languages, such as dolphin or whale communication, which lack a written form. The team hopes DenseAV can assist in translating these languages, which have long eluded human understanding.

"Our hope is that DenseAV can help us understand these languages that have evaded human translation efforts since the beginning. Finally, we hope that this method can be used to discover patterns between other pairs of signals, like the seismic sounds the earth makes and its geology," Hamilton said in a statement.

The findings of the team were published in arXiv.

Byline
Byline
ⓒ 2024 TECHTIMES.com All rights reserved. Do not reproduce without permission.
Join the Discussion
Real Time Analytics