As great as generative AI looks, researchers at Harvard, MIT, the University of Chicago, and Cornell concluded that LLMs are not as reliable as we believe. Even a big company like Nintendo did not want to associate it with its game development.
Despite tremendous growth, these AI systems remain inconsistent and inaccurate in unpredictable real-world conditions.
Why GenAI Models Aren't Fully Reliable Yet
Although they perform outstandingly in generating text, writing code, and doing many applications, LLMs fail when the tasks or environment change. This is a drawback that makes one question the trustworthiness of these models in real-world applications, where adaptability and reliability matter most, Interesting Engineering reports.
Recently, it was disclosed that GenAI models fail to have an internal "understanding" of the data they process when faced with dynamic tasks.
Examining AI Performance For Real-World Scenarios
In one experiment, researchers attempted to determine how well a very popular LLM can perform in providing directions throughout New York City. The AI model provided nearly flawless directions under normal conditions and seemed very capable on the surface. However, when researchers introduced roadblocks and detours, the model's accuracy plummeted.
It could not adapt to the new street layout; instead, it could not even navigate properly and revealed a serious flaw in its "understanding" of the city's geography.
This means that though LLMs might "learn" about the real world, they don't create the kind of robust, flexible knowledge structures that humans or other sophisticated systems do.
Structural Weakness in LLMs' World Models
LLMs, such as popular GenAI models GPT-4, are constructed based on a form of AI architecture known as the transformer. Such transformers are trained on gigantic language datasets that predict words or sequences in order to give human-like responses.
Researchers, however, have determined that just because these models can be so good at prediction does not mean that they really know the world that they're describing.
An example would be that a transformer model could be extremely effective at making valid moves within the board game of Connect 4 but still was not an understanding of how the game was actually supposed to work.
To answer this, the authors came up with two new metrics to check whether such AI models can learn coherent "world models"—structured knowledge that would enable them to work appropriately in diverse scenarios. They applied these metrics to two tasks: navigating the streets of New York City and playing Othello, a board game.
Random Models Outperform Predictive AI
Interestingly, what the researchers found was that transformer models that made random decisions often produced more accurate world models than those with higher prediction accuracy. This in itself suggests that AI models trained just to predict sequences may not be learning to understand their jobs.
When the researchers closed just 1% of streets on New York City's map, the AI model's accuracy fell from close to 100% to just 67%, showing a failure at a deep level in adaptability.
In the task of Othello games, one model succeeded in creating a coherent "world model" that worked in the context of Othello moves, but not one of the models really succeeded in forming a sound model for New York City navigation.
Implications For Future AI Development
These results indicate the inadequacy of current approaches to LLM construction and evaluation for developing reliable, "real-world" AI systems.
"Often, we see these models do impressive things and think they must have understood something about the world. I hope we can convince people that this is a question to think very carefully about, and we don't have to rely on our own intuitions to answer it," said one researcher, but they emphasized that new approaches that cannot be reduced to predictive accuracy need to be developed if one would like to construct models that really understand the context in which they are deployed.
The scientists hope to apply their new metrics to scientific and real-world problems to find ways of making LLMs more adaptable and reliable.
Focusing on the final insights developed in AI engineering can better build systems for real-world applications while solidifying a stronger foundation to base improvements in artificial intelligence.