AI models, including the widely used ChatGPT, struggle to analyze Securities and Exchange Commission (SEC) filings, according to a new study conducted by startup Patronus AI.
According to CNBC, the study found that even the best-performing AI model, OpenAI's GPT-4-Turbo, achieved only a 79% accuracy rate when answering questions based on SEC filings.
AI Performance Rate Is 'Absolutely Unacceptable'
The test included scenarios where the AI had access to nearly the entire filing alongside the question. Researchers cited instances wherein the AI models would either refuse to answer or provide inaccurate information not present in the SEC filings.
Patronus AI co-founder Anand Kannappan has expressed dissatisfaction with the performance, saying that it is "absolutely unacceptable" for the AI models to be considered reliable for automated and production-ready applications.
The study sheds light on AI models' challenges, especially in regulated industries like finance, where accuracy and reliability are crucial for decision-making processes.
The finance industry has shown a keen interest in incorporating AI models, like ChatGPT, for tasks such as summarizing SEC filings and quickly extracting essential financial data.
However, the entry of AI models into this sector has faced challenges, including inaccuracies in summarizing earnings press releases and the generation of incorrect figures, CNBC reported.
Patronus AI co-founders highlighted a significant challenge in incorporating Large Language Models (LLMs) into products: their nondeterministic nature.
LLMs do not assure consistent output for identical inputs, emphasizing the importance of rigorous testing to ensure accurate functionality and dependable outcomes.
The founders underscored the necessity for advanced testing methodologies to validate AI models' proper operation, topical coherence, and reliability, especially in industries subject to regulation.
Read Also : OpenAI's ChatGPT Marks First Anniversary, Emerges as Fastest-Growing Consumer App in History
FinanceBench
Patronus AI developed a comprehensive test, consisting of over 10,000 questions and answers, drawn from SEC filings from major publicly traded firms, which it calls FinanceBench. The dataset includes correct answers and indicates the location within the filing where the answers can be found.
Through its testing framework, Patronus AI aims to set a "minimum performance standard" for language AI in the financial sector. The study evaluated four language models: GPT-4 and GPT-4-Turbo from OpenAI, Claude 2 from Anthropic, and Llama 2 from Meta.
The testing involved various configurations and prompts, such as "Oracle" mode, wherein models were given the exact relevant source text in the question, and "long context," which included almost the entire SEC filing alongside the question.
GPT-4-Turbo struggled in the "closed book" test but showed improvement in "Oracle" mode. Llama 2 demonstrated significant inaccuracies, while Claude 2 performed well with "long context."
Despite some models performing relatively well, the co-founders emphasized that there is no acceptable margin for error, particularly in regulated industries.
They believe that language models like GPT have significant potential in the finance industry but stress the importance of continuous improvement in AI models to meet the required standards of accuracy and reliability.
"Models will continue to get better over time. We're very hopeful that in the long term, a lot of this can be automated. But today, you will definitely need to have at least a human in the loop to help support and guide whatever workflow you have," Kannapan told CNBC.