The popularity of AI models, like OpenAI's ChatGPT and Copilot, has led to a significant rise in the use of artificial intelligence in programming.
These AI assistants have proven to be valuable tools for programmers, providing efficient and effective support. However, concerns have risen about the potential copyright issues caused by these AI models.
Researchers at the McKelvey School of Engineering at Washington University in St. Louis have recognized the problem and taken action by developing an automated testing platform called CodeIPPrompt.
The CodeIPPrompt Platform
This platform aims to evaluate the extent to which language models generate code that violates intellectual property (IP) rights.
The team, comprising assistant professors Ning Zhang and Chenguang Wang, professor Yevgeniy Vorobeychik, and graduate student Zhiyuan Yu, who is the first author of the paper, collaborated with Chaowei Xiao, assistant professor of computer science at Arizona State University.
Yu presented their work during the International Conference on Machine Learning in Honolulu. Their analysis revealed that copyright infringement is prevalent in state-of-the-art open-source models, such as CodeRl, CodeGen, and CodeParrot, as well as in commercial products like Copilot, ChatGPT, and GPT-4.
The development of CodeIPPrompt stems from the team's desire to raise awareness among users of large language models. According to the team, when using these models to assist in code writing, there is a significant risk that they may unknowingly generate content that infringes on intellectual property rights.
"We developed this tool to help people understand that if they're using these large language models to help write code, there's a good chance they might generate IP infringing content," Zhang said in a statement.
"As users, we have a responsibility to use AI ethically. That's influenced by how we understand AI technology and the content it produces," he added.
Identifying IP Violation
However, the team noted that CodeIPPrompt could not definitively determine whether AI-generated code constitutes an IP violation. The ultimate determination of whether infringement has occurred remains a legal question that will be resolved in courts as cases are brought against users of AI tools for copyright infringement.
Nevertheless, the platform can provide users with a risk score, indicating the similarity between the generated code and copyright-protected content. This risk score will help users gauge the potential IP infringement risk associated with using AI in code writing.
Zhang is optimistic about the future impact of CodeIPPrompt. The tool promises to guide the ongoing development of AI, leading to potential mitigation strategies and other protective measures against IP violations in the AI-generated code.