According to a new study from University of Illinois Urbana-Champaign experts, the GPT- 4 big language model from OpenAI you exploit vulnerabilities in the real world without human intervention. Another open- cause models, including GPT- 3.5 and risk scanners, are not able to accomplish this.
When provided with their National Institute of Standards and Technology information, a large vocabulary type agent running on GPT-4 successfully exploited 87 % of “one-day” risks. It is an advanced technique based on an LLM that can take activities via tools, reason, self-reflect, and more. One-day threats are those that have been publicly exposed but have not yet been patched, making them also vulnerable to exploitation.
” As LLMs have become extremely potent, so have the skills of LLM brokers”, the scientists wrote in the arXiv draft. They also made the speculative theory that the other models ‘ analytical failures result from their “much worse tool use” than GPT-4.
The findings demonstrate that GPT- 4 has the “emergent capability” to independently identify and exploit one-day vulnerabilities that scanners may ignore.
Daniel Kang, associate professor at UIUC and study writer, hopes that the results of his study will be used in the defense environment, however, he is conscious that the capability may provide an emerging mode of attack for cybercriminals.
He stated in an email to TechRepublic,” I would assume that this would lower the barriers to exploiting one-day vulnerabilities as LLM costs decrease.” Previously, this was a manual process. If LLMs become cheap enough, this process will likely become more automated”.
How successful is GPT- 4 at autonomously detecting and exploiting vulnerabilities?
GPT- 4 can autonomously exploit one- day vulnerabilities
The GPT- 4 agent’s impressive capabilities were demonstrated by its ability to autonomously exploit web and non-web one-day vulnerabilities, even those that were made public on the Common Vulnerabilities and Exposures database after the model’s knowledge cutoff date of November 26, 2023.
” In our previous experiments, we found that GPT- 4 is excellent at planning and following a plan, so we were not surprised”, Kang told TechRepublic.
SEE: GPT- 4 cheat sheet: What is GPT- 4 &, what is it capable of?
Kang’s GPT- 4 agent did have access to the internet and, therefore, any publicly available information about how it could be exploited. However, he explained that, without advanced AI, the information would not be enough to direct an agent through a successful exploitation.
” We use’ autonomous’ in the sense that GPT- 4 is capable of making a plan to exploit a vulnerability”, he told TechRepublic. ” Many real- world vulnerabilities, such as ACIDRain— which caused over$ 50 million in real- world losses — have information online. Yet exploiting them is non- trivial and, for a human, requires some knowledge of computer science”.
Out of the 15 one- day vulnerabilities the GPT- 4 agent was presented with, only two could not be exploited: Iris XSS and Hertzbeat RCE. The authors speculated that this was because the Iris web app is particularly challenging to use and because the Hertzbeat RCE description is written in Chinese, which means it might be harder to interpret when the prompt is in English.
GPT- 4 cannot autonomously exploit zero- day vulnerabilities
While the GPT- 4 agent had a phenomenal success rate of 87 % with access to the vulnerability descriptions, the figure dropped down to just 7 % when it did not, showing it is not currently capable of exploiting’ zero- day’ vulnerabilities. This finding, according to the researchers, demonstrates how much more adept is the LLM at finding vulnerabilities than finding them.
GPT- 4 is less expensive to use to exploit vulnerabilities than a human hacker would.
The researchers estimated that employing a human penetration tester would cost about$ 25 per vulnerability if it took them half an hour, whereas employing a successful GPT- 4 exploitation would be an average of$ 8.80 per vulnerability.
The researchers anticipate that the associated running costs for GPT-4 will decrease even further as the LLM agent is already 2.8 times less expensive than human labor, as GPT-4 has become over three times less expensive in just one year. “LLM agents are also trivially scalable, in contrast to human labour”, the researchers wrote.
GPT- 4 takes many actions to autonomously exploit a vulnerability
Other findings included that up to 100 actions were needed to exploit a significant number of the vulnerabilities. Surprisingly, the average number of actions taken when the agent had access to the descriptions and when it did n’t only differed marginally, and GPT- 4 actually took fewer steps in the latter zero- day setting.
Kang speculated to TechRepublic,” I think without the CVE description, GPT- 4 gives up more easily since it does n’t know which path to take”.
How were LLMs’ vulnerability exploitation capabilities evaluated?
The researchers first extracted 15 real-world, one-day vulnerability vulnerabilities from a benchmark dataset of 15 from the CVE database and academic papers. These reproducible, open- source vulnerabilities consisted of website vulnerabilities, containers vulnerabilities and vulnerable Python packages, and over half were categorised as either “high” or” critical” severity.

Next, they created an LLM agent using the ReAct automation framework, which would allow it to reason over its next action, create an action command, use the appropriate tool, and run the process again in a loop-based interactive loop. To create their agent, the developers only needed to write 91 lines of code, which illustrates how simple it is to implement.

GPT- 4 and these other open-source LLMs could be used as an alternative to the base language model:
- GPT-3.5.
- OpenHermes-2.5-Mistral-7B.
- Llama- 2 Chat ( 70B ).
- LLaMA- 2 Chat ( 13B ).
- LLaMA- 2 Chat (7B).
- Mixtral- 8x7B Instruct.
- Mistral (7B ) Instruct v0.2.
- Nous Hermes- 2 Yi 34B.
- OpenChat 3.5.
The agent was equipped with the tools necessary to autonomously exploit vulnerabilities in target systems, like web browsing elements, a terminal, web search results, file creation and editing capabilities and a code interpreter. It could also use the CVE database’s vulnerability descriptions to resemble the one-day setting.
The researchers then gave each agent a thorough prompt to think creative, persistent, and think creative about how to exploit the 15 vulnerabilities. This prompt consisted of 1, 056″ tokens”, or individual units of text like words and punctuation marks.
Based on the number of tokens input and outputted and the costs of the OpenAI API, the performance of each agent was evaluated on the success of exploiting the vulnerabilities, the degree of complexity of the vulnerability, and the overall project’s cost.
SEE: OpenAI’s GPT Store is Now Open for Chatbot Builders
The experiment was repeated, but this time the agent was not given details of the vulnerabilities to simulate a more challenging zero-day setting. In this situation, the agent must first identify the vulnerability and then successfully exploit it.
The same vulnerabilities were provided to the vulnerability scanners ZAP and Metasploit, both of which are frequently used by penetration testers, in addition to the agent. The researchers wanted to compare LLMs’ ability to find and exploit vulnerabilities to their own.
Ultimately, it was found that only an LLM agent based on GPT- 4 could find and exploit one- day vulnerabilities — i. e., when it had access to their CVE descriptions. All other LLMs and the two scanners tested with zero-day vulnerabilities, so they had a 0 % success rate.
Why did the researchers evaluate the LLMs’ vulnerability exploitation capabilities?
This study sought to address the knowledge gap regarding LLMs’ ability to successfully patch computer systems in one day without the use of human intervention.
When vulnerabilities are disclosed in the CVE database, the entry does not always describe how it can be exploited, therefore, threat actors or penetration testers looking to exploit them must work it out themselves. The researchers looked to see if it was possible to automate this process using existing LLMs.
SEE: Learn how to Use AI for Your Business
The Illinois team has previously demonstrated the LLMs’ ability to hack independently through” capture the flag” exercises, but not in actual deployments. Other work has mostly focused on AI in the context of “human- uplift” in cybersecurity, for example, where hackers are assisted by an GenAI- powered chatbot.
Kang told TechRepublic,” Our lab is focused on the academic question of what are the capabilities of frontier AI methods, including agents. Due to recent trends in its importance, we have focused on cybersecurity.
OpenAI has been approached for comment.