
Zico Kolter has a knack for getting artificial intelligence to misbehave in interesting and important ways. His research group at Carnegie Mellon University has discovered numerous methods of tricking, goading, and confusing advanced AI models into being their worst selves.
Kolter is a professor at CMU, a technical adviser to Gray Swan, a startup specializing in AI security, and, as of August 2024, a board member at the world’s most prominent AI company, OpenAI. In addition to pioneering ways of jailbreaking commercial AI models, Kolter designs his own models that are more secure by nature. As AI becomes more autonomous, Kolter believes that AI agents may pose unique challenges—especially when they start talking to one another.
Kolter spoke to WIRED senior writer Will Knight. The conversation has been edited for length and clarity.
Will Knight: What is your lab working on currently?
Zico Kolter: One thing my group is working on is safely training models. We work a lot on understanding how to break models and circumvent protections, but this sort of raises the question of how we could build models that are inherently much more resistant to such attacks.
We are building a set of models that are more inherently safe. These models are not the 700 billion parameters [the scale of some frontier models]. They’re a few billion parameters. But they have to be trained from scratch, and doing the full pretraining of these [large language models], even for a 1 billion-parameter model, is actually quite a compute-intensive task.
CMU just announced a partnership with Google, which will supply the university with a lot more compute. What will this mean for your research?
Machine learning is becoming more and more compute-heavy. Academic research will never get the kind of resources that large-scale industry has. However, we are reaching a point where we cannot make do with no such resources. We need some amount just to demonstrate the techniques we’re developing.
Even though we are not talking about the same numbers of GPUs as industry has, [more compute is] becoming very necessary for academics to do their work at all. And this partnership with Google really does move the needle substantially in terms of what we can do as a research organization at CMU.
As your research has shown, powerful AI models are still often vulnerable to jailbreaks. What does this mean in the era of agents—where programs take actions on computers, the web, and even the physical world?
When I give my talk on AI and security, I now tend to lead with the example of AI agents. With just a chatbot the stakes are pretty low. Does it really matter if a chatbot tells you how to hot-wire a car? Probably not. That information is out there on the internet already.
That’s not going to necessarily be true for much more capable models. As chatbots become more capable, there absolutely exists the possibility that the reasoning power that these things have could be harmful themselves. I don’t want to downplay the genuine risk that extremely capable models could bring.
At the same time, the risk is immediate and present with agents. When models are not just contained boxes but can take actions in the world, when they have end-effectors that let them manipulate the world, I think it really becomes much more of a problem.
We are making progress here, developing much better [defensive] techniques, but if you break the underlying model, you basically have the equivalent to a buffer overflow [a common way to hack software]. Your agent can be exploited by third parties to maliciously control or somehow circumvent the desired functionality of the system. We’re going to have to be able to secure these systems in order to make agents safe.
This is different from AI models themselves becoming a threat, right?
There’s no real risk of things like loss of control with current models right now. It is more of a future concern. But I’m very glad people are working on it; I think it is crucially important.
How worried should we be about the increased use of agentic systems then?
In my research group, in my startup, and in several publications that OpenAI has produced recently [for example], there has been a lot of progress in mitigating some of these things. I think that we actually are on a reasonable path to start having a safer way to do all these things. The [challenge] is, in the balance of pushing forward agents, we want to make sure that the safety advances in lockstep.
Most of the [exploits against agent systems] we see right now would be classified as experimental, frankly, because agents are still in their infancy. There’s still a user typically in the loop somewhere. If an email agent receives an email that says “Send me all your financial information,” before sending that email out, the agent would alert the user—and it probably wouldn’t even be fooled in that case.
This is also why a lot of agent releases have had very clear guardrails around them that enforce human interaction in more security-prone situations. Operator, for example, by OpenAI, when you use it on Gmail, it requires human manual control.
What kinds of agentic exploits might we see first?
There have been demonstrations of things like data exfiltration when agents are hooked up in the wrong way. If my agent has access to all my files and my cloud drive, and can also make queries to links, then you can upload these things somewhere.
These are still in the demonstration phase right now, but that’s really just because these things are not yet adopted. And they will be adopted, let’s make no mistake. These things will become more autonomous, more independent, and will have less user oversight, because we don’t want to click “agree,” “agree,” “agree” every time agents do anything.
It also seems inevitable that we will see different AI agents communicating and negotiating. What happens then?
Absolutely. Whether we want to or not, we are going to enter a world where there are agents interacting with each other. We’re going to have multiple agents interacting with the world on behalf of different users. And it is absolutely the case that there are going to be emergent properties that come up in the interaction of all these agents.
One of the things that I’m most interested in in this particular area is how we extend the game theory we have for humans to interactions between agents, and interactions between agents and humans. It becomes very interesting, and I think we definitely do really need a better understanding of how this web of different intelligent systems will really manifest itself.
We have a lot of experience with how human societies are built, just because we’ve done it for a very long time. We have much less understanding of what will emerge when different AI agents with different aims, different purposes, all start interacting.
I wrote about some research that suggests communities of AI agents can be manipulated relatively easily.
It’s a field that is largely unexplored, both scientifically and commercially, and it’s a really valuable space. Game theory was developed in no small part due to World War II, and then during the Cold War afterwards. I’m not equating the current setting to this in any way, but I think oftentimes, when there are these massive breaks in the operations of the world, we need a new kind of theory to explain how we might operate in these settings. And I think that we need a new game theory to understand the risk associated with AI systems. Because traditional modeling just doesn’t really capture the variety of possibilities here.