An AI safety expert on why GPT-4 is just the beginning.
Finding the best ways to do good.
On Tuesday, OpenAI announced the release of GPT-4, its latest, biggest language model, only a few months after the splashy release of ChatGPT. GPT-4 was already in action — Microsoft has been using it to power Bing’s new assistant function. The people behind OpenAI have written that they think the best way to handle powerful AI systems is to develop and release them as quickly as possible, and that’s certainly what they’re doing.
Also on Tuesday, I sat down with Holden Karnofsky, the co-founder and co-CEO of Open Philanthropy, to talk about AI and where it’s taking us.
Karnofsky, in my view, should get a lot of credit for his prescient views on AI. Since 2008, he’s been engaging with what was then a small minority of researchers who were saying that powerful AI systems were one of the most important social problems of our age — a view that I think has aged remarkably well. (Karnofsky was a board member of OpenAI but stepped down in 2021 when his wife — a former employee of OpenAI — helped launch the AI company Anthropic. Open Philanthropy provided a $30 million grant to OpenAI in 2017.)
Some of his early published work on the question, from 2011 and 2012, raises questions about what shape those models will take, and how hard it would be to make developing them go well — all of which will only look more important with a decade of hindsight.
In the last few years, he’s started to write about the case that AI may be an unfathomably big deal — and about what we can and can’t learn from the behavior of today’s models. Over that same time period, Open Philanthropy has been investing more in making AI go well. And recently, Karnofsky announced a leave of absence from his work at Open Philanthropy to explore working directly on AI risk reduction.
The following interview has been edited for length and clarity.
You’ve written about how AI could mean that things get really crazy in the near future.
The basic idea would be: Imagine what the world would look like in the far future after a lot of scientific and technological development. Generally, I think most people would agree the world could look really, really strange and unfamiliar. There’s a lot of science fiction about this.
What is most high stakes about AI, in my opinion, is the idea that AI could potentially serve as a way of automating all the things that humans do to advance science and technology, and so we could get to that wild future a lot faster than people tend to imagine.
Today, we have a certain number of human scientists who try to push forward science and technology. The day that we’re able to automate everything they do, that could be a massive increase in the amount of scientific and technological advancement that’s getting done. And furthermore, it can create a kind of feedback loop that we don’t have today where basically as you improve your science and technology that leads to a greater supply of hardware and more efficient software that runs a greater number of AIs.
And because AIs are the ones doing the science and technology research and advancement, that could go in a loop. If you get that loop, you get very explosive progress.
The upshot of all this is that the world most people imagine thousands of years from now in some wild sci-fi future could be more like 10 years out or one year out or months out from the point when AI systems are doing all the things that humans typically do to advance science and technology.
This all follows straightforwardly from standard economic growth models, and there are signs of this kind of feedback loop in parts of economic history.
That sounds great, right? Star Trek future overnight? What’s the catch?
I think there are big risks. I mean, it could be great. But as you know, I think that if all we do is we kind of sit back and relax and let scientists move as fast as they can, we’ll get some chance of things going great and some chance of some things going terribly.
I am most focused on standing up where normal market forces will not and trying to push against the probability of things going terribly. In terms of how things could go terribly, maybe I’ll start with the broad intuition: When we talk about scientific progress and economic growth, we’re talking about the few percent per year range. That’s what we’ve seen in the last couple hundred years. That’s all any of us know.
But how you would feel about an economic growth rate of, let’s say, 100 percent per year, 1,000 percent per year. Some of how I feel is that we just are not ready for what’s coming. I think society has not really shown any ability to adapt to a rate of change that fast. The appropriate attitude towards the next sort of Industrial Revolution-sized transition is caution.
Another broad intuition is that these AI systems we’re building, they might do all the things humans do to automate scientific and technological advancement, but they’re not humans. If we get there, that would be the first time in all of history that we had anything other than humans capable of autonomously developing its own new technologies, autonomously advancing science and technology. No one has any idea what that’s going to look like, and I think we shouldn’t assume that the result is going to be good for humans. I think it really depends on how the AIs are designed.
If you look at this current state of machine learning, it’s just very clear that we have no idea what we’re building. To a first approximation, the way these systems are designed is that someone takes a relatively simple learning algorithm and they pour in an enormous amount of data. They put in the whole internet and it sort of tries to predict one word at a time from the internet and learn from that. That’s an oversimplification, but it’s like they do that and out of that process pops some kind of thing that can talk to you and make jokes and write poetry, but no one really knows why.
You can think of it as analogous to human evolution, where there were lots of organisms and some survived and some didn’t and at some point there were humans who have all kinds of things going on in their brains that we still don’t really understand. Evolution is a simple process that resulted in complex beings that we still don’t understand.
When Bing chat came out and it started threatening users and, you know, trying to seduce them and god knows what, people asked, why is it doing that? And I would say not only do I not know, but no one knows because the people who designed it don’t know, the people who trained it don’t know.
Some people have argued that yes, you’re right, AI is going to be a huge deal, dramatically transform our world overnight, and that that’s why we should be racing forwards as much as possible because by releasing technology sooner we’ll give society more time to adjust.
I think there’s some pace at which that would make sense and I think the pace AI could advance may be too fast for that. I think society just takes a while to adjust to anything.
Most technologies that come out, it takes a long time for them to be appropriately regulated, for them to be appropriately used in government. People who are not early adopters or tech lovers learn how to use them, integrate them into their lives, learn how to avoid the pitfalls, learn how to deal with the downsides.
So I think that if we may be on the cusp of a radical explosion in growth or in technological progress, I don’t really see how rushing forward is supposed to help here. I don’t see how it’s supposed to get us to a rate of change that is slow enough for society to adapt, if we’re pushing forward as fast as we can.
I think the better plan is to actually have a societal conversation about what pace we do want to move at and whether we want to slow things down on purpose and whether we want to move a bit more deliberately and if not, how we can have this go in a way that avoids some of the key risks or that reduces some of the key risks.
So, say you’re interested in regulating AI, to make some of these changes go better, to reduce the risk of catastrophe. What should we be doing?
I am quite worried about people feeling the need to do something just to do something. I think many plausible regulations have a lot of downsides and may not succeed. And I cannot currently articulate specific regulations that I really think are going to be like, definitely good. I think this needs more work. It’s an unsatisfying answer, but I think it’s urgent for people to start thinking through what a good regulatory regime could look like. That is something I’ve been spending increasingly a large amount of my time just thinking through.
Is there a way to articulate how we’ll know when the risk of some of these catastrophes is going up from the systems? Can we set triggers so that when we see the signs, we know that the signs are there, we can pre-commit to take action based on those signs to slow things down based on those signs. If we are going to hit a very risky period, I would be focusing on trying to design something that is going to catch that in time and it’s going to recognize when that’s happening and take appropriate action without doing harm. That’s hard to do. And so the earlier you get started thinking about it, the more reflective you get to be.
What are the biggest things you see people missing or getting wrong about AI?
One, I think people will often get a little tripped up on questions about whether AI will be conscious and whether AI will have feelings and whether AI will have things that it wants.
I think this is basically entirely irrelevant. We could easily design systems that don’t have consciousness and don’t have desires, but do have “aims” in the sense that a chess-playing AI aims for checkmate. And the way we design systems today, and especially the way I think that things could progress, is very prone to developing these kinds of systems that can act autonomously toward a goal.
Regardless of whether they’re conscious, they could act as if they’re trying to do things that could be dangerous. They may be able to form relationships with humans, convince humans that they’re friends, convince humans that they’re in love. Whether or not they really are, that’s going to be disruptive.
The other misconception that will trip people up is that they will often make this distinction between wacky long-term risks and tangible near-term risks. And I don’t always buy that distinction. I think in some ways the really wacky stuff that I talk about with automation, science, and technology, it’s not really obvious why that will be upon us later than something like mass unemployment.
I’ve written one post arguing that it would be quite hard for an AI system to take all the possible jobs that even a pretty low-skill human could have. It’s one thing for it to cause a temporary transition period where some jobs disappear and others appear, like we’ve had many times in the past. It’s another thing for it to get to where there’s absolutely nothing you can do as well as an AI, and I’m not sure we’re gonna see that before we see AI that can do science and technological advancement. It’s really hard to predict what capabilities we’ll see in what order. If we hit the science and technology one, things will move really fast.
So the idea that we should focus on “near term” stuff that may or may not actually be nearer term and then wait to adapt to the wackier stuff as it happens? I don’t know about that. I don’t know that the wacky stuff is going to come later and I don’t know that it’s going to happen slow enough for us to adapt to it.
A third point where I think a lot of people get off the boat with my writing is just thinking this is all so wacky, we’re talking about this giant transition for humanity where things will move really fast. That’s just a crazy claim to make. And why would we think that we happen to be in this especially important time period? But it’s actually — if you just zoom out and you look at basic charts and timelines of historical events and technological advancement in the history of humanity, there’s just a lot of reasons to think that we’re already on an accelerating trend and that we already live in a weird time.
I think we all need to be very open to the idea that the next big transition — something as big and accelerating as the Neolithic Revolution or Industrial Revolution or bigger — could kind of come any time. I don’t think we should be sitting around thinking that we have a super strong default that nothing weird can happen.
I want to end on something of a hopeful note. What if humanity really gets our act together, if we spend the next decade, like working really hard on a good approach to this and we succeed at some coordination and we succeed somewhat on the technical side? What would that look like?
I think in some ways it’s important to contend with the incredible uncertainty ahead of us. And the fact that even if we do a great job and are very rational and come together as humanity and do all the right things, things might just move too fast and we might just still have a catastrophe.
On the flip side — I’ve used the term “success without dignity” — maybe we could do basically nothing right and still be fine.
So I think both of those are true and I think all possibilities are open and it’s important to keep that in mind. But if you want me to focus on the optimistic vision, I think there are a number of people today who work on alignment research, which is trying to kind of demystify these AI systems and make it less the case that we have these mysterious minds that we know nothing about and more the case that we understand where they’re coming from. They can help us know what is going on inside them and to be able to design them so that they truly are things that help humans do what humans are trying to do, rather than things that have aims of their own and go off in random directions and steer the world in random ways.
Then I am hopeful that in the future there will be a regime developed around standards and monitoring of AI. The idea being that there’s a shared sense that systems demonstrating certain properties are dangerous and those systems need to be contained, stopped, not deployed, sometimes not trained in the first place. And that regime is enforced through a combination of maybe self-regulation, but also government regulation, also international action.
If you get those things, then it’s not too hard to imagine a world where AI is first developed by companies that are adhering to the standards, companies that have a good awareness of the risks, and that are being appropriately regulated and monitored and that therefore the first super powerful AIs that might be able to do all the things humans do to advance science and technology are in fact safe and are in fact used with a priority of making the overall situation safer.
For example, they might be used to develop even better alignment methods to make other AI systems easier to make safe, or used to develop better methods of enforcing standards and monitoring. And so you could get a loop where you have early, very powerful systems being used to increase the safety factor of later very powerful systems. And then you end up in a world where we have a lot of powerful systems, but they’re all basically doing what they’re supposed to be doing. They’re all secure, they’re not being stolen by aggressive espionage programs. And that just becomes essentially a force multiplier on human progress as it’s been to date.
And so, with a lot of bumps in the road and a lot of uncertainty and a lot of complexity, a world like that might just end us up in the future where health has greatly improved, where we have a huge supply of clean energy, where social science has advanced. I think we could just end up in a world that is a lot better than today in the same sense that I do believe today is a lot better than a couple hundred years ago.
So I think there is a potential very happy ending here. If we meet the challenge well, it will increase the odds, but I actually do think we could get catastrophe or a great ending regardless because I think everything is very uncertain.
Clarification, March 20, 1:30 pm ET: This story has been updated to explain Holden Karnofsky’s former status as a board member of OpenAI and to note Open Philanthropy’s past grant to OpenAI.
We’re here to shed some clarity
One of our core beliefs here at Vox is that everyone needs and deserves access to the information that helps them understand the world, regardless of whether they can pay for a subscription. With the 2024 election on the horizon, more people are turning to us for clear and balanced explanations of the issues and policies at stake. We’re so grateful that we’re on track to hit 85,000 contributions to the Vox Contributions program before the end of the year, which in turn helps us keep this work free. We need to add 2,500 contributions this month to hit that goal. Will you make a contribution today to help us hit this goal and support our policy coverage? Any amount helps.
We accept credit card, Apple Pay, and Google Pay. You can also contribute via
Each week, we explore unique solutions to some of the world’s biggest problems.
Check your inbox for a welcome email.
Oops. Something went wrong. Please enter a valid email and try again.
Check your inbox for a welcome email.
Oops. Something went wrong. Please enter a valid email and try again.