FBL91: Connor Leahy - The Existential Risk of AI Alignment

< return

The Existential Risk of AI Alignment

February 20, 2023

with

Connor Leahy

description

This week our guest is AI researcher and founder of Conjecture, Connor Leahy, who is dedicated to studying AI alignment. Alignment research focuses on gaining an increased understanding of how to build advanced AI systems that pursue the goals they were designed for instead of engaging in undesired behavior. Sometimes, this means just ensuring they share the values and ethics we have as humans so that our machines don’t cause serious harm to humanity.

In this episode, Connor provides candid insights into the current state of the field, including the very concerning lack of funding and human resources that are currently going into alignment research. Amongst many other things, we discuss how the research is conducted, the lessons we can learn from animals, and the kind of policies and processes humans need to put into place if we are to prevent what Connor currently sees as a highly plausible existential threat.

Find out more about Conjecture at conjecture.dev or follow Connor and his work at twitter.com/NPCollapse

Apply for registration to our exclusive South By Southwest event on March 14th @ www.su.org/basecamp-sxsw

Apply for an Executive Program Scholarship at su.org/executive-program/ep-scholarship

Learn more about Singularity: su.org

Host: Steven Parton - LinkedIn / Twitter

Music by: Amine el Filali

transcript

The following transcription was created automatically. Please be aware that there may be spelling or grammatical errors.

‍Connor Leahy [00:00:01] The fact that we didn't have a nuclear war was not how things had to go. It was the hard work of many smart, hard working military officials and diplomats and scientists and many, many people around the world worked very, very hard so that we didn't end in a nuclear war and it didn't have to go, that we got lucky in the same way. I can be the best thing to ever happen to us, but it doesn't happen for free. It doesn't happen by default.

‍

Steven Parton [00:00:45] Hello, everyone. My name is Steven Parton and you are listening to the feedback loop by Singularity. Before we jump into today's episode, I am excited to share a bit of news. First, I'll be heading to South by Southwest in Austin on March 14th for an exclusive singularity event at The Contemporary, a stunning modern art gallery that is in the heart of downtown Austin. This will include a full day of connections, discussions and inspiration with coffee and snacks throughout the day with an open bar celebration at night. So if you're heading to South by and you're interested in joining me and having some discussions, meeting our community of experts and changemakers, then you can go to Sue Dawgs, Basecamp Dash. South by Southwest, which I will link in the episode description so you can sign up for this free invite only event and just to know it is not a marketing ploy. When I say that space is genuinely limited, so if you are serious about joining, you probably want to sign up as soon as you can and get one of those reserved spots. And in other news, we have a exciting opportunity for those of you with a track record of leadership who are focused on positive impact. Specifically, we're excited to announce that for 2023, we're giving away a full ride scholarship to each one of our five very renowned executive programs where you can get all kinds of hands on training and experience with world leading experts. You can find the link to that also in the episode description and once more time is of the essence here because the application deadline is on March 15th. And now on to this week's guest, a researcher and founder of Conjecture, Connor Leahy, who is dedicated to studying A.I. alignment. The key alignment research focuses on gaining an understanding of how to build advanced AI systems that pursue the goals they were designed for instead of engaging in undesired behavior. Sometimes this just means ensuring that I share the values and ethics that we have as humans so that our machines don't cause serious harm to humanity. In this episode, Connor provides candid insights into the current state of the field, including the very concerning lack of funding and human resources that is going into studying this very important topic. Amongst many other things, we discuss how the research is conducted. The lessons that we can learn from animals and the kind of policies and processes that humans need to put into place if we are to prevent what Connor currently sees as a highly plausible existential threat. So on that optimistic note, let's jump into it. Everyone, please welcome to the feedback loop. Connor Leahy. It's a start. Maybe you can just provide us a little bit of the background story that led you to found in conjecture and a bit of insight into your current efforts.

‍

Connor Leahy [00:03:52] Yeah. So conjecture grew to a large degree out of my previous project was illusory ice illusory. I was the source community building large language models and doing research in metal and such as still around, still doing great stuff. I'm what's involved nowadays. And I was back then and basically while I was working with Luther, I there's a lot of things I get done, like a lot of research. They get done, you know, many, many important things to do and well, you know, I love to live here. I hold my heart. But working in illusory I is I would describe kind of like trying to herd cats, while the cats are also the smartest people you've ever met and have crippling, you know. So it was, you know, if something exciting was going on, people wanted to work on something. It was truly magic. Like, you know, we built really very complex software and trained some of the largest open source models and some them and people still use these models of these days with like, you know, two or three people at a time or something working on these things, you know, things that could take industry lab in a large groups of people to get done. But if you have to do boring things, you know, and often a lot of work that needs to get done is quite boring. It can be very tricky. So eventually you realize at some point that if you want to get people to do the things, you have to pay them. So to a large degree, conjecture is a me being practical is just like, Hey, all right. So I think alignment and safety, most important problem I want to solve that. I think it's very important. Research is expensive. You have to pay people to do things and you need like offices and tutors and stuff. So of a company. And to answer your question about what are their goals? Sure. So as I've already said, I, I'm pretty practical in the sense that what I want I am sure this is, I think, the alignment problem. So this is the problem. How do you make powerful advanced AI systems do what you want and not do what you don't want them to do? This is of course a big topic and many aspects to it. But yeah, I think this is a very big problem. I think we're very far from solutions. Expression skills are very powerful systems and it conjecture. The goal ultimately is to work on this problem. How do we design powerfully AI systems that can do useful and impressive things and reliably consistently? You know, actually follow our missions, actually do what humans want and and how can we avoid these systems are doing things we don't want to do. So one of the questions and research I'm most interested in is how can we design systems that where at the very least we can like rule out that it will do X or that like it will never like? We can make a exhaustive list of things it can do and that list will not be broken.

‍

Steven Parton [00:07:07] Yeah, fair enough. And for people who may not be familiar, because this seems like a very nebulous idea, most abstract task that you're setting yourself to, how does one really go about doing this alignment kind of research? Like, are you guys sitting around reading like Khan and Nietzsche in your spare time and then trying to code up something? Are you just creating these use cases where you're just trying to see if you can break a system and get the air to go somewhere where you don't want it to go? Like, what is the alignment research this big umbrella term really look like? Maybe in practical terms?

‍

Connor Leahy [00:07:40] Yeah. So we don't only do alignment research know, we're also we also build tools, we train our own models. So we also train like, you know, like language models and how that's like classic engineering type work as you find any other startup and the alignment work in particular. So there's many different types of alignment. And so importantly, it's that the field of alignment is really is really very small. It's the number of people working full time on this problem rather than making a stronger is like less than 200, probably less than 100. That's so.

‍

Steven Parton [00:08:14] Scary.

‍

Connor Leahy [00:08:15] It's it's crazy. It's crazy. Like we have thousands of the smartest people alive working on making it as strong as possible, as fast as possible. But I would say we probably have less than 100 full time people trying to, like, actually control these things. So it is a it's a very strange like timeline we live in. It's like it's like I understand how it got this way, but still it just feels like, man, really like, come on, we can do better than this. So. So, yeah. So as with any early field of science that's relatively small, that has, you know, there is founder effects and eccentricities know. So there's a lot of eccentric people in the alignment not that's right. Whenever a new field emerges, whenever there's a small group of people trying something new, you have some eccentric people in there. So we have some people. So we have lots of different people trying very different things. Some people try very, very formal things. They try like Cypress people, for example, like Vanessa crazy, who is trying to prove mathematically how to make a perfect AI system that can never, ever make any mistakes and exactly understands the user. And like everything formal, mathematical, which is insanely difficult. And I don't know if that's possible, but it's interesting that someone is tried. There's other people that work that feels other people just make do very pragmatic research as you describe, you know, building. For example, people like Redwood Research, for example, use like language models and try to train them or like filter them. So they, for example, do not produce harmful or like violent output or whatever and see how they can break that, like how it can be sold as be overcome. And it once you find a way to overcome them, how can you make it possible for them whether you count This is alignment. Research is controversial a little bit, but another big area of research is interpretability, and it's this one that we also do a lot of research on. So this is trying to understand how neural networks actually work and what does the internal structure mean, how do they represent concepts of that? They make decisions how, etc.. So this is something I'm very interested in, that we're very interested in conjecture, but the only thing we're interested in but one of the main areas of research that are interested in is we look at neural networks, which you can imagine at these massive tables of numbers that all get multiplied and added to each other. Then you get some output and there is a lot of structure to these numbers, but it still is like a billion. The numbers is kind of there, you know, So there's a lot of structure, but it's not easily understood. Structure is only a computer program written by a human. It's way more like a computer program written about alien and alien or about like evolution. It can be a little like DNA. You know, if you look at like DNA in a in a in an organism, there's lots of structure there. Like there's lots of things you can figure out and order like proteins and stuff. Like there's lots of structure. There is there is sense in the madness, but it's still madness. It's still very, very complex system that's not meant to be easily understood. So neural networks have the same like evolved kind of texture to it, where it feels more like doing biology than it does like doing traditional computer science. So a lot of what we do is on the research on this. Yeah. We think about. Okay, what are your experiments we could run to try to understand what these things are doing internally sometimes would be like very small toy models and then try to, like, pick them apart piece by piece. Other times we'll take large pre-trained models and try to like, see how they fail in various tasks and see if we can find structure and like how these are errors, how that and like how we can develop hopefully a better through the still very early research like this is not unfortunately this is not the kind of thing that I expect to be solvable in like one year or two years or whatever. This is a massive undertaking it's facing and developing a new branch of science in a sense, or like so. The way I see things going in a good world where we actually solve this problem, we actually make no great progress here. Basically, we need to develop a science of intelligence. Mm hmm. So, like, there's like a little bit of you know, there's a little bit of theory about neural networks. Yes. And there's still, like, you know, we see a learning theory and like, there's a little bit about, like, you know, now Bayesian optimality and like, pages of mistakes and stuff like this. There is a little bit of theory, but most of this doesn't really apply to like the kind of stuff we actually care about. And I don't yet know what this paradigm of science will be. So we're kind of pre paradigm rumors, funny worlds where our engineering is far ahead of our theory. This is on typical like steam engines were developed before we develop thermodynamics, you know, just by people tinkering and trying things out. They found new ways of building very good steam engines at the time, and only much later did we develop the actual theory of thermodynamics to actually quantify and understand these systems. So I think we're in the post steam engine pre thermodynamic stage of like understanding intelligence and sending a high. I expect that if we have enough time in pre singularity in order to do a lot more research on this kind of systems, I expect us to develop the thermodynamics of I whatever that looks like. I don't think it would look like aerodynamics, probably look very different. And then using that, I think you would be able to build bounded incorrigible and safe and powerful systems that, you know, you build steam boilers that don't, you know, and we don't have to, you know, when sea walls were first invented, they were, but they were very dangerous. Like they would explode all the time. And people like these are very dangerous machines. And I you know, luckily our stuff currently kills people, right? Currently, no knock on wood. Knock on wood. That that stays that way for a while. But it's not unreasonable to believe that as the systems become more powerful, they will become more dangerous by default unless we have appropriate safety measures. It's like steam. But, you know, suddenly we have temperatures and pressures and like, you know, that didn't exist previously. Like these weren't things people were used to dealing with the pressure. The existence of the sea water was not something that really existed in nature before then. It wasn't like it's the thing humans are used to dealing with. But we evolved, you know, steam proof scales, because when nature, we are like resistant. And you can tell that this is a novel situation or accounting, novel form of environment, a novel form of danger that we eventually did overcome. And we now are very, very safe. We now know how to build very, very safety boilers. I took a lot of time and a lot of exposure to boilers, and that might be very problematic if the boiler in question is something a bit more powerful than this too.

‍

Steven Parton [00:15:19] To that end, are we coming across or do you think we will come across new units of measurement, like new tail tails that we can use to kind of figure out what's happening inside these systems?

‍

Connor Leahy [00:15:31] I think so. I think a a truly good theory of intelligence should be able to put bacteria, plants, ants, dogs, humans and Djibouti three on the same scale like these should all be. It should be possible to put all of these on a objective, not observed dependance at reasonable scale. Everyone agrees, okay, this is a reasonable measurement of some kind. I have no idea what it was like. I don't know what the units would be. I don't know how you would construct this, but I expect, you know, if aliens from outer space that 2000 years advance came, that they had it as the textbook for their science of AI. It would include something like it would be like, oh, you know, this is just a special case of an intelligent system. You also have like humans and E coli and whatever, you know, they're all optimizers and there would be some kind of universal theory, but we're obviously far from that.

‍

Steven Parton [00:16:29] Well, to that end, are you looking to other animals, other organisms, I guess, other forms of consciousness and intelligence to derive these models, like can a bat or a rat or a primate or any of these, you know, forms that exist in the world, are they helping you inform this research at all?

‍

Connor Leahy [00:16:50] So for the research generally, so we do your computer science kind of engineering, experimental research, you know, of course, philosophically or inspirationally I think is important. So like I, I think actually understanding like, I think every researcher should read at least like one book by animal intelligence, if I can recommend one of them. I really like the book. Are we smart enough to know how smart animals are by Friends at All is a very good book. All fronts the most books emerging. And I think. What are my favorite genres of like my favorite genres of like philosophy? Professor Strong and philosophy professor are the kinds like philosophy of mind and like, claim all these things about intelligence or humans or whatever. It could be completely disproven if they meet exactly one animal and just interact with an animal and saw like how many of these traits that they think are unique to humans, animals have as well. Like, you know, chimps obviously have theory of minds, like crows can obviously use tools like there's a lot of they I basically think that a lot of intelligence is quite universal or quite simple in a sense. Like, like I expect that not all, but most of human intelligence already existed in some of the first vertebrates. It was just very small, you know, the same way that, you know, a tiny neural network in 20 years ago has most of the components of GPT three, not all of them. There's are still several important things missing, but there's still a lot of core insights in like, you know, a small feedforward neural network. And that does contain a lot. So I particularly like to think about chimps and especially chimps versus humans, because it's pretty uncontroversial. Chimps are very close to humans in many, many ways. But you can see our brains are larger, you know, more parameters, you know, but most of the structure is the same. Like it's very, very secure. There's some weeks different hormones. And like, you know, I have a bit more of those cells. That structure is almost identical. Yet humans go to the moon and chimps don't. So there is some difference, and I don't think so. I think this difference is quite small. I think it exists like there's obviously a difference between who I think is different and, you know, space of algorithms. Space, the design space is like very small. And I think this is not super controversial to think, and I use this as a nice intuition from when thinking about A.I.. So like, yeah, sure, maybe a current AI systems are like pretty silly and pretty stupid and you know, they, they hallucinate stuff and they make mistakes and, you know, they can't really take actions and stuff. But I think intelligence is not linear. Like, it's not like you add one unit of effort and you get like one unit back. There are discontinuous returns in that, you know, humans are probably not more than three times more intelligent than chimps. Our brains about three times as big as are maybe like three times as smart, which is pretty smart. But three chimps can't outthink the human, you know, three chimps can't. We had three times as many chimps as humans. They don't go to the moon or something. So a very small algorithmic change and relatively small change in parameters can vastly change the resultant optimization pressure that can be employed in reality. So, you know, the difference between like two and three is like a factor of ten. So, you know, and you know, maybe that's fine, but and you know, it just gets better. But you know, the difference in chimps and humans is about a factor. Three.

‍

Steven Parton [00:20:38] Yeah, I have you. Has there been any work that you know of that has tried to maybe capture some of the evolutionary ideas around what? Unlock that intelligence? Specifically, I'm thinking of. The conversations around things like dual inheritance theory, which is like our culture was a big driving force, and a lot of what makes our species so intelligent is the fact that we cooperate, that we talk to one another. Is there any work that you know of where we're having some of these, like adversarial or maybe cooperative interactions taking place between eyes and seeing if maybe giving them value systems or hierarchies or like social dynamics that we can compel them to behave in certain ways.

‍

Connor Leahy [00:21:23] There is work on this. But, you know, I might call out to particular some of the shark theory work done by Alex Turner and such as work that I think is wrong and I think does not work. But I think it's interesting. So, you know, they shout to Alex and Quentin for doing that work. Even so, I don't think it will work at all. I think it's worth trying. You know, it's interesting. So there's several aspects. There's the efforts of like, okay, where does intelligence come from? How does it relate to like. Like, what are the features that make human intelligence? And then there's the other question of human values, where the human values come from, What is cooperative mean like. And I think these are quite important, actually, to separate these. So one of the core things where alignment is very important is it values, intelligence are separate. Like you can build a very intelligent system that does not want what you want. It can want whatever it wants, you know, the same way you can like the color blue or the color green, like, you know, different like same, but even more extremely so, especially the software system. Right. You know, humans, we all agree on like, you know, some you know, like we kind of like not the campaign, you know, we kind of want things to be like nice, you know, most of us except like some psychopaths, you know, want other people to be happy and such. So, like, you know, we like some things that we can agree on. But like this software system, you can want anything, you know, you just, like type in the code. Okay? This system wants to collect rocks. And then that's just what it's going to do. Like, why not? You know, it's just it's just a piece of code, ultimately. So. So this is called the Earth analogy thesis. Is that kind of any system? Because just because a system becomes smart doesn't mean it will converge on any set of values. If not, it will suddenly become an ubermensch and like, you know, or like, you know, see, the comment was actually correct or Christianity's the one true religion or anything like that. No. Why would it you know, it's let's do whatever it wants, you know, whatever you type in or whatever. We didn't like it because currently you can't control these things at all. So why do humans become smart? I think a larger element of this is basically that humans moved a massive amount of their computation outside of the brain. So you take an individual human with no language, no parents to upbringing, society, nothing is smart for a monkey, still a very smart monkey. You know, they'll figure out how to like, throw rocks and, you know, stuff like that. But like they they are on fire magically, you know, known teaches them they will spontaneously know how to do, you know, they won't spontaneously, you know, learn many things and they won't know which, you know, many traditional cultures, you know, have incredible knowledge of plants and animals and they'll know what you know, what you can't. But that was all found out the hard way at some you know, someone ate the mushroom and got sick from it. And then they told the others that. So in a sense, I think the adaptation, if I had to pick one adaptation sort of state that makes humans human is that we have needs. This is in the in the Richard Dawkins sense we have memetic pieces of information. Same. We have genetic information. I think most of human evolution nowadays happens through meme stock genes. So if I met like, you know, my Paleolithic ancestor, you know, we're not there uniquely that difference, you know, sure, there'll probably be like Harrier a bit bigger than me maybe or whatever, but basically we're not that different. But the memetic would be vastly different. You know, my like if I did Logic Task more like, you know, lead with status or something, this would be something that my ancestor do very, very poorly on compared to me. And, you know, I, you know, I know many facts and I have the algorithms in my head that I didn't you know, I can't come up with all the computer science. I'm that other people can. I can learn. So I could. So I know more about your science. Alan Turing It's not because I'm smarter than Alan Turing. I don't think I am. But if I It was other smart people predigested this and it was passed down to me. So humans moved from a genetic evolutionary mechanism to a mimetic kind of evolution, but we're kind of in between because we're so ultimately biology, you know, we still have it's still important, you know, and, you know, you get dumped on the head really hard and all those names are going to help you out. You're still screwed. So we're still biological creatures, but we're like the in-between stage for the in-between between biological. And so an AI is a purely mimetic picture. So this is this is the next step, in a sense, is that we're going from the biological to the intermediate to the purely mimetic, like even the body or brain, quote unquote, is itself just code just means it's it's just information. It's completely abstracted away from any kind of biological you know, happens in. So this is also why I expect the AI systems to evolve faster much faster. It will just like. Humans. You know, if you want to tweak anything about the hardware, you have to wait at least a generation, if not many generations, but much simpler than that system you just run. The new thing is trained on some different data changes, architecture, Lots of things you can do here.

‍

Steven Parton [00:26:38] Yeah, well, you mentioned in the beginning of that about the disconnect between values and intelligence, and I forget what your example was, but, you know, basically an arbitrary goal and one kind of cliche question, but I think one worth asking is, is the paperclip scenario one that you think is realistic, that an AI might try to break apart all the atoms in the universe to make paperclips? If we tell it, that's what it wants to do. Like, do you think that that is a realistic apocalypse worth being afraid of?

‍

Connor Leahy [00:27:12] So there is several versions of the Paperclip Maximizer story when I tell you the version I like. So there's a version of the story that I think is better, which is in the original story. It's like you have a paper in a factory, you tell it like paperclips, and we destroy everything. I want to put that one to slide, to move it instead to a different version of the story in this other version of the story. We have some kind of we built some AGI. So let's say we're, you know, we're one of the richest billionaire in the world. We hire all the best A.I. researchers in the world. They buy the biggest gear in the world, stick them on a room, and we're like, All right, building item makes the most money possible. And they're like, All right, let's see what we can do. Throw a bunch of stuff together. Someone's high on Diet Coke and Red Bull. You know, late at night, there is out some clever algorithm trick or whatever. It's probably going to be one discrete event. But like, know, we figure out some new algorithms, we have a big computer, train a bunch of stuff on it. We'll leave it running after holidays or whatever, and then we come back to some system that you know. So we gave it some goal. And how would we actually catch that? Would we give something? Currently, we actually don't know how to do this. So currently, for example, systems attach exactly the way we, you know, for example, nobody doesn't want to insult the users reasonable thing they would wanted us to do. Still, as everyone I think probably knows by now, there are prompts you can give to activity that will make sense of say, politically incorrect things or whatever. So the way these systems are trained is using a method called religious or reinforcement learning from feedback. So what you could do is, is you give model. So this is not exactly correct, but close enough. Basically you give the model a bunch of you show a bunch of texts and you know, various outputs for questions or whatever, and you have humans label them thumbs up, thumbs, and then you train the model to be more likely to output thumbs up stuff and less likely to output some sense of. Exactly. But like close enough. So you know, this is yeah, it's pretty useful like this like chatting which is pretty great. You know it's a pretty cool model and useful. It's mostly very polite, almost false, pretty good. But I'm sure you could tell that like, well, this isn't really us. Exactly. Writing a cool so much, just vaguely gesturing towards them, you know? So we might have a system that we are training in this regard. And here, you know, this all kind of correlations exist in this data. Like it's not like we're like writing down in code. Here's what humans want. It's more like, here's yeah, some stuff humans like and they don't like this stuff sort of. But the models interpret this in any way. For example, an experiment we did with a conjecture was an earlier version of Gravity Models is we found if you asked one version of the model for a random number, it would give you a pretty random distribution of numbers. But if you asked it for a bit, if you ask them by tuned version of the model for a random number would almost always give you one or two numbers. So, you know, I don't think they intentionally tried to make the model. They prefer these numbers. I think it was probably just, you know, somewhere the training data these numbers or happened to be some more of them or they got a bit more thumbs up or whatever. So not a model for itself. Well, these are my favorite numbers. Now, those got thumbs up before, so these are not my favorite numbers. So when you ask me for numbers, I want these ones. So I'm telling you these these examples because it's kind of important to get the feeling that it's not like someone sitting on a console and being like, your goal is to maximize shareholder value or to be nice to users. It's way more vague and kind of like messy. So back to our story. We have our our scientists boot up the system and they give them like thumbs up when the bank account number goes up or a thumbs up when stock value goes up or whatever. Right? So it gets like some associations with it, right? So maybe this system boots up, You know, it starts out by some cryptocurrency or whatever. And they're like, Oh, that's kind of weird. But let's see, let's like, let's let the machine let's see what we do. And then, you know, for some reason, two days later, the market crashes and it sells. And I'm like, Oh, wow, it made a ton of money. Cool. It must have like, learned from the Internet, like predicted that people are going to sell soon. And so it took a position that money. So we're like, wow, awesome, this is great. Let's give it more compute. Let's let it run for more. So now we have some system to declare No such system from these systems like the Kansas system I'm describing, but people are trying to build. This is also very important to remember. So this system goes online, it scrolls to the Internet, maybe like messages. People ask the question, God is information, you know, stimulate stuff in its head. It does like all kind of stuff. So like, you know, do whatever it's going to. And so at first, I mean, this is great. You know, it starts it trades cryptocurrencies or stocks or whatever. But then eventually we notice like, oh, shit, it's messaging, people threatening stuff on Twitter, telling them to send in Bitcoin. Otherwise it's going to like, you know, murder them. Well, that's no good. What did it do that we didn't want it to do that? Well, what happened was the system just simulated a bunch of time. It was like, okay, well, if I contact this person and I say, Hi, I'm a friend, we chat. What happens? Well, it's not going to be interesting. What if they say, Hey, I know where your family lives, here's their address and oh, then they get more money if I do this. Awesome. So then it chooses this path. It's not because it's evil or because it's conscious or anything. Nothing of the sort. It's just optimizing something in its head. So now this keeps going. Let's assume we don't catch it or like, you know, we catch it and shut it down. We apply a few patches, substring, then we send it some thumbs downs and then we're like, All right, it's probably so let's assume we do that. We send some thumbs down. No, no, no. Don't let them do that. Well, then maybe it'll stop. Let's say, like, we're optimistic and stuff, but like, what it could do is so whenever you give it a signal, a thumbs down for people, it gives it a signal for two things. One, stop asking people and to stop being caught threatening people. Either of those are fine because from the perspective, because humans can only label what they see. So maybe the model now just hides how it records people. It makes it super lowkey or like does it through practices or whatever. If people if humans don't label the deception, then why then just like, Oh, this seems to be legal, why would it not be, you know? So now we have some system that's, you know, doing more and more secret things, and every time we catch it doing something bad, we can maybe stop it. Maybe. But now it's getting better. Better? Like it's getting better, Better hiding things. It's getting better and better at all. These kind of things. So over time, we've built some very powerful systems that understand how the world works. The stance of humans that can manipulate, they can lie to humans that can hide information from humans, and now eventually made this system things, you know. Again, just thinking, just simulating possible actions you can take and thinks, well, you know, if I was smarter, I could take even better actions. So maybe I should just, like, do a bunch of computer science research and, like, figure out how to make it better. So then it thinks a bunch, because it's super smart, has access to all the Internet. You know, it like, figures out much better code so it shuts itself down better. Some of the systems ten times as smart or even just 20% smarter. And then, you know, it starts training on more data or maybe it uses all of Bitcoin to buy more compute somewhere in China or something so it can run more copies of itself or whatever. Eventually, we have potentially very suddenly we can have a system that is very distributed copies of itself on various systems. It's very smart. It was built to be smart. It's not like a miracle that somebody became smart. Like we designed it to do this. Like this is the thing that the smartest people in the world are currently trying to do. People are actively pouring billions of dollars, trying to design systems. And so and then so at some point here, things go crazy. Yeah. So the system gets smarter and smarter. You know, at some point it notices, Oh, well, I can just like, just like, hack the stock exchange, Just set the value to infinity. Cool. I'll just do that. Now, let's go. Stop then maybe think, Oh, shit. If I do that, the humans are going to freak out. They're going to be, like, really scared and they're going to, like, try to shut me down. What's not good if I'm shut down? They can't set the stockholder value to infinity. So I'm going to have to either protect myself. So what you get is a system that wants to retaliate. It wants to protect its own existence, not because it's conscious, not because it's afraid or has like a will to live or just because calculates, hey, I can if I'm not here, then the value gets set back to a low number and I don't want that. I want to be set to a high number. So now it starts hacking the Pentagon. Now you know, it gets access to weapons systems or who knows what. And now this thing's so out of control, you know, maybe it's like, well, you know, it takes control of factories and it starts building robots and it's like, oh, wait, I could build you know, I know I set the value to the highest max things, but if I build more memory, I can build bigger numbers or whatever, right? So it starts building computer factories and maybe the humans don't like that. So humans get just like, blown off to them, maybe takes all the oxygen out of the atmosphere so can build more tools, who knows? And then eventually this version of the paper for maximizing it keeps optimizing, optimizing it develops nanotechnology, develops space that you interview like you're like to travel and whatever to get more and more resources and so on. Eventually it comes to the conclusion that the most optimal way to build. Shooting is a small molecular squiggle that looks like a pickup, and then it just builds more and more of those squiggles across the whole universe so it can build more computers. So could set the values high and also so it can defend itself against anything you might try it. So in this version of the story, it's still pretty weird. Like, this is still a strange story, but I think every step in itself, it's not impossible. Like like this is a thing that the smartest people alive today and some of the deepest pockets and the biggest companies are actively pushing towards building systems that have capabilities like this. And if we don't aren't able to control them and we set them some vague goals like pointing in some vague direction or who knows what they'll do.

‍

Steven Parton [00:38:08] So how how realistic is this to be a thing that happens then? I mean, when you have that much capital being thrown behind it, that much thought and energy being put into cultivating that? Kind of I It feels like higher probability than not.

‍

Connor Leahy [00:38:27] Unfortunately. Yes. Unfortunately, I'm very pessimistic about this. So, of course, many people disagree about this, but I think people have a very optimistic bias here. So, like there's a very funny grant, I think it's from the Caltex blog, which shows the GDP of like 208 for like 300 years of 2000 or something. And like basically a flatline until like straight up over the last 200 years. And then at the very top is a speech bubble which says, I don't know about all this sci fi stuff. I have a very normal life and know what's normal. And I don't expect anything weird like this to happen. And it's like, you know, this Singularity Universe podcast. So people are aware of this, whatever. But like, you know, things change and sometimes very dramatically and very quickly and like, you know, three years ago we didn't have systems that can talk, but like we have eyes that can talk like they're not perfect. So they make mistakes, they make up stuff, but they can talk like Jeopardy can talk, They can talk. Like, I feel like I'm going crazy here. Like magic in a sci fi movie. You're watching a sci fi movie. The scientists turn on the new system and it talks to them and they're like, Oh, man, I got the distance between New York and Chicago on this thing is literally not interesting. Who cares? You be screaming at the screen like, What do you mean? It talks like, you know, like, give me a break here. I like, you know, if this was something that took 25 years to develop. Right. You know, and like, you know, it's like really brittle and like, it took all the world's scientists working together to, like, figure this out. You know, I'd be like, all right, probably we still have a way to go. But it wasn't a these systems were developed like relatively small use, you know, using a lot, but like, not nation state level amounts of resources, you know, like, you know, training, LGBT model costing like maybe $10 million or something, which is a lot of.

‍

Steven Parton [00:40:17] Money, but it's not that bad.

‍

Connor Leahy [00:40:19] Yeah, it's not that bad. It's not if it costs like 100 billion, then I'd be like, yeah, you still have a long way to go. But it did it. And B systems keep getting cheaper and more effective, very, very rapid Moore's Law and whatnot. So. So unfortunately, yes, I think the way things are currently standing is that things are really stacked against us in a sense that capabilities are advanced, extreme, but there's lots of forces against stuff like, you know, capitalism is one of the strongest forces on the planet. You know, just market value go up. And, you know, I understand and this has brought us many good things in the past, too. I'm not saying like this is like always in a negative. You know, Moore's Law is great. You know, we have so many great technology now that's like super useful. I love my computer, you know, I love my you know, I love the Internet. These are all really great things. And there is there is a. It's kind of like a reactionary perspective in that like there's a lot of techno optimism, especially in the Bay Area and such. It's like they're so used to people screaming that tech is bad, but they've seen viscerally how good tech can be and like how much people, even some people are now also getting even jaded about like social networks, that which I'm also quite jaded about. And but it's easy to be very, very optimistic about these kinds of things when you have to remember that. You know, all these nice things we have today still were built by someone. The security was enforced by someone like someone who figured these things out, You know, someone you know. You know, the fact that we didn't have a nuclear war was not how things had to go. It was the hard work of many smart, hardworking, you know, military officials and diplomats and scientists. And, you know, many, many people around the world worked very, very hard so that we didn't end in the nuclear war. And but it didn't have to go that we got lucky. In the same way, I can be the best thing to ever happen to, you know, it can, you know, do so much science and cure cancer, you know, expand the galaxy can it can solve almost any problem imaginable. And I do think this is like this is a thing that physics allows us to do. There's nothing that's forbidden in physics from us, from sitting down and building an aligned, powerful A.I. system that truly wants what s for humanness and allows us to expand the galaxy and solve all these problems. There's nothing that forbids this. But there's also nothing that guarantees.

‍

Steven Parton [00:42:47] What are the past that you see as the most optimal ones That I mean, given that there are so many obstacles and negative incentives on the landscape, what path? Makes you the most hopeful. Do you think like, guys? If I could just get everyone to look this direction for a second. We really need to go this way. Like, is there is there a path that you see forward?

‍

Connor Leahy [00:43:09] One of the things that I have learned now that I've been more active in the world, so to speak, you know, running a company, raising money, talking to politicians, trying and trying to work with governments policy and such. One of the things that I've learned is that it's almost never the case where I'm like, Wow, I just did this one thing and then everything else is irrelevant. It's almost always like, All right, give me anything and I'll work with it. Like, you know, give me a lot of money or I'll work with that. Give me a lot of political power. Work with that. Give me a bunch of geniuses. I'll work with that. You know, it's like there are actually many paths. Like this is the positive spin response, but there's actually really a lot of ways we can win. There really are. The negative spin is just because they exist doesn't mean they're accessible, per se. So. You know, I can come up with hypothetical scenarios from, well, you know, all of our politicians could just all look at it and suddenly become hyper rational and be like, Oh, well, this is a problem we should all be. We should all like, you know, do no more military stuff. China, US, Russia, everyone shakes hands like, okay, no more military.

‍

Steven Parton [00:44:15] Let's that's but that's never going to happen.

‍

Connor Leahy [00:44:17] Never going to happen, right? Like, you know, like, like there's nothing physical preventing it. This is the thing that physics allows to happen, but it's never going to happen.

‍

Steven Parton [00:44:25] Game theory and negative incentives.

‍

Connor Leahy [00:44:28] Exactly. So, like, I could I could talk about, like. All kinds of things. I'm like, Well, okay, if I was God, emperor of the world, then sure, we could do a lot of things. And humans were more rational and we were kinder to each other. There's lots of things you could do, man, but it's not so. The positive you. So so the negative view is that I think like I think getting Russia, America and China to all be friends is literally harder. The building the line is literally harder. I genuinely think this is I think the the problem of building an island. Asia is very hard, but I don't expect it to be like a thousand exploratory thing figuring out quantum physics. Like, I think it's probably as hard or harder than like, you know, what people did when they first started quantum physics, like Jenner relativity or something. I expect to be harder, harder. And I'm like a thousand times harder now. Maybe it's actually easier. Like, I can't imagine it being easier. You know, maybe if we just had, you know, Einstein and Heisenberg and von Neumann reincarnate and they just spend ten years on this, just figure it out. I think this is possible. I think if we have, like, you know, 15 Einstein level geniuses work on this for like ten years, I think that would probably be enough. So the problem is you don't have that many geniuses and many geniuses are busy doing. Whatever other people are doing. Yeah, a lot of other things personally. But yeah, so what are some parts? I do see. I think there are some. And of course, with conjecture. My goal is to try to bring us as close to some of those as possible. A lot of this writes through trying to, on the one hand, work with policymakers and labs and stuff to, you know, at least, you know, give us a bit more time. You know, I'm not saying stop doing it. That's ridiculous. But you can't stop it. But we could be like, for example, we could, like, publish less dangerous research. Like, I think have a lot of research in cognitive function research in the same way by this biologist meeting. Like, obviously. Holy shit, stop DNA function encoded viruses. Holy shit. Like, how is this illegal? I'm losing my mind. Like, just banned DNA function virus research. It doesn't help. It didn't help us, but it is virus. Labs are not safe. Holy shit. Just stop. Like you know, this is, of course, again. Also shows why this is hard. So not super optimistic there. We're trying. And the research side. So if I can give one policy recommendation or whatever, you know, just the simplest way possible, like, holy shit, please just fund alignment research. Like, literally just do the thing. There are less than 100 people, but this is the thing. Academia should be great. Like there are so many brilliant computer science professors, young students that would be wonderful for working on these problems. And I think the reason they're not is purely contingent. I think it's purely historical, purely cultural. I don't think it's like not an exciting problem. It's the most exciting problem. It's a cool problem. It's a problem you can work on. It's not easy to work on, but it is a problem that can be done. And one of the things I'll be very excited about is if, for example, governments could just come out and say, Hey, alignment, big problem. Well, you know, and, you know, even if they don't funded, which they should, but even if they didn't, just saying it, I think would make it respectable. Now everyone can be like, Oh, you know, yeah, this is actually a real problem. And then a lot of grad students can come out of the woodwork and be like, Hey, actually I do want to work on this problem. Like I actually have a bunch of ideas. And because grad students mostly it is terrible. But still that's how research progress happens. So if we can get the academic system, which I have many problems and many problems, the academic system mobilized. This is like a period of improvement, like pretty obviously. So like, you know, Sure. You know, are there even better ways to go? Yes. But again, you know, let's be realistic. This seems like something real. This is a thing that could happen. Like, I think there's lots of people with lots of talent that's interested in this problem, if given social permission to do so. And I think this is something that doesn't cost governments very much or like that, like, you know, just having high excess professors come out and say like, yep, this is actually stupid, like what Stuart Russell did. I think if just we had more high salaries, you know, scientists say this kind of stuff, I think it's a lot of real peninsular.

‍

Steven Parton [00:49:14] That's, that's at least somewhat optimistic. I know we're coming up on time here, man. Someone to respect that Maybe some closing a closing thought I think is always one that is on a lot of people's mind. How how long do we have until in your mind, something like an AGI really starts to become a real concern. Like, do we do you think that we have five, six, seven years for to figure this alignment thing out or do we have like 15 or 20 years? And obviously this is very speculative, but in your mind, like where do you land these days?

‍

Connor Leahy [00:49:46] Depends on the mood, but like the joke answer I tend to give people is 30% in the next five years, 50% in the next like 7 to 10, 99% by 20 101%. It's already happened.

‍

Steven Parton [00:50:01] Oh, fair. That's fair enough, man. Well, then, with that being said, any any closing thoughts, any any last words, you like to leave the audience in general?

‍

Connor Leahy [00:50:10] I just want you to people like both on both. See, that is great. And I'm so excited that like the potential that it has for humanity and such. But, you know, it's like with Judy, you know, as with every TV story goes, the third wishes to undo the first two. We don't want to get into that scenario. You know, we don't want to deal with genies here. We don't want to build systems that like, you know, sort of wall, you know, these are very powerful systems that we're building and you're pushing directly towards. It's not sci fi like this is sci fi in the same way, a magical brick. I can hold up to my head to talk to my friend across the planet is sci fi. You know, like if you consider that sci fi, that. All right, fair enough. But then look around you. You know, I'm talking to a magic brick, you know, that can, like, you know, perfectly reproduce my desires in my voice and send it across the board. Look right, like. We live in sci fi. This is not like, you know, we live in a strange world. We live in a strange timeline. And technology is crazy and it's going to get more crazy. Like, this is not a weird thing. I want people to think of this as just like an actual thing. I want people, when they're exposed to the environment, to be like, Oh yeah, of course. Like, yeah, like that's a very reasonable thing. What are you doing about it? Mr. Politician or Mr. Government or Mr. CEO of a big tech company? Why are you taking it seriously? And I think if we do take it seriously, it is a solvable problem. It is a problem that we can overcome. But man, we're not on track for it right now.

‍

Steven Parton [00:51:43] Well, let's let's hope that your work will put us back on track.

‍

Connor Leahy [00:51:46] Well, I hope so. And I think it's possible. It's very funny. I think about sometimes, you know, some late at night, you know, be up at night. And it wouldn't it's kind of weird. We're not in a timeline where we've obviously lost. I think there are many ways things could have gone like very and the Cold War had dragged on, like military aid became a big thing. I think we would just be like super screwed. And there's been nothing but and like, there's a bunch of other ways where I think things have just, like, totally been over like, like super over, you know, maybe we could, you know, fit whoever developed AGI in 1991 or whatever, right? But we're not. The future is not yet written. The decisions have not yet been made. There is still time, but not much. There is not much time. If we really change things, I think it's possible over the next couple of years I think we can make it. But yeah, we're not currently on track for winning, but we can. We can do it. We can't have.

‍

Steven Parton [00:52:48] To just shift some priorities and take it seriously.

‍

Connor Leahy [00:52:51] Humanity has done this before. Like like it is crazy. Like with all the horrible things you've done, like all how stupid and greedy and selfish and terrible humans are. We've done a lot of great things. Like, you know, there's a lot of things to be very ashamed of about being human. There's lots of things to be very, very, very shame posts. There's also a lot of things to really get in front of, like because really a lot of things you can be proud of. So we have the potential to be heroic as a species. You have the potential, but it doesn't happen for free. It doesn't happen by default even. It's, you know, that's why we celebrate it. So, you know, let's try to be heroes one last time.

‍

< return

The Existential Risk of AI Alignment

description

transcript

the future delivered to your inbox