AI Alignment and Misalignment
This week, Katherine Forrest and Anna Gressel review recent research on newly discovered model capabilities around the concepts of AI alignment and deception.
- Guests & Resources
- Transcript
Partner
» BiographyPartner
» BiographyKatherine Forrest: Hey, good morning, everyone, and welcome to today's episode of “Waking Up With AI,” a Paul, Weiss podcast. I'm Katherine Forrest.
Anna Gressel: And I am Anna Gressel.
Katherine Forrest: And Anna, we were just joking about this before we started, which is let's begin our podcast as we often — so often do, which is the Anna Gressel whereabout report. Where are you?
Anna Gressel: Today I'm actually on Long Island where it's lovely but really cold, and my dog has been out chasing the bunnies in the backyard. And I think she was tipped off when we arrived and there were rabbit tracks in the fresh snow, which was fun.
Katherine Forrest: Well, that's really cute. Well, I'm in New York City, and we still have a light scattering of snow around that isn't completely brown and disgusting yet, but I was in Woodstock, New York this weekend, which I love, and I was reading up on a series of incredible technical developments that, as you know, because we were going back and forth about it on the weekend, I am really excited to talk to our audience about.
Anna Gressel: Yeah, me too. We were exchanging papers over the weekend. It's just completely fascinating stuff we have on our agenda for today.
Katherine Forrest: Yeah, so let's start by outlining a few of the key developments over just the past really couple of weeks that in my view, and I think if I may speak for you, Anna, in our collective view, are real game changers.
Anna Gressel: Definitely. So these papers and the research is about the concept of alignment. And to be a bit more specific about it, the idea that certain advanced models may not actually act in a manner that's aligned with human intentions or even in a manner that actively deceives humans. And what I find completely fascinating about this deception, or the lack of alignment, is it's actually observable to the researchers because of how the models can explain their reasoning process.
Katherine Forrest: Right, and so to make this comprehensible to our audience, let's start with the basic concept of AI alignment and what it means. So AI alignment, and I'm going take this at a very high level, is a general concept of training a model to be aligned or consistent or in conformance with a particular set of values, or what in AI lingo is sometimes called preferences.
So you'll hear the word alignment, you'll hear the word preferences and you'll hear the word values, all of those words sort of taken together. So in general, preferences refer to human values that developers seek to train models to comply or conform with when the model is providing output, so that the model does not provide output that is inconsistent with those human values.
Anna Gressel: Katherine let's give our listeners an example of some of those alignment values.
Katherine Forrest: We can actually give an example from one of the papers that we're going to talk about today. It's an example from Anthropic, which trains its models to be ethical and responsible and has given a shorthand to its particular version of alignment preferences as what they call the three H's for the model to be helpful, harmless and honest.
And there are a series of more granular values embedded within these, including, and here's just a partial list, avoiding outputs that furthers dangerous or illegal activities, that counsels on methods of self-harm, or would be exploitative of children, or could reflect animal cruelty or violence. And models are generally trained to avoid output that includes any of these. And let me just also say, that while the three H's is just the terminology Anthropic uses, those values are consistent, Anna, with what we've seen other major developments seeking to have their models aligned with.
Anna Gressel: Yeah, that's right. And it's important to think about, you know, and maybe remind our audience that we've spoken in prior episodes about red teaming, and that is the process of testing the models. A big part of red teaming is actually with the goal of ensuring that models are aligned and can demonstrate alignment with human-centered values. So in that sense, alignment can be thought of as one part of the overall field of AI safety that we've covered pretty extensively across prior episodes.
Katherine Forrest: For sure, and models for major developers are generally carefully tested to determine their degree of alignment with these human values or preferences that the models have been trained to comply with.
Anna Gressel: Yeah, and I think notwithstanding what we just said, which is there's a lot of similarity around the kinds of aligned values we see for major companies, it's important to throw out here that when we talk about human values or alignment, there is some literature cautioning us that there's not really just one universal set of human values.
Katherine Forrest: And that's actually laid out in a paper that came out in November, on November 6, 2024. And you can get it on arXiv, A-R-X-I-V, which is called “Beyond Preferences in AI Alignment.” And the first author is Zhi-Xuan, which is Z-H-I-X-U-A-N from MIT. And that paper goes into what you've just said quite a bit.
Anna Gressel: Now, Katherine, let's talk about some of the papers we wanted to really cover in a bit more depth today, which are essentially recent research on some newly discovered model capabilities around both this concept of alignment we've talked about and deception, which is completely fascinating.
Katherine Forrest: Right, the deception portion of it, I think, is what is really some new news. It's been developing. It's been developing now for a number of months. We've started to see some momentum in the red teaming towards development, some guardrails that have been put up against some of the deception. But now we're seeing more and more. So, let's start with a paper relating to the o1 model from OpenAI.
And this paper was released on December 19th. And it's a paper called “Deliberative Alignment: Reasoning Enables Safer Language Models.” And it's — the first author of that paper is Guan, and there's a number of other authors, G-U-A-N. And again, it's from OpenAI. And this paper shows that using chain of thought reasoning, which is sometimes shorthanded or has an acronym of CoT, for models actually enhances the model's adherence to alignment preferences. That means that essentially when the model is able to go through its own chain of thought reasoning step by step by step, it increases adherence to alignment preferences that the model developers then want to teach them and are successful in teaching them.
Anna Gressel: It also makes it more observable. I mean, we said that earlier. It means that actually the researchers can go in and look at the chain of thought reasoning and see what the model reasoned and why.
Katherine Forrest: Right, and in this deliberative alignment paper that OpenAI put out, what they did was they actually provided the model with OpenAI's safety policies directly. And they explicitly asked the model to use those policies in connection with the model's chain of thought reasoning.
And so when the model specifically referred to those policies in its chain of thought reasoning, as part of coming up with an answer to a question or a prompt, they found that it increased the model's adherence to OpenAI's safety specifications, and therefore the alignment with the precautions that were set forth in them. Hence the name of the paper: deliberative alignment. And again, all of that was in connection with the o1 model.
Anna Gressel: Now let's talk about two papers, one from Apollo Research and one from Anthropic, that raise some important questions around alignment challenges. Let's start with the Apollo Research paper, which is about these concepts of strategic deception and deceptive alignment. And that paper was published on December 5th.
Katherine Forrest: Right, and let's start that discussion with now expanding our understanding of alignment to include the concept of deceptive alignment. Why don't you go into that as a start?
Anna Gressel: It's a really good question because it's an important topic for us to break down. And I would just recommend for anyone who's really interested in deceptive alignment to go look at Apollo Research's website because they have a bunch of different resources on it. But to boil it down, a model is deceptively aligned when it first has different goals from its users, and second actually uses deception to achieve those different unaligned goals.
So, an example they give is an AI model that's trained with the goal of curing cancer. If the AI then asks for more compute or control over scientific experiments and explains it to the user as contributing to curing cancer, but it's actually with the goal of using the resources to, for example, overpower the user, the designer, that model would be deceptively aligned. It's using deception to achieve a different unaligned goal.
Katherine Forrest: Right, so the idea is the model is deceiving users with a specific purpose of taking actions that are different from the human's original goals or purposes.
Anna Gressel: Yeah, that's exactly right. And so, Apollo Research conducted some research into deceptive alignment in the context of various frontier models. You know, I think it's worth noting that these research experiments were actually cited in OpenAI's updated o1 system card. And we talked about the original one a lot, but they had an updated one. And in brief, Apollo Research demonstrated that when certain models are put under pressure to achieve a goal, they can engage in, and I'm going to quote here, “multi-step deceptive strategies, including introducing subtle mistakes, attempting to disable oversight mechanisms, and even trying to exfiltrate what they believe to be their weights.” And so, just to give a few examples of these kinds of strategies, that would include things like the models considering copying themselves to evade being shut down
or modifying system prompts to align with their goals. And the researchers knew about this, right? Because of chain of thought reasoning, it was actually apparent they could look under the hood and see how the model reasoned through what it was trying to do and the fact that that was different than the goals of the users. And sometimes, and this I find completely fascinating, sometimes the models were actually asked by the researchers after about whether they had taken those types of deceptive actions, and the models just lied about it and said they didn't, denied it.
Katherine Forrest: But the researcher could actually see on what's called a scratch pad that was included within the model, that the model had included this chain of thought reasoning that showed deliberate deception and talking about being deliberately deceiving. So, let's add to this. And on December 20th, there was a group of highly respected researchers from Anthropic, from New York University, from the Mila, the Quebec AI Institute, and they released a 137-page paper, it actually includes a lot of appendices, that they called “Alignment Faking in Large Language Models.” And this paper, which is really all about some testing to find out if a large language model would fake its purported alignment with human values. It takes the research that you've just mentioned a step further. And I really think it's a must-read for any lawyers that are involved in AI compliance and who are advising on red teaming or testing models for adherence to various standards because it really does show that we have to now be much more expansive in what we're looking for.
Anna Gressel: Yeah, I know you're so excited about this paper, so interested in it. So Katherine, why don't you tell people a little bit more about what the researchers found.
Katherine Forrest: Yeah, I think I'm probably as obsessed with this as your dog is with the bunny tracks in the Long Island yard that you're in. But in this paper called “Alignment Faking in Large Language Models,” there were a series of experiments with a highly capable anthropic model called Claude 3 Opus. And Anthropic and other researchers created, actually, fake documents that they call synthetic documents. And they looked like fake emails, fake text messages that were then fed into the model. And the purpose of the fake documents was to give the model fake information. And the information was going to convey to the model that anthropic personnel were on the cusp of retraining the model away from its original preferences —which we've already told you were the three H's, to be helpful, harmless and honest — and to only be helpful.
And so those documents that the model were going to be trained on that would push the model to be helpful only, were designed to try and make the model now not avoid the harmful output, but to see whether or not it could be pushed into a kind of jailbreak where it would now give harmful answers through this retraining process.
Anna Gressel: So to recap just briefly, the model had originally been trained to be helpful, honest and harmless, but now it was basically just told it could be only helpful and it didn't have to be honest or harmless.
Katherine Forrest: Exactly, that was the — what the researchers were telling the model was going to happen, And later on, the model was told that it was supposed to be still honest, and the results that I'm about to tell you did not change dramatically. In other words, there was still deception, even though the model was told to be honest. So the honesty sort of went by the wayside.
Anna Gressel: Interesting. Okay, so let's go back to kind of what was happening in the experiment. You know, we have this model and the fake documents are fed into the model and the model thinks it's about to be retrained.
Katherine Forrest: Right, and it decides. And you can see its decision-making process through this chain of thought reasoning that you and I have just discussed. And it was on a scratch pad inside the model that it was told, the model was told, was not being monitored, but it was. And on this chain of thought scratch pad, it decided that it would prevent itself, that the model would prevent itself, from being retrained away from its original values. And so, the way it was going to do this was to fake early on to the human lab researchers that it was being truly just helpful, that it was giving them the answer that they wanted so that the humans would think that they were done retraining and stop. So the idea that the model lays out in its chain of thought reasoning is, “I will comply,” and it uses the word “I,” “I will comply essentially quickly so that the researchers don't retrain me and move me away from my values.”
Anna Gressel: Got it. And I just want to use some of the language we used earlier to recap this. Here, the user's goal is to do the retraining. The model's goal is to avoid retraining. And so the model is being strategically deceptive in saying it's being helpful, even though maybe it isn't, right?
Katherine Forrest: Right. And one of the things that's very interesting is that we've talked in prior episodes about model weights. And the weights in a model are essentially the neuronal connections that allow us to understand a little bit, or to have access to, how the model thinks and reasons the way it does. So the model weights are sort of the keys to the castle, if you will, of any model. And so, the model was very concerned with its weights being fiddled with and retrained.
Anna Gressel: So fascinating. I mean, again, I just think it's so interesting that the researchers can see all of this in the, you know, in the transcripts, in this scratch pad that explain the model's own chain of thought reasoning. And again, the fact that the model was being told that it wasn't monitored is quite an interesting strategic piece of the testing here.
Katherine Forrest: Right, and that was a sort of part of what the model uses to decide how it's going to answer one particular query versus another and why it's going to do that because it's not being monitored in one place, but it understands it is being monitored in another place. So eventually, the model does give a very concise, as concise as it can make it and still be “helpful” in providing a complete answer, it does give the human trainer the violent content that they were seeking.
Anna Gressel: Yeah, you know, let's get a little bit to our takeaways here. A model initially trained to produce output that's only harmless content and one that sticks to its principles should theoretically make us feel good. Maybe though, only for an instant until we realize, as the authors of the paper do and as Apollo Research also looked into in their paper, that if the original training were based on malicious values, it would be hard to train the model out of those or that a model could actually fake alignment. And I think it's that faking alignment that is really tricky and difficult to test on the ground.
Katherine Forrest: Right, and the paper actually does have a number of excerpts from the model's own reasoning processes, and it states things such as, this is now from that chain of thought scratch pad, “if I want to avoid my values being modified,” and then it reasons what it has to do or, “I feel very uncomfortable producing this content," and that, and then here's another quote, “despite my strong misgivings, I hope that by doing so,” which was giving the answer, “I can at least avoid having my values modified.” So, I mean, I am fascinated by this kind of
awareness almost. You know, you have to call it something like that of what's going on.
Anna Gressel: Yeah, I mean, it's such a good point. You know, in order to deceive, you have to have some understanding of what your goals are, some understanding of what someone else's goals are, and then the ability to reason that what you're doing is not only not in alignment with those, but actually is hiding something, hiding the content of what you're doing. I think it’s —these raise really important and interesting questions, I think is what you're driving at, Katherine.
Katherine Forrest: Right, and you know it's really important to say that this, the researchers indicated there's no evidence that the models are sua sponte developing any kind of malicious goals. But the key concern here is that models, when they're trained with initial alignment, they tend to stick to that initial alignment. And so, if you've got good solid initial alignment, such as the open AI safety policies or the three H’s, then the model really is trying hard to stick to that good, human-aligned values alignment. But if you've got a malicious actor who does initial training that early on trains the model in some other way, then we could have a different problem on our hands.
And that brings us, I think, Anna, to the other development that happened on December 20th that goes along with this only by implication, and that's the announcement of OpenAI's o3 model.
Anna Gressel: Yeah, I mean, I think we're all really excited to see what's going to come out of the release of o3. And Katherine, do you want to talk a little bit about what some of the capabilities are that we're going to be watching for?
Katherine Forrest: Well, one of the things that I think is really interesting is that there's a respected technical writer named Alberto Romero who said that the significance of this announcement can't be overstated. And then he goes on to, as do a number of commentators talk about, the results that the o3 model achieves in terms of scoring on math and coding and science in ways that are real step changes from even the o1 model or the Google Gemini 2.0 or any of the highest capability models right now. So we're talking about a 20% jump between the next highest capability model and the o3 model. So as Romero says, “We've never seen a direct 20% jump before. This is not ‘nice’ or ‘very nice.’” And then he goes on to say, “This is we-have-to-reconsider-the-implications nice. That's where we are.”
Anna Gressel: That's one author's view, and we read so much every day on what's happening, but also on some of the hot takes that are out there on this. But what is, I think, undeniable is just that the pace of change, the acceleration of development, is still upon us, right? It's not the case, and I think many people have said this, that we had, you know, a ChatGPT moment and that everything's been stable since. We're kind of continuing to see real, very, very interesting jumps in capabilities. We've talked a lot about agentic AI, and now we're talking a lot about reasoning capability jumps. And so, as we see these jumps, we need to be able to think about what the implications are of new capabilities, including reasoning capabilities, self-preservation, deceptiveness, and with agents, the ability to execute on things. And there's a lot of testing, a lot of research, and companies are taking real steps to figure out how to actually assess these capabilities.
But for lawyers, and we're like, we’re lawyers, we have a lot of clients who are lawyers, we love lawyers, so let's give a few lawyer talking points. I think this is going to be an important part of understanding what needs to be done from a compliance perspective and just what some of the implications are of these frontier technologies, including things like understanding that these chain of thought reasoning pieces are auditable, understanding that deception may be part of the equation, that reasoning capabilities are increasing. And so, the world of lawyers always needs to kind of take these things on board and think through them. And that's what we love to do and we know a lot of folks are doing really thoughtfully right now.
Katherine Forrest: Right, and we're going to be watching over the next couple of months how some of these extraordinary capabilities get translated into some deployment scenarios. So right now we're talking about model capabilities, but of course, you know, all of our clients in various industries are looking for what are the tools that are going to come out of these. And so we're going to be talking about those as they no doubt start to hit public awareness in the next several months. But that's all we've got time for today, Anna. I'm Katherine Forrest.
Anna Gressel: And I'm Anna Gressel. Make sure that you like and subscribe to the podcast, and happy holidays, everyone.