AI Interpretability
In this week’s episode, Katherine and Anna unravel interpretability, an evolving field of AI safety research, while exploring its potential significance for regulators as well as its relation to concepts like explainability and transparency.
- Guests & Resources
- Transcript
Partner
» BiographyCounsel
» BiographyKatherine Forrest: Hey, good morning, everyone and welcome to another episode of “Waking Up With AI,” a Paul, Weiss podcast. I'm Katherine Forrest.
Anna Gressel: And I’m Anna Gressel.
Katherine Forrest: And Anna in our world travels, because we always talk about our world travels in these podcasts, I'm just back from Denmark where I was attending my brother's wedding and I was thinking a little bit about the Danish concept of hygge, which is sort of a cultural philosophy on taking time to be among those that are close to you and away from the daily rush of life. It's also, by the way, something that's really good for the month of August.
Anna Gressel: I have to say I hope our podcast offers that little daily break for some people. You know, selfishly, I'd like to think we're all sitting around over our cup of coffee together and, hanging out and talking about AI and that's a little nice moment of respite from the days that we sometimes each have.
Katherine Forrest: Although I would say that those who are actually listening to this at some time other than coffee time, that's okay too. I want to tell you, want to affirm that you are accepted.
All right, so today we're going to talk about a concept that really ties into some previous concepts from our earlier episodes and it's one of interpretability.
Anna Gressel: Yeah, I'm looking forward to this discussion today, Katherine, because I think interpretability is a really tricky concept. And I know you and I were talking about that beforehand. So, we should really break it down and talk about what interpretability is and why it's such a big concern for regulators, particularly now. Why is it coming up? And we're hearing about it more in connection to the EU AI Act and the Executive Order and other kinds of regulatory levers.
Katherine Forrest: And, you know, you and I, I think agree, and we might disagree on some of the nuances of interpretability, but agree that with narrow AI, which was the kind of single task AI that was the generation before generative AI, with narrow AI, interpretability wasn't as complicated or nuanced an issue as interpretability is today now that we're dealing very frequently with neural networks. And with narrow AI, you could interpret how the model worked in a way that was frankly a whole lot easier than it is today. I mean, you'd agree with that, Anna, wouldn't you?
Anna Gressel: Definitely. I mean, I think the same is true for concepts like explainability and transparency. It was actually easier with traditional machine learning to understand not only what the data sets were and what the outcomes were, but how the data sets were really computed into the outcomes. All of that process, it wasn't necessarily completely transparent, completely interpretable, but there was a sense of how the inputs were deriving real consequences, and that was helpful and easier than I think with what the generative AI landscape is today.
Katherine Forrest: All right, so let's talk and I'll go first and then I want to hear what you've got to say about our concepts of interpretability and how that differs today from the concepts of transparency and explainability. And the importance, just to sort of preview this for our audience, the importance of this is that each of these three words has taken on some independent meaning from the regulators and so having interpretability alone without transparency or explainability isn't enough. You need, particularly when you've got a model that has some form of human impact, you've got to have interpretability, transparency and explainability.
So, interpretability in my mind today, based upon the literature I've seen, is looking at really three different things. It's looking at the ability to understand or interpret, sometimes inferentially, what the model is intended to do; how it does it, which can be incredibly complicated with a neural network; and whether it does it in the way that it was originally intended. And so, I differentiate that from explainability because I think of interpretability as sort of a preliminary step to explainability. You've got to be able to first interpret and have some basic understanding of the how before you can get to the explanatory component that allows you to put that in words that people will then understand and can then use as the rationale or understand as the rationale for certain decisions. And then transparency is how easy it is to get to either of those things. Whether or not with transparency you're able to see into the model or understand the model in a way that's clear.
Today, it's very difficult to have transparency with a neural network because you've got a black box. But you can have some form of transparency on the input end and the output end. So that's how I think of interpretability. Why don't you give us an Anna Gressel version now.
Anna Gressel: I'm just going to be honest. I think it's challenging to piece apart these definitions in part because the technology is changing as we've been working through these concepts. I mean, what interpretability may have meant in 2018 is different, I think, than some of the concepts around it today. And Katherine, I thought you did a great job with the definitions.
I mean, for me, interpretability, if you think about the concept of the black box, we've always thought in the AI space that someday we might be able to get to something that might be like a glass box. I don't know if people have heard of that glass box concept on the line. But the glass box is this idea that AI is the engine, suddenly everything, it's kind of what you were saying about transparency right, suddenly everything becomes clear and you can see the workings inside, and that's the first—that layer is the transparency piece. You can see the gears; you can see like all of the mechanisms worrying behind the model.
Interpretability, to my mind, is the why. Like, why does this actually all fit together and work? And do we understand the mechanism? It's not just that we can say like, there's a gear over there. I mean, look, you can see I'm not a mechanic. There's a gear over there. There's a gear over there. They fit together. But interpretability is understanding there's a combustion engine in there and how the combustion engine works. And so, it's really that deepest level to my mind of understanding is interpretability, it's also sometimes the hardest to ascertain. But at the same time, Katherine, as you said, you can infer elements of interpretability. How and why does this work the way it works from the inputs and the outputs as well? Katherine, would you agree with my engineering?
Katherine Forrest: I totally would and I'm actually going to sort of talk about a couple of papers that sort of go directly to the point that you've just made, but I want to pause for a moment on the irony of having a debate about what interpretability means because that sort of suggests that there is some more clarity that might be needed. But what you were just talking about in terms of this mechanistic aspect of it is something that actually has become a subfield of AI research, really of AI safety research, and it's called mechanistic interpretability. And there's been recently a number of papers that have been issued that are really, that are very complex. I don't want to use the word really too many times in the same sentence. But one of them is from Anthropic, and that's something called “Scaling Monosemanticity,” and OpenAI has some papers and DeepMind has some papers. DeepMind has one called “Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2.”
But this field is clearly picking up traction and really when you boil it down it's doing what you were just saying Anna. There is a way in which you're trying to find the language that can sort of crack the code where large language models, which are like human brains and they've got these neurons that are firing, whether or not you can actually interpret them in some way that would allow us to understand how they think and how they quote reason.
Anna Gressel: Yeah, and I mean, I'll jump in as the former neuroscientist in the room. That's incredibly difficult to do in the brain. And the question is, can we even think about doing this from a neural network perspective? And I thought the Anthropic paper was super interesting. And I think it's worth a read for folks who want to nerd out on interpretability, because part of what they're asking is this question: is a node actually relevant to a particular concept? Can you kind of locate a concept in a neural network in a particular node? That's not so different than the question that people ask about the brain, right? Can we locate certain kinds of concepts in certain neurons or certain areas of the brain? So, we're reasoning through these in similar ways. And the idea would be if you can locate a concept in a particular place, maybe you can actually take it out of the model if you needed to. Or you could look for it in the model if you needed to. Again, we're getting down to kind of the deepest level of reasoning and understanding what these models do and why.
Katherine Forrest: Right, and these are, I think, are concepts that are particularly important for companies that are making LLMs or are really trying to understand the inner workings of their MLLMs or LLMs. But the point that you've just raised Anna, is a really good one, which is the possibility of using interpretability and sort of mechanistic interpretability as a method of control, and that that can be a safety aspect for some of these models. You know, we're getting increasingly aware these days of the ability of these models to have outputs that can lead to various kinds of difficult safety issues. You know, the CBRN that we talked about in one of our prior episodes is one of them. But mechanistic interpretability can actually help us understand where we can sort of adjust the levers for some control.
Anna Gressel: Yeah, and people have been talking for a long time about this potential concept of pruning models, pruning out things for models you don't necessarily want. But if you don't know where to prune and how to use those clippers, it may not actually be functionally valuable. So, this idea of interpretability can overlay with that. Ideally, you could potentially pull out concepts you didn't even want in there or functions you didn't want in there.
Katherine Forrest: And so, I think that for our listeners, if what they've done is they've licensed in a model, then the question of interpretability will be one that you'll need to be asking the licensor of the model. You may really want to understand not necessarily all the deep dark engineering aspects of it, but enough of the language to be able to speak to the regulators.
And for those who are the actual model developers, you're going to need to be able to understand the model in some way so that you can check the box, if you will, of the interpretability that leads you to the explainability and maybe or maybe not the transparency.
I think that's all we've got time for today, Anna.
Anna Gressel: I think so too. Thanks for joining us everyone for this really interesting discussion and like and subscribe to the podcast.