AI Inference: Connecting the Dots
Following last week’s deep dive into semiconductor chips, Katherine and Anna revisit the concept of AI inference. They explore AI reasoning capabilities, why inference quality matters for lawyers and recent advancements that can improve how models infer.
- Guests & Resources
- Transcript
Partner
» BiographyPartner
» BiographyKatherine Forrest: Hey, welcome everyone to another episode of “Waking Up With AI,” a Paul, Weiss podcast. I'm Katherine Forrest.
Anna Gressel: And I'm Anna Gressel.
Katherine Forrest: And Anna, I can see you. Nobody else can see you, but I can see that you've got your home background behind you. And you're back from Silicon Valley.
Anna Gressel: I know, I went and I sought out sunnier shores. I was up in San Francisco and the peninsula, and then actually up in Seattle, which was lovely. And it was great to just meet with a ton of fun folks, but I would say, in particular, just so many amazing women lawyers who are working on AI and chips these days. And so shout out to all of our friends. It was great to see you guys.
Katherine Forrest: Well, that's terrific. And we're glad to have you back on this side of the time change. That is, so that there is no time change.
Anna Gressel: Definitely.
Katherine Forrest: And I'm thrilled that in today's episode we get to follow up on an aspect of the one that I did solo last week, Anna.
Anna Gressel: Mm-hmm.
Katherine Forrest: That was on the exciting, scintillating topic of chips and GPUs. Actually, it was a good episode, but I had to keep reminding our audience not to turn me off. Don't turn me off.
Anna Gressel: Now I have to wait and listen to it with everyone else. I'm so on the edge of my seat.
Katherine Forrest: And it will keep you on the edge of your seat. But today we're planning to talk about another concept that's thrown around all the time in the AI world, and that is a word called inference. I-N-F-E-R-E-N-C-E, inference. And as a former judge, I have to say that I'm used to the word inference meaning sort of an educated view where you arrive at a fact based upon putting together other facts. So it's like, in a way, like circumstantial evidence, where you use almost like a connect-the-dot kind of process to infer a fact based on other facts. And so, we used to use an example when I was a judge about inference, which was, you wake up in the morning and you see that the sidewalk outside your building is wet, and you can infer that it rained the night before. But if you look around further, and you add additional information to just the wetness, and you see that there's a hose nearby, then that additional information would allow you to add to your inference and to predict that the water may have even come from the hose. And so you could test both hypotheses with additional information in the mix. You'd look for wetness, say, in the street, if it rained, or whether or not the hose was dripping if it was the hose.
And so, I used to say to juries all the time that an inference is not speculation, it's not guesswork, but it's, again, using one or more facts to make a determination as to another. And today we're going to be talking about the special meaning of the word inference here.
Anna Gressel: Yeah, in the AI space, inference is really a very particular term of art. It has some relationship to the use of the term you used when you were a judge. I definitely think there are some similarities, but it's also got a particularized difference between the normal usage of the word and what's specialized about it in the AI context. So let's talk about that a little bit. When we talk about inferences from an AI tool, we're essentially talking about the output from the model. So if I query a model, then based on all of the information at its disposal, the model infers an answer.
Katherine Forrest: All right, so let's break that down into pieces for our listeners. So when an AI model is inferring something, it's basically taking your query, it's taking your input, if you will, in terms of the query, not the information that was input, but the query. And based upon all of the information that had been input, that had been ingested into the model during the training process, the model is then arriving at an answer, at a predictive inferred answer. And that's inference.
Anna Gressel: In another episode, we'll talk about all of the different ways the models can be trained and why it matters. But for now, I think what we want to focus on is just that process of inferring an output, the process of having queries go in and outputs come out. And in order for that process to work, you basically take a model that's already been trained. And once that training is complete, the model is queried by a user and asked to produce output based on data, usually data that's not seen before during training. And to do that, it has to make computations to infer the output. So basically, the computing process that we call the inference process is the process of computing that particular output in response to a particular query. And that's a little bit different than the computing process that goes through to train the model in the first place.
Katherine Forrest: And when you say input, it has never seen before, the input there could even just be the query itself.
Anna Gressel: Exactly.
Katherine Forrest: When you talk about input, I always think of it as having to be more than a query, like a news article or a book or some sort of content, but it can just be the query.
Anna Gressel: Mm-hmm, like, what temperature is it today?
Katherine Forrest: Right, exactly. Well, in last week's episode, when I talked about chips and compute power, we briefly mentioned that there are different aspects of using computing power or compute. And one is to train the model. And that is one use of compute or computing power. And the other is for inference. And so you need compute or computing power for both the training aspect and the inference process for the model. So, chips and semiconductors and GPUs, all of that, as we talked about in the last episode, that's all relevant for both processes.
Anna Gressel: Yeah, so let's talk about the fact that inference is such a buzzword these days, but why should we care about this particularly as lawyers?
Katherine Forrest: Well, a primary thing that we care about with inference is the quality of the inference, the quality of what's coming out of the model. And quality comes in a couple of different pieces. There can be inferences which are frankly incorrect, and those could be errors or they could be hallucinations. But we also separately care about the sophistication of the reasoning. So, even if you've got a correct inference, you may want that inference to actually have sophistication, and it can be basic or it can have sophisticated reasoning. And so, inference is on a spectrum of different qualities of inference.
Anna Gressel: Yeah, and one kind of version of this that we've talked about in the past relating to inference issues is algorithmic bias, actually. And those are inferences that can occur, among other reasons, because of overly narrow data sets that might result in conclusions that exhibit some bias. And we can imagine a facial recognition tool that's trained on really limited sets of facial pigmentation. The error rate for that could be a lot higher, for example, when they're outside of their training data set and use the inferences working on different kinds of pigmentation, than if the data set had been more inclusive to start.
Katherine Forrest: Right, exactly. And then there's another, more technical issue with inference, which is called overtraining. And in some cases, overtraining a model can lead to something called overfitting. So you can overtrain to lead to an overfit. I like to think of it in terms of exercise, which would be the opposite. You could overtrain so that you could underfit, but whatever. Let's put that to the side. In the AI context, it would be overtraining to lead to overfitting, where a model performs well on training data, but has been so overtrained that it actually can perform relatively poorly on new data. And so we care about whether the AI tools that our clients are using or that our clients are asking about have inference error rates. And one thing that is done to determine error rates is testing the models from time to time. And I also, as I mentioned, there are different kinds of tests to do for the level of sophistication of the reasoning process.
Anna Gressel: Right, and there are a number of papers that, I think, helpfully discuss the reasoning ability of models, and that reasoning is contained in the model output, and that's an inference.
Katherine Forrest: And one of the ways that we understand some of the reasoning that the models do today is by actually using something called chain-of-thought reasoning. And that chain-of-thought reasoning is a way of asking the model to step by step tell us how it's actually reaching the inference. So if you want to understand the inference, you can use this kind of chain-of-thought reasoning. But interestingly, this chain-of-thought reasoning, if you have the model do that and tell you step by step how it's reaching its inference, the inferences they found are actually better. The quality and the sophistication level of the inference is actually better. And there are a couple of papers that really go into this. There are actually now, a number of papers. So I'm just going to mention two.
One is from Li, L-I, et al. that came from September 2024. And there were a group of researchers from Stanford, the Toyota Technological Institute and Google. And it's called, “Chain of Thought Empowers Transformers to Solve Inherently Serial Problems.” And then there's a second paper that just came out from Xiang, et al. on January 8, 2025 from SynthLabs.ai, Stanford and Berkeley called, “Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought.” So those are papers that I would really recommend to our listeners. And there's more and more sophisticated reasoning with this chain-of-thought coming out of LLMs these days and the o3 model, which hasn't been fully released — but it's been now starting to be discussed, and they've released some very preliminary results about it —does use chain-of-thought reasoning.
Anna Gressel: I have to say, what lawyer doesn't relate a little bit to chain-of-thought reasoning? Doesn't it remind you, Katherine, a bit of the Socratic method and having to really walk through all of the different links in that inferential chain we even use as lawyers? I think it's so relatable, and it's so interesting as lawyers to think about what that means to have that all written down and able to examine for us as we think about looking under that black box, look under the hood of that black box. So...
Katherine Forrest: Well, hold on, because I just want to say one thing about the Socratic method. So you know, I have taught for 10 years a class…
Anna Gressel: I do.
Katherine Forrest: …as an adjunct at NYU Law on quantitative methods. And I have tried the Socratic method, and I want to tell you it does not always work because some of the Gen X, Gen Z, Gen ABC, whatever the Gen is, whatever they are doing. You say to them, and John Doe, “what's the answer?” And John says, “I wasn't able to do the reading.” You know, the chain-of-thought reasoning has actually maybe gone out of the window. But I think that with the machines, they at least are doing the reading because that's what the ingestion process is all about.
Anna Gressel: Soon enough, your law students will say, “o3 was the reasoning behind my answer.”
Katherine Forrest: Right, right, right. Quickly, refer to o3. Call home.
Anna Gressel: Well, so, I think it helps to think a little bit about what actually accounts for some of these advances like the o3 model and the chain-of-thought reasoning. And that includes a lot of different things. Some of it is the process of generating chain-of-thought tokens, which is a big, big part of the architecture of those models. A big part of it, as well, is what Katherine talked about last time, the compute power and the chips.
You know, there's obviously the GPUs that Katherine talked about before, which are great at performing all kinds of massive parallel computations for AI needs and for training in particular. Really, GPUs are so, so critical. But there is an additional fact about computations that are performed to result in inferences. They can be counted. And in fact, they are counted. And a word for this is another technical one called FLOPS. I think we've used this before, right, Katherine?
Katherine Forrest: FLOPS, like he flopped over, only it's not the whole flopped business, it's just the FLOP.
Anna Gressel: It's just the FLOPS, the plural FLOPs. So that stands for floating-point operations. And I know for folks who have listened before, we talked about this from a bit of a regulatory perspective. So, an AI model's weights are floating-point numbers, and FLOPS are essentially the number of operations done during the training to tune the model's many weights.
Katherine Forrest: One of the things about FLOPS is that the EU AI Act has a certain number of FLOPs. It's 1025, 10E25, I think…
Anna Gressel: Mm-hmm, that's right.
Katherine Forrest: …for the FLOPS for the high capability models. And the White House executive order had 1026 for its frontier models. But another way that inferences are being made substantively better is from something called in-context learning. And that's another AI phrase. You know, there all these AI phrases. But in-context learning, which is where even if the model doesn't directly train off of user prompts, it can use examples from past prompts to help make responses to a current prompt better. So it can help make its inferences better. It uses the context, if you will, of the conversation to make the inference or the output better.
Anna Gressel: Yeah, and one key point for folks who are trying to understand this holistic market for AI and for chips, is to understand the fact that the future of inference is going to really require chips that are custom designed just for that purpose, not just for training. And that is driving a lot of the differentiation that we're beginning to see in the market today. Options that are not just GPUs, and remember GPUs are also pretty heavily controlled these days from an export control perspective. So it's worth really giving a call out to a few companies here. One is Cerebras. They produce a chip that's not a GPU, that's designed from the outset to handle AI inference. And their chip is called a wafer-scale engine, or WSE. And that measurement for comparison here is tokens per second, which assesses how fast the response from a model gets fed to you as a user. It's very, very fast, super impressive stuff.
Katherine Forrest: Tokens per second. That's interesting. Alright, and Groq, which is spelled G-R-O-Q, not to be confused with a different Grok company, which has a K on the end. But Groq with a Q has also made a number of advances here, and they've developed an inference chip called the Language Processing Unit, or LPU, which can be up to 10 times more energy efficient than a traditional GPU without giving up anything on speed.
Anna Gressel: So you might be wondering what the “so what?” is here. And if you're running AI at scale at your business, or even thinking about it, you need to consider whether the model you're using is running on the right chips to get you the right answer as quickly and cheaply as possible, particularly if you're processing real, big data at scale, not just asking what the weather is, per my example earlier.
Katherine Forrest: Right, although I'm going to go to Cartagena soon, and I really am interested in what the weather is. Okay, but here's one more thing to chew on. If you can shave off, say, 10 seconds from how long it takes a model to infer its first potential answer to you, its first, let's call it a draft answer, you could give the model back five of those seconds to allow it to double check itself before giving you a final answer. And the models now, some models, are architected to do that, to actually double check themselves on their answer before they actually provide it to the user in their final inference response. So OpenAI's o1 model, for instance, has time to, as they call it, think and reason at inference time, rather than spitting out its first guess very, very quickly. So the model is actually taking time to make its inferences.
Anna Gressel: Yeah, and I think this isn't the end of the story at all for improving how models infer. We'll be keeping a really close eye, too, on how regulators, including the EU AI Act’s enforcers, might adapt if inference sophistication becomes a key way in which models set themselves apart. And if inference chips become as or more important than GPU training chips, that might happen in the future. We're going to start potentially seeing some reactions from regulators around things like export controls. And we're going to be really curious to see how this whole ecosystem results. And I say this is a particularly interesting point on a day like today when there were actually some very new, important export controls proposed around both the chip and the AI space. And maybe we'll cover those, Katherine, in another upcoming podcast episode.
Katherine Forrest: Right. And the day where we're recording this episode is January 13th, 2025. So that's what we're referring to for some of you out there who are wondering what we're talking about, in terms of those new export controls. But there's been a lot out there. So even from our last episode to this one, we've got more new news that we'll have to then take you back to in the future. But for now, that's all we've got time for. I'm Katherine Forrest.
Anna Gressel: And I'm Anna Gressel. Make sure to like and share the podcast.