AI and Scaling: Its Meaning and Limits
Join Katherine Forrest and Anna Gressel as they explore the challenges and opportunities of scaling AI models, from data walls to Moore's law, in this episode of "Waking Up With AI."
- Guests & Resources
- Transcript
Partner
» BiographyPartner
» BiographyKatherine Forrest: Welcome to another episode of “Waking Up With AI,” a Paul, Weiss podcast. I'm Katherine Forrest.
Anna Gressel: And I’m Anna Gressel.
Katherine Forrest: And Anna, it's Thanksgiving week and while this particular episode, as we know, is going to be played sort of after Thanksgiving has come and gone and we're into the holiday season and people are sort of trying to organize the eight gazillion holiday parties, I just wanted to say that I love the Thanksgiving week.
Anna Gressel: Me too, actually. One thing that not everyone knows about me is I used to be quite a good baker and I don't bake everything equally well, but I'm particularly good at pie baking. So I love this week, I love baking different kinds of pies and scheming up which ones I'm going to bake every year.
Katherine Forrest: I just want to say that I am one of the people who did not know that you could bake because I have not been the recipient of not only not a pie but a single baked good. Nothing! I have not gotten a cookie. There has not been a muffin. No cupcakes. Nada. A total injustice. But one thing I wanted to say before we get on to something more serious is I was looking at the front page of The New York Times today and it has the picture of the two turkeys who are going to be pardoned, right?
Anna Gressel: Yep.
Katherine Forrest: And I was struck by the fact having been, am not having been, I actually still am a former judge, that these poor turkeys are being pardoned from no crime, right? They have not committed a crime. And so I just had suddenly this moment of dissonance between pardoning a turkey that has committed no crime apart from existing and having a lot of feathers and the idea of it being pardoned.
Anna Gressel: Katherine, I feel like that should be in the New Yorker caption contest. You should just enter that right in. It's ready to go.
Katherine Forrest: All right, all right, all right. Well, let's actually sort of move on to our topic today. Well, actually, even before we get to our topic, I want to say one more thing about the week because I'm not ready for our topic yet. You know, one thing I'm going to be doing this week is I'm going to be sort of moving along and trying to finish this book again that I'm co-authoring on called “Of Another Mind,” and I'm co-authoring it with Amy Zimmerman. It's a fascinating topic, but one thing I ran across while I was doing some, I always want to make sure I'm up on the last minute research for this book. I don't want to sort of have the book come out and have something big have come out just at the very end. But there's a really interesting book that talks about AGI, Artificial General Intelligence.
Anna Gressel: Do you want to remind people what AGI is and why it's such an important concept?
Katherine Forrest: AGI, which is a term that you're hearing more and more often right now in the press, is actually the point at which intelligence or reasoning abilities of AI can meet or exceed that of humans. And so you can think of it as almost a point of super intelligence, although there's something else, I think, that I would distinguish called singularity, which is really, in my view at least, not AGI but AGI is on that route and there are right now a lot of press reports about model developers actually seeking to achieve AGI with their models and we can talk more about this another time but I wanted to just recommend to folks Ray Kurzweil that's K-U-R-Z-W-E-I-L. It’s Ray Kurzweil's new book which is called “The Singularity is Nearer.” He'd written an earlier book called “The Singularity is Near,” and this one is called “The Singularity is Nearer.” And whatever you think about AGI, I just wanted to recommend it as a good read, a really fascinating read for folks. And it's actually a building block, AGI, on relating to something that we're going to talk about today called scaling.
Anna Gressel: And I think the word scaling can be a little confusing. So, Katherine, why don't we start with a refresher on that because we don't want to just lose our audience before we even jump fully into the topic today.
Katherine Forrest: Right, and we had mentioned scaling a couple of episodes back, but we didn't dive into it like we're going to dive into it today. So scaling, first of all, is not descaling a fish, all right? So for those folks who think that when they pick this episode out called scaling that they were going to get a fish episode, that is not so. But it does get used, the word scaling does get used in the AI context quite a lot. It's one of those confusing terms. I love the term, agentic, because it's sort of, I think, unnecessarily confusing. But scaling gets used also in a confusing way. And I think everybody sort of assumes that everybody else understands it, and only actually a very few people understand it. So because of these multi-context uses of the word scaling, it's worth sort of diving into and defining it.
Anna Gressel: Yeah, so I mean, think our listeners might be familiar with some of the ways that the term scaling is used, like as a size metric. So scaling something from a pilot program to a more general program, making it bigger, or spreading a use case out over a larger group of people, that can be kind of a concept of scaling up or scaling out.
Katherine Forrest: Right, or scaling can refer to increasing the size or the capacity of an AI model or system to maybe improve its performance, its capabilities, its efficiencies, a variety of different aspects of a model.
Anna Gressel: So let's just pause on that a little bit because as a general matter, scaling can refer to scaling deployment or usage of a model. That's what we just talked about a minute ago. And it can also refer to scaling model size. And so we'll talk about that particular point also in a minute.
Katherine Forrest: And it can also talk about, we can talk about scaling data, which might mean using larger or more diverse datasets. But equally, it can refer to increasing the compute resources that are used for processing.
Anna Gressel: Right, and compute is the hot topic these days. I mean, think there's like very little that's more on folks' minds than trying to get their hands on some of those GPUs or graphic processing units that NVIDIA makes or TPUs, those are tensor processing units. So, you know, when we think about scaling compute, we're really talking about the amount of computing or processing power used to train a model or that the model has access to during use.
Katherine Forrest: Absolutely, and then you can also scale tasks, which means to expand the range of tasks that a model can perform.
Anna Gressel: Yeah, and so let's go back today for our focus to talk about something that's getting a lot of attention in the press and has implications for compliance and governance issues. So that's particularly this question about scaling the size of the model, or put another way, scaling its capabilities. Maybe we can talk about why those are connected.
Katherine Forrest: Yeah, and for this topic, we're going to back up and start with what it means to actually scale a model apart from this now brief introduction that we've done. And what we're doing is referring to the efforts for this topic of what makes a model more capable or at least theoretically can make it more capable. We'll talk about some of the limitations in a moment. And more powerful with regard to certain tasks. And in part, This can be done by training them in a way that makes the model larger overall.
Anna Gressel: So let's talk about what larger means. Larger here, you know, in the concept of scaling is measured by the number of parameters in the model and the investment needed to actually create that huge number of parameters. So we'll do a short refresher on parameters, which is another tricky term, sometimes kind of arcane term. But parameters are adjustable values within neural network of the relationship of one piece of data or information to another.
So for instance, a parameter might suggest there's more of a relationship with one piece of data and the other piece of data. For example, cats and dogs might be related as concepts. And then trucks and machinery might also be related as concepts and less related to the concepts of cats or dogs. So there is a network of information about any piece of data that connects it to other pieces of data and the values of the parameter or its connections changes as the model gets trained and can actually increase as the model, you know is trained up.
Katherine Forrest: In the context of AI, the weights is another word that's related to the word parameter. But it's not really equivalent entirely. Parameters are the internal values that a model has learned during the training process, and the weights define the connections between the neurons within the neural network where all of this is occurring. So you've got parameters as sort of a larger sort of grouping and then weights within it.
But going back to scaling, as more and more data comes into the model, there is more and more training that can occur within the model of the data that's there. And the model then learns the nuances and what connections there might be between the data. And like the cats and dogs example that you just gave, there can be a relationship between cats and dogs and say human beings because they're mammals, but there might be a different weight and connection between cats and say jungle cats in the rainforest than there are between cats, house cats and humans. They might all be mammals but there can be a different kind of weighting between one point of information and another.
Anna Gressel: Right, so as these models grow, as they become bigger, there's usually a concept of having additional parameters, a larger parameter size of the model. And larger models, models that are actually scaled up with all of that additional training, like so much training, are often, they're not always, but they're often models that perform best at various tasks.
Katherine Forrest: Right, and the scaling itself is a complex engineering task that isn't done in just a single way. It's not just a matter of turning up the compute, throwing in more data, and extending the length of time that you do the training. There can be very, and there are, very, very sophisticated ways that are proprietary to many companies in terms of how they actually scale a model. And there are balances that different companies have between these different aspects of what levers can be used to scale the model, whether it be data or compute or any other aspect of the model.
And last summer, we talked about a public, maybe it was even in October when we finally aired it, but we had a publication that we talked about called “Can Scaling Continue Through 2030,” and I'm still recommending it and it's from Epoch AI and it really lays out the different aspects of what is needed to scale and whether or not things will be scalable continuously through and up into 2030 and what would happen then.
Anna Gressel: And I think it's, is it Epic AI or Epoch AI that published that, Catherine? I know you love this paper.
Katherine Forrest: Well, it's only you who would call it epoch. Like, what kind of word is epoch? There is no such word as epoch. There's only the word epic. So I don't know.
Anna Gressel: Okay, we'll take this debate offline, Katherine, but let's just talk about the debate in the paper on whether scaling necessarily leads to better models. And that depends a lot on a lot of things like the architecture of the model, how it's trained, what it's trained on and the type of task that the developers intend it for.
Katherine Forrest: Right, for sure. And without getting lost in all of the details of this so we can talk more about sort of scaling itself, the paper talks about the possibility that by 2030 we'll be able to scale a model that'll be 10,000 times larger than GPT-4. So that's 10,000 times. So think about something that is twice as large or four times as large or 10 times as large or 100 times as large as GPT-4. We're talking about 10,000 times as large as GPT-4. And there's something in the world of computer science that's called Moore's law, which, you know, relates to this concept from Gordon Moore who co-founded Intel that transistors, the number of transistors on a microchip would double every two years and the cost would decrease, but there really isn't an equivalent of Moore's law in the AI world. And so it's not the case that as the model size grows or the data set size grows or the compute power grows, that there necessarily has to be an increase in capability.
Anna Gressel: Yeah, and I mean, there's so much debate around this. And I think we can talk about this; we can keep talking about this. Some people talk about more than Moore's, the idea that as architecture gets even more complex or quantum computing kind of comes to the fore, we're even going to surpass Moore's law in really interesting ways. But at the same time, there are some people who are saying that this kind of exponential increase in capability and size can't go on. you know, infinitely and so there might be some diminishing returns on scale. That's a hotly debated question right now.
Katherine Forrest: Right, and there are articles, I think, a number of articles actually on exactly this topic that some developers are not seeing the capability improvements that they would expect, but also there's now, and you'll see this a lot with some of the newer releases, you could actually see models that have fewer parameters than some of the big flagship, you know, huge models, but that have really impressive capabilities.
Anna Gressel: So some people like a16z’s Ben Horowitz, have placed attention on the hardware used to train AI, saying we're not getting the intelligent improvements at all out of the increasing number of GPUs to train AI. But the one factor everyone points to right now in terms of this debate is data.
Katherine Forrest: We'll come back to that in just a second, but I wanted to point out that, going back to something that I had just said, that when we look at the llama herd of models, which are, it's a very powerful model set, you see one billion parameter models, a three billion parameter model, but also the 405 billion parameter model, which is sort of the flagship model, all of which were released between July, which was the 405 billion, and then September for the one in the three billion parameter model, all of which have incredibly impressive capabilities. So even with the smaller models, we do get these impressive capabilities.
Anna Gressel: Yeah, I think that's right. smaller models and larger models can be good at different things. They can be used for different things. One thing that's great about small models is you can stick them on devices. We'll talk about that probably in a different episode. But there are all these questions, again, about how do you measure these capabilities. But let's pause and go back a little bit to the computing power piece. One thing that developers have used to scale models over the past few years has been by accessing a combination of computing power. And so that's you know, for some developers buying thousands of GPUs, which are kind of like, you know, the scarce resource these days, and funding new data centers to house those GPUs and using new data sources to train models. So all of those things kind of come together to help with scale.
But there are also now choices about when scale helps and when additional scale is unnecessary. It may just use up additional resources without return. So there's kind of this complex balancing act between what you're actually training the model to do, how it's going to be used, what the design is for, again, like, is it supposed to be on-device, is it supposed to be this large model powering new kinds of capabilities, and so companies are making these really interesting trade-offs and choices just given the cost constraints on one side and the value, the ROI on the other.
Katherine Forrest: Right, and then let's go back to that data point that you made a moment ago and assume that a developer wants to scale, but there really is a finite amount of data on the internet or available on or off the internet. And recent model releases are expected to have trained on pretty much all of the data that can be accessed. There are some who say that we're about to hit or have hit or are past hitting something called a data wall.
Anna Gressel: Yeah, it reminds me of, there was kind of this famous quote, I think it was as early as early 2023, Katherine, saying, you know, we no longer have a moat because everyone has the same data. There are different kinds of moats that people are thinking about now and like differential value that's going to come from new areas. One of that could be new kinds of content partnerships to find more data, to get different kinds of data into the models. You know, some people, like some developers or bespoke organizations are actually working to develop more data from scratch. So they might be having PhDs write new papers to train a model or provide really sophisticated answers to template questions. They can also use something called synthetic data, that's a super interesting concept, but that could include things like data generated by another model.
Katherine Forrest: Right, and we haven't seen this next generation of models yet that will be trained on, let's just call it, data beyond the wall. Hey, we could do a whole episode on data beyond the wall. Right, it's a sci-fi series, data beyond the wall. But we're going to see how it all sort of plays out. But we know for sure that there's going to be additional development and undoubtedly people are going to sort of work on the ways to best solve this problem. So we'll come back to these developments, but I just wanted to say in terms of governance and compliance issues, as these models scale, it's very important that in-house, for instance, legal departments understand what the implications of that scaling are. Whether it's to roll out certain kinds of capabilities within their organization, which would be task scaling, or if it's capability scaling, whether or not it actually triggers some of the EU AI Act risk categories. And so I think that there are going to be scaling implications for the in-house folks in our audience.
Anna Gressel: Yeah, I mean, it's such an interesting time. We haven't really seen the full scope of this next generation of technologies. And it's hard to see exactly how it's going to pan out. But one thing we can point to with certainty is that companies are really beginning to explore this seriously. And we're going to be looking at models that are trying to surpass any plateaus in different ways. And that'll be its own kind of competitive exercise.
Katherine Forrest: All right, that's all we've got time for today. And I hope folks are enjoying their increasingly cool holiday season that's starting now. And I am Katherine Forrest signing off.
Anna Gressel: And I'm Anna Gressel. Make sure to like and share the podcast.