Demystifying the Mixture-of-Experts Approach
Katherine and Anna break down the concept of “Mixture of Experts,” an innovative AI technique that enhances efficiency and performance and powers some of today’s most advanced large language models.
- Guests & Resources
- Transcript
Partner
» BiographyCounsel
» BiographyKatherine Forrest: Hey, good morning everyone and welcome to another episode of “Waking Up With AI,” a Paul, Weiss podcast. I’m Katherine Forrest.
Anna Gressel: And I’m Anna Gressel.
Katherine Forrest: And Anna, today I am actually in Maine. I’m in Maine with my travel mic. It’s a new version—I got it off of Amazon—from the one that my dog ate. A new travel mic. And I’ve got it now wedged against a nice little stand here and I’m ready to go in this stunningly beautiful place in Maine.
Anna Gressel: I can look out your window and it actually does look stunningly beautiful.
Katherine Forrest: All right. So now, what’s on our plate for this episode?
Anna Gressel: So we have a really interesting topic for us today, Katherine: Mixture-of-Expert models. This one is much more of a technical and engineering deep dive than we normally do.
Katherine Forrest: Okay, we’re not going to—for the audience, don’t despair. We’re not going to lose you. What we’re going to try to do, Anna, is as we take this dive down, we’ll baseline our listeners on what a Mixture of Experts is and why we should think about them. So why don’t you go ahead and just kick us off.
Anna Gressel: Awesome, let’s do it. So at a high level, Mixture-of-Experts or MoE models, as they’re often called, is a set of techniques to get a larger foundation model at a lower cost while making the decision or prediction processes, sometimes we call those inference processes, more efficient. And one reason listeners should care about this is that many of the leading models in the world, these kind of unreported, are thought to be Mixture-of-Expert models.
Katherine Forrest: All right, and that’s for those listening and for organizations that develop or license a model, it’s really useful to think about this new thing that we’re calling “Mixture of Experts” because you get effectively a larger model but a lower cost and a more efficient prediction. So you’re going to have your business people starting to talk about these things, you’re going to be asking questions about how can we get our compute costs down but keep the quality of the AI model the same or even enhance its abilities and so that’s why we’re talking about it.
Anna Gressel: Yeah, and I used the term inference earlier, so why don’t we just talk a little bit about that for a moment. That’s really what we call the amount of computation necessary to get a finished model to output responses. Those are the things that come back at you once you put it a prompt. And this is different than the computation required to actually train a model, which is really computationally heavy, as we’ve talked about in other episodes.
Katherine Forrest: Right, and another thing to say here, and as we sort of take that dive into the technology down, down, down, is unpacking the name of “Mixture of Experts.” I like the name because it’s something to aspire to, to have a mixture of experts within any organization, but here it’s a mixture of AI experts within the tool or the model itself.
Anna Gressel: Absolutely. That gets right to the heart of what’s so interesting about these models for any of the engineers in your life. But Katherine, I think a good place to start this topic today would be on your recipe analogy for understanding transformers and the attention mechanism. Can you talk us through that again?
Katherine Forrest: Okay, so I’ll give you a short version. And in this short version, you’re looking for the best bread recipe. So imagine that you’re starting with a data set that’s comprised of all of the bread recipes that have ever been published. And you’ve got essentially like a bread LLM. And you take the recipes and you feed them into the neural network based on transformer architecture, which is the architecture that is at the base of the ChatGPT models and things like that and the Claude models. And the model then analyzes the relationships between the various words in these bread recipes and in the context in which they occur—for instance, what stage of the bread recipe—and it then weights them. And the relationship between the words are called parameters.
Anna Gressel: So let’s really extend that recipe analogy here and imagine if the model was one expert. That is to say, let’s pretend the bread LLM is actually a baker doing all of the work and let’s put a little anthropomorphized spin on it. The baker takes your question, what’s the best bread recipe and spits out a response. It’s just one very busy, very talented baker doing everything.
Katherine Forrest: Okay, so we’ve got that. Now, how is that process different in the context of Mixture of Experts?
Anna Gressel: Right, imagine the Mixture-of-Experts models as a whole kitchen of bakers with a head chef and the head chef gets the order, your input question, and then she routes the different components of that question to different bakers in the kitchen. You know, for everyone watching “The Bear,” I’m sure you are right on the same page with me.
Katherine Forrest: Yes, chef.
Anna Gressel: So say she has eight bakers and each baker specializes in a different thing. One baker is the fluffiness of the bread, another is the healthfulness, another is taste. So for each order, only a few of the bakers are going to be used at a time, but all of them might be used for the full order.
Katherine Forrest: All right, so if I understand it correctly, Chef, a key value add of a Mixture-of-Experts model is that you have a whole bunch of these little experts inside the model that together can handle anything. But for a particular task, only a subset of them have to be activated, like a sous chef.
Anna Gressel: Yep, that’s right.
Katherine Forrest: Okay, so what’s the point? The point is that if you’re only using a few of the experts to answer a question, you’re limiting where the model has to go to look for the answer. So let me give you an example. Let’s say you want a fluffy loaf of bread with a lot of butter. So you’re looking for that recipe. What may occur is the model then is using a mixture of experts about fluffiness and about butteriness, and they may not be knocking on the door of the healthfulness part of the expert group. And so it’s knocking on fewer doors and therefore saving some compute.
So in the Mixture-of-Experts models, you effectively have a head chef that’s really good at dividing labor and picking which sous chefs to call on.
All right, folks, so that’s mixture of experts at a high level. And I want to leave our listeners with a couple of papers that they may or may not want to look at. I think they’re really interesting. First is one called “A Closer Look into Mixture-of-Experts in Large Language Models” with Ka Man Lo as the first author, dated 26 June, 2024, and also a paper by Mistral, which is actually called “Mixtral of Experts.” Both of these papers are terrific explanations of this Mixture of Experts concept that we’ve been talking about today.
So that’s all we’ve got time for. I’m Katherine Forrest.
Anna Gressel: And I’m Anna Gressel.
Katherine Forrest: All right everybody, listen to our podcast. You heard it here.