Waking Up With AI: Mamba and AI Architectures

Learn More

Mamba and AI Architectures

In this week’s episode of “Waking Up With AI,” Katherine Forrest and Anna Gressel delve into different types of AI architectures, from Transformer models to “Mamba” — a type of architecture that’s making significant headway in 2024.

Guests & Resources
Transcript

Katherine Forrest

Partner

» Biography

Anna Gressel

Partner

» Biography

Katherine Forrest: All right. Good morning, everyone, and welcome to another episode of “Waking Up With AI,” a Paul, Weiss podcast. I'm Katherine Forrest.

Anna Gressel: And I'm Anna Gressel.

Katherine Forrest: And Anna, it's August. And so, while we're here recording this, we're not in foreign countries, but we are in vacation kind of spots, but yet working.

Anna Gressel: That's true. Katherine, do you want to tell folks where you are today?

Katherine Forrest: Well, I'm in Maine. I think I've been in Maine for other episodes before. Even if I haven't mentioned it, I have in fact been in Maine.

Anna Gressel: That's true, that's true. And I'm out in Long Island, and it's a perfectly blue sky today. So, I'm just really (…) really a lovely time to record a podcast and nerd out on some of our favorite AI topics.

Katherine Forrest: Yeah, and today's going to be a particularly nerdy one. You got to love technology that not only allows us to work remotely, but to record podcasts remotely too, and then talk about some extraordinary technological developments.

Anna Gressel: Yep, I'm a fan. My dog's a fan. We're all fans of our traveling podcast.

Katherine Forrest: All right. Let's jump right in, and what we're going to do is we're going to talk today about some important developments in AI models.

Anna Gressel: That's right. And we'll start with a basic point underneath it all, which is that every generative AI model relies on a choice of architecture. And today, the generative AI models that have been commercialized, like OpenAI's GPT models or Meta's Llama or Google's Gemini, etc., most of those are built on an architecture called the Transformer architecture.

Katherine Forrest: And we mentioned that Transformer concept in one of our early episodes.

Anna Gressel: Right, definitely. And some of our listeners will have heard of it. That architecture dates back to a 2017 paper that's now quite famous that came out of Google called, “Attention is All You Need.” For our purposes, I think we can describe Transformer architectures as neural networks where data is broken down into chunks or tokens, ingested and then taken into the model. And the architecture is based on complex ways of the model analyzing, by paying attention to tokens and their relationship to one another and to various concepts.

Katherine Forrest: Yeah, for instance, if the model, a Transformer model, ingests a story about a cat, it would do that by taking the words “cat” and whatever the context was around the story, it'd make them into tokens, and then it would pay attention to how that word “cat” relates to the context. It might relate it to other cats within the neural network. It might relate it to dogs, and you've got these relationships between the different tokens that become really weighted parameters or relationships between the words.

Anna Gressel: So, let's contrast that to another big architectural paradigm that listeners may be familiar with. And those are diffusion models. If you remember the early days of image generation, a lot of those image generation models like Midjourney and Stable Diffusion were actually diffusion models. And they're used often to generate graphics or images. But we're not going to talk about those in detail today.

Katherine Forrest: Right, those were some text to image models where you enter a prompt, and I did this once, I entered a lady drinking tea in the park with a lion and then the model actually will have as output a photorealistic image for me in my prompt of a lady drinking tea in the park with a lion, And I did it several different times and came out with several different versions of that image.

Anna Gressel: Right. And so, let's focus today again on these Transformer models, which have been really immensely successful in terms of most of the generative AI models that companies are working with today. However, despite their success, there have been challenges that people are talking about and working through. And one of those is compute. And that's because a crucial ingredient in Transformer success, the attention mechanism that we talked about a few minutes ago, imposes relatively high compute costs. That's the cost it takes to train the model or run the model. And that can translate into a very high financial cost.

Katherine Forrest: Yeah, and there's been a lot of work done to try and reduce those compute costs—really, the financial aspects are, they can be very significant. And so, you've got researchers who are experimenting with different architectures. And I should also say, it's not just about compute costs. There are a variety of additional complex reasons why other architectures for AI and generative AI have been the subject of a lot of experimentation.

Anna Gressel: So, one of the really fun things about working with Katherine is that we actually do spend our non-podcast coffee time talking about things like different architectures that are coming into play right now and that people are thinking about. And one of the architectures that Katherine, you and I have been talking about for quite some time is called Mamba. And that's a non-Transformer language model. It's actually called a “selective state space sequence model,” which is quite a mouthful.

Katherine Forrest: Right. It really is a mouthful. And if you say it four times really quickly, then you get actually a gold star. So let me give the audience a quick sense, a very quick sense, of how Mamba works. There's a selection mechanism within the model that enables it to selectively focus on the most important information that is input into the model. So, you can think of it as a model that doesn't have to look at every word, for instance, in a paragraph or in a sequence in order to understand the overall context, the information, the semantic structure. The model is able to make judgments about and then discard what is deemed to be irrelevant data. And that allows it to actually get a lot of bang for the buck over what it's ingesting and focusing on and not having to do it for every single word. And the Transformer architecture really did look at every single token.

Anna Gressel: That's exactly right, Katherine. And a model based on the Mamba architecture was just released by Abu Dhabi's Technology Innovation Institute, or TII, called the Falcon Mamba 7B. And it's really one of the first major releases, particularly on an open source basis, of a Mamba-based model.

Katherine Forrest: And it's an iteration of their Falcon series of models.

Anna Gressel: Now, Mamba goes back to a December 2023 paper from Albert Gu and Tri Dao respectively out of Carnegie Mellon and Princeton University. And that paper is called “Mamba: Linear Time Sequence Modeling with Selective State Spaces.” Again, kind of a mouthful. And the word Mamba literally refers to the venomous snake in Africa and was apparently chosen because of all of those “S” words in selective state space sequence model. See, I can't even say it, but you can just imagine it like a “ssss” and then think of a Mamba.

Katherine Forrest: Right, well, thankfully, we're not going to go too far into the weeds on how these selective state space models work and how they're different from Transformers, though that's a really interesting topic. The headline really for our audience is that these models, this different kind of architecture is making significant headway in 2024 and that there are now computational efficiencies. And some companies may be tempted to start really working with them and adopting them.

Anna Gressel: I don't think we're saying that companies are going to stop using models based on Transformer architecture. They're robust and they're very advanced and they're some of the most powerful models today. But other model architectures are making headway, and companies are exploring with them. And that's a really interesting, exciting moment to be in.

Katherine Forrest: And part of the story here is that generative AI, at the most fundamental level, is not a monolith. We've been talking a lot in some of our prior episodes, or here and there at least, about the black box problem of interpretability. But when it comes to architectures, you are able to understand or know the type of architecture and therefore some of the basic ways in which a particular model with a particular architecture is supposed to work.

Anna Gressel: And particular architectures may also have strengths and be good at certain tasks. So, Mamba's original authors talked about things like audio and genomics tasks as potential strengths of that architecture.

Katherine Forrest: Okay, so the legal takeaway of all of this is not only the fact that generative AI is just not one type of monolith, but that there are actually several types of architectures that are going to be utilized in the near future, but that regulators are going to be interested in this as well. And they're going to be looking at some of the differences in the way those architectures impact the decision making of a particular model and whether or not different kinds of models are able to interact with one another.

Anna Gressel: And it's not just regulators that may be paying attention to architectural choices. For our listeners that are licensing in or considering licensing in generative AI applications, you too may want to know more about what architectural choices the developer has made and why they made those choices, because that might actually have implications for the benefits of the technology and whether it's fit for purpose for the application you're envisioning.

Katherine Forrest: Right, and we're going to come back to these models. I think there's going to be a lot of developments with the Mamba model, as well as other alternatives to Transformer architecture, but that's all we've got time for today. I'm Katherine Forrest.

Anna Gressel: And I'm Anna Gressel. Make sure you like and share the podcast if you've been enjoying it.

Show Transcript