skip to main content

DeepSeek Rising

In this episode of “Waking Up With AI,” Katherine Forrest delves into the groundbreaking advancements of AI newcomer DeepSeek's R1 and V3 models. She explores how this Chinese tech company is challenging the status quo and making waves in the AI space.

  • Guests & Resources
  • Transcript

Katherine Forrest: So hello, everyone, and welcome to today's episode of “Waking Up With AI,” a Paul, Weiss podcast. I'm Katherine Forrest. And you're going to have to put up with me solo today because Anna is off on more of her AI adventures in far-flung places around the world, doing incredibly interesting things that she's going to come and tell us all about next week. But in the meantime, she did the last episode solo, and I'm doing this one solo. We don't mean to make it a solo podcast. It's just the way it's worked out. We like our banter.

What we're going to do today, and I'm going to take all the fun so that Anna, who also is very interested in this topic, is just going to have to listen to this podcast. So Anna, if you're listening to this podcast, I just want to say I got to DeepSeek first because that's what today's episode is going to be about. It's going to be about DeepSeek and the models that it's recently released, in particular the R1 model, although it's also released a couple of other models in the last month. V3 is another one, and we'll talk about that. And the reason that we're going to be talking about DeepSeek is because everybody's talking about DeepSeek. And so, while I don't want to just jump on the bandwagon of what everybody else is talking about, it is actually newsworthy. And that's why we've got so many commentators and so many news articles talking about DeepSeek. So I want to add my voice to theirs and, for this audience, give you sort of an overview of what's happening in R1 model world, this DeepSeek world that has got everybody's attention.

So it's worth saying that I'm recording this particular episode on January 31st, 2025. And the reason I say that is, as you folks know, it'll be released after that. And there might be interesting new developments between today, January 31st, and even next week when this is finally released, because the R1 model and the DeepSeek novelty is now getting so much attention, and so many people that are really digging into the code that there might be even more developments. And so, just bear with me and take this as an overview.

All right. So one starting point for this discussion about DeepSeek is that Marc Andreessen, who's a tech investor, called the DeepSeek release of the R1 model and of the related models as “AI's Sputnik moment.” And what he meant by that, my interpretation of what he meant by that for those of you who may not recall Sputnik in person — and I don't because I was not yet born, but learned about it in history class — is there was a moment in American history when there was a realization that the Russians were actually ahead of, or at least appeared to be ahead of, the Americans in terms of the space race. So Americans had, in the 50s, believed that our space programs were chugging along and that we were ahead of Russians in all things technical and military and space-related. And then in 1957, the Russians actually successfully launched a satellite into space. And that satellite, Sputnik, was able to transmit back to Earth. And this was ahead of the Americans. And so it was a real wake-up call. And many people attribute the concept that the Russians were able to produce this satellite, this transmitting satellite, ahead of the Americans' ability to, or at least the Americans' launch into space, that caused this sort of awakening.

And so when Marc Andreessen says this is AI's Sputnik moment when we're talking about DeepSeek, we're talking about the fact that we have been in the United States ahead, and really very proudly so, of many parts of the world in terms of our AI development efforts, with a variety of different companies located in different parts of the country, but a lot of it in Silicon Valley, but not all of it in Silicon Valley. And we've got these companies that are really well funded and coming up with highly capable models and lots and lots and lots of technical advancements.

But we now have DeepSeek, and by the way, there are other companies in the world as well. DeepSeek is not the only non-U.S. company, but this is the one that's making the big splash because of the capabilities and some of the things that we'll talk about today. But what we've got is this company, it's a Chinese company, that has released an open source series of models. And remember, there's a difference between open source and closed models. Open source models like the Meta Llama herd of models are open source. Open source models allow people to see the code, get into the code, work with the code, and the DeepSeek models are also open source. And so that's really a big advancement.

So the Sputnik moment happens a couple of weeks ago, and let's dig into that now and talk about why that happened. So on December 26th, there was an initial release of an early DeepSeek model, the V3, and the technical paper relating to the V3 model came out. But that's not what caused the big headlines. What caused the big headlines was on January 20th, 2025, which was just 11 days before today when I'm recording this, there was the release of the R1 large language model, the R1 DeepSeek LLM, and that is a chatbot. And you can actually download it onto your phone. And it almost immediately, within days, became the Apple App Store's most downloaded app, and continues to be one of its most downloaded apps, although there are some now sort of restrictions on getting it, not regulatory restrictions, just it's being sort of rolled out in a different way.

But before this, no one really had been paying a huge amount of attention to DeepSeek. They were a tech company, but they were a tech company that only came into existence in 2023, in May of 2023. DeepSeek is the subsidiary of a company called High-Flyer Capital Management, which is a Chinese hedge fund. And that Chinese hedge fund is owned by an individual — or at least he's, I don't know exactly what the ownership structure is of High-Flyer Capital Management — but an individual named Wenfeng, W-E-N-F-E-N-G, from Hangzhou, China, who was associated or owns, I don't know which, the High-Flyer Capital Management started in May of 2023, the DeepSeek subsidiary, which was dedicated towards AI model development. So they start working on it after now we've got ChatGPT already released. If you recall the chronology of things, we have ChatGPT released in the late fall of 2022. Kevin Roose of The New York Times does his article about his Valentine's chatbot experience with ChatGPT on February 14th, or I guess he releases it probably on February 15th of 2023. So we're in May of 2023 when this company gets started. And DeepSeek, so it comes out in May of 2023 as a new company and starts working away on its various models. And it comes out now in late December and then into January with these R1 and V3 DeepSeek models.

So what's surprising about them? Why do we care so much about the release of these Chinese models that are now available to people to download? Well, first of all, they're highly capable. And so let's just start with that, and then we'll get to one of the big punchlines. But they're highly capable models, and they're compared to the OpenAI o1 model and also to in the technical report that DeepSeek itself had put out, and you can get that. It's actually a technical report dated December 27th, 2024, and you can get that on arXiv, A-R-X-I-V. They actually talk about their own models being similar in capability to, and they claim advanced in capability, to the Claude Sonnet family of models and also the Llama herd of models, and then the o1 model they also compare themselves to.

So you've got this highly capable set of models, particularly the R1, which I'll be really focusing on, and their reasoning models. And so what does that mean? Well, like the o1 model, it is a model which can engage in chain-of-thought reasoning, meaning that the model is able to, in a sort of step-by-step way, explain how it gets to a particular answer. And this chain-of-thought reasoning, as you may recall from a prior episode of this podcast, is at least one way that a number of engineers believes actually increases the accuracy and the reasoning capabilities of certain models. That when the model is asked to essentially slow down, take things step by step, explain the reasoning, that for some reasons we understand and some reasons frankly we don't understand, the model does better with the actual reasoning. So the R1 model is a reasoning model, all right? So it's highly capable and now it also, when it has presented its measurements against other, or metrics, against other models, it scores, it claims, above them in math reasoning. And math reasoning is a particularly complicated area of reasoning for LLMs, and it scores either at or above other models in terms of other kinds of reasoning. And so, highly capable.

Now, what's the big deal? We've gotten sort of used to these highly capable models that are out there, and we'll talk about some security implications in a moment. The big deal is that this model is allegedly, and we're not able to fully test this, a lot, a lot cheaper to make than the other models that I've been mentioning during this episode. There is a claim in the technical paper that it cost $5.57 million to train the DeepSeek V3 model, and similarly there are — $6 million dollars has been thrown around for the R1 now. That sounds like a lot of money, and it is, but you should know that many of the models that we've been talking about, like the OpenAI models or the Meta Llama herd family of models or the Claude Sonnet models, that these models can cost many tens of millions of dollars to train. Why? Why are models expensive to train, and why is the DeepSeek model group not as expensive to train? Well, as you recall, there are three big components of AI that has made it into this juggernaut that we see today. You've got to have architecture advances. We got that with transformer architecture. You then have to have compute, which is made up of two things. It's made up of highly advanced semiconductor chips, as well as energy, whether that be through solar power, fossil fuels, or nuclear power, whatever the source of energy is, in order to allow the computational processes to occur. And you've got to have data. So the cost, one of the huge costs, is in that semiconductor chip area, the compute area plus the energy. And what the DeepSeek models do, allegedly, this is what the paper says, is they use far fewer chips and much less energy. And this combination, in terms of how it does that and why it's able to nonetheless be so highly capable, that's the magic sauce.

So what we've got is a situation where previously the world had come to expect that Nvidia, Intel, many of the chip makers — and there are many different chip makers, but Nvidia is certainly one that is very, very entrenched in the AI area — that their stock price would continue to go up and up and up because the need and the appetite for chips would continue to increase, increase, increase, increase. Now, DeepSeek sort of throws a monkey wrench into that. And again, we don't exactly know, because people are still looking at these models, where the facts will really sort of sort themselves out. But it does look like DeepSeek is using fewer chips and also using less compute. And there are going to be two papers that I'm going to refer you to for the technical aspects as to why, architecturally, because I don't want to go into all of it. I'll go into a little of it, but I won't bore you with all of it, why they're able to do this cheaply.

One paper is actually this technical report that I keep talking about, which is the “DeepSeek-V3 Technical Report” that came out on December 27th, 2024. But there's another paper that came out on January 25th, 2025. And that had several Apple researchers and an MIT researcher, the two authors who are Abnar, A-B-N-A-R, and Shah, S-H-A-H. And this article is called “Parameters vs FLOPS: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models.” And so that is also available on arXiv. And I would recommend both of those papers to you.

So what the papers suggest — I'll focus on the DeepSeek technical paper itself because it's its own explanatory paper, the Apple and MIT paper actually talks about similar concepts and explains them in additional ways —is that they've been able to achieve efficiency in their model by taking the transformer architecture, which is the architecture that we're familiar with, with the ChatGPT model, the o1 model, the meta models. These are, that we've been talking about on prior episodes, are all part of the transformer set of architectures where you've got data that's broken into tokens that then gets ingested into the model. The model then analyzes the tokens of data and creates different relationships between different data. And so, that's all happening within the neural network. So what the DeepSeek model does that's different is it doesn't analyze one token at a time at lightning speed, as some of these other models do. But it engages in what's called multi-token prediction. And that was actually proposed for the first time, I believe, by Meta over a year ago, but now has found its sort of public moment in the DeepSeek R1 model. And so the efficiency is achieved through this multi-token analysis. It also uses something called a Mixture-of-Experts technology, which we've mentioned in prior episodes. And that's when different routines are broken down or different tasks are broken down, and you have maybe one portion of a model that is an expert in one type of analysis conduct that type of analysis. That's also at work. And then there are various ways in which DeepSeek also trains its models that create yet additional efficiencies.

So here's an example that was given to me that I think is actually really useful to help explain this. If you want to make wine, and I like wine. I like actually Sauvignon Blanc. But if you want to make wine more cheaply, but to have a taste that's as good as a wine made very expensively, you can either figure out a way to make the grape, say, for instance, juicier, or have the kind of sugar content that you want and the other taste content that you want. Or you can figure out a way to make the machinery that extracts the ingredients that you want from the grapes more effectively. And so you've got sort of two ways of doing it. You can either invest in ever genetically juicier grapes or, in trying to find a way of taking the first grape without making it ever more genetically complicated, and just extracting the contents from that grape more efficiently. And it's that kind of efficiency extraction that is at work here. So what's at work here with DeepSeek is taking a transformer architecture, taking the data, taking the tokens, and doing the process of analysis much, much more efficiently and of training much, much more efficiently.

There are a couple of issues that I want to raise with the DeepSeek models. And it's early days, so we're going to learn a lot more about all of this. But we do know that there are certain ways in which the content that the DeepSeek model will produce in response to a query can actually end up having some differences in the major and most popular U.S. models. So for instance, there's been a couple of studies now that have said that the DeepSeek model is, and these are just studies I'm looking at. If you look at bgr.com/tech/deepseek-r1-harmful-content and the Enkrypt study, E-N-K-R-Y-P-T study, you'll see what I'm talking about. But there's an analysis that says that there's more algorithmic bias by three times within the DeepSeek R1 model than within the Claude 3 Opus. Or that the DeepSeek model is four times more vulnerable to generating insecure code than OpenAI's o1. Or that it's more likely, the DeepSeek model is more likely, to create more toxic content, to output more toxic content than the GPT-4o model. And the article goes on and on. So that's going to be interesting to watch. You know, I think that there are going to be a lot of folks who are going to be studying this model to see whether or not in fact those biases and that harmful content is in fact true or not. We do know that the DeepSeek model has been trained not to give certain answers that relate to certain Chinese political events. But you know, all models, frankly, are often trained to stay away from certain kinds of content in the U.S. A lot of it's the harmful content, the biased content, certain kinds of CCM content, things like that. Those are rather routine.

So one of the other things, before I end on my monologue about DeepSeek, is that we're going to be now watching and seeing what happens in the United States as we have the Trump administration that's declared a desire to be dominant in AI, but yet now we've got this DeepSeek series of models that have come out. And we know that the U.S. has been exploring sanctions and export controls that have been designed to really hinder or slow AI growth in certain parts of the world for geopolitical reasons for a little while now. And actually on the eve of the presidential transition, the Bureau of Industry and Security, which is part of the Department of Commerce, came out with something called the Artificial Intelligence Diffusion Framework. You’ve got to love a title like that, right? The Artificial Intelligence Diffusion Framework, whatever. That actually extended and unified certain kinds of export controls that related, in part, to a number of things, but in part to semiconductor chips. Those are, again, going back to what we've talked about now, part of the essential compute requirement in order to really power the AI models. But what DeepSeek has said is that while export controls have been a problem for it, the company had actually already acquired thousands of state-of-the-art GPUs, which as we know is, you can use that term interchangeably with chips, before some of these controls were put into place. But we're going to be watching what the United States might do next.

Now, one last piece. I think I said the last piece was the last piece, but this is actually the last piece, which is that we also know that the queries, the user queries and user data for the DeepSeek models are stored in China. And so there's going to be a number of questions about some various kinds of national security implications for the DeepSeek models and what's being put into them and how they're being used, whether or not there's something in the code that could carry interesting spyware or payloads. All of that will be looked at, and you'll be reading an awful lot about it. What this does say is that the United States needs to double down if it wants to maintain or regain, depending upon your view of the DeepSeek models, its lead in terms of AI models. And that what is happening with the DeepSeek models is at the very least suggesting very significant engineering and architectural advances, different ways of training, and potentially a significant, very significant cost difference and some benefits that can then be used by companies all over the world to make AI models that are less expensive, as well as being highly capable.

So that's my DeepSeek monologue for now. Like many things, we're going to come back to these things. And that's it for today. Next week we hope to have Anna back and telling us all about her travels. Actually, she'll probably be recording from her travels, wherever they may have taken her. Maybe she'll click her ruby slippers together and say, “there's no place like home,” and come on home, Anna. And so we can do our banter again. But in the meantime, thanks for joining us today. Please make sure to like and share the podcast, and we’ll talk to you next week.

Apple Podcasts_podcast Spotify_podcast Google Podcasts_podcast Overcast_podcast Amazon Music_podcast Pocket Casts_podcast IHeartRadio_podcast Pandora_podcast Audible_podcast Podcast Addict_podcast Castbox_podcast YouTube Music_podcast RSS Feed_podcast
Apple Podcasts_podcast Spotify_podcast Google Podcasts_podcast Overcast_podcast Amazon Music_podcast Pocket Casts_podcast IHeartRadio_podcast Pandora_podcast Audible_podcast Podcast Addict_podcast Castbox_podcast YouTube Music_podcast RSS Feed_podcast

© 2025 Paul, Weiss, Rifkind, Wharton & Garrison LLP

Privacy Policy