Did DeepSeek out innovate American AI because of restrictions?
The more I learn about the founder of DeepSeek (Liang Wenfeng), the more I question what I wrote yesterday about DeepSeek’s training costs. I expressed skepticism that r1 was trained for roughly $6 million instead of the hundreds of millions like American companies building foundational models.
What if what DeepSeek stated is actually the truth?
One of Liang’s biggest decisions was to make his code open-source, meaning anyone can access it. He said he wanted DeepSeek to break the monopoly of big tech companies.
“For technologists, having others follow your work gives a great sense of accomplishment,” he said in an interview last year with 36Kr. “Open source is more of a culture rather than a commercial behavior, and contributing to it earns us respect.”
We assume that Liang has a mental makeup similar to Sam Altman. Liang’s actions and words indicate that his motivations and mindset are fundamentally different.
“It is like buying a piano,” Liang told Chinese tech publication 36Kr in 2023, talking about the chip purchases. “Firstly, it’s because you can afford it. And secondly, it’s because you have a group of people who are eager to play music on it.”
American tech leaders assume that DeepSeek is using unauthorized H100s despite the fact that DeepSeek specifies that they are using H800s.
H100s vs H800s
Nvidia’s H800 is a nerfed H100 designed to meet U.S. export restrictions created by the outgoing Biden administration and lobbyists representing some of the current market leaders. The core difference between H800s vs H100s is drastically reduced memory bandwidth. This makes large-scale AI training less efficient on the H800 compared to the H100.
Here’s the thing: a huge number of the innovations I explained above are about overcoming the lack of memory bandwidth implied in using H800s instead of H100s. Moreover, if you actually did the math on the previous question, you would realize that DeepSeek actually had an excess of computing; that’s because DeepSeek actually programmed 20 of the 132 processing units on each H800 specifically to manage cross-chip communications. This is actually impossible to do in CUDA. DeepSeek engineers had to drop down to PTX, a low-level instruction set for Nvidia GPUs that is basically like assembly language. This is an insane level of optimization that only makes sense if you are using H800s.
Friction Creates Innovation
What the quote above implies is that being limited to H800s pushed the DeepSeek team to be more efficient in ways US companies working on AI have not. Had chip restrictions not been in place, DeepSeek would have likely pursued model training the same way as American AI companies.
CUDA is a high-level programming framework for Nvidia GPUs. It provides APIs and libraries that make it straightforward to write parallel code. PTX (Parallel Thread Execution) is a lower-level intermediate language for Nvidia GPUs.
CUDA: Offers abstractions (e.g., kernels, built-in functions, etc.) and a friendlier interface for most developers.
PTX: Gives direct, fine-grained control over the GPU, allowing you to program features or optimizations that aren’t exposed at the CUDA level, such as managing communication lanes on the H800 to compensate for lower memory bandwidth.
DeepSeek is an interesting company not because of the r1 model but because of its culture of curiosity. This allows it to attract the best talent which will allow the organization to compete long-term.
Waves: Why do Chinese companies — including the huge tech giants — default to rapid commercialization as their #1 priority?
Liang Wenfeng: In the past 30 years, we’ve emphasized only making money while neglecting innovation. Innovation isn’t entirely business-driven; it also requires curiosity and a desire to create. We’re just constrained by old habits, but this is tied to a particular economic phase.
Waves: But you’re ultimately a business organization, not a public-interest research institution — so where do you build your moat when you choose to innovate and then open source your innovations? Won’t the MLA architecture you released in May be quickly copied by others?
Liang Wenfeng: In the face of disruptive technologies, moats created by closed source are temporary. Even OpenAI’s closed source approach can’t prevent others from catching up. So we anchor our value in our team — our colleagues grow through this process, accumulate know-how, and form an organization and culture capable of innovation. That’s our moat.
This same culture may have led to the DeepSeek team to do something unfathomable by Americans that didn’t face the same constraints. It’s a possibility worth considering.