Scaling Logic: Tracing the Eager Mind
Note: This is the part one of of my take-away from PyTorch Conference 2025. You should be able to find part two here
It’s all about Scaling
It’s been a few weeks since the 2025 PyTorch conference. And it’s becoming more and more obvious that everyone is putting a lot of bets on machine learning compilers. A few years ago it’s not necessary for machine learning engineer to even think about hard-squeezing the last bit of performance out of the underlying hardware. Nowadays you can’t even run some moderate-sized models with top tier retail hardware.
For example, the cheapest solution to run inference on 4-bit quantized Deepseek R1, with 671B parameters, is to run it on a Mac Studio with 512GB unified memory, which costs ~9500USD by the time of this post. It costs the eye-watering amount of 1.9 million to put one’s hand on the latest NVIDIA GB200 NVL72 data center system, a rack-scale solution, features 72 NVIDIA Blackwell GPUs. And it’s not just the money. There is a 12-15 month backlog since the demands from cloud and data center providers skyrocketed.
But scaling is hard. Hardware engineers keep churning out hardware features and software people keep playing the catch-up game. It takes the entire CUDA software stack approximately 6 month to achieve the roofline efficiency after each new GPU release. And the next release is only 6 months away. Can we build a scalable solution to this ultimate hardware-software dilemma?
Compiler to the rescue
In my mind, the compiler approach is always graph-oriented, in contrast to the eager/interpreter approach where you run your code line by line. The earliest debate one could think of is RISC v. CISC. The ultimate debate is about whether to perform a sequence of compute in one shot, with the benefit of optimizing them as whole, versus to perform it in many shots, in which it leaves the optimization burden/flexibility to the user.
In other words, would one rather write this code and let the compiler handle the loop unrolling for you:
for (uint i = 0; i < 100; ++i):
do something
Or would one prefer manually unroll the for loop to align better with CPU cycles?
for (uint i = 0; i < 100; i+=4):
do something
do something
do something
do something
In an ideal world, a machine learning engineer shouldn’t care anything about the underlying hardware architecture since the compiler will handle everything and generate one fused, optimized kernel for the entire model so that it runs perfectly efficient on the target hardware. One can argue that this vision is pretty close to the reality of the CPU world. Unless you are super performance focusing, it’s prevalent to just use stock LLVM/GCC with a few tweaks on how one wants their programs compiled. However, this is not working in the current white-hot GPU world.
Take XLA and torch.compile, two of the best machine learning compilers out there for PyTorch/CUDA and JAX/TPU respectively, as examples. With it’s tight integration with the TPU ecosystem, XLA was able to deliver tiling and pipeling on TPU, which is nice. But you need to define your model in JAX, compile it on closed-source XLA distribution and run your compute on Google’s proprietary TPU, which is not currently sold and can only be rented on Google cloud. In other words, you run the risk of getting vendor locked-in with the quasi-best perf delivered out of the box.
On the other hand, torch.compile takes a completely different approach, with PyTorch being open-source and Nvidia GPUs being offered off the shelf. It essentially can be looked as a two-part system. One part as a graph-tracing framework and another as a code generator for traced graph. There could still be graph-breaks, which limit how a model can be optimized later. And most of the time people still write their own CUDA kernels and use them as a wrapped torch op if they want good kernel-level tiling and pipeling, which arguably are the most crucial optimizations for large models.
In Between Stack
One cannot blame torch.compile for not delivering some of the most crucial optimizations right out-of-the-box for the latest generation of Nvidia GPUs. Looks at the definitions of Tensor Core operations for each GPU generations, Tensor Memory Accelerator(TMA) was only introduced in and after Hopper. And on Blackwell, there is already the Tensor Memory(TM) per SM making Tensor Core programming more complicated. Even Nvidia’s own software stack needs to catch up with the fast-revolting hardware spec. It’s a never-ending game