Tesla's Dojo supercomputer - sorting out fact from hype

Tesla's Dojo supercomputer - sorting out fact from hype Neil Raden Mon, 09/27/2021 - 21:00

Summary:: Tesla's Dojo supercomputer was announced with fanfare - and a specific use case in mind. But how does it stack up against other high-performance computers? Will Dojo constitute a step forward for the HPC industry, and for Tesla's own automotive AI pursuits?

Elon Musk has been hinting for a couple of years that they were developing their supercomputer. At Tesla's AI Day, Tesla announced the arrival of Dojo, its supercomputer designed entirely in-house.

Dojo is a supercomputer by virtue of its complexity and speed but differs from other supercomputers in quite a few ways. But it's not technically a supercomputer yet because it hasn't been entirely built out.

Tesla's Senior Director of Autopilot Hardware, Ganesh Venkataramanan, is the head of the project and was the point-person for the presentation.

The heart of the design is the Dojo D1 chip, which provides stunning bandwidth and compute performance. Tesla found existing computer platforms lacking for their primary problem to solve: developing self-driving technology by training its massive neural networks, but they also hinted they might provide Dojo to others developing AI in the near future.

Tesla's motivation for Dojo springs from a massive amount of video data captured from over their large fleet of existing vehicles, which it uses to train its neural nets. Tesla was not satisfied with other HPC (High-Performance Computing) options for training its computer vision neural nets and decided they could create a better platform.

It's unusual for a supercomputer to be designed for just one problem. Whether its design is general-purpose enough to be suitable for other industries and applications, time will tell if it can be used for different applications, particularly deep learning, optimization simulation and NLP.

Existing supercomputers are more general-purpose than Dojo. HPC (High-Performance Computing, another name for supercomputers) are optimized for very complex mathematical models of physical problems or designs, such as climate, cosmology, nuclear weapons and nuclear reactors, novel chemical and material compounds, support for pharmaceutical research and cryptology.

Just for historical reference, the first supercomputer was the 1964 Control Data Corporation 6600, capable of executing 3 million floating-point operations per second (FLOPS). Fast forward to 2020, and the PlayStation 5 has hardware capable of up to 10.28 teraFLOPS, roughly three million times faster. But the fastest supercomputer today is clocked at 450 petaFLOPS, ten thousand times faster than the Playstation and Tesla claims Dojo will reach exascale: an exaFLOP is one quintillion (10¹⁸) double-precision floating-point operations per second. I can't be sure this is real or hype, because below we'll dig into some complex data that brings Dojo far below exascale.

There is also some controversy about how Dojo is measured. According to the TOP500 list compiled twice per year "Fugaku" in Kobe, Japan, holds the #1 as the undisputed fastest supercomputer in the world with a demonstrated 442 petaFLOPS (it is widely believed that Fugaku is just getting starting and could exceed an exaFLOP in its current configuration). This is a staggering three times faster than the #2 entrant, "Summit," at the Oak Ridge Laboratory in Tennessee, with a top speed of 149 petaFLOPS. Dojo, with its 68.75 petaFLOPS (approximately), would then be in 6th place. Because the next three supercomputers are pretty close at 61.4 to 64.59 petaFLOPS, Dojo may be in seventh, eighth or even ninth place.

The top supercomputers today cost $500 million or more, often take two 5,000 square foot buildings. They are designed to process very complex mathematical calculations at scale, so it wouldn't make sense to be built for a single application.

While the top 10 or so are devoted to defense and intelligence applications, at least a third of them on the TOP500 list are dedicated to healthcare, and many support crucial drug-related research. There are two supercomputers that I know of that are used in a private enterprise setting, one in an oil- and gas-related study. It is no secret that most of the most powerful ones are used for nuclear weapons research and cybersecurity in the US, EU, Russia and China, and possibly others (although it has been made available for studying COVID-19). Others have advanced the science in weather and climate in significant ways. Some examples are:

Cambridge-1 - the fastest supercomputer in the UK. was designed and assembled by Nvidia and was clocked at 400-petaFLOPS. It is applied to medical research (as far as we know).
Summit, the aforementioned IBM-designed computer at the Oak Ridge National Laboratory (ORNL), is currently the fastest supercomputer in the US. Still, its 148.8 petaFLOPS will be eclipsed in 2022 by three computers provided by HPE/Cray. With 4,356 nodes, each with two 22-core Power9 CPUs and six Nvidia Tesla V100 graphics processing units (GPUs). It is already obsolete. It weighs a staggering 340 tons. And will be decommissioned and cut up for scrap to make way for a new exascale computer. This begs an important quest: will Dojo show significant miniaturization and power coemption?

Summit and the other HPC "elephants" get their performance by scaling. The design of the chips, the creation of the nodes, the configuration, and of course, the interconnect are all necessary, but 200 quadrillion FLOPS, 250 petabytes of data is achieved by volume. It requires 20-30MW of electricity to run, enough to light a small town.

Summit's "sister" computer was installed at Lawrence Livermore National Laboratory in California. Sierra is air-gapped and applied to predictive applications in nuclear weapon stockpile stewardship, a US DOE program for simulating and maintaining nuclear weapons.

Scrutinizing Tesla's Dojo

Tesla has not developed Dojo from commodity components. It created a unique architecture and several chip designs that were produced, most likely, by Samsung.

For example, instead of using multicore chips connected to motherboards, they use the entire wafer (chips are produced on a wafer and cut out separately.) Tesla claimed that the number of GPUs is more than the top 5 supercomputer(s) in the world, but this is a misstatement. They meant that it has more GPUs than the FIFTH fastest supercomputer in the world, Selene, which has 4,480 NVIDIA V100 GPUs.

Andrej Karpathy, Senior Director of AI at Tesla, revealed in a presentation that the largest cluster is made up of NVIDIA's new A100 GPUs, which would put it in the 5th position in the world, but some are fudging on these figures. An FP32 performance metric measures how many single-precision floating-point operations per second the machine can produce. Typical measures for supercomputers are F64, double-precision floating point calculations. There was some confusion in the presentation about which metric was used.

My take

Whether Tesla was able to produce such a powerful computer is not nearly as interesting as what they intend to do with it. It remains to be seen if Dojo is a new architecture for supercomputers, exceeding the characteristics of the current ones in production, or a one-off device for their application. We don't know just how clever Dojo is yet.

Telsa could potentially make Dojo the new most powerful supercomputer in the world. But if that's Tesla's plan, they have their work cut out for them. The history of computing is littered with technical breakthroughs that didn't achieve market traction, much less dominance. As General George Patton once said, "No good decision was ever made in a swivel chair." The same holds true for tech. Press conferences don't amount to field victories. Tesla's HPC field victory has not yet been achieved.