The Wafer Scale Engine measures 8 inches by 8 inches, which is considerably larger than a 1-inch to 1.5-inch GPU. Whereas a GPU has about 5,000 cores, the WSE has 850,000 cores and 40 GB of on-chip SRAM memory, which is 10 times faster than HBM memory used in GPUs. That means 20 PB/sec of memory bandwidth and 6.25 petaflops of processing power on dense matrices and 62.5 petaflops on sparse matrices.
In another benchmark against the Meta Llama 3.1-405B model used to train generative AI to respond to human input, Cerebras produced 969 tokens per second, far outpacing the number two performer, Samba Nova, which generated 164 tokens per second. That makes Cerebras’s throughput 12 times faster than AWS’s AI instance and six times faster than its closest competitor, Samba Nova.
Cerebras isn’t shy about the secret to its success. According to James Wang, director of product marketing at Cerebras, it’s the giant Wafer Scale Engine with its 850,000 cores that can all talk to each other at high speeds.
“Supercomputers today are great for weak scaling,” said Wang. “You can do more work, more volume of work, but you can’t make the same work go faster. Typically it tapers out at the max number of GPUs you have per node, which is around eight or 16, depending on configuration. Beyond that, you can do more volume, but you can’t go faster. And we don’t have this problem. We literally, because our chip itself is so large, move the strong scaling curve up by one or two orders of magnitude.”
Inside a single server with eight GPUs, the GPUs use NVLink to share data and communicate, so they can be programmed roughly to look like a single processor, Wang adds. But once it goes beyond eight GPUs, in any supercomputer configurations, the interconnect changes from NVLink to InfiniBand or Ethernet, and at that point, “they can’t be programmed like a single unit,” Wang says.
Earlier this month, Cerebras announced that Sandia National Laboratories is deploying a Cerebras CS-3 testbed for AI workloads.