Intel’s first 4004 processor in 1971 had 2,300 transistors, and a recent Advanced Micro Devices processor has 32 billion transistors. New artificial intelligence company Cerebras Systems is unveiling the largest semiconductor chip ever built.The Cerebras Wafer Scale Engine has 1.2 trillion transistors.
The Cerebras Systems chip is a single chip interconnected on a single wafer. The interconnections are designed to keep it all functioning at high speeds so the trillion transistors all work together as one.
In this way, the Cerebras Wafer Scale Engine is the largest processor ever built, and it has been specifically designed to process artificial intelligence applications.
Samsung has actually built a flash memory chip, the eUFS, with 2 trillion transistors. But the Cerebras chip is built for processing, and it boasts 400,000 cores on 42,225 square millimeters. It is 56.7 times larger than the largest Nvidia graphics processing unit, which measures 815 square millimeters and 21.1 billion transistors.
The WSE also contains 3,000 times more high-speed, on-chip memory and has 10,000 times more memory bandwidth.
Chip size is profoundly important in AI, as big chips process information more quickly, producing answers in less time. Reducing the time to insight, or “training time,” allows researchers to test more ideas, use more data, and solve new problems. Google, Facebook, OpenAI, Tencent, Baidu, and many others argue that the fundamental limitation of today’s AI is that it takes too long to train models. Reducing training time thus removes a major bottleneck to industrywide progress.
These performance gains are accomplished by accelerating all the elements of neural network training. A neural network is a multistage computational feedback loop. The faster inputs move through the loop, the faster the loop learns, or “trains.” The way to move inputs through the loop faster is to accelerate the calculation and communication within the loop.
The 46,225 square millimeters of silicon in the Cerebras WSE house 400,000 AI-optimized, no-cache, no-overhead, compute cores and 18 gigabytes of local, distributed, superfast SRAM memory as the one and only level of the memory hierarchy. Memory bandwidth is 9 petabytes per second. The cores are linked together with a fine-grained, all-hardware, on-chip mesh-connected communication network that delivers an aggregate bandwidth of 100 petabits per second. More cores, more local memory, and a low-latency high-bandwidth fabric together create the optimal architecture for accelerating AI work.
The WSE contains 400,000 AI-optimized compute cores. Called SLAC for Sparse Linear Algebra Cores, the compute cores are flexible, programmable, and optimized for the sparse linear algebra that underpins all neural network computation. SLAC’s programmability ensures cores can run all neural network algorithms in the constantly changing machine learning field.
Because the Sparse Linear Algebra Cores are optimized for neural network compute primitives, they achieve industry-best utilization — often triple or quadruple that of a graphics processing unit. In addition, the WSE cores include Cerebras-invented sparsity harvesting technology to accelerate computational performance on sparse workloads like deep learning.
Zeros are prevalent in deep learning calculations. Often, the majority of the elements in the vectors and matrices that are to be multiplied together are zero. And yet multiplying by zero is a waste of silicon, power, and time as no new information is made.
Because graphics processing units and tensor processing units are dense execution engines — engines designed to never encounter a zero — they multiply every element even when it is zero. When 50-98% of the data is zeros, as is often the case in deep learning, most of the multiplications are wasted. Imagine trying to run forward quickly when most of your steps don’t move you toward the finish line. As the Cerebras Sparse Linear Algebra Cores never multiply by zero, all zero data is filtered out and can be skipped in the hardware, allowing useful work to be done in its place.
The Cerebras Wafer Scale Engine includes more cores, with more local memory, than any chip to date and has 18 Gigabytes of on-chip memory accessible by its core in one clock cycle. The collection of core-local memory aboard the WSE delivers an aggregate of 9 petabytes per second of memory bandwidth 3,000 times more on-chip memory and 10,000 times more memory bandwidth than the leading graphics processing unit.
Swarm communication fabric, the interprocessor communication fabric used on the WSE, achieves breakthrough bandwidth and low latency at a fraction of the power draw of the traditional communication techniques. Swarm provides a low-latency, high-bandwidth, 2D mesh that links all 400,000 cores on the WSE with an aggregate 100 petabits per second of bandwidth. Swarm supports single-word active messages that can be handled by receiving cores without any software overhead.
Routing, reliable message delivery, and synchronization are handled in hardware. Messages automatically activate application handlers for every arriving message. Swarm provides a unique, optimized communication path for each neural network. Software configures the optimal communication path through the 400,000 cores to connect processors according to the structure of the particular user-defined neural network being run.
Typical messages traverse one hardware link with nanosecond latency. The aggregate bandwidth across a Cerebras WSE is 100 petabits per second. Communication software such as TCP/IP and MPI is not needed, so their performance penalties are avoided. The energy cost of communication in this architecture is well under 1 picojoule per bit, which is nearly two orders of magnitude lower than in graphics processing units. With a combination of massive bandwidth and exceptionally low latency, the Swarm communication fabric enables the Cerebras WSE to learn faster than any currently available solutions.
Cerebras has started shipping the hardware to a small number of customers.It has not yet revealed how much the chips cost.