In 2016, Google announced they had developed their own chip to handle machine learning workloads. They now use these custom-made chips in all their major datacenters for their most important daily functions.
Even in 2006, Google engineers had identified the need for a machine learning based hardware, but this need became acute in 2013 when the explosion of data to be mined meant they might need to double the size of their data centers. In just under 2 years, they hired, designed, and built the first versions of their TPU’s. A time that was described as ‘hectic’ by one of the chief engineers.
The units were marvels of simplicity based on observing a neural network workload. Machine learning networks consist of the following repeating steps:
- Multiply the input data (x) with weights (w) to represent the signal strength
- Add the results to aggregate the neuron’s state into a single value
- Apply an activation function (f) (such as ReLU, Sigmoid, tanh or others) to modulate the artificial neuron’s activity.
The chip was then designed to handle the linear algebra steps of this workload after they analyzed the key neural net configurations used in their operations. They found they needed to handle operations with anywhere from 5 to 100 million weights at a time – far more than the dozens or even hundreds of multiply units in a CPU or GPU could handle.
They then created a pipelined chip with a massive bank of 65,536 8-bit integer multipliers (compared with just a few thousand multipliers in most GPU’s and dozens in a CPU), a unified 24MB cache of SRAM that worked as registers, followed by activation units hardwired for neural net tasks.
To reduce complexity of design, they realized their algorithms worked fine using quantization, so they could utilize 8-bit integer multiplication units instead of full floating point ones.
In looking further at the workload, they utilize a ‘systolic’ system in which the results of one step flow into the next without having to be written out to memory – pumping much like the chambers of a heart where one step feeds into the next:
The results: astounding. They are able to make 225,000 predictions in the same time it takes a GPU to make 13,000, or a CPU to make only 5500.
Their simplistic design and systolic system that minimizes writes to memory greatly reduces power usage – something important when you have thousands of these units in every data center:
They have since built 2 more versions for the TPU, one in 2017 and another in 2018. The second version improved on the first after they realized the first version was bandwidth limited – so they added 16GB of high bandwidth memory to achieve 45TFLOPS. They then expanded the reach by allowing each chip to be combined into a single 4 chip module. 64 modules are combined into a single pod (making 256 individual chips).
I highly suggest giving the links a further read. It’s not only fascinating from an AI perspective, but a wonderful story of looking at a problem with outside eyes and engineering the best solution given today’s technology.