Google Tensor Processing Units – version 4

Google Tensor Processing Units – version 4

I’ve written about Google’s custom silicon TPUs before (Google’s Tensor Processing Units – v1).

One of the big reasons for Google and others web services to develop their own custom chips is that general purpose CPUs are flexible but typically need a lot of power. That power costs a lot of money in electricity bills and cooling costs in huge data centers. So, why buy chips with lots of stuff you don’t need when you can build your own – and save millions of dollars a year in a data center with lower cooling and power costs?

In just a 6 years, Google has managed to design and build 4 ever increasingly capable AI data center chips. They had somewhat humble beginnings – but they are becoming increasingly powerful. Now they have just published information about TPU version 4.

What is this new chip capable of?

  • a nearly 10x leap forward in scaling ML system performance over TPU v3 
  • boosting energy efficiency ~2-3x compared to contemporary ML DSAs, and 
  • reducing CO2e as much as ~20x over these DSAs in typical on-premise data centers

Even crazier, it’s the first system to use purely optical switching.

TPU v4 is the first supercomputer to deploy a reconfigurable OCS (optical circuit switching). OCSes dynamically reconfigure their interconnect topology and are much cheaper, lower power, and faster than Infiniband.  The figure below shows how an OCS works, using two MEMs arrays. No optical to electrical to optical conversion or power-hungry network packet switches are required, saving power.

Add to this, the newest version claims to be 1.2-1.7x faster and 1.9x more efficient than nVidia A100 chips.

Worth a read.

Links:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.