by Chen Yichun from Unsplash.com

Should We All Embrace Systolic Arrays?

On Scalability of the Matrix Multiply Unit in Google’s TPU

CP Lu, PhD
8 min readApr 28, 2017

--

In the highly-anticipated paper, In-Datacenter Performance Analysis of a Tensor Processing Unit, Google disclosed the technical details and performance metrics of the Tensor Processing Unit (TPU). The TPU was built around a matrix multiply unit based on systolic arrays. In this post, I argue systolic arrays are not scalable and an alternative solution is preferable.

Why Matrix Multiplication ?

Google claimed their TPU is orders of magnitude better in performance and energy efficiency than CPUs and GPUs; further, it attributed the success to a domain-specific design. In the blog post AI Drives the Rise of Accelerated Computing in Data Centers, NVIDIA responded to these claims with their own performance metrics and concluded that TPUs and GPUs share the common themes of domain-specific acceleration of tensor computations. Tensors are high-dimensional data arrays used to represent layers of Deep Neural Networks. A Deep Learning task can be described as a Tensor Computation Graph:

A TPU is domain-specific to matrix computations in Deep Learning. A GPU has evolved from being domain-specific to 3d graphics into a general-purpose parallel computing machine. What makes GPU domain-specific to Deep Learning is its highly optimized matrix library, previously developed for HPC.

When a GPU is used for Deep Learning, tensors are unfolded into 2-dimensional matrices, and matrix computations are handled by calling matrix kernels from the host CPU; matrix kernels refer to GPU programs implementing different types of matrix computations. Matrix multiplication comprises many MAC (multiply accumulate) operations. These MAC operations are the most time consuming part of Deep Learning. Even though the GPU environment allows programmers to write their own matrix kernels, they predominately use the pre-built matrix multiplication kernels as black boxes. Such a matrix-centric approach (as described in cuDNN: Efficient Primitives for Deep Learning) might be the inspiration behind the choices of the TPU design team to treat matrices as the basic primitives and to build the TPU around the matrix multiply unit.

Why Systolic Arrays ?

What’s eye-catching is the choice by the TPU design team to use a systolic array. A systolic array is defined as a collection of Processing Elements (PEs), typically arranged in a 2-dimensional grid. A PE in a systolic array works in lock steps with its neighbors. Each PE in a systolic array is basically a MAC unit with some glue logic to store and forward data. In comparison, a PE in a mesh-connected parallel processor is a full-featured processor core with its own frontend and necessary peripherals, whereas a PE in a GPU is a simplified processor core sharing a common frontend and peripherals with other PEs in the same compute cluster. Among the three solutions, the density of MAC units is the highest in a systolic array. These differences are shown in the following figure:

A systolic array claims several advantages: simple and regular design, concurrency and communication, and balancing computation with I/O. However, until now, there has been no commercially successful processor based on a systolic array. The TPU is the first and it is impressive, arguably the largest systolic array implemented or even conceived. Their design is reminiscent of an idea introduced by H. T. Kung in 1982.

Do Systolic Arrays Scale?

Although the details of the TPU design were not fully disclosed, the authors of the TPU paper made some puzzling statements that bring into question the scalability of systolic arrays.

  • A bigger matrix multiply unit (512-by-512) doesn’t help any DNN
  • A small fraction of a big number (256-by-256) can nonetheless be relatively large

Let me try to use the following hypothetical chart to interpret these two statements:

The first statement seems to indicate 256-by-256 and 512-by-512 units are at the plateau section of the speedup curve. The second statement implies 128-by-128 might be at the plateau. It indicates a possibility that given the on-chip buffer size and memory bandwidth, Google could have designed TPU with a 128-by-128 or smaller unit without significant performance loss, except for compute-bound applications. Such a wide performance plateau from 128-by-128 to 512-by-512 (a factor of 16) could be due to an unbalanced design and/or fundamental limitations of systolic arrays. Being the first successfully deployed implementation of systolic array, the TPU might help us identify issues.

In a typical systolic array for multiplying matrices A and B, the second matrix B is partitioned into tiles of the same square shape as the array. A sub-matrix of matrix A is loaded once and multiplied against many tiles of B. For this reason, the bandwidth for loading A is assumed to be relatively small.

In the context of the TPU, the input matrix is A, and the weight matrix is B. Let’s analyze the information from the TPU paper based on our understanding of a systolic array:

Latency scales linearly with the side length of the array Latency is hailed in the paper as a key performance indicator demanded by the customers. However, the latency to pass through the array increases by a factor of 2 when expanding the side length of the array. It takes 256 clocks for the first element of the first row of input to traverse across the array and another 256 clocks for the first partial sum to traverse down the array and accumulate along the way. When the side length of the systolic array is doubled from 256 to 512, the latency to pass through the array is doubled from 512 to 1024 as shown below:

Overhead to load the weights worsens in two dimensions When an input sub-matrix is relatively short, the execution time is dominated by the time to load the tiles of weights. Let’s assume it takes the same time to prepare for a fully occupied tile as a partially occupied one. In the figure below, the overhead shown as the total area of the unoccupied portions of the tiles is bigger with a 512-by-512 unit than with a 256-by-256 one:

This is consistent with what the TPU paper says:

The issue is analogous to internal fragmentation of large pages, only worse since it’s in two dimensions. Consider the 600x600 matrix used in LSTM1. With a 256x256 matrix unit, it takes 9 steps to tile 600x600, for a total of 18 us of time. The larger 512x512 unit requires only four steps, but each step takes four times longer, for 32 us of time.

Input bandwidth requirement increases with square of the side length of the array Let’s peruse the case when the input sub-matrix is sufficiently tall. A 512-by-512 unit is expected to be four times faster than a 256-by-256 one by multiplying a B-by-512 input sub-matrix with a 512-by-512 tile of weights in B clocks. Since the number of relevant 512-by-512 tiles is 1/2 the number of 256-by-256 tiles, a B-by-512 sub-matrix stays 1/2 the time compared to a B-by-256 one. It requires four times the input bandwidth to load the next B-by-512 sub-matrix in time to achieve the desired fourfold speedup.

Occupancy of the array decreases with the side length of the array This typically happens to a layer of a Convolutional Neural Network (CNN) where the length of the inner dimension of the corresponding matrix multiplication is related to the feature depth. A bigger array is more likely to be poorly occupied as shown in the following figure:

In summary, a four-times larger 512-by-512 unit results in a four-times larger tile size for the weights, which might work against scalability in the following ways:

  1. It causes the latency to double
  2. It makes the overhead to load the weights worse
  3. It requires four times the input bandwidth to achieve a fourfold speedup
  4. It makes the feature depths of a CNN relatively shallow

According to our observation, we explain the possible reasons why a 512-by-512 unit does not help compute-bound applications such as CNN0:

  1. A 512-by-512 unit must be matched with a fourfold input bandwidth to achieve the desired fourfold speedup. However, the input is fed to the TPU through Peripheral Component Interconnect Express (PCIe) bus. A standard PCIe bus might not meet the fourfold input bandwidth requirement. As a consequence, it might cause compute-bound CNN0 to become PCIe-bound.
  2. CNN1 performance is poor on the TPU due to its shallow feature depths. The shallowness of feature depths is relative to the side length of the systolic array. It might be that the feature depths of CNN0 become shallow on a 512-by-512 unit.

Should We All Embrace Systolic Arrays?

The success of TPU points to the opportunities and direction of using matrices as basic primitives at the right level of domain-specialization to accelerate Deep Learning. However, a systolic array is not scalable because it requires a proportional increase in bandwidth to sustain a desired speedup. This works against Moore’s Law and the technology trend that memory speed lags behind that of logic. It makes latency worse, which goes against the trend of user demands as claimed in the paper. Even though the matrix multiply unit in TPU is just one among different options of systolic matrix multiplication, I believe similar scalability issues will manifest in any commercial implementation due to

The curse of the square shape of a systolic array.

Finding a scalable alternative for on-chip matrix multiplication is necessary not only to build a bigger matrix multiply unit for better performance, but also to achieve the same level of performance with one that is smaller, yet better utilized.

According to the TPU paper:

As reading a large SRAM uses much more power than arithmetic, the matrix unit uses systolic execution to save energy by reducing reads and writes of the Unified Buffer

In a systolic array, the horizontal systolic movements are for implementing data broadcasts, whereas the vertical ones are for implementing accumulates shown in the following figure:

It will be interesting to see if we can find a scalable and more energy-efficient alternative to achieve the desired broadcasts and accumulates without sacrificing latency.

--

--

CP Lu, PhD

Committed to advancing AI hardware, I relish exploring philosophy and history, bridging the past and future.