Disrupting the Giants

As we get close to launching our first chip, one of the most frequent questions I get asked is, “How will you guys compete against other AI chips, including those from giant companies like Intel and Nvidia?”  It’s a question I love, because it leads to some great discussions around computing architectures, chip design, and how AI will be deployed to improve peoples’ lives. These can all be lengthy discussions, but for this post let me just focus on some of the design considerations that have led NovuMind to a unique, disruptive approach.

We are just at the beginning of an AI revolution, brought about by Deep Neural Networks. As this revolution unfolds, it is not surprising that there are many design approaches, building on existing technologies. The processing units that power most of today’s DNN solutions are CPUs, GPUs, and DSPs.  These are all derivatives of the von Neumann model, based on the concept of executing a computer instruction set, able to execute loops, subroutines, arithmetic functions, and so on.

But DNNs are forcing us to think about processor architectures in new ways. DNNs are not organized around the traditional instruction set model. They are more about interconnected layers and models that describe the operations between layers. The fundamental primitives are tensor operations. We can implement those primitives using traditional computing cores, and this is what is often done, but it doesn’t lead to efficient implementations. Problems arise, such as having to devote extra cycles to moving data in and out of memory in repetitive operations.

Smart researchers have been working on this, and one of the noteworthy results in recent years has been the TPU (tensor processing unit) approach, based on the “systolic array” architecture. Basically, you lay out a very regular pattern of multiply-accumulate (MAC) processing elements and you optimize to get a large, dense array of MACs in silicon. This architecture has been an important step forward in handling the types of tensor operations inherent in DNNs.

Our vision has been to bring the benefits of AI to the edge of the network – where data resides and where there are huge opportunities to deploy AI. These applications usually require low latency, low power consumption, and low cost. Here, the TPU architecture is not optimal. Consider latency: When a tensor is unfolded into matrices and they are “shifted” through a 256×256 systolic array, it takes you 256 clock cycles before the first column of data you put into the array makes it to the end. It will take another 256 clock cycles before the last column gets to the end. That’s a big latency penalty. In our chip, we have a totally different architecture, designed specifically for the 3D tensor calculations inherent in convolutional neural networks. Inspired by some of the techniques used in supercomputers to distribute data out to many nodes for parallel processing, our chip design distributes tensor data to more MAC processing elements and gets much more computation done relative to time spent moving data around.

Earlier this month, we were very pleased when the US Patent Office awarded one of our patents that specifically covers this unique native tensor processing approach. We are very proud to get this validation.

When you take a disruptive approach, you raise eyebrows and you have to answer a lot of questions, like how you can possibly compete against the giant companies. I don’t mind the questions, and I love to see peoples’ reactions when I explain our view of the market, our unique design approach, and how we will enable entirely new markets for AI. It is a very exciting time for us at NovuMind.