How to Inference Your Dragon: Survival Guide for Neurochip Makers

In 2018 we took up our first major contract on developing a software devkit for a neural processing unit. Back in those times, I knew the AI market was rapidly growing and contained hundreds of companies. But I could never imagine that by 2021 another hundred or two chipmakers who developed AI accelerators would emerge, or that we would become Arm AI-partner, or that our projects with neuro-chipmakers would evolve into a separate field. This article is an attempt to formalize the knowledge that we gained from several projects and about a hundred dialogs with chipmakers. At the same time, I would be happy if this article could be helpful to someone.

Traditionally, our clients are CPU, MCU, or DSP manufacturers, and IP core developers from various countries. The most famous of them — Samsung Electronics (back in 2000 we ported our C-compiler for them and have been maintaining it for 15 years). Then came the LLVM and Samsung moved to solutions based on Arm but that’s a story for another time).

When we started the neural network compiler development project for a tensor chip, it seemed like a one-time job. Then came the second client. And now there are dozens of potential clients. The number of manufacturers and developers of neural network accelerators is rapidly growing: each year not less than ten new companies emerge. The market is populated by such “giants” like Arm, STMicroelectronics, Achronix, Andes, as well as big and small startups like Kneron, AreanaAI, AlphalCs. Because the neurochip market is still being formed, such manufacturers face problems that have never been solved by anyone yet.

New possibilities — new problems

Every developer/manufacturer of proprietary neurochips is striving to expand the number of their clients. To do this, they try to give their users the most efficient solutions with minimal ownership costs, as well as appropriate and convenient development tools, coupled with technical support. Worse still, there are many ML platforms on the market: PyTorch, MXNet, Caffe2, ONNX, TensorFlow, and others. Each of them supports a limited set of target hardware.

This is why neural chip developers must either provide their own tools or use translators (from ONNX, for example). Both options are not only extremely cost-intensive but also have technical restrictions and require extra costs for support and maintenance. For example, if a new version of TensorFlow is released, the developer is forced to release updates, which leads to the need for new optimization algorithms. This is why impossible for a chip developer to support multiple frameworks at one time, especially if the company is small.

The problem is that when choosing an inferencing environment, users usually base their choice on already existing and trained networks. As a result, if the manufacturer chose to support one platform, they get a limited user base. Attracting new users and making users choose a new chip for device designing is extremely difficult.

Solution

In order to overcome this restriction, we searched for the most universal solution for our projects, which would support as many ML frameworks as possible.

In our very first project, we chose TVM and got our money’s worth.

To simplify, TVM provides an inference of trained models of different formats for target hardware, even for special-purpose and unique chips after specially customized additions to TVM. It aims to enable machine learning engineers to optimize and run computations efficiently on any hardware backend.

The next picture represents TVM’s transformation stages and data structures flow. The part on the right shows possible ways of extending TVM with existing tools and algorithms. In total, it describes points that could be modified as a part of the proposed solution.

Usual backend implementation can include:

Add new target architecture to TVM
Add/modify optimization passes
Define device/host operations
Implement /adopt scheduling scenarios for operations
Implement code-generation part
Implement runtime
Implement/modify simulator (functional or cycle-accurate)

Due to its modular structure and flexibility, TVM allows implementing different backend types for various architectures and even combines them to provide the best possible implementation depending on what solutions a chip’s manufacturer already has.

Here are some various TVM backend implementation approaches that can be used for effective support of new architectures, or to support the existing one:

BYOC

LLVM

Source

VTA

Design of linear algebra accelerator with full integration in TVM
Reference implementations on Vivado HLS C++, Chisel, OpenCL
Functional and Verilator-based simulators
Possible use of multiple VTA-cores
Can be used as a base for similarly designed targets

Custom

Integration options

Therefore, TVM upgrading projects can vary and depend on already existing chipmaker-specific SDK elements.

If a manufacturer has a neural network compiler it can use TVM as a frontend extender and implement a converter from Relay to provide compiler representation.

Usually, such projects take a team of two developers up to four months.

If a manufacturer has C/C++ / OpenCL / CUDA compiler it can use the existing TVM backend to produce source code or integrate LLVM-based compiler into TVM’s LLVM backend, adjust scheduling stage to suit architecture’s specific features.

Resource-wise and time-wise such projects are similar to the previous one.

The last option, if a manufacturer doesn’t have any compiler, can implement a new TVM backend for provided architecture.

Usually, these are decently complicated projects which would take more people and up to six months compared to the previous options.

Why did we choose TVM

Like I said earlier, the first reason is the number of supported ML platforms: Keras, MXNet, PyTorch, TensorFlow, CoreML, DarkNet. This solves the first problem — it increases the number of potential clients. The more diversified the tools that can be used by accelerator developer’s clients are, the more clients there can be.

The second reason is the TVM community. Currently, it includes more than 500 contributors, and the development is sponsored by such giants, as Amazon, ARM, AMD, Microsoft, Xilinx. Thanks to that, the chip developer automatically gets all the new optimizations and components of TVM, including the support of new frontend versions. At the same time, TVM is under the Apache license which allows one to use it in any commercial project while ensuring independence from big corporations. The developer of a new neural chip owns all IPs and decides himself what to share with the community.

What is more, the TVM ecosystem is constantly expanding. There are companies like OctoML, Imagination Technologies, and our company, for that matter. This allows the chipmakers to give their users a lot of various tools and services to maximize their devices performance which, in its turn, gives them an advantage over their competitors. For example, OctoML offers a service for improving neural networks performance. It is compatible with most ML frameworks and platforms: ARM (A class CPU/GPU, ARM M class microcontrollers), Google Cloud Platform, AWS, AMD, Azure, and others. Clients are getting the ability to automatically configure their models on target hardware which saves engineers time, which is usually wasted on manual optimization and performance testing. This gets them to market faster.

The third reason is actually more of a complex of reasons: the technical possibilities of TVM. Here is a list of the most important ones, according to our insights:

Relay — statically typed, purely functional, differentiable high-level intermediate representation. Auto TVM / Auto Scheduler — machine learning-based frameworks intended for optimizing tensor programs, using evolutionary search and a learned cost model.

All of this gives you the ability to create a full-fledged SDK for new neural chips while staying within budget and deadlines. And the client gets even more features for the same budget.

Potentially the client cuts the SDK development expenses by using a rapidly-developing open-source framework and is able to involve highly qualified teams for outsourcing (it turned out to be crucial for our clients). All of that, while keeping their core technology inside. Obviously, it brings the cost of support and maintenance down compared to a proprietary custom solution.

And most importantly, the client increases his potential user base by avoiding the need to tie their solution to a specific ML platform. He even gets a new tool to attract new users — through the TVM community.

Let me take this opportunity to announce a most captivating online conference which will take place soon — on December 15th- 17th: TVMCon 2021. TVMCon will cover the state of the art of deep learning compilation and optimization, with a range of tutorials, research talks, case studies, and industry presentations. The conference will discuss recent advances in frameworks, compilers, systems and architecture support, security, training, and hardware acceleration. Speakers from Microsoft, Xilinx, Google, Qualcomm, Arm, AWS, Intel Labs, AMD, NTT, EdgeCortix, Synopsys, OctoML, etc will take part. In short, if TVM draws your attention — welcome! It promises to be interesting. Register for free: https://www.tvmcon.org

Conclusion

Looks like a eulogy for TVM. Honestly, I did not mean to. I intended to share the experience and the best practices I gathered during my 3-years work with TVM, which we use in our projects along with the other tools, e.g. MLIR. Still, I consider TVM a great tool worthy of praise. Therefore, I left this introductory article as it is. If you are interested in this topic, I can share other TVM materials dedicated to particular projects and technical issues.