Design of AI may change with the open-source Apache TVM and a little help from startup OctoML

In recent years, artificial intelligence programs have been prompting change in the design of computer chips, and novel computers have likewise made possible new kinds of neural networks in AI. There is a feedback loop going on that is powerful.

At the center of that sits the software technology that converts neural net programs to run on novel hardware. And at the center of that sits a recent open-source project gaining momentum.

Apache TVM is a compiler that operates differently from other compilers. Instead of turning a program into typical chip instructions for a CPU or GPU, it studies the “graph” of compute operations in a neural net, in TensorFlow or Pytorch form, such as convolutions and other transformations, and figures out how best to map those operations to hardware based on dependencies between the operations. 

At the heart of that operation sits a two-year-old startup, OctoML, which offers ApacheTVM as a service. As explored in March by ZDNet‘s George Anadiotis, OctoML is in the field of MLOps, helping to operationalize AI. The company uses TVM to help companies optimize their neural nets for a wide variety of hardware. 

Also: OctoML scores $28M to go to market with open source Apache TVM, a de facto standard for MLOps

In the latest development in the hardware and research feedback loop, TVM’s process of optimization may already be shaping aspects of how AI is developed.

“Already in research, people are running model candidates  through our platform, looking at the performance,” said OctoML co-founder Luis Ceze, who serves as CEO, in an interview with ZDNet via Zoom. The detailed performance metrics mean that ML developers can “actually evaluate the models and pick the one that has the desired properties.”

Today, TVM is used exclusively for inference, the part of AI where a fully-developed neural network is used to make predictions based on new data. But down the road, TVM will expand to training, the process of first developing the neural network. 

luis-ceze-octoml-sept-2021.png

“Already in research, people are running model candidates through our platform, looking at the performance,” says Luis Ceze, co-founder and CEO of startup OctoML, which is commercializing the open-source Apache TVM compiler for machine learning, turning it into a cloud service. The detailed performance metrics mean that ML developers can “actually evaluate the models and pick the one that has the desired properties.”

“Training and architecture search is in our roadmap,” said Ceze, referring to the process of designing neural net architectures automatically, by letting neural nets search for the optimal network design. “That’s a natural extension of our land-and-expand approach” to selling the commercial service of TVM, he said. 

Will neural net developers then use TVM to influence how they train?

“If they aren’t yet, I suspect they will start to,” said Ceze. “Someone who comes to us with a training job, we can train the model for you” while taking into account how the trained model would perform on hardware. 

That expanding role of TVM, and the OctoML service, is a consequence of the fact that the technology is a broader platform than what a compiler typically represents.

“You can think of TVM and OctoML by extension as a flexible, ML-based automation layer for acceleration that runs on top of all sorts of different hardware where machine learning models run—GPUs, CPUs, TPUs, accelerators in the cloud,” Ceze told ZDNet

“Each of these pieces of hardware, it doesn’t matter which, have their own way of writing and executing code,” he said. “Writing that code and figuring out how to best utilize this hardware today is done today by hand across the ML developers and the hardware vendors.” 

The compiler, and the service, replace that hand tuning — today at the inference level, with the model ready for deployment, tomorrow, perhaps, in the actual development/training.

Also: AI is changing the entire nature of compute

The crux of TVM’s appeal is greater performance in terms of throughput and latency, and efficiency in terms of computer power consumption. That is becoming more and more important for neural nets that keep getting larger and more challenging to run. 

“Some of these models use a crazy amount of compute,” observed Ceze, especially natural language processing models such as OpenAI’s GPT-3 that are scaling to a trillion neural weights, or parameters, and more. 

As such models scale up, they come with “extreme cost,” he said, “not just in the training time, but also the serving time” for inference. “That’s the case for all the modern machine learning models.”

As a consequence, without optimizing the models “by an order of magnitude,” said Ceze, the most complicated models aren’t really viable in production, they remain merely research curiosities.

But performing optimization with TVM involves its own complexity. “It’s a ton of work to get results the way they need to be,” observed Ceze. 

OctoML simplifies things by making TVM more of a push-button affair. 

“It’s an optimization platform,” is how Ceze characterizes the cloud service. 

“From the end user’s point of view, they upload the model, they compare the models, and optimize the values on a large set of hardware targets,” is how Ceze described the service. 

“The key is that this is automatic — no sweat and tears from low-level engineers writing code,” said Ceze. 

OctoML does the development work of making sure the models can be optimized for an increasing constellation of hardware.  

“The key here is getting the best out of each piece of hardware.” That means “specializing the machine code to the specific parameters of that specific machine learning model on a specific hardware target.” Something like an individual convolution in a typical convolutional neural network may become optimized to suit a particular hardware block of a particular hardware accelerator. 

The results are demonstrable. In benchmark tests published in September for the MLPerf test suite for neural net inference, OctoML had a top score for inference performance for the venerable ResNet image recognition algorithm in terms of images processed per second.

The OctoML service has been in a pre-release, early access state since December of last year.

To advance its platform strategy, OctoML earlier this month announced it had received $85 million in a Series C round of funding from hedge fund Tiger Global Management, along with existing investors Addition, Madrona Venture Group and Amplify Partners. The round of funding brings OctoML’s total funding to $132 million. 

The funding is part of OctoML’s effort to spread the influence of Apache TVM to more and more AI hardware. Also this month, OctoML announced a partnership with ARM Ltd., the U.K. company that is in the process of being bought by AI chip powerhouse Nvidia. That follows partnerships announced previously with Advanced Micro Devices and Qualcomm. Nvidia is also working with OctoML.

The ARM partnership is expected to spread use of OctoML’s service to the licensees of the ARM CPU core, which dominates mobile phones, networking and the Internet of Things.

The feedback loop will probably lead to other changes besides design of neural nets. It may affect more broadly how ML is commercial deployed, which is, after all, the whole point of MLOps.

As optimization via TVM spreads, the technology could dramatically increase portability in ML serving, Ceze predicts. 

Because the cloud offers all kinds of trade-offs with all kinds of hardware offerings, being able to optimize on the fly for different hardware targets ultimately means being able to move more nimbly from one target to another.

“Essentially, being able to squeeze more performance out of any hardware target in the cloud is useful because it gives more target flexibility,” is how Ceze described it. “Being able to optimize automatically gives portability, and portability gives choice.”

That includes running on any available hardware in a cloud configuration, but also choosing the hardware that happens to be cheaper for the same SLAs, such as latency, throughput and cost in dollars. 

With two machines that have equal latency on ResNet, for example, “you’ll always take the highest throughput per dollar,” the machine that’s more economical. “As long as I hit the SLAs, I want to run it as cheaply as possible.”