Blog
Open Source
Research

Cost-effective Hyper-parameter Tuning using NNI and AdaptDL

8 mins

February 23, 2021

Acknowledgement:  Microsoft NNI Team

Hyper-parameter tuning (HPT) is an essential step in deep learning workflows, allowing teams to push a model’s quality higher by systematically evaluating many different sets of hyper-parameters (each called a “trial”) and picking the best outcome. HPT is appealing because it is easy to automate and requires little engineering or coding. At Petuum, we use HPT to tune our models for Healthcare Report Writing, Industrial Process Optimization and Image Annotation, running dozens of trials per deployed model.

However, HPT requires large amounts of computing — proportional to the number of trials you run — and quickly becomes expensive in time and dollar cost. That’s especially challenging when factoring in the size of modern models. Take Natural Language Processing (NLP) as an example — the figure below shows recent language models reaching 100s of millions to billions of parameters, with training times measured in thousands of GPU-hours or more. Multiply that by tens or hundreds of HPT trials, and the whole HPT workflow may take days or weeks to complete, not to mention thousands of dollars or more in cloud compute costs.

[Source: Sanh et al., 2019]

To tackle the problem of long and expensive HPT workflows, our team at Petuum collaborated with Microsoft to integrate AdaptDL with NNI. AdaptDL is an open-source tool in the CASL (Composable, Automatic, and Scalable Learning) ecosystem. AdaptDL offers adaptive resource management for distributed clusters, and reduces the cost of deep learning workloads ranging from a few training/tuning trials to thousands. Specific benefits of AdaptDL include:

  • Elastic job scheduling for trials: while each trial is running, AdaptDL dynamically adjusts the number of GPUs assigned to it. Each trial may be distributed across several machines, and AdaptDL ensures each trial is only allocated GPUs it can efficiently utilize. AdaptDL also automatically defragments the cluster to eliminate slow-downs due to network interference between concurrent trials.

Neural Network Intelligence (NNI), from the Microsoft open-source community, is a toolkit for automatic machine learning (AutoML) and hyper-parameter tuning. NNI provides a frontend for managing AutoML experiments and a rich library of HPT and Neural Architecture Search (NAS) algorithms. NNI dispatches and runs experiments’ trial jobs generated by tuning algorithms to search for the best neural architecture and/or hyper-parameters. By running NNI trials using AdaptDL, we’ve been able to perform HPT 1.5x faster in our clusters, and 3x cheaper on AWS. If you’re already using NNI in your workflow (or thinking about it), you can now plug in AdaptDL to make HPT faster, more efficient, and cheaper!

Getting Started

It’s straight-forward to get started. You will need:

  • a cluster either on cloud (AWS EKS, Azure AKS etc.) or on premises with Kubernetes, and

If you don’t have a Kubernetes cluster and just want to try AdaptDL+NNI, you can follow this guide to set up a simple MicroK8s instance on your local machine.

Next steps:

  • Helm-install AdaptDL onto the Kubernetes instance:
$ helm install adaptdl adaptdl-sched \
  --repo https://github.com/petuum/adaptdl/raw/helm-repo \
  —-namespace adaptdl —-create-namespace \
  --set docker-registry.enabled=true

Please refer to the AdaptDL installation page for detailed instructions.

  • Pip-install NNI onto your local machine:
$ python3 -m pip install —-upgrade nni

Please see the latest (2.0+) NNI release installation guide for detailed instructions.

Now AdaptDL+NNI should be ready to go! For more details, refer to the NNI AdaptDL Experiment page to verify the successful installations of both and get started with the examples.

Clone the NNI repository:

$ git clone -b v2.0 https://github.com/Microsoft/nni.git
$ cd nni

The NNI repository provides several AdaptDL examples for you:

  1. CIFAR-10: Configurations defined in
    examples/trials/cifar10_pytorch/config_adl.yml

The CIFAR-10 configuration file is shown below.

[CIFAR-10 configuration file]

To run the CIFAR-10 example, you should first modify the configuration file by providing the IP address of your local machine in the “nniManagerIp” field. You will also need to choose an appropriate Kubernetes storage class so that AdaptDL can checkpoint the model. For example, if using MicroK8s, a storage class name of “microk8s-hostpath” can be used (as provided in config_adl.yml).

To run the CIFAR-10 example, simply use nnictl to start your HPT.

$ nnictl create --config examples/trials/cifar10_pytorch/config_adl.yml

Open the NNI GUI to watch your experiment run with AdaptDL!

[An experiment viewed in the NNI GUI]

What’s Next

Beyond the convenience provided by this integration for ML experiments, the CASL Open-source community from Petuum has other projects that is already compatible with NNI and AdaptDL: For example, Tuun can work together with NNI and offer more flexible tuning model choices. Refer to the CASL Tuun page on how NNI can work better together. This Tuun documentation page has more details on how Tuun + NNI can work better together, including a couple of small Tuun + NNI examples.

About CASL

CASL provides a unified toolkit for composable, automatic, and scalable machine learning systems, including distributed training, resource-adaptive scheduling, hyperparameter tuning, and compositional model construction. CASL consists of many powerful Open-source components that were built to work in unison or leveraged as individual components for specific tasks to provide flexibility and ease of use.

Thanks for reading! Please visit the CASL website to stay up to date on additional CASL and NNI announcements in the near future: https://www.casl-project.ai/ If you’re interested in working professionally on CASL, visit our careers page at Petuum!

Related

Intro to Distributed Deep Learning Systems

Intro to Distributed Deep Learning Systems

February 6, 2018

Introducing AdaptDL, an Open Source resource adaptive deep-learning framework

September 2, 2020