Blog
Open Source
Research

Introducing AdaptDL, an Open Source resource adaptive deep-learning framework

5 min

September 2, 2020

What is AdaptDL?

Petuum is very excited to announce the launch of our newest open source offering, AdaptDL, a resource-adaptive deep learning (DL) training and scheduling framework. The goal of AdaptDL is to make distributed DL easy and efficient in dynamic-resource environments such as shared clusters and the cloud. During our benchmark studies when using AdaptDL with Amazon Web Services (AWS), we recorded a reduction in cost by up to 80% when AdaptDL was set to automatically provision spot instances on AWS when available.

AdaptDL can automatically determine the optimal number of resources given a job’s need. It will efficiently add or remove resources dynamically to ensure the highest-level performance. Using a scheduler to leverage elasticity, AdaptDL quickly scales resources in and out of clusters to adapt to a changing availability pool allowing for faster job completion and more efficient resource allocation. A more detailed description can be found in our technical paper at https://arxiv.org/pdf/2008.12260.pdf.

Some core features offered by AdaptDL are:

  • Elastically schedule distributed DL training jobs in shared clusters.
  • Cost-aware resource auto-scaling in cloud computing environments (e.g. AWS).
  • Automatic batch size and learning rate scaling for distributed training.

Why Use AdaptDL?

AdaptDL’s state-of-the-art scheduling algorithm directly optimizes cluster-wide training performance and resource utilization. By elastically re-scaling jobs, co-adapting batch sizes and learning rates, and avoiding network interference, AdaptDL improves shared-cluster training compared with alternative schedulers.

In the cloud (e.g. AWS), AdaptDL auto-scales the size of the cluster based on how well those cluster resources are utilized. AdaptDL automatically provisions spot instances when available to reduce cost by up to 80%.

Efficient distributed training requires careful selection of the batch size and learning rate, which can be tricky to find manually. AdaptDL offers automatic batch size and learning rate scaling, which enables efficient distributed training without requiring manual tuning effort. To do this, each AdaptDL job measures its system performance and statistical properties during training, and adaptively selects the most efficient batch size and learning rate.

How do I use it?

AdaptDL with PyTorch

AdaptDL offers an easy-to-use API to make existing PyTorch training code elastic with adaptive batch sizes and learning rates.

Standard distributed PyTorch consists of a few key lines of code:

  # Initializing the distributed PyTorch process.
  torch.distributed.init_process_group("nccl" if torch.cuda.is_available() else "gloo")

  # Enabling data-parallel distributed training.
  model = torch.nn.parallel.DistributedDataParallel(model)

  # Configuring the training data loader.
  dataloader = torch.utils.data.DataLoader(dataset, batch_size=64, drop_last=True)

  # Main epoch loop.
  for epoch in range(100):
  ...

Enabling AdaptDLis as simple as changing a few lines of code (shown below):

  # Initializing the elastic AdaptDL+PyTorch process.
  adaptdl.torch.init_process_group("nccl" if torch.cuda.is_available() else "gloo")

  # Enabling elastic training with performance profiling.
  model = adaptdl.torch.AdaptiveDataParallel(model, optimizer)

  # Configuring the elastic training data loader.
  dataloader = adaptdl.torch.AdaptiveDataLoader(dataset, batch_size=64, drop_last=True)

  # Enable the batch size to automatically be scaled up to 1024.
  # AdaptDL automatically scales the learning rate as well.
  data_loader.autoscale_batch_size(max_batch_size=1024)

  # Main epoch loop which handles checkpointing and resuming training.
  for epoch in adaptdl.torch.remaining_epochs_until(100):
  ...

See https://adaptdl.readthedocs.io/en/latest/adaptdl-pytorch.html for more details.

With a Shared Cluster

It’s easy to deploy the AdaptDL scheduler on Kubernetes, using just a few commands.

  $ helm repo add adaptdl https://raw.githubusercontent.com/petuum/adaptdl/helm-repo
  $ helm repo update
  $ helm install adaptdl adaptdl/adaptdl -n adaptdl --create-namespace --set docker-registry.enabled=true

Once the scheduler is installed and running, training jobs can be submitted using the AdaptDL CLI.

  $ adaptdl submit /path/to/your/code

The AdaptDL scheduler will automatically figure out the most efficient number of GPUs to allocate to your job, based on its scalability. When the cluster load is low, your job can dynamically expand to take advantage of more GPUs. When combined with adaptive batch size training, your job will also use the most efficient batch size and learning rate to best utilize those GPUs.

With Kubernetes, the AdaptDL scheduler can also be deployed to Amazon EKS clusters and take advantage of cheap spot instances.

See https://adaptdl.readthedocs.io/en/latest/installation/index.html for more details on training in Kubernetes clusters.

With Standalone Training

When you don’t need to share a cluster with anyone, AdaptDL can adaptively scale up the batch size and learning rate for individual training jobs, without using Kubernetes at all. Standalone Training can be useful for:

  1. Distributed training with adaptive batch sizes in a dedicated cluster.
  2. Local testing the training code before submitting to an AdaptDL cluster.

To implement Petuum AdaptDL Standalone Training, simply run your training script on each of your nodes, with a few environments variable set. For example,

On node-0:

  $ ADAPTDL_MASTER_ADDR=node-0 ADAPTDL_MASTER_PORT=47000 \  ADAPTDL_NUM_REPLICAS=2 ADAPTDL_REPLICA_RANK=0 python3 mnist.py

And on node-1:

  $ ADAPTDL_MASTER_ADDR=node-0 ADAPTDL_MASTER_PORT=47000 \  ADAPTDL_NUM_REPLICAS=2 ADAPTDL_REPLICA_RANK=1 python3 mnist.py

See https://adaptdl.readthedocs.io/en/latest/standalone-training.html for more details on standalone distributed training.

What’s Next?

Here at Petuum we set out to build an elegant and cutting-edge solution to overcome the rigidity of one-time allocation of cluster resources from start to finish. In this article we detailed how we solved this limitation using our solution AdaptDL Currently AdaptDL supports PyTorch training programs and we will be releasing, AdaptDL for TensorFlow soon. Please stay tuned.  AdaptDL from Petuum is part of the ongoing work of the Scalable ML team at Petuum. We are currently working hard on a series of Open Source projects that will be launching in the upcoming weeks.

To learn more about AdaptDL welcome to check out our resources page: https://adaptdl.readthedocs.io/en/latest/

Thanks for reading! We would love to know your experiences and feedback on using AdaptDL. Please check us out at Petuum to be kept in the loop on new scalable ML project launches in the near future.

Related

AutoDist Logo

AutoDist, an Open Source distributed deep learning training engine from Petuum

July 21, 2020

Introducing Texar-PyTorch: An ML Library Integrating the Best of TensorFlow into PyTorch

October 16, 2019