The model training process in big data machine learning is both computation- and memory-intensive. Many parallel machine learning algorithms consist of iterating a computation over a training dataset and updating the related model parameters until the model converges. In the Big Data era, both the volume of a dataset and the number of model parameters can be huge. To accelerate the performance of the iterative computation, it’s common to cache the training data and model parameters into memory. However, due to the limitations of memory, in many scenarios, it might not all fit. It’s a challenge to run machine learning training algorithms efficiently with memory constraints.
Currently there are several approaches to solve the memory limit challenge. These solutions can either be done at the application level, the runtime level, or the OS level.
By designing an algorithm with out-of-core execution for a particular training application, it’s possible to partition the dataset/model parameters so that part of them can be cached into memory and the rest can be put on a disk. With a smaller working set, the program can still execute within the data and model parameters that exist in the memory. Since this solution is quite algorithm-specific, it puts a lot of burden on developers to implement the correct algorithms.
Another issue in this approach is that in algorithms like Latent Dirichlet allocation with Gibbs sampling, memory usage varies across the different stages of the training process. Since the model keeps shrinking, the total memory used also reduces and, as a result, the program can cache more dataset/model parameters at the later stage than it can cache at the beginning of the training process. If this type of out-of-core execution algorithm is used to partition the dataset and model parameters, it would waste memory in the later stage of the execution when memory consumption decreases.
Frameworks such as Spark provide a set of transparent dataset APIs that leverage both memory and disks for data storage. Compared to algorithm-specific solutions, this solution releases developers from the burden of memory management in training algorithm implementation. However, the out-of-core execution is not fully transparent — it still requires job/cluster-wide configuration of the storage levels and deals with persistent data storage, but not with memory allocation in task execution.
Using OS swap space to address the memory limit challenge has the benefits of being adaptive, since it is only used when memory resources are constrained, and transparent to the application, as there is no requirement for additional programming. Additionally, using swap space can bypass the overhead of the file system.
However, this solution also has limitations. The amount of swap space is usually limited — its size is difficult to reconfigure dynamically, requiring privileged access to the operating system, and can only be done when swap is disabled. It’s also difficult to achieve higher performance. Though it’s possible to tune swap for performance by adjusting the swapping granularity and the eviction policy, it requires knowledge of addresses in virtual memory, which are often hidden beneath many layers of system abstractions.
Adaptive, Transparent, and High-Performing Out-of-Core Execution
We’ve chosen to work at the runtime level so that we can provide a transparent interface between the user code of the training algorithm and the execution runtime, which hides the logic of the out-of-core execution from the application code. We use an abstraction of parallel execution runtime (Figure 1) in which each executor (a word we’re using to describe the parallel components) caches its related part of the dataset and maintains its own computation state. During iterations, these executors are either scheduled with computation tasks or coordinated for synchronization.
This abstraction allows us to separate the conceptual parallelism from the real parallel execution ability of the hardware. When there are more executors than physical parallel units, the runtime will only execute them when the computation resource is available. That means that no matter what policy the runtime takes, as long as the parallel algorithm is expressed through executors, the correctness of parallel execution won’t be affected by how the execution is scheduled by the runtime.
The abstraction also means that the runtime can move executors that are not running out of memory. As long as the state and data of the executor is preserved, the correctness of the parallel execution is maintained when it’s swapped out. As a result, we can build an out-of-core execution policy inside of the runtime engine: when parallel execution reaches the memory limit, the runtime can save executors that are not running to the filesystem and load them back later when their turn for execution comes.
Our out-of-core execution is implemented with the following features:
- Adaptive to memory limit: We set the memory limit for the program built on our runtime. Initially, all of the executors are loaded into memory. Then, when the memory limit is hit, we save some executors that are not running to the filesystem and allow the current execution to continue with the released memory. Since the filesystem is much larger than swap space, our solution does not require pre-configuration, which offers more flexibility. In the later execution stages, if memory usage shrinks, all of the executors are automatically returned to memory.
- Transparent interfacing: When developers write parallel algorithms based on executors, they don’t need to interfere with any logic related to out-of-memory detection or the swapping out/in of executors. All of these are controlled by the runtime. However, developers still need to write code to tell it how to save/load executors.
- High performance: Although framework-level out-of-core execution precludes optimizations that bypass the overhead of using the filesystem, it enables other powerful optimizations that are application-aware. Knowing the dependencies between the application’s tasks lets the runtime select and save the executor that is not assigned with tasks or that has the lowest task priority in order to avoid memory thrashing.
OOM Trap and Eviction
To detect out-of-memory, we use “ulimit -v” to set the program’s upper virtual memory limit; when the limit is hit, memory allocation calls return null pointers or error codes. We also wrap memory allocation APIs so that we can capture out-of-memory errors and add extra eviction routines. These wrappers use “dlsym” to link to the memory allocation library to perform memory allocation. If any memory allocation fails, the wrappers execute the eviction routine, which moves executors that are not in execution to the filesystem. If there is no executor to swap-out but the allocation still fails, the program fails (Figure 2).
In the process of eviction, executors are serialized to a disk as files. Since free memory may not be available at this moment, the memory allocation calls are redirected to a reserved memory area where we manage all the memory allocation activities during executor serialization. Since all of the threads that fail to allocate memory access the executors concurrently, we avoid thread contention by letting the threads that fail to access the executors go to sleep. The threads that successfully exit the eviction routine then notify those that have gone to sleep. Since some memory allocation libraries may retain virtual memory even when they have been freed, meaning any memory allocation after the eviction may still fail, we use the Jemalloc library and ask it to return the unused virtual memory.
Since the runtime knows all the task scheduling information, we can maintain the executors that are ready for swapping-out in a queue based on their task execution priority. The queue places executors with tasks at the head of the queue, according to the task dispatching order, and the executors that don’t have tasks at the tail of the queue. During the “swap-out” routine, the executor at the end of the queue is picked for swapping-out and then removed from the queue. In this way, memory thrashing is avoided and, by knowing the task execution plan, there is potential to have additional out-of-core execution polices.
Our out-of-core execution can also be integrated into a scheduler for distributed computing to achieve auto-scaling. When the executor eviction is detected, as long as there are still resources available, the scheduler can spawn more processes to run those evicted executors so that the execution can continue with a greater speed.
We compared our prototype with the system swap using mini-batch K-means on the AWS cloud with two types of instances (Figure 3): m4.2xlarge with 8 vCPU, 32GB memory and i3.xlarge with 4 vCPU, 32GB memory. We set the number of computation threads to eight and four, respectively, and used 256GB general purpose SSD (gp2) EBS volumes. We used a dataset from the full ImageNet ILSVRC2012 dataset, which contains 1140297 images, each with 10K features. We ran 100 batches to train 1000 classes, with each batch consisting of 40 tasks and each task consisting of 1100 images. We limited the virtual memory size to a number close to the total physical memory size (“ulimit -v 31000000”) and we let the OS schedule 96GB of virtual memory between the physical memory and the swap space.
We found that on i3.xlarge, our system used 2.13 hours while the system swap used 12.30 hours. On m4.2xlarge, our system took 1.52 hours while the system swap took 13.10 hours. The system swap was so much slower because its execution generates memory thrashing, which results in high IOPS (input/output operations per second) and hits the performance upper limit of the EBS volumes.
These preliminary results show that our approach to out-of-core execution significantly decreases time costs when compared to the system swap method. With transparent interfacing and adaptive memory management, we anticipate that this method will help lower the difficulty of developing and deploying parallel model training applications in big data machine learning while still achieving high performance. Since not every AI/ML user has access to the unlimited computing resources of big companies or national laboratories, this method helps to democratize access to big data AI/ML applications.