Petascale Machine Learning

Deep learning and AI benefit greatly from using very large training sets. The software to utilize them remains complex, and few companies and academic institutions currently are able to deal with such problems. This project aims to develop simple, easy-to-use, efficient tools that allow deep learning and machine learning to scale easily to training dataset that are petabytes large–without having to hire an entire IT staff.

The initial target of the work is typical CPU/GPU clusters with 2 PB of rotational storage, 144 V100 GPUs, 100 Gbit point-to-point connectivity, and datasets of up to 1 PB. Software is based on Kubernetes and PyTorch. We achieve linear scaling of I/O performance up to the maximum aggregate bandwidth of the drives of 40 Gbytes/s on this hardware.

I’ve given a tutorial about some of the issues in large scale deep learning at IEEE BigData 2019, together with Alex Aizman.

We have been developing a number of server side and client side tools to make working with very big datasets easier:

  • WebDataset is a drop-in replacement for the PyTorch Dataset class; it permits training of PyTorch and Tensorflow models with minimal changes on very large datasets served from web servers, cloud storage, object stores, and local disk.
  • AIStore is a specialized, distributed web server, object store, and content distribution network geared towards serving very large training sets to distributed, large scale deep learning jobs
  • tarproc and tarp are simple tools for performing map-reduce style computations on large datasets

We have demonstrated linear scaling with negligible overhead for very large training jobs using these tools. Furthermore, these tools provide high speed I/O using inexpensive rotational storage.

All tools can be deployed directly onto Kubernetes clusters. For batch scheduling and deployment, we use Argo and ArgoCD.