Scaling Deep Learning on Databricks

Training modern Deep Learning models in a timely fashion requires leveraging GPUs to accelerate the process. Ensuring that this expensive hardware is properly utilised and scales efficiently is complex however. All the steps, from data storage and loading through to preprocessing and finally distributing the model training process requires careful thought. To reduce the cost of training a model, we need to ensure that we are making best use of our hardware resources. Typically, the GPUs that we rely on are memory constrained with much smaller amounts of VRAM being available relative to CPU RAM. As such we will need to leverage a variety of libraries to help ensure that we can keep our GPUs running. Through the use of libraries like Petastorm to handle the data loading side, PyTorch Lightning and Horovod to handle the model distribution side we can accelerate can leverage commodity spark clusters to accelerate the training process for our Deep Learning Models. Connect with us: Website: Facebook: Twitter: LinkedIn: Instagram:

3 views