Asynchronous Multiprocess Large Model Training on PyTorch

Swiss Python Summit Association

Sulimov, Pavel

Formale Metadaten

Titel

Serientitel

Swiss Python Summit 2023 (SPS23)

Anzahl der Teile

Autor

Sulimov, Pavel

Mitwirkende

N. N. (Moderation)

Sordini, Furio Valerio

Lizenz

CC-Namensnennung 4.0 International:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/68619 (DOI)

Herausgeber

Swiss Python Summit Association

Erscheinungsjahr

2023

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

"With the increasing popularity of large machine learning models capable of solving complicated tasks in the sphere of natural language processing, computer vision, etc., the need for distributed computation has rocketed significantly. We would like to provide the "surgery" of parallelization methods from one of the most popular deep learning frameworks - PyTorch. Particularly, we would like to demonstrate two main approaches: data parallelization (when the single module is trained asynchronically in streams) and model parallelization (both horizontal – with several models trained simultaneously, and vertical – when the model parameters are split into groups). Moreover, we will guide through the cases of different resources availability, i.e. what could be done when having only CPUs, a single GPU, or multiple GPUs. Our showing is to be done on an example of urban planning problem solution, where we are creating synthetic cities with deep convolutional generative adversarial neural networks. These models have complicated architecture and billions of parameters when generating images starting from mid-resolution like 256x256, which makes them perfect instances for distributed computation demonstration.