Efficient Large-Scale Language Model Training on GPU Clusters

Автор: Databricks

Загружено: 2021-08-26

Просмотров: 7408

Описание:

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these large models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on a single GPU or even on a multi-GPU server; and b) the number of compute operations required to train these models can result in unrealistically long training times. New methods of model parallelism such as tensor and pipeline parallelism have been proposed to address these challenges; unfortunately, naive usage leads to fundamental scaling issues at thousands of GPUs due to various reasons, e.g., expensive cross-node communication or idle periods waiting on other devices.

In this work, we show how to compose different types of parallelism methods (tensor, pipeline, and data parallelism) to scale to thousands of GPUs, achieving a two-order-of-magnitude increase in the sizes of models we can efficiently train compared to existing systems. We discuss various implementations of pipeline parallelism and propose a novel schedule that can improve throughput by more than 10% with comparable memory footprint compared to previously-proposed approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. The composition of these techniques allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of peak; previous efforts to train similar-sized models achieve much lower throughput (36% of theoretical peak). Our code has been open-sourced at https://github.com/nvidia/megatron-lm.

Connect with us:
Website: https://databricks.com
Facebook:   / databricksinc
Twitter:   / databricks
LinkedIn:   / databricks
Instagram:   / databricksinc   Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-nam...

Efficient Large-Scale Language Model Training on GPU Clusters

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Distributed ML Talk @ UC Berkeley

Distributed ML Talk @ UC Berkeley

Training LLMs at Scale - Deepak Narayanan | Stanford MLSys #83

Training LLMs at Scale - Deepak Narayanan | Stanford MLSys #83

How to Train Your Own Large Language Models

How to Train Your Own Large Language Models

Scale ANY Model: PyTorch DDP, ZeRO, Pipeline & Tensor Parallelism Made Simple (2025 Guide)

Scale ANY Model: PyTorch DDP, ZeRO, Pipeline & Tensor Parallelism Made Simple (2025 Guide)

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM | Jared Casper

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM | Jared Casper

vLLM Office Hours - Distributed Inference with vLLM - January 23, 2025

vLLM Office Hours - Distributed Inference with vLLM - January 23, 2025

A friendly introduction to distributed training (ML Tech Talks)

A friendly introduction to distributed training (ML Tech Talks)

Multi GPU Fine tuning with DDP and FSDP

Multi GPU Fine tuning with DDP and FSDP

Делаем графические процессоры по-настоящему быстрыми: глубокий анализ эффективности тренировок

Делаем графические процессоры по-настоящему быстрыми: глубокий анализ эффективности тренировок

Освоение оптимизации вывода LLM: от теории до экономически эффективного внедрения: Марк Мойу

Освоение оптимизации вывода LLM: от теории до экономически эффективного внедрения: Марк Мойу

Accelerating Data Ingestion with Databricks Autoloader

Accelerating Data Ingestion with Databricks Autoloader

Ultimate Guide To Scaling ML Models - Megatron-LM | ZeRO | DeepSpeed | Mixed Precision

Ultimate Guide To Scaling ML Models - Megatron-LM | ZeRO | DeepSpeed | Mixed Precision

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

The Evolution of Multi-GPU Inference in vLLM | Ray Summit 2024

The Evolution of Multi-GPU Inference in vLLM | Ray Summit 2024

Microsoft DeepSpeed introduction at KAUST

Microsoft DeepSpeed introduction at KAUST

LLM Foundations (LLM Bootcamp)

LLM Foundations (LLM Bootcamp)

Wizard Driven AI Anomaly Detection with Databricks in Azure

Wizard Driven AI Anomaly Detection with Databricks in Azure

ChatGPT vs Thousands of GPUs! || How ML Models Train at Scale!

ChatGPT vs Thousands of GPUs! || How ML Models Train at Scale!

How Fully Sharded Data Parallel (FSDP) works?

How Fully Sharded Data Parallel (FSDP) works?

Kubernetes — Простым Языком на Понятном Примере

Kubernetes — Простым Языком на Понятном Примере