Creating a large dataset for pretraining LLMs by Guilherme Penedo

Автор: Data Makers Fest

Загружено: 2025-04-24

Просмотров: 117

Описание:

How do you build a dataset capable of training a powerful Large Language Model (LLM)? In this Data Makers Fest talk, Guilherme Penedo explores the essential steps in creating large-scale pretraining datasets for LLMs.

The session covers key insights from recent dataset projects like RefinedWeb, Dolma, and Yi, as well as the open source tools, such as Datatrove, that streamline the process of collecting and scaling massive text datasets.

Watch the full video to understand the fundamentals of dataset curation and how it impacts LLM performance.

::::::
If you love watching content like this, consider joining us in person at the next event: www.datamakersfest.com

👉 FOLLOW US
Instagram: / datamakersfest
LinkedIn: / data-makers-fest

Our channel features talks for anyone building products and services with and around data. Subscribe to our channel for videos on Data Science, Machine Learning, AI, Data Engineering, and more.

Data Makers Fest videos may be used for non-commercial purposes under a Creative Commons License, Attribution–Non-Commercial–No Derivatives (or the CC BY – NC – ND 4.0 International). To use the talk for other purposes, please contact us at [email protected].

#datamakersfest #datascience #ai #machinelearning #dataengineering

Creating a large dataset for pretraining LLMs by Guilherme Penedo

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео