DuckDB and recommenders : a lightning fast synergy ft. Khalil Muhammad
Автор: MotherDuck
Загружено: 2024-02-19
Просмотров: 3156
Talk from the DuckDB user meetup that happened in Dublin on 23 January 2024!
Future events: https://motherduck.com/events/
☁️🦆 Start using DuckDB in the Cloud for FREE with MotherDuck : https://hubs.la/Q02QnFR40
📓 Resources
Slides : https://docs.google.com/presentation/...
Khalil Linkedin : / mihai-bojin
➡️ Follow Us
LinkedIn: / motherduck
Twitter : / motherduck
Blog: https://motherduck.com/blog/
#datascience #dataengineering #duckdb
--------------------------------------
Discover how DuckDB revolutionizes machine learning workflows, particularly for building recommender systems. This video moves beyond simple SQL queries to showcase DuckDB's power in accelerating development. We start with a primer on recommender systems, explaining how they learn user preferences using "positive samples" (what users interact with) and the often-elusive "negative samples." You'll understand the common challenges in ML projects, such as ensuring reproducibility for your data science team, managing scalability with growing data, and avoiding GPU IO bottlenecks during model training.
Learn how DuckDB acts as the central glue in your data engineering pipeline to solve collaboration and scale. We demonstrate a practical architecture using a "dataset spec" to create reproducible snapshots of your training data from various cloud data sources, enabling seamless teamwork. For handling datasets larger than memory, we dive into a key technique for PyTorch and TensorFlow model training: creating an iterable dataset. By using DuckDB's `fetch_record_batch` command, you can efficiently stream data directly to your model, feeding your GPU faster and enabling training on massive datasets without memory constraints.
Unlock incredible speed with DuckDB performance tuning and advanced features. We'll show you why DuckDB is significantly faster than Pandas for many data manipulation tasks and how proper memory configuration is key. A major highlight is implementing a custom negative sampling algorithm directly within SQL using a DuckDB Python UDF (User-Defined Function), a task that is often complex in other systems. Through concrete benchmarks, you'll see a potential 10x performance gain. We also share practical DuckDB optimization tips, including how to analyze the memory impact of window functions and set memory limits to prevent errors.
Finally, we cover essential best practices for productionizing your DuckDB-powered ML pipeline. Learn the importance of data hygiene and establishing a single, configured entry point for your DuckDB connections to ensure consistency. This video illustrates that by adopting DuckDB, you gain not just raw speed but also the convenience and cost-savings needed for modern machine learning tasks, making it a powerful tool for any data professional looking to build and deploy recommender systems efficiently.
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: