DuckDB & Python | End-To-End Data Engineering Project (1/3)
Автор: MotherDuck
Загружено: 2024-02-02
Просмотров: 33923
In this video @mehdio goes over a fun end-to-end data engineering project : get usage insights from a python library using Python, SQL and DuckDB! This is the first part of the series. Check links below to learn about transformation and dashboarding using DuckDB !
🎥 Part 2 of the end-to-end data engineering project : • DuckDB & dbt | End-To-End Data Engineering...
🎥 Part 3 : • DuckDB & dataviz | End-To-End Data Enginee...
☁️🦆 Start using DuckDB in the Cloud for FREE with MotherDuck : https://hubs.la/Q02QnFR40
📓 Resources
Github Repo of the tutorial : https://github.com/mehd-io/pypi-duck-...
BigQuery performance issue with certain libraries: https://github.com/googleapis/python-...
DuckDB for beginner video : • DuckDB Tutorial For Beginners In 12 min
➡️ Follow Us
LinkedIn: / motherduck
Twitter : / motherduck
Blog: https://motherduck.com/blog/
0:00 Intro
1:06 Architecture
3:13 Ingestion Pipeline Python & DuckDB
41:08 Wrapping up & what's next
#duckdb #dataengineering #sql #python
Learn how to build a complete, end-to-end data engineering project using Python, SQL, and DuckDB. This video guides you through creating a robust Python data pipeline to ingest and analyze PyPI download statistics, providing valuable insights into any Python library's adoption. We'll cover the full architecture, from sourcing raw data in Google BigQuery to preparing it for transformation and visualization, making this a perfect tutorial for anyone looking to apply data engineering best practices in a real-world scenario.
We kick off the data ingestion phase by demonstrating how to efficiently query massive public datasets in BigQuery without incurring high costs, focusing on partition filtering for optimization. You'll learn how to set up a professional development environment using Docker and VS Code dev containers, and we'll install all the necessary libraries, including the Google Cloud SDK, Pandas for data manipulation, and of course, the DuckDB Python package. This setup ensures your data pipeline is reproducible and isolated.
Discover Python data pipeline best practices as we structure our code for maintainability and robustness. We use Pydantic to define clear data models for our job parameters and, critically, for schema validation against the source data from BigQuery. This prevents data quality issues from breaking your pipeline downstream. We also leverage the Fire library to automatically generate a powerful and flexible command-line interface (CLI) from our Pydantic models, making the pipeline easy to parameterize and run.
See how DuckDB acts as the powerful core of our ingestion logic. After fetching data into a Pandas DataFrame, we seamlessly load it into an in-memory DuckDB instance. This simplifies complex tasks like creating reliable test fixtures for schema validation and exporting the validated data to multiple destinations. Learn the simple SQL commands to write data locally, push to a data lake on AWS S3 with efficient Hive partitioning, or load it directly into MotherDuck for a serverless cloud data warehouse experience.
By the end of this tutorial, you'll have built a fully functional raw data ingestion pipeline, ready for the next step. This video sets the foundation for our series, where we'll next use DBT and DuckDB to build the transformation layer. You'll gain practical skills in data engineering, schema management, and building efficient pipelines with modern developer tools.
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: