DuckDB & Python | End-To-End Data Engineering Project (1/3)

Автор: MotherDuck

Загружено: 2024-02-02

Просмотров: 33923

Описание:

In this video @mehdio goes over a fun end-to-end data engineering project : get usage insights from a python library using Python, SQL and DuckDB! This is the first part of the series. Check links below to learn about transformation and dashboarding using DuckDB !

🎥 Part 2 of the end-to-end data engineering project :    • DuckDB & dbt | End-To-End Data Engineering...
🎥 Part 3 :    • DuckDB & dataviz | End-To-End Data Enginee...

☁️🦆 Start using DuckDB in the Cloud for FREE with MotherDuck : https://hubs.la/Q02QnFR40

📓 Resources
Github Repo of the tutorial : https://github.com/mehd-io/pypi-duck-...
BigQuery performance issue with certain libraries: https://github.com/googleapis/python-...
DuckDB for beginner video :    • DuckDB Tutorial For Beginners In 12 min

➡️ Follow Us
LinkedIn:   / motherduck
Twitter :   / motherduck
Blog: https://motherduck.com/blog/

0:00 Intro
1:06 Architecture
3:13 Ingestion Pipeline Python & DuckDB
41:08 Wrapping up & what's next

#duckdb #dataengineering #sql #python

Learn how to build a complete, end-to-end data engineering project using Python, SQL, and DuckDB. This video guides you through creating a robust Python data pipeline to ingest and analyze PyPI download statistics, providing valuable insights into any Python library's adoption. We'll cover the full architecture, from sourcing raw data in Google BigQuery to preparing it for transformation and visualization, making this a perfect tutorial for anyone looking to apply data engineering best practices in a real-world scenario.

We kick off the data ingestion phase by demonstrating how to efficiently query massive public datasets in BigQuery without incurring high costs, focusing on partition filtering for optimization. You'll learn how to set up a professional development environment using Docker and VS Code dev containers, and we'll install all the necessary libraries, including the Google Cloud SDK, Pandas for data manipulation, and of course, the DuckDB Python package. This setup ensures your data pipeline is reproducible and isolated.

Discover Python data pipeline best practices as we structure our code for maintainability and robustness. We use Pydantic to define clear data models for our job parameters and, critically, for schema validation against the source data from BigQuery. This prevents data quality issues from breaking your pipeline downstream. We also leverage the Fire library to automatically generate a powerful and flexible command-line interface (CLI) from our Pydantic models, making the pipeline easy to parameterize and run.

See how DuckDB acts as the powerful core of our ingestion logic. After fetching data into a Pandas DataFrame, we seamlessly load it into an in-memory DuckDB instance. This simplifies complex tasks like creating reliable test fixtures for schema validation and exporting the validated data to multiple destinations. Learn the simple SQL commands to write data locally, push to a data lake on AWS S3 with efficient Hive partitioning, or load it directly into MotherDuck for a serverless cloud data warehouse experience.

By the end of this tutorial, you'll have built a fully functional raw data ingestion pipeline, ready for the next step. This video sets the foundation for our series, where we'll next use DBT and DuckDB to build the transformation layer. You'll gain practical skills in data engineering, schema management, and building efficient pipelines with modern developer tools.

DuckDB & Python | End-To-End Data Engineering Project (1/3)

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Будущее BI: изучение влияния инструментов BI-как-код с DuckDB

Будущее BI: изучение влияния инструментов BI-как-код с DuckDB

PYSPARK X DBT End-To-End Data Engineering Project | Master Big Data Engineering

PYSPARK X DBT End-To-End Data Engineering Project | Master Big Data Engineering

PAK3 day2 PanAfricanIO 2025

PAK3 day2 PanAfricanIO 2025

Hannes Mühleisen - Data Wrangling [for Python or R] Like a Boss With DuckDB

Hannes Mühleisen - Data Wrangling [for Python or R] Like a Boss With DuckDB

Understanding DuckLake: A Table Format with a Modern Architecture

Understanding DuckLake: A Table Format with a Modern Architecture

Watch a Complete NOOB Try DuckDB and DuckLake for the first time

Watch a Complete NOOB Try DuckDB and DuckLake for the first time

DuckDB против Pandas против Polars для разработчиков Python

DuckDB против Pandas против Polars для разработчиков Python

How to Build Data Pipelines for ML Projects (w/ Python Code)

How to Build Data Pipelines for ML Projects (w/ Python Code)

DuckDB и MotherDuck для начинающих: ваше полное руководство

DuckDB и MotherDuck для начинающих: ваше полное руководство

DuckDB & dbt | End-To-End Data Engineering Project (2/3)

DuckDB & dbt | End-To-End Data Engineering Project (2/3)

Конвейеры геопространственных данных: извлечение, загрузка, преобразование!

Конвейеры геопространственных данных: извлечение, загрузка, преобразование!

Starting With DuckDB and Python: An Introduction & Using DuckDB With Databases

Starting With DuckDB and Python: An Introduction & Using DuckDB With Databases

SWIGGY Data Pipeline | End To End Data Engineering Project In Snowflake

SWIGGY Data Pipeline | End To End Data Engineering Project In Snowflake

DuckDB для разработчиков Python: 6 причин, по которым он лучше DataFrames

DuckDB для разработчиков Python: 6 причин, по которым он лучше DataFrames

Build a Live Air Quality Dashboard with Python and Plotly Dash! | Beginner Data Engineering Project

Build a Live Air Quality Dashboard with Python and Plotly Dash! | Beginner Data Engineering Project

Apache Iceberg: что это такое и почему все о нем говорят.

Apache Iceberg: что это такое и почему все о нем говорят.

Является ли DuckDB секретом раскрытия вашего ГИС-потенциала?

Является ли DuckDB секретом раскрытия вашего ГИС-потенциала?

End to End Data Engineering Project using Databricks Free Edition | FMCG Domain

End to End Data Engineering Project using Databricks Free Edition | FMCG Domain

Introducing DuckLake

Introducing DuckLake

Faster Data Pipelines development with MCP and DuckDB

Faster Data Pipelines development with MCP and DuckDB