Git for Data Lakes—How lakeFS Scales Data Versioning to Billions of Objects

Автор: Databricks

Загружено: 2022-07-19

Просмотров: 4059

Описание:

Modern data lake architectures rely on object storage as the single source of truth. We use them to store an increasing amount of data, which is increasingly complex and interconnected. While scalable, these object stores provide little safety guarantees: lacking semantics that allow atomicity, rollbacks, and reproducibility capabilities needed for data quality and resiliency.

lakeFS - an open source data version control system designed for Data Lakes solves these problems by introducing concepts borrowed from Git: branching, committing, merging and rolling back changes to data.

In this talk you'll learn about the challenges with using object storage for data lakes and how lakeFS enables you to solve them.

By the end of the session you’ll understand how lakeFS scales its Git-like data model to petabytes of data, across billions of objects - without affecting throughput or performance. We will also demo branching, writing data using Spark and merging it on a billion-object repository.

Connect with us:
Website: https://databricks.com
Facebook:   / databricksinc
Twitter:   / databricks
LinkedIn:   / data.  .
Instagram:   / databricksinc

Git for Data Lakes—How lakeFS Scales Data Versioning to Billions of Objects

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Presto On Spark: A Unified SQL Experience

Presto On Spark: A Unified SQL Experience

MLOps on Databricks: A How-To Guide

MLOps on Databricks: A How-To Guide

Master Data Pipeline Version Control

Master Data Pipeline Version Control

MLflow Pipelines: Accelerating MLOps from Development to Production

MLflow Pipelines: Accelerating MLOps from Development to Production

Delta Live Tables: Modern Software Engineering and Management for ETL

Delta Live Tables: Modern Software Engineering and Management for ETL

КАК НЕЛЬЗЯ ХРАНИТЬ ПАРОЛИ (и как нужно) за 11 минут

КАК НЕЛЬЗЯ ХРАНИТЬ ПАРОЛИ (и как нужно) за 11 минут

Apache Iceberg - A Table Format for Huge Analytic Datasets

Apache Iceberg - A Table Format for Huge Analytic Datasets

Data Warehouse против Data Lake против Data Lakehouse

Data Warehouse против Data Lake против Data Lakehouse

Cross-Platform Data Lineage with OpenLineage

Cross-Platform Data Lineage with OpenLineage

Dive Deeper into Data Engineering on Databricks

Dive Deeper into Data Engineering on Databricks

Kubernetes — Простым Языком на Понятном Примере

Kubernetes — Простым Языком на Понятном Примере

КАК УСТРОЕН TCP/IP?

КАК УСТРОЕН TCP/IP?

Git for Data with LakeFS + AWS re:Invent 2022 Recap for AI and Machine Learning

Git for Data with LakeFS + AWS re:Invent 2022 Recap for AI and Machine Learning

Лучший Гайд по Kafka для Начинающих За 1 Час

Лучший Гайд по Kafka для Начинающих За 1 Час

Data Warehousing on the Lakehouse

Data Warehousing on the Lakehouse

Version Control of Local Datasets for MLflow Experiments

Version Control of Local Datasets for MLflow Experiments

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

LLM и GPT - как работают большие языковые модели? Визуальное введение в трансформеры

dbt and Python—Better Together

dbt and Python—Better Together

Git for Data: Managing Data like Code with lakeFS

Git for Data: Managing Data like Code with lakeFS