Recent Parquet Improvements in Apache Spark

Автор: Databricks

Загружено: 2022-07-19

Просмотров: 3228

Описание:

Apache Parquet is a very popular columnar file format supported by Apache Spark. In a typical Spark job, scanning Parquet files is sometimes one of the most time consuming steps, as it incurs high CPU and IO overhead. Therefore, optimizing Parquet scan performance is crucial to job latency and cost efficiency.

Spark currently have two Parquet reader implementations: a vectorized one and a non-vectorized one. The former was implemented from scratch and offers much better performance than the latter. However, it currently doesn’t support complex types (e.g., array, list, map) at the moment and will fallback to the latter when encountering them. In addition to the reader implementation, predicate pushdown is also crucial to Parquet scan performance as it enables Spark to skip those data that do not satisfy the predicates, before the scan. Currently, Spark constructs predicates itself and rely on Parquet-MR to do the heavy lifting, which does the filtering based on various information such as statistics, dictionary, bloom filter or column index.

This talk will go through two recent improvements for Parquet scan performance: 1) vectorized read support for complex types, which allows Spark to achieve 10x+ improvement when reading Parquet data of complex types, and 2) Parquet column index support, which enables Spark to leverage Parquet column index feature during predicate pushdown. Last but not least, Chao go over some future work items that can further enhance Parquet read performance.

Connect with us:
Website: https://databricks.com
Facebook:   / databricksinc
Twitter:   / databricks
LinkedIn:   / data.  .
Instagram:   / databricksinc

Recent Parquet Improvements in Apache Spark

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

Владимир Озеров — Как работает Apache Iceberg на примере Trino

Владимир Озеров — Как работает Apache Iceberg на примере Trino

Seattle Spark + AI Meetup: How Apache Spark™ 3.0 and Delta Lake Enhance Data Lake Reliability

Seattle Spark + AI Meetup: How Apache Spark™ 3.0 and Delta Lake Enhance Data Lake Reliability

Parquet File Format | Apache Spark

Parquet File Format | Apache Spark

What's Next for Apache Spark™ Including the Upcoming Release of Apache Spark 4.0

What's Next for Apache Spark™ Including the Upcoming Release of Apache Spark 4.0

Apache Spark Core—Deep Dive—Proper Optimization Daniel Tomes Databricks

Apache Spark Core—Deep Dive—Proper Optimization Daniel Tomes Databricks

Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland

Spark + Parquet In Depth: Spark Summit East talk by: Emily Curtin and Robbie Strickland

Building InfluxDB 3.0 with Apache Arrow, DataFusion, Flight and Parquet

Building InfluxDB 3.0 with Apache Arrow, DataFusion, Flight and Parquet

Emerging Data Architectures & Approaches for Real-Time AI using Redis

Emerging Data Architectures & Approaches for Real-Time AI using Redis

Превратите ЛЮБОЙ файл в знания LLM за СЕКУНДЫ

Превратите ЛЮБОЙ файл в знания LLM за СЕКУНДЫ

What is Apache Iceberg?

What is Apache Iceberg?

Что это за дельта-озеро?

Что это за дельта-озеро?

Как сжимаются изображения? [46 МБ ↘↘ 4,07 МБ] JPEG в деталях

Как сжимаются изображения? [46 МБ ↘↘ 4,07 МБ] JPEG в деталях

Building a SIMD Supported Vectorized Native Engine for Spark SQL

Building a SIMD Supported Vectorized Native Engine for Spark SQL

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

GraphRAG: союз графов знаний и RAG: Эмиль Эйфрем

GraphRAG: союз графов знаний и RAG: Эмиль Эйфрем

A 101 in Time Series Analytics with Apache Arrow, Pandas and Parquet

A 101 in Time Series Analytics with Apache Arrow, Pandas and Parquet

КАК УСТРОЕН TCP/IP?

КАК УСТРОЕН TCP/IP?

Making Apache Spark™ Better with Delta Lake

Making Apache Spark™ Better with Delta Lake

Row Groups in Apache Parquet

Row Groups in Apache Parquet

Что такое Apache Spark? | Инструменты для работы с большими данными

Что такое Apache Spark? | Инструменты для работы с большими данными