95% reduction in Apache Spark processing time with correct usage of repartition() function

Автор: Rajesh Jakhotia

Загружено: 2020-12-16

Просмотров: 10163

Описание:

Hello Friends,

In this video I have demonstrated how we can reduce the processing time by more than 95% with correct usage of repartition() function in Apache Spark.

If we repartition() the data before running join or aggregation queries then it reduced the amount of data shuffle read / write and as such processing happens very fast.

Also by increasing the number of partitions, we make the aggregation tasks more manageable for the processor and thereby reduce the processing time.

The data file used in demo can be downloaded from our website https://k2analytics.co.in under the Resource tab. Within Resources the file will be in the Complimentary Resources. You may have to change the hdfs file path to file system path in case you are running the code in a Standalone Cluster.

Thanks.

95% reduction in Apache Spark processing time with correct usage of repartition() function

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

275 million records of Stock Market Data processed in less than 10 Seconds on 3 Node Spark Cluster

275 million records of Stock Market Data processed in less than 10 Seconds on 3 Node Spark Cluster

Apache Spark был сложным, пока я не изучил эти 30 концепций!

Apache Spark был сложным, пока я не изучил эти 30 концепций!

4 недавно заданных вопроса по программированию Pyspark | Интервью с Apache Spark

4 недавно заданных вопроса по программированию Pyspark | Интервью с Apache Spark

How Salting Can Reduce Data Skew By 99%

How Salting Can Reduce Data Skew By 99%

Data Caching in Apache Spark | Optimizing performance using Caching | When and when not to cache

Data Caching in Apache Spark | Optimizing performance using Caching | When and when not to cache

Basics of Apache Spark | Shuffle Partition [200] | learntospark

Basics of Apache Spark | Shuffle Partition [200] | learntospark

How to read large files in Apache spark || spark Performance tuning tips and tricks

How to read large files in Apache spark || spark Performance tuning tips and tricks

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

The Parquet Format and Performance Optimization Opportunities Boudewijn Braams (Databricks)

Cache, Persist & StorageLevels In Apache Spark

Cache, Persist & StorageLevels In Apache Spark

Spark - Repartition Or Coalesce

Spark - Repartition Or Coalesce

How to handle Data skewness in Apache Spark using Key Salting Technique

How to handle Data skewness in Apache Spark using Key Salting Technique

How Partitioning Works In Apache Spark?

How Partitioning Works In Apache Spark?

Основы Spark | Перемешивание

Основы Spark | Перемешивание

Shuffle Partition Spark Optimization: 10x Faster!

Shuffle Partition Spark Optimization: 10x Faster!

coalesce vs repartition vs partitionBy in spark | Interview question Explained

coalesce vs repartition vs partitionBy in spark | Interview question Explained

RDD | Устойчивый распределённый набор данных | RDD — устойчивый, неизменяемый и распределённый | ...

RDD | Устойчивый распределённый набор данных | RDD — устойчивый, неизменяемый и распределённый | ...

Spark Memory Management | How to calculate the cluster Memory in Spark

Spark Memory Management | How to calculate the cluster Memory in Spark

Вопрос для собеседования Spark | Сколько ядер процессора? | Сколько исполнителей? | Сколько памят...

Вопрос для собеседования Spark | Сколько ядер процессора? | Сколько исполнителей? | Сколько памят...

Partition vs bucketing | Spark and Hive Interview Question

Partition vs bucketing | Spark and Hive Interview Question

Архитектура среды выполнения Spark (кластерный режим) | #pyspark | #databricks

Архитектура среды выполнения Spark (кластерный режим) | #pyspark | #databricks