Soumil Shah
As a Data Engineer and an expert in Apache Hudi and Iceberg, I navigate the vast landscape of AWS Big Data and data lakes with a focus on building scalable data ingestion pipelines.
I developed the "LakeBoost" framework, integrating Apache Hudi with AWS Glue ETL to enhance efficiency and significantly reduce costs for large-scale data operations. With strong skills in Spark and data platforms, I design systems that support robust, high-performance data workflows.
Beyond my engineering work, I am a dedicated content creator, running a YouTube channel with 44,000 subscribers and over 1,600 videos on big data technologies. My passion for data engineering extends through both technical innovation and educational outreach, making complex concepts accessible to a global audience.

Learn How to Use Trino Iceberg Compaction for S3 Tables

Parallel Iceberg Table Compaction with AWS Step Functions and Athena

Partition-Aware Compaction: A Fail-Safe Strategy for Streaming Data Lakes with Apache Iceberg

Leveraging Spark Connect with S3 Tables (Managed Iceberg): A Comprehensive Guide

Getting started with LakeFS and Apache Iceberg Running Locally

Iceberg Table Metrics Collector: Python tool to collect and analyze Apache Iceberg table metrics

AWS Summit Mumbai - JUNE 19, 2025

Learn How to use kinesis Data Firehose to write data into S3 tables | Step by step Guide

Let's Migrate Iceberg Tables from One Catalog to Another | Simple Hands-on Lab

Migrating Hive Tables to Apache Iceberg on AWS: A Metadata-Only Approach

Multi-Tenant Data Ingestion with Apache Iceberg Views: A Spark-Powered Single Table Design

Querying S3 Tables with StarRocks Step by Step Guide

S3 Tables give you the flexibility to run table maintenance your way—via Spark when you need control

Building Per-Tenant Iceberg Tables Using Spark: | Hands on Labs #2

Building Per-Tenant Iceberg Tables Using Spark: | Hands on Labs #1

Single Table Design vs. Multiple Table Design: A Comparison for Tenant-Based Data Processing #1

Multiple DuckDB Arrow Flight Server and clients are using consistent Hashing to write data to Nodes

Concurrent Reader&Writer in DuckDB with Arrow Flight Server |High-Performance Data Analytics Service

Let’s Talk S3 Tables with Anupriti (Sr. Product Manager at AWS) & Soumil (Zeta Global)

Query S3 Table Buckets in Snowflake via AWS Glue IRCC Endpoints!

DuckDB Now Supports Querying New S3 Table Buckets via Glue IRCC Endpoints

Incremental File Processing from S3 with Spark: Avoid LastModified Timestamp Pitfalls!

Exploring SmallPond Fork for S3 by Mike-Luabase | Lightweight Data Processing with DuckDB

Real Time Streaming Pipeline from Kafka to New S3 Table Buckets with Checkpointing | Hands on Labs

Learn How to Perform Dual Write: S3 Table Buckets and Unmanaged Iceberg on EMR EC2, Sync with Glue

Learn How to Query S3Table Buckets (Managed Iceberg) with Trino | Hands on Labs

Optimize EMR on EC2 with AWS Step Functions | Control Job Concurrency Using Map Iterator

How to Resolve AccessDeniedException: Insufficient Lake Formation permission For S3 table Bucket

Read &Write Data into New S3 Table Buckets with PyIceberg Using Glue REST Endpoint – No Spark Needed

S3 Incremental File Processing with Pessimistic Locking using S3 Lock