Wrangle PDFs with Custom User Defined Functions (UDF) in Daft
Автор: Daft Engine
Загружено: 2025-07-30
Просмотров: 265
Wrangle PDFs from start to finish with custom User Defined Functions (UDFs) in Daft. Software Engineer Malcolm Greaves / malcolm-greaves walks you through every step of a PDF processing pipeline. By the end of the video, you will have a fully functional pipeline that:
• Starts with downloading PDFs from an S3 bucket
• Extracts text boxes using OCR or by reading the file format
• Performs spatial layout analysis to group text boxes into lines or paragraphs
• Computes embeddings using a lightweight LLM, running locally
• Saves everything to Parquet
Build a singular PDF processing pipeline and have complete control over all of it, no more stitching together fragmented tools for these types of workloads.
Notebook to follow along: https://docs.daft.ai/en/stable/resour...
Try it yourself and get started today: pip install daft
🩷 Get to know Daft
‣ Learn more about Daft: https://www.daft.ai
‣ Join our Distributed Data Slack Community: https://www.daft.ai/slack
‣ Star Daft Github: https://github.com/Eventual-Inc/Daft
‣ Subscribe to Daft Engineering Blog: https://www.daft.ai/blog
📲 Follow us
‣ LinkedIn: / daftengine
‣ X/Twitter: / daftengine
#daft #distributed #multimodal #data #dataengineering
00:00 Introduction
00:35 Download Daft & Dependencies
00:58 Pull S3 urls of PDFs
01:53 Download PDFs from S3
02:38 Use Pydantic classes
04:39 Generating Daft Datatypes from Pydantic
05:10 Load & Parse PDFs Using UDFs
07:53 Perform OCR and Extract Text on First PDF
08:49 Document Processing
11:10 Text Embedding with SentenceTransformer
12:00 Entire End-to-End Pipeline
12:52 Step 1: Enumerate S3 Keys
13:06 Step 2: Download PDFs
13:12 Step 3: Load PDFs, Maybe Apply OCR
13:46 Explaining Daft UDF Application
14:31 Step 4: Text Box Processing
16:27 Explaining Structure Access Expressions
18:09 Step 5: Text Embeddings
19:07 Execute and Write to Parquet
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: