Gianluca Campanella: The unreasonable effectiveness of feature hashing | PyData London 2019
Автор: PyData
Загружено: 2019-07-18
Просмотров: 4844
Feature hashing is a computationally efficient pre-processing technique for sparse, high-dimensional features. Starting from an overview of the method, this talk covers: the impact of hash functions, hash size and collisions on statistical performance; three libraries for model training with feature hashing; hash reversibility and its implications for model interpretability.
www.pydata.org
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.
00:00 Welcome!
1:00 - Introduction
1:53 - Background- Supervised ML
3:29 - Categorical Features
4:00 - One-hot encoding
4:36 - Bag of words
5:20 - High dimensional feature space
9:20 - Feature Hashing
11:26 - Hash function
12:24 - Feature Hashing in Python
13:27 - Hashing of Unicode Strings
15:05 - Projection
15:06 - Collisions
17:58 - Sign Functions
20:26 - Feature Hashing- Example
26:40 - Feature Hashing- Use Case
30:10 - Library Support
31:27 - Recap
32:20 - Q&A
S/o to https://github.com/Cyborg-vs-Droids for the video timestamps!
Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVi...
Доступные форматы для скачивания:
Скачать видео mp4
-
Информация по загрузке: