Популярное

Музыка Кино и Анимация Автомобили Животные Спорт Путешествия Игры Юмор

Интересные видео

2025 Сериалы Трейлеры Новости Как сделать Видеоуроки Diy своими руками

Топ запросов

смотреть а4 schoolboy runaway турецкий сериал смотреть мультфильмы эдисон
dTub
Скачать

Mike Mull: The Art and Science of Data Matching

Автор: PyData

Загружено: 4 дек. 2015 г.

Просмотров: 16 390 просмотров

Описание:

PyData NYC 2015

Data matching is the process of finding records in one or more data sources that refer to the same item. Variants of this process include de-duplication (one data source), record linkage (two data sources), and entity resolution (2+ data sources). This talk will discuss Python tools and libraries that can be applied to data matching, as well as various tricks of the trade.

Data matching enriches existing data sources, leading to new data products or clean input for further analysis. Correct matching is also a crucial aspect of information quality for enterprise data. Although there are many commercial tools for data matching, the Python ecosystem has components that make it relatively simple to build domain-specific matching applications or to incorporate matching into products and services.

Data matching uses basic computer science, NLP, statistics and machine learning; combined with a variety of hacks to deal with notoriously messy data like human names and street addresses. This talk will work through a test case, covering the following specific areas:

Using pandas as a framework for pre-processing and merging data
Profiling data to assess how hard or successful the matching process might be
Similarity metrics for approximate string matching
Techniques for parsing and matching human names
Techniques for handling address data, including geo-coding
Using blocking or indexing to reduce the number of comparisons
Probabilistic methods for optimal matching, such as the Fellugi-Sunter method
Using scikit-learn classifiers for record-linkage
A demonstration of the open-source dedupe tool
Information quality metrics 00:00 Welcome!
00:10 Help us add time stamps or captions to this video! See the description for details.

Want to help add timestamps to our YouTube videos to help with discoverability? Find out more here: https://github.com/numfocus/YouTubeVi...

Mike Mull: The Art and Science of Data Matching

Поделиться в:

Доступные форматы для скачивания:

Скачать видео mp4

  • Информация по загрузке:

Скачать аудио mp3

Похожие видео

Timothy Hopper: Understanding Probabilistic Topic Models By Simulation

Timothy Hopper: Understanding Probabilistic Topic Models By Simulation

Assurance Scoring Using Machine Learning and Analytics to Reduce Risk in the Public Sector

Assurance Scoring Using Machine Learning and Analytics to Reduce Risk in the Public Sector

Deep & Melodic House 24/7: Relaxing Music • Chill Study Music

Deep & Melodic House 24/7: Relaxing Music • Chill Study Music

Entity Resolution Explained Step by Step

Entity Resolution Explained Step by Step

RAG vs. CAG: Solving Knowledge Gaps in AI Models

RAG vs. CAG: Solving Knowledge Gaps in AI Models

Probabilistic Record Linkage of Hospital Patients - Chris Oakman

Probabilistic Record Linkage of Hospital Patients - Chris Oakman

MCP vs API: Simplifying AI Agent Integration with External Data

MCP vs API: Simplifying AI Agent Integration with External Data

How to Build An MVP | Startup School

How to Build An MVP | Startup School

Data Deduplication using Locality Sensitive Hashing - Matti Lyra

Data Deduplication using Locality Sensitive Hashing - Matti Lyra

Record Linkage: Probabilistic Linkage

Record Linkage: Probabilistic Linkage

© 2025 dtub. Все права защищены.



  • Контакты
  • О нас
  • Политика конфиденциальности



Контакты для правообладателей: [email protected]