Популярное

Музыка Кино и Анимация Автомобили Животные Спорт Путешествия Игры Юмор

Интересные видео

2025 Сериалы Трейлеры Новости Как сделать Видеоуроки Diy своими руками

Топ запросов

смотреть а4 schoolboy runaway турецкий сериал смотреть мультфильмы эдисон
dTub
Скачать

Python Tutorial : Feature engineering and overfitting

Designing Machine Learning Workflows in Python

Machine Learning Workflows in Python

DataCamp

Python Tutorial

want to learn Python

Data Science

how to learn data science

Data Analyst with Python

Data Scientist with Python

Supervised learning pipelines

Feature engineering and overfitting

Model complexity and overfitting

Dimensionality and overfitting

Автор: DataCamp

Загружено: 16 апр. 2020 г.

Просмотров: 415 просмотров

Описание:

Want to learn more? Take the full course at https://learn.datacamp.com/courses/de... at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.

---

Feature engineering uses domain knowledge and common sense to describe an object with numbers. Although adding more features can improve performance, it can also increase the risk of overfitting. In this lesson, you will learn more about this interesting trade-off.

Sometimes, the raw data can not fit into the form of a table. For example, consider electrocardiogram (or ECG) traces for a number of individuals. Each ECG trace is a time series, possibly of variable length, that cannot fit in one cell of a table.

Instead, in the dataset shown here experts extracted over 250 one-dimensional numerical summaries from each ECG. These range from simple summaries like heart-rate to very complex properties of the signal with weird names like T-wave-amp, all of which can be useful in detecting a medical condition known as arrhythmia.

Even if the data are tabular, some of the columns might be non-numeric. Here is an example from the credit scoring dataset: the purpose of the loan takes values such as "buy a new car", "education" or "retraining". LabelEncoder will map these values onto a range of numbers.

But the classifier is then confused. It thinks that the categories have a natural ordering. For example, a decision tree might try to split the range in two. If it splits at 4, it is putting loans for business together with loans for a microwave oven!

A different approach is to use one-hot-encoding, implemented by the .get_dummies() pandas method. This creates one new dummy variable for each category, taking the value 1 for each example that falls in that category and 0 otherwise. You can see the first row of the data on the left, printed vertically for readability. No artificial ordering is introduced.

How about capturing semantic similarity? Notice that similar categories share keywords: for example, all consumer loans feature the keyword "buy". You can count common keywords using CountVectorizer from the feature_extraction module.

First, replace underscores with spaces for easier tokenization.

Then, apply the encoder using its .fit_transform() method.

Finally, convert the resulting matrix to a DataFrame, naming the columns using the .get_feature_names() method of the CountVectorizer object.

Note that as we improve our feature engineering pipeline, the dimension of our DataFrame increases! The question arises: how many features is too many?

Well, with more columns, the algorithm has more opportunity to mistake coincidental patterns for real signal. We can test this by adding columns to the data containing purely random numbers totally unrelated to the class. As we add more columns on the horizontal axis, overfitting kicks in! Accuracy improves in-sample but deteriorates out-of-sample.

A popular solution is to add features freely, and then select the "best" ones using some feature selection technique.

Let's try the trick from the previous slide, and augment the credit scoring dataset with 100 fake variables.

Then, we use the SelectKBest algorithm from the feature_selector module to select the 20 highest-scoring columns. We use the chi2 scoring method. The feature selector has a .fit() method to fit it to the data, and a .get_support() method that returns the index of the selected columns.

Thankfully, only a handful of fake columns remain in the selected features.

So remember this: every decision you make in your pipeline might affect other aspects of it, and in particular the risk of overfitting. The following exercises confirm this insight.


#DataCamp #PythonTutorial #DesigningMachineLearningWorkflowsinPython

Python Tutorial : Feature engineering and overfitting

Поделиться в:

Доступные форматы для скачивания:

Скачать видео mp4

  • Информация по загрузке:

Скачать аудио mp3

Похожие видео

Python — полный курс для начинающих. Этот навык изменит твою жизнь.

Python — полный курс для начинающих. Этот навык изменит твою жизнь.

All Machine Learning algorithms explained in 17 min

All Machine Learning algorithms explained in 17 min

How to Build Your First Decision Tree in Python (scikit-learn)

How to Build Your First Decision Tree in Python (scikit-learn)

How is data prepared for machine learning?

How is data prepared for machine learning?

Python  - Полный Курс по Python [15 ЧАСОВ]

Python - Полный Курс по Python [15 ЧАСОВ]

Но что такое нейронная сеть? | Глава 1. Глубокое обучение

Но что такое нейронная сеть? | Глава 1. Глубокое обучение

RAG vs. CAG: Solving Knowledge Gaps in AI Models

RAG vs. CAG: Solving Knowledge Gaps in AI Models

UML class diagrams

UML class diagrams

Machine Learning Tutorial Python - 20: Bias vs Variance In Machine Learning

Machine Learning Tutorial Python - 20: Bias vs Variance In Machine Learning

4.5 0/1 Knapsack - Two Methods - Dynamic Programming

4.5 0/1 Knapsack - Two Methods - Dynamic Programming

© 2025 dtub. Все права защищены.



  • Контакты
  • О нас
  • Политика конфиденциальности



Контакты для правообладателей: [email protected]