Python Tutorial : Feature engineering and overfitting

Автор: DataCamp

Загружено: 2020-04-15

Просмотров: 429

Описание:

Want to learn more? Take the full course at https://learn.datacamp.com/courses/de... at your own pace. More than a video, you'll learn hands-on coding & quickly apply skills to your daily work.

---

Feature engineering uses domain knowledge and common sense to describe an object with numbers. Although adding more features can improve performance, it can also increase the risk of overfitting. In this lesson, you will learn more about this interesting trade-off.

Sometimes, the raw data can not fit into the form of a table. For example, consider electrocardiogram (or ECG) traces for a number of individuals. Each ECG trace is a time series, possibly of variable length, that cannot fit in one cell of a table.

Instead, in the dataset shown here experts extracted over 250 one-dimensional numerical summaries from each ECG. These range from simple summaries like heart-rate to very complex properties of the signal with weird names like T-wave-amp, all of which can be useful in detecting a medical condition known as arrhythmia.

Even if the data are tabular, some of the columns might be non-numeric. Here is an example from the credit scoring dataset: the purpose of the loan takes values such as "buy a new car", "education" or "retraining". LabelEncoder will map these values onto a range of numbers.

But the classifier is then confused. It thinks that the categories have a natural ordering. For example, a decision tree might try to split the range in two. If it splits at 4, it is putting loans for business together with loans for a microwave oven!

A different approach is to use one-hot-encoding, implemented by the .get_dummies() pandas method. This creates one new dummy variable for each category, taking the value 1 for each example that falls in that category and 0 otherwise. You can see the first row of the data on the left, printed vertically for readability. No artificial ordering is introduced.

How about capturing semantic similarity? Notice that similar categories share keywords: for example, all consumer loans feature the keyword "buy". You can count common keywords using CountVectorizer from the feature_extraction module.

First, replace underscores with spaces for easier tokenization.

Then, apply the encoder using its .fit_transform() method.

Finally, convert the resulting matrix to a DataFrame, naming the columns using the .get_feature_names() method of the CountVectorizer object.

Note that as we improve our feature engineering pipeline, the dimension of our DataFrame increases! The question arises: how many features is too many?

Well, with more columns, the algorithm has more opportunity to mistake coincidental patterns for real signal. We can test this by adding columns to the data containing purely random numbers totally unrelated to the class. As we add more columns on the horizontal axis, overfitting kicks in! Accuracy improves in-sample but deteriorates out-of-sample.

A popular solution is to add features freely, and then select the "best" ones using some feature selection technique.

Let's try the trick from the previous slide, and augment the credit scoring dataset with 100 fake variables.

Then, we use the SelectKBest algorithm from the feature_selector module to select the 20 highest-scoring columns. We use the chi2 scoring method. The feature selector has a .fit() method to fit it to the data, and a .get_support() method that returns the index of the selected columns.

Thankfully, only a handful of fake columns remain in the selected features.

So remember this: every decision you make in your pipeline might affect other aspects of it, and in particular the risk of overfitting. The following exercises confirm this insight.

#DataCamp #PythonTutorial #DesigningMachineLearningWorkflowsinPython

Python Tutorial : Feature engineering and overfitting

Доступные форматы для скачивания:

Скачать видео mp4

Информация по загрузке:

Скачать аудио mp3

Похожие видео

array(10) { [0]=> object(stdClass)#4924 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "aircAruvnKk" ["related_video_title"]=> string(101) "Но что такое нейронная сеть? | Глава 1. Глубокое обучение" ["posted_time"]=> string(19) "7 лет назад" ["channelName"]=> string(11) "3Blue1Brown" } [1]=> object(stdClass)#4897 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "YkYpGhsCx4c" ["related_video_title"]=> string(62) "How to Build Your First Decision Tree in Python (scikit-learn)" ["posted_time"]=> string(19) "1 год назад" ["channelName"]=> string(24) "Ryan & Matt Data Science" } [2]=> object(stdClass)#4922 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "LPZh9BOjkQs" ["related_video_title"]=> string(82) "Краткое объяснение больших языковых моделей" ["posted_time"]=> string(27) "7 месяцев назад" ["channelName"]=> string(11) "3Blue1Brown" } [3]=> object(stdClass)#4929 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "p3CcfIjycBA" ["related_video_title"]=> string(59) "154 - Understanding the training and validation loss curves" ["posted_time"]=> string(21) "4 года назад" ["channelName"]=> string(13) "DigitalSreeni" } [4]=> object(stdClass)#4908 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "B01qMFMAgUQ" ["related_video_title"]=> string(75) "Machine Learning Tutorial Python - 20: Bias vs Variance In Machine Learning" ["posted_time"]=> string(21) "3 года назад" ["channelName"]=> string(10) "codebasics" } [5]=> object(stdClass)#4926 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "wSxBh3KPJcw" ["related_video_title"]=> string(168) "Обострение между Москвой и Баку. Итоги войны Ирана и Израиля. Максим Шевченко: Особое мнение" ["posted_time"]=> string(67) "Трансляция закончилась 9 часов назад" ["channelName"]=> string(23) "Живой Гвоздь" } [6]=> object(stdClass)#4921 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "sGwHElJ4vu4" ["related_video_title"]=> string(175) "В Донецке взорван командующий 8 армией | Погиб командир 110 бригады | Алиев дал команду «фас»" ["posted_time"]=> string(21) "4 часа назад" ["channelName"]=> string(29) "Тарас Березовец" } [7]=> object(stdClass)#4931 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "X1hwtpAE7UY" ["related_video_title"]=> string(93) "✓ Докажем, что π = 2 | Ботай со мной #096 | Борис Трушин" ["posted_time"]=> string(21) "4 года назад" ["channelName"]=> string(23) "Борис Трушин" } [8]=> object(stdClass)#4907 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "pq43um-iv6U" ["related_video_title"]=> string(176) "ОСЕЧКИН: "Псы сорвались с цепи. Путин больше не может". Как в Азербайджане, часы Кадырова,что в ФСБ" ["posted_time"]=> string(24) "13 часов назад" ["channelName"]=> string(24) "И Грянул Грэм" } [9]=> object(stdClass)#4925 (5) { ["video_id"]=> int(9999999) ["related_video_id"]=> string(11) "mRNVc3-XGFg" ["related_video_title"]=> string(54) "How Your AI Prompt Travels Through a Data Center | WSJ" ["posted_time"]=> string(24) "12 часов назад" ["channelName"]=> string(23) "The Wall Street Journal" } }

Но что такое нейронная сеть? | Глава 1. Глубокое обучение

Но что такое нейронная сеть? | Глава 1. Глубокое обучение

How to Build Your First Decision Tree in Python (scikit-learn)

How to Build Your First Decision Tree in Python (scikit-learn)

Краткое объяснение больших языковых моделей

Краткое объяснение больших языковых моделей

154 - Understanding the training and validation loss curves

154 - Understanding the training and validation loss curves

Machine Learning Tutorial Python - 20: Bias vs Variance In Machine Learning

Machine Learning Tutorial Python - 20: Bias vs Variance In Machine Learning

Обострение между Москвой и Баку. Итоги войны Ирана и Израиля. Максим Шевченко: Особое мнение

Обострение между Москвой и Баку. Итоги войны Ирана и Израиля. Максим Шевченко: Особое мнение

В Донецке взорван командующий 8 армией | Погиб командир 110 бригады | Алиев дал команду «фас»

В Донецке взорван командующий 8 армией | Погиб командир 110 бригады | Алиев дал команду «фас»

✓ Докажем, что π = 2 | Ботай со мной #096 | Борис Трушин

✓ Докажем, что π = 2 | Ботай со мной #096 | Борис Трушин

ОСЕЧКИН: "Псы сорвались с цепи. Путин больше не может". Как в Азербайджане, часы Кадырова,что в ФСБ

How Your AI Prompt Travels Through a Data Center | WSJ

How Your AI Prompt Travels Through a Data Center | WSJ