Data pre-processing for Machine Learning in Python

Data pre-processing for Machine Learning in Python

Data pre-processing for Machine Learning in Python

How to transform a dataset for a machine learning model

Language: english

Note: 4.7/5 (15 notes) 702 students

Instructor(s): Gianluca Malato

Last update: 2021-04-23

What you’ll learn

  • How to fill the missings in numerical and categorical variables
  • How to encode the categorical variables
  • How to transform the numerical variables
  • How to scale the numerical variables
  • Principal Component Analysis and how to use it
  • How to apply oversampling using SMOTE
  • How to use several useful objects in scikit-learn library



  • Basic knowledge of Python programming language



In this course, we are going to focus on pre-processing techniques for machine learning.

Pre-processing is the set of manipulations that transform a raw dataset to make it used by a machine learning model. It is necessary for making our data suitable for some machine learning models, to reduce the dimensionality, to better identify the relevant data, and to increase model performance. It’s the most important part of a machine learning pipeline and it’s strongly able to affect the success of a project. In fact, if we don’t feed a machine learning model with the correctly shaped data, it won’t work at all.

Sometimes, aspiring Data Scientists start studying neural networks and other complex models and forget to study how to manipulate a dataset in order to make it used by their algorithms. So, they fail in creating good models and only at the end they realize that good pre-processing would make them save a lot of time and increase the performance of their algorithms. So, handling pre-processing techniques is a very important skill. That’s why I have created an entire course that focuses only on data pre-processing.

With this course, you are going to learn:

  1. Data cleaning

  2. Encoding of the categorical variables

  3. Transformation of the numerical features

  4. Scikit-learn Pipeline and ColumnTransformer objects

  5. Scaling of the numerical features

  6. Principal Component Analysis

  7. Filter-based feature selection

  8. Oversampling using SMOTE

All the examples will be given using Python programming language and its powerful scikit-learn library. The environment that will be used is Jupyter, which is a standard in the data science industry. All the sections of this course end with some practical exercises and the Jupyter notebooks are all downloadable.


Who this course is for

  • Python developers
  • Aspiring data scientists
  • People interested in machine learning and artificial intelligence


Course content

  • Introduction
    • Introduction to the course
    • Numerical and categorical variables
    • The dataset
    • Required Python packages
    • Jupyter notebooks
  • Data cleaning
    • Introduction to data cleaning
    • Selecting numerical and categorical variables
    • Cleaning the numerical features
    • Cleaning the categorical features
    • KNN blank filling
    • ColumnTransformer and make_column_selector
    • Exercises
  • Encoding of the categorical features
    • Introduction to the encoding of categorical variables
    • One-hot encoding
    • Ordinal encoding
    • Label encoding of the target variable
    • Exercise
  • Transformations of the numerical features
    • Introduction to transformations
    • Power Transformation
    • Binning
    • Binarizing
    • Applying an arbitrary transformation
    • Exercise
    • About power transformations
  • Pipelines
    • Define a transformation pipeline
    • Pipelines and ColumnTransformer together
    • Exercises
  • Scaling
    • Introduction to scaling
    • Normalization, Standardization, Robust scaling
    • Exercise
  • Principal Component Analysis
    • Introduction to PCA
    • How to perform PCA
    • Exercise
  • Filter-based feature selection
    • Introduction to feature selection
    • Numerical features, numerical target
    • Numerical features, categorical target
    • Categorical features, numerical target
    • Categorical features, categorical target
    • Feature importance according to a model
    • A comment on mutual information
    • A comment on feature selection with categorical variables
    • Exercises
  • A complete pipeline
    • An example of a complete pipeline
  • Oversampling
    • Introduction to SMOTE
    • How to perform SMOTE
    • Exercise
  • General guidelines
    • Practical suggestions


Time remaining or 449 enrolls left


Don’t miss any coupons by joining our Telegram group 

Udemy Coupon Code 100% off | Udemy Free Course | Udemy offer | Course with certificate