Data Preprocessing using Scikit Learn

Dhruv Shah
5 min readOct 26, 2021

This task is for educational purposes on how to preprocess the data using Scikit learn library.

Data Preprocessing

Why data-preprocessing is required?

Data preprocessing is crucial in any data mining process as it directly impacts the success rate of the project. This reduces the complexity of the data under analysis as data in the real world is unclean.

Data is said to be unclean if it is missing attribute, attribute values, contain noise or outliers, and duplicate or wrong data. The presence of any of these will degrade the quality of the results.

What is data-preprocessing?

  • Data preprocessing is a data mining technique that is used to transform the raw data in a useful and efficient format.
  • Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors.
  • Data preprocessing is a proven method of resolving such issues.

In this blog, We will learn about Data Preprocessing using Scikit learn. We will learn different methods of Data preprocessing and learn when to use which method.

There are a lot of preprocessing methods but we will mainly focus on the following methodologies:

(1) Encoding the Data

(2) Normalization

(3) Standardization

(4) Imputing the Missing Values

(5) Discretization

Data Encoding

Encoding is the conversion of Categorical features to numeric values as Machine Learning models cannot handle the text data directly. Most of the Machine Learning Algorithms' performance varies based on how the Categorical data is encoded. The two popular techniques of converting Categorical values to Numeric values are done in two different methods.

  1. Label Encoding
  • Label Encoding refers to converting the labels into the numeric form to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated.

2. One Hot Encoding

  • Though label encoding is straight it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column.

Normalization

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as -1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms. When multiple attributes are there but attributes have values on different scales, this may lead to poor data models while performing data mining operations. So they are normalized to bring all the attributes on the same scale.

Standardization

Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Discretization

Discretization is the process of putting values into buckets so that there are a limited number of possible states. The buckets themselves are treated as ordered and discrete values. You can discretize both numeric and string columns.

Imputation of missing values

Missing data are values that are not recorded in a dataset. They can be a single value missing in a single cell or missing an entire observation (row). Missing data can occur both in a continuous variable (e.g. height of students) or a categorical variable (e.g. gender of a population).

We can handle missing values in two ways:

  1. Remove the data (whole row) which have missing values.
  2. Add the values by using some strategies or using Imputer.

Let’s get started

Step 1: Import the required libraries and dataset

  • Load the dataset into the data frame.

Step 2: Get the dataset info

Step 3: Import LabelEncoder, OneHotEncoder from skikit learn library

Step 4: Now, using a label encoder we will try to convert the labels into a numeric form to convert them into the machine-readable form.

Step 5: Using OneHot encoder, we will try to represent categorical data to be more expressive

Step 6: Normalization methods allow the transformation of any element of an equivalence class of shapes under a group of geometric transforms into a specific one, fixed once for all in each class.

Step 7: Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.

Step 8: The goal of discretization is to reduce the number of values a continuous variable assumes by grouping them into a number, b, of intervals or bins.

Step 9: Imputation is the process of replacing missing data with substituted values.

Step 10: Final Output

★ You can find the whole code of data preprocessing here or from my GitHub.

--

--