Intro to ML Part 8: Preprocessing
Data Preprocessing¶
In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Importing dataset¶
In [2]:
df = pd.read_csv('demo-data.csv')
df
Out[2]:
Country | Age | Salary | Purchased | |
---|---|---|---|---|
0 | France | 44.0 | 72000.0 | No |
1 | Spain | 27.0 | 48000.0 | Yes |
2 | Germany | 30.0 | 54000.0 | No |
3 | Spain | 38.0 | 61000.0 | No |
4 | Germany | 40.0 | NaN | Yes |
5 | France | 35.0 | 58000.0 | Yes |
6 | Spain | NaN | 52000.0 | No |
7 | France | 48.0 | 79000.0 | Yes |
8 | Germany | 50.0 | 83000.0 | No |
9 | France | 37.0 | 67000.0 | Yes |
Splitting the dataset into Dependent and Independent variable¶
In [3]:
df.iloc[0]
Out[3]:
Country France Age 44.0 Salary 72000.0 Purchased No Name: 0, dtype: object
In [4]:
df.iloc[-1]
Out[4]:
Country France Age 37.0 Salary 67000.0 Purchased Yes Name: 9, dtype: object
In [5]:
df.iloc[2:5]
Out[5]:
Country | Age | Salary | Purchased | |
---|---|---|---|---|
2 | Germany | 30.0 | 54000.0 | No |
3 | Spain | 38.0 | 61000.0 | No |
4 | Germany | 40.0 | NaN | Yes |
In [6]:
df.iloc[:]
Out[6]:
Country | Age | Salary | Purchased | |
---|---|---|---|---|
0 | France | 44.0 | 72000.0 | No |
1 | Spain | 27.0 | 48000.0 | Yes |
2 | Germany | 30.0 | 54000.0 | No |
3 | Spain | 38.0 | 61000.0 | No |
4 | Germany | 40.0 | NaN | Yes |
5 | France | 35.0 | 58000.0 | Yes |
6 | Spain | NaN | 52000.0 | No |
7 | France | 48.0 | 79000.0 | Yes |
8 | Germany | 50.0 | 83000.0 | No |
9 | France | 37.0 | 67000.0 | Yes |
In [7]:
df.iloc[:,:-1]
Out[7]:
Country | Age | Salary | |
---|---|---|---|
0 | France | 44.0 | 72000.0 |
1 | Spain | 27.0 | 48000.0 |
2 | Germany | 30.0 | 54000.0 |
3 | Spain | 38.0 | 61000.0 |
4 | Germany | 40.0 | NaN |
5 | France | 35.0 | 58000.0 |
6 | Spain | NaN | 52000.0 |
7 | France | 48.0 | 79000.0 |
8 | Germany | 50.0 | 83000.0 |
9 | France | 37.0 | 67000.0 |
In [8]:
df.iloc[:,:1]
Out[8]:
Country | |
---|---|
0 | France |
1 | Spain |
2 | Germany |
3 | Spain |
4 | Germany |
5 | France |
6 | Spain |
7 | France |
8 | Germany |
9 | France |
In [9]:
df.iloc[:,1:]
Out[9]:
Age | Salary | Purchased | |
---|---|---|---|
0 | 44.0 | 72000.0 | No |
1 | 27.0 | 48000.0 | Yes |
2 | 30.0 | 54000.0 | No |
3 | 38.0 | 61000.0 | No |
4 | 40.0 | NaN | Yes |
5 | 35.0 | 58000.0 | Yes |
6 | NaN | 52000.0 | No |
7 | 48.0 | 79000.0 | Yes |
8 | 50.0 | 83000.0 | No |
9 | 37.0 | 67000.0 | Yes |
In [10]:
df.iloc[:,1:2]
Out[10]:
Age | |
---|---|
0 | 44.0 |
1 | 27.0 |
2 | 30.0 |
3 | 38.0 |
4 | 40.0 |
5 | 35.0 |
6 | NaN |
7 | 48.0 |
8 | 50.0 |
9 | 37.0 |
In [11]:
df.iloc[1:5,2:4]
Out[11]:
Salary | Purchased | |
---|---|---|
1 | 48000.0 | Yes |
2 | 54000.0 | No |
3 | 61000.0 | No |
4 | NaN | Yes |
Independent Variables¶
In [5]:
x = df.iloc[:, :-1].values
x
Out[5]:
array([['France', 44.0, 72000.0], ['Spain', 27.0, 48000.0], ['Germany', 30.0, 54000.0], ['Spain', 38.0, 61000.0], ['Germany', 40.0, nan], ['France', 35.0, 58000.0], ['Spain', nan, 52000.0], ['France', 48.0, 79000.0], ['Germany', 50.0, 83000.0], ['France', 37.0, 67000.0]], dtype=object)
Dependent Variable¶
In [6]:
y = df.iloc[:, -1].values
y
Out[6]:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'], dtype=object)
Handling null/missing values¶
In [14]:
df0 = df
df00 = df.copy()
df000 = df.copy(deep=False)
print(id(df))
print(id(df0))
print(id(df00))
print(id(df000))
140288963705728 140288963705728 140288963703520 140288963703184
Using pandas's fillna()¶
In [15]:
df1 = df.copy(deep=False)
# Filling missing values with mean
print(df1.iloc[4])
df1['Salary'] = df1['Salary'].fillna(df1['Salary'].mean())
print(df1.iloc[4])
Country Germany Age 40.0 Salary NaN Purchased Yes Name: 4, dtype: object Country Germany Age 40.0 Salary 63777.777778 Purchased Yes Name: 4, dtype: object
In [16]:
df2 = df.copy(deep=False)
# Filling missing values with median
print(df2.iloc[4])
df2['Salary'] = df2['Salary'].fillna(df2['Salary'].value_counts().index[0])
print(df2.iloc[4])
Country Germany Age 40.0 Salary NaN Purchased Yes Name: 4, dtype: object Country Germany Age 40.0 Salary 72000.0 Purchased Yes Name: 4, dtype: object
In [17]:
df2 = df.copy(deep=False)
# Filling missing values with median
print(df2.iloc[4])
df2['Salary'] = df2['Salary'].fillna(df2['Salary'].value_counts().index[0])
print(df2.iloc[4])
Country Germany Age 40.0 Salary NaN Purchased Yes Name: 4, dtype: object Country Germany Age 40.0 Salary 72000.0 Purchased Yes Name: 4, dtype: object
In [18]:
df3 = df.copy(deep=False)
# Filling missing values with some constant value
print(df3.iloc[4])
df3['Salary'] = df3['Salary'].fillna(-1)
print(df3.iloc[4])
Country Germany Age 40.0 Salary NaN Purchased Yes Name: 4, dtype: object Country Germany Age 40.0 Salary -1.0 Purchased Yes Name: 4, dtype: object
Using sklearn's Imputer¶
In [15]:
import numpy as np
from sklearn.impute import SimpleImputer
In [16]:
# Updating missing value with mean, other allowed values are `median`, `most_frequent` and `constant`
imputer_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
# imputer_median = SimpleImputer(missing_values=np.nan, strategy='median')
# imputer_most_frequent = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
# imputer_mean = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=-1)
# fitting numerical data
imputer_mean.fit(x[:, 1:3])
# updating dataset
x[:, 1:3] = imputer_mean.transform(x[:, 1:3])
x
Out[16]:
array([['France', 44.0, 72000.0], ['Spain', 27.0, 48000.0], ['Germany', 30.0, 54000.0], ['Spain', 38.0, 61000.0], ['Germany', 40.0, 63777.77777777778], ['France', 35.0, 58000.0], ['Spain', 38.77777777777778, 52000.0], ['France', 48.0, 79000.0], ['Germany', 50.0, 83000.0], ['France', 37.0, 67000.0]], dtype=object)
Encoding categorical data¶
Categorical data are non-numerical variables, also known as nominal.
Encoding independent variable¶
In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
- Applying one hot encoding via inbuilt
ColumnTransformer
- Setting remainder='passthrough' to keep the dataset intact.
- Here transformer takes three-element tuple as input: (name, transformer, columns).
In [8]:
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
print(x[0])
x = np.array(ct.fit_transform(x))
print(x[0])
x
['France' 44.0 72000.0] [1.0 0.0 0.0 44.0 72000.0]
Out[8]:
array([[1.0, 0.0, 0.0, 44.0, 72000.0], [0.0, 0.0, 1.0, 27.0, 48000.0], [0.0, 1.0, 0.0, 30.0, 54000.0], [0.0, 0.0, 1.0, 38.0, 61000.0], [0.0, 1.0, 0.0, 40.0, nan], [1.0, 0.0, 0.0, 35.0, 58000.0], [0.0, 0.0, 1.0, nan, 52000.0], [1.0, 0.0, 0.0, 48.0, 79000.0], [0.0, 1.0, 0.0, 50.0, 83000.0], [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)
Encoding the Dependent Variable¶
In [9]:
from sklearn.preprocessing import LabelEncoder
- Applying label encoding for the dependent variable as it contains only binary values, so one hot encoding may be overkill.
In [10]:
le = LabelEncoder()
print(y[0], y[1])
y = le.fit_transform(y)
print(y[0], y[1])
y
No Yes 0 1
Out[10]:
array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
Splitting the dataset into Training and Testing Set¶
- Allocating 25% of the whole dataset to training and the rest to testing purposes
In [21]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 1)
In [22]:
x_train
Out[22]:
array([[0.0, 1.0, 0.0, 40.0, nan], [1.0, 0.0, 0.0, 44.0, 72000.0], [0.0, 0.0, 1.0, 38.0, 61000.0], [0.0, 0.0, 1.0, 27.0, 48000.0], [1.0, 0.0, 0.0, 48.0, 79000.0], [0.0, 1.0, 0.0, 50.0, 83000.0], [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)
In [23]:
y_train
Out[23]:
array([1, 0, 0, 1, 1, 0, 1])
In [24]:
x_test
Out[24]:
array([[0.0, 1.0, 0.0, 30.0, 54000.0], [1.0, 0.0, 0.0, 37.0, 67000.0], [0.0, 0.0, 1.0, nan, 52000.0]], dtype=object)
In [25]:
y_test
Out[25]:
array([0, 1, 0])
Feature Scaling¶
Feature scaling is a method used to normalize the range of independent variables or features of data. Some common methods are:
Normalization: Rescales your values in the range [0, 1]: Also know as min max scaling.
here x is the org. value and x' is the normalized value.
Standarization: Rescales your values to have mean 0 and standard deviation 1:
here x is the org value,x̄ is the mean and σ is the standard deviation.
In [26]:
from sklearn.preprocessing import StandardScaler
In [28]:
sc = StandardScaler()
x_train[:, 3:] = sc.fit_transform(x_train[:, 3:])
In [29]:
x_train
Out[29]:
array([[0.0, 1.0, 0.0, -0.03891021128204815, nan], [1.0, 0.0, 0.0, 0.5058327466666259, 0.42119409708279354], [0.0, 0.0, 1.0, -0.31128169025638514, -0.4755417225128314], [0.0, 0.0, 1.0, -1.8093248246152385, -1.53532041839857], [1.0, 0.0, 0.0, 1.0505757046152997, 0.9918441640981912], [0.0, 1.0, 0.0, 1.3229471835896367, 1.3179299166784184], [1.0, 0.0, 0.0, -0.7198389087178906, -0.7201060369480019]], dtype=object)
In [ ]: