Normalization
Major tasks of preprocessing are:
- Data cleaning
- filling missing values
- smoothing of noisy data
- identifying and removing outliers
- resolving inconsistencies
- Data Integration
- integrating data from multiple databases, data file, cubes
- Data transformation
- normalization
- aggregation
- Data reduction
- obtain a reduced representation of data but same results
- Data discretization
- part of data reduction but with particular importance, especially for numeric data
Normalization :
The goal of normalization is to make an entire set of values have a particular property. There are 3 different ways to perform normalization :
The goal of normalization is to make an entire set of values have a particular property. There are 3 different ways to perform normalization :
- min-max normalization
here max, min is the new rangesX_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) X_scaled = X_std * (max - min) + min
- z-score normalization
where u is the mean of the training samples, s is the standard deviationz = (x - u) / s
- normalization by decimal scaling
here j is the number of digits in the largest number of the whole attributev_new = v/pow(10,j)
I will demonstrate all the three in a single program as shown below:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.preprocessing import MinMaxScaler | |
from sklearn.preprocessing import StandardScaler | |
import math | |
import numpy as np | |
# taking this as example | |
data = [[20, 2], [8, 3], [0, 10], [1, 7], [5,7]] | |
# here 20,8,0,1,5 belongs to column-1 and 2,3,10,7,7 belongs to column-2 | |
print(data) | |
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler | |
scaler = MinMaxScaler()#default=(0, 1) | |
print(scaler.fit(data))# it finds min and max | |
print(scaler.data_max_,scaler.data_min_) | |
tran = scaler.transform(data) | |
print("tranformed : ",tran) | |
invert = scaler.inverse_transform(tran) | |
print(invert) | |
# now standardScaler | |
data = [[20, 2], [8, 3], [0, 10], [1, 7], [5,7]] | |
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler | |
scalerstd = StandardScaler() | |
print(scalerstd.fit(data)) | |
print("mean : ",scalerstd.mean_) | |
print("var_ : ",scalerstd.var_) # standard deviation is the square root of variance | |
print(scalerstd.transform(data)) | |
invertstd = scalerstd.inverse_transform(tran) | |
print(invertstd) | |
# decimal scaling,function | |
print("Decimal Scaling") | |
def Dec_scale(df): | |
p = max(df) | |
q = len(str(abs(p))) | |
for x in range(len(df)): | |
df[x] = df[x]/10**q | |
data1 = [] | |
for j in range(len(data[0])): | |
data2 = [data[i][j] for i in range(len(data))] | |
Dec_scale(data2) | |
data1.append(data2) | |
# print(data1) | |
# i have taken (2,5) reshape(as i know the shape of data) | |
pw = np.reshape(data1,(2, 5)) | |
print("decimal scaled : ",pw.T) |
[[20, 2], [8, 3], [0, 10], [1, 7], [5, 7]]
MinMaxScaler(copy=True, feature_range=(0, 1))
[20. 10.] [0. 2.]
tranformed : [[1. 0. ]
[0.4 0.125]
[0. 1. ]
[0.05 0.625]
[0.25 0.625]]
[[20. 2.]
[ 8. 3.]
[ 0. 10.]
[ 1. 7.]
[ 5. 7.]]
StandardScaler(copy=True, with_mean=True, with_std=True)
mean : [6.8 5.8]
var_ : [51.76 8.56]
[[ 1.83474958 -1.29881326]
[ 0.16679542 -0.9570203 ]
[-0.94517403 1.43553045]
[-0.80617785 0.41015156]
[-0.25019312 0.41015156]]
[[13.9944423 5.8 ]
[ 9.67777692 6.16571847]
[ 6.8 8.72574777]
[ 7.15972211 7.62859235]
[ 8.59861057 7.62859235]]
Decimal Scaling
decimal scaled : [[0.2 0.02]
[0.08 0.03]
[0. 0.1 ]
[0.01 0.07]
[0.05 0.07]]
Interested one can explore this :
https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py
Feel free to comment about mistakes and doubts.
Comments
Post a Comment