Normalization

Major tasks of preprocessing are:

  1. Data cleaning
    1. filling missing values
    2. smoothing of noisy data
    3. identifying and removing outliers
    4. resolving inconsistencies
  2. Data Integration
    1. integrating data from multiple databases, data file, cubes
  3. Data transformation
    1. normalization 
    2. aggregation
  4. Data reduction
    1. obtain a reduced representation of data but same results
  5. Data discretization
    1. part of data reduction but with particular importance, especially for numeric data
Normalization : 
The goal of normalization is to make an entire set of values have a particular property. There are 3 different ways to perform normalization :
  1. min-max normalization
     X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))  
     X_scaled = X_std * (max - min) + min  
    here max, min is the new ranges 
  1. z-score normalization
     z = (x - u) / s  
    
    where u is the mean of the training samples, s is the standard deviation
  2. normalization by decimal scaling
     v_new = v/pow(10,j)  
    
    here j is the number of digits in the largest number of the whole attribute
I will demonstrate all the three in a single program as shown below:

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import math
import numpy as np
# taking this as example
data = [[20, 2], [8, 3], [0, 10], [1, 7], [5,7]]
# here 20,8,0,1,5 belongs to column-1 and 2,3,10,7,7 belongs to column-2
print(data)
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler
scaler = MinMaxScaler()#default=(0, 1)
print(scaler.fit(data))# it finds min and max
print(scaler.data_max_,scaler.data_min_)
tran = scaler.transform(data)
print("tranformed : ",tran)
invert = scaler.inverse_transform(tran)
print(invert)
# now standardScaler
data = [[20, 2], [8, 3], [0, 10], [1, 7], [5,7]]
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
scalerstd = StandardScaler()
print(scalerstd.fit(data))
print("mean : ",scalerstd.mean_)
print("var_ : ",scalerstd.var_) # standard deviation is the square root of variance
print(scalerstd.transform(data))
invertstd = scalerstd.inverse_transform(tran)
print(invertstd)
# decimal scaling,function
print("Decimal Scaling")
def Dec_scale(df):
p = max(df)
q = len(str(abs(p)))
for x in range(len(df)):
df[x] = df[x]/10**q
data1 = []
for j in range(len(data[0])):
data2 = [data[i][j] for i in range(len(data))]
Dec_scale(data2)
data1.append(data2)
# print(data1)
# i have taken (2,5) reshape(as i know the shape of data)
pw = np.reshape(data1,(2, 5))
print("decimal scaled : ",pw.T)
output:
 [[20, 2], [8, 3], [0, 10], [1, 7], [5, 7]]  
 MinMaxScaler(copy=True, feature_range=(0, 1))  
 [20. 10.] [0. 2.]  
 tranformed : [[1.  0.  ]  
  [0.4  0.125]  
  [0.  1.  ]  
  [0.05 0.625]  
  [0.25 0.625]]  
 [[20. 2.]  
  [ 8. 3.]  
  [ 0. 10.]  
  [ 1. 7.]  
  [ 5. 7.]]  
 StandardScaler(copy=True, with_mean=True, with_std=True)  
 mean : [6.8 5.8]  
 var_ : [51.76 8.56]  
 [[ 1.83474958 -1.29881326]  
  [ 0.16679542 -0.9570203 ]  
  [-0.94517403 1.43553045]  
  [-0.80617785 0.41015156]  
  [-0.25019312 0.41015156]]  
 [[13.9944423  5.8    ]  
  [ 9.67777692 6.16571847]  
  [ 6.8     8.72574777]  
  [ 7.15972211 7.62859235]  
  [ 8.59861057 7.62859235]]  
 Decimal Scaling  
 decimal scaled : [[0.2 0.02]  
  [0.08 0.03]  
  [0.  0.1 ]  
  [0.01 0.07]  
  [0.05 0.07]]  

Comments

Popular posts from this blog

Splitting criteria

Embedding the GitHub Gists codes into blogs

How to Handle Noisy Data in preprocessing of data?