Normalization

Major tasks of preprocessing are:

  1. Data cleaning
    1. filling missing values
    2. smoothing of noisy data
    3. identifying and removing outliers
    4. resolving inconsistencies
  2. Data Integration
    1. integrating data from multiple databases, data file, cubes
  3. Data transformation
    1. normalization 
    2. aggregation
  4. Data reduction
    1. obtain a reduced representation of data but same results
  5. Data discretization
    1. part of data reduction but with particular importance, especially for numeric data
Normalization : 
The goal of normalization is to make an entire set of values have a particular property. There are 3 different ways to perform normalization :
  1. min-max normalization
     X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))  
     X_scaled = X_std * (max - min) + min  
    here max, min is the new ranges 
  1. z-score normalization
     z = (x - u) / s  
    
    where u is the mean of the training samples, s is the standard deviation
  2. normalization by decimal scaling
     v_new = v/pow(10,j)  
    
    here j is the number of digits in the largest number of the whole attribute
I will demonstrate all the three in a single program as shown below:

output:
 [[20, 2], [8, 3], [0, 10], [1, 7], [5, 7]]  
 MinMaxScaler(copy=True, feature_range=(0, 1))  
 [20. 10.] [0. 2.]  
 tranformed : [[1.  0.  ]  
  [0.4  0.125]  
  [0.  1.  ]  
  [0.05 0.625]  
  [0.25 0.625]]  
 [[20. 2.]  
  [ 8. 3.]  
  [ 0. 10.]  
  [ 1. 7.]  
  [ 5. 7.]]  
 StandardScaler(copy=True, with_mean=True, with_std=True)  
 mean : [6.8 5.8]  
 var_ : [51.76 8.56]  
 [[ 1.83474958 -1.29881326]  
  [ 0.16679542 -0.9570203 ]  
  [-0.94517403 1.43553045]  
  [-0.80617785 0.41015156]  
  [-0.25019312 0.41015156]]  
 [[13.9944423  5.8    ]  
  [ 9.67777692 6.16571847]  
  [ 6.8     8.72574777]  
  [ 7.15972211 7.62859235]  
  [ 8.59861057 7.62859235]]  
 Decimal Scaling  
 decimal scaled : [[0.2 0.02]  
  [0.08 0.03]  
  [0.  0.1 ]  
  [0.01 0.07]  
  [0.05 0.07]]  

Comments

Popular posts from this blog

Splitting criteria

How to Handle Noisy Data in preprocessing of data?

Embedding the GitHub Gists codes into blogs