How to Handle Noisy Data in preprocessing of data?


Binning method:(one of the method)
  • first sort data and partition into (equi-depth) bins
  • then one can smooth by bin means,  smooth by bin median, smooth by bin boundaries, etc.
Equal-width (distance) partitioning:
  • It divides the range into N intervals of equal size: uniform grid
  • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.
  • The most straightforward
  • But outliers may dominate the presentation
  • Skewed data is not handled well.
Equal-depth (frequency) partitioning:
  • It divides the range into N intervals, each containing approximately same number of samples
  • Good data scaling
  • Managing categorical attributes can be tricky.
Code for binning (if needed we can edit for user input instead of random) :
"""
How to Handle Noisy Data in preprocessing of data?
Binning method
"""
import math
import random
import statistics
size = int(input("enter the size : "))
numbers = [random.randrange(100) for i in range(size)]
# sorting
numbers = sorted(numbers)
print("chosen numbers : ",numbers)
bins = int(input("How many bins(input should divide the size perfectly)? : "))
if size%bins == 0:
# number of item in each bin
c = int(size/bins)
# Partition into (equi-depth) bins
equi_depth = [[numbers[p] for p in range(i,i+c)] for i in range(0,size,c) ]
# Smoothing by bin means
smooth_bin_means = [[statistics.mean(i) for j in range(c)] for i in equi_depth]
# Smoothing by bin boundaries
smooth_bin_boundary = []
for i in equi_depth:
min_num = i[0]
max_num = i[-1]
item = []
for j in i:
if j - min_num <= max_num - j:
item.append(min_num)
else:
item.append(max_num)
smooth_bin_boundary.append(item)
print("Partition into (equi-depth) bins : ",equi_depth)
print("Smoothing by bin means : ",smooth_bin_means)
print("Smoothing by bin boundaries : ",smooth_bin_boundary)
else:
print("incorrect input")
view raw binning.py hosted with ❤ by GitHub
Feel free to comment about mistakes and doubts.

Comments

Popular posts from this blog

Splitting criteria

Embedding the GitHub Gists codes into blogs