How to Handle Noisy Data in preprocessing of data?


Binning method:(one of the method)
  • first sort data and partition into (equi-depth) bins
  • then one can smooth by bin means,  smooth by bin median, smooth by bin boundaries, etc.
Equal-width (distance) partitioning:
  • It divides the range into N intervals of equal size: uniform grid
  • if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.
  • The most straightforward
  • But outliers may dominate the presentation
  • Skewed data is not handled well.
Equal-depth (frequency) partitioning:
  • It divides the range into N intervals, each containing approximately same number of samples
  • Good data scaling
  • Managing categorical attributes can be tricky.
Code for binning (if needed we can edit for user input instead of random) : Feel free to comment about mistakes and doubts.

Comments

Popular posts from this blog

Splitting criteria

Embedding the GitHub Gists codes into blogs