QNA > H > How To Remove Outliers In A Large Dataset With Pandas

How to remove outliers in a large dataset with pandas

I would say that using the boxplot function of pandas is one of the best tools to visualize the outliers.

  1. df.plot(kind = 'box', sharex = False, sharey = False) 
  2. plt.show() 

To obtain a number that allows you to affirm if a certain data point is really an outlier or not, it is slightly more complicated.

The definition of an outlier per si is already quite dubious, but we can define them as those values that surpass the limit of +- 1.5 * IQR. In questo caso, sia il metodo della deviazione standard che il metodo Tukey sono opzioni valide. We just need to try and see which gives better results.

  1. # Tukey Method 
  2.  
  3. n = 2 #In this case, we considered outliers as rows that have at least two outlied numerical values. The optimal value for this parameter can be later determined through the cross-validation 
  4. indexes = [] 
  5.  
  6. for col in df.columns[0:14]: 
  7. Q1 = np.percentile(df[col], 25) 
  8. Q3 = np.percentile(df[col],75) 
  9. IQR = Q3 - Q1 
  10.  
  11. limit = 1.5 * IQR 
  12.  
  13. list_outliers = df[(df[col] < Q1 - limit) | (df[col] > Q3 + limit )].index # Determine a list of indices of outliers for feature col 
  14.  
  15. indexes.extend(list_outliers) # append the found outlier indices for col to the list of outlier indices 
  16.  
  17. indexes = Counter(indexes) 
  18. multiple_outliers = list( k for k, v in indexes.items() if v > n ) 

Once you detect the outliers, you can either remove them or replace them with max/min limit values. In the first case, you just need to do this:

  1. df.drop(multiple_outliers, axis = 0) 
  2.  
  3. df = df.drop(multiple_outliers, axis = 0).reset_index(drop=True) 

But for the second case, you should do this:

  1. #Setting the min/max to outliers using standard deviation 
  2. for col in df.columns[0:14]: 
  3. factor = 3 #The optimal value for this parameter can be later determined though the cross-validation 
  4. upper_lim = df[col].mean () + df[col].std () * factor 
  5. lower_lim = df[col].mean () - df[col].std () * factor 
  6.  
  7. df = df[(df[col] < upper_lim) & (df[col] > lower_lim)] 

Finally, you can also use Isolation Forest or LocalOutlierFactor (more appropriate for Anomaly/Fraud Detection Problems).

Di Ebonee Waage

Cosa posso regalare a mio fratello che compie 18 anni? :: Qual è la differenza tra una batteria da 35Ah e una da 65Ah?
Link utili