Google News
logo
Statistics in Data Science - Interview Questions
How to screen for outliers in a data set?
There are many ways to screen and identify potential outliers in a data set. Two key methods are described below –

Standard deviation/z-score : Z-score or standard score can be obtained in a normal distribution by calculating the size of one standard deviation and multiplying it by 3. The data points outside the range are then identified. The Z-score is measured from the mean. If the z-score is positive, it means the data point is above average.
If the z-score is negative, the data point is below average.

If the z-score is close to zero, the data point is close to average.

If the z-score is above or below 3, it is an outlier and the data point is considered unusual.

The formula for calculating a z-score is –
z= data point−mean/standard deviation OR z=x−μ/ σ?

Interquartile range (IQR) : IQR, also called midspread, is a method to identify outliers and can be described as the range of values that occur throughout the length of the middle of 50% of a data set. It is simply the difference between two extreme data points within the observation.
IQR=Q3 – Q1?

Other methods to screen outliers include Isolation Forests, Robust Random Cut Forests, and DBScan clustering.
Advertisement