목록에서 이상 값을 거부하는 numpy 내장 기능이 있습니까?

Programing

목록에서 이상 값을 거부하는 numpy 내장 기능이 있습니까?

crosscheck 2020. 9. 23. 07:15

목록에서 이상 값을 거부하는 numpy 내장 기능이 있습니까?

다음과 같은 작업을 수행하는 numpy 내장이 있습니까? 즉, 목록을 가져 와서 에서 점의 일부 가정 된 분포를 기반으로 제거 된 외부 요소 가있는 목록 d을 반환합니다 .filtered_dd

import numpy as np

def reject_outliers(data):
    m = 2
    u = np.mean(data)
    s = np.std(data)
    filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
    return filtered

>>> d = [2,4,5,1,6,5,40]
>>> filtered_d = reject_outliers(d)
>>> print filtered_d
[2,4,5,1,6,5]

함수가 다양한 분포 (포아송, 가우시안 등)와 해당 분포 내에서 다양한 특이 치 임계 값 ( m여기에서 사용한 것과 같은)을 허용 할 수 있기 때문에 '같은 것'이라고 말합니다 .

이 방법은 당신의 방법과 거의 동일하며 더 numpyst입니다 (또한 numpy 배열에서만 작동합니다).

def reject_outliers(data, m=2):
    return data[abs(data - np.mean(data)) < m * np.std(data)]

특이 치를 다룰 때 중요한 것은 추정치를 가능한 한 강력하게 사용해야한다는 것입니다. 분포의 평균은 특이 치에 의해 편향되지만 예를 들어 중앙값은 훨씬 적습니다.

eumiro의 대답을 바탕으로 :

def reject_outliers(data, m = 2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d/mdev if mdev else 0.
    return data[s<m]

여기서는 평균을 더 강력한 중앙값으로 바꾸고 표준 편차를 중앙값까지의 절대 거리로 대체했습니다. 그런 다음 (다시) 중앙값으로 거리를 조정하여 m합리적인 상대적인 척도를 유지했습니다.

에 대한 참고 data[s<m]작업에 대한 구문, dataNumPy와 배열해야합니다.

Benjamin Bannier의 답변은 중앙값으로부터의 거리 중앙값이 0 일 때 통과를 산출하므로 아래 예제와 같이이 수정 된 버전이 경우에 더 유용하다는 것을 알았습니다.

def reject_outliers_2(data, m=2.):
    d = np.abs(data - np.median(data))
    mdev = np.median(d)
    s = d / (mdev if mdev else 1.)
    return data[s < m]

예:

data_points = np.array([10, 10, 10, 17, 10, 10])
print(reject_outliers(data_points))
print(reject_outliers_2(data_points))

제공 :

[[10, 10, 10, 17, 10, 10]]  # 17 is not filtered
[10, 10, 10, 10, 10]  # 17 is filtered (it's distance, 7, is greater than m)

Benjamin에서 빌드 하고을 사용 pandas.Series하고 MAD를 IQR로 대체 :

def reject_outliers(sr, iq_range=0.5):
    pcnt = (1 - iq_range) / 2
    qlow, median, qhigh = sr.dropna().quantile([pcnt, 0.50, 1-pcnt])
    iqr = qhigh - qlow
    return sr[ (sr - median).abs() <= iqr]

예를 들어를 설정 iq_range=0.6하면 사 분위수 범위의 백분위 수는 다음이 0.20 <--> 0.80되므로 더 많은 이상 값이 포함됩니다.

대안은 표준 편차를 강력하게 추정하는 것입니다 (가우스 통계 가정). 온라인 계산기를 보면 90 % 백분위 수는 1.2815σ에 해당하고 95 %는 1.645σ입니다 ( http://vassarstats.net/tabs.html?#z ).

간단한 예로서 :

import numpy as np

# Create some random numbers
x = np.random.normal(5, 2, 1000)

# Calculate the statistics
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Add a few large points
x[10] += 1000
x[20] += 2000
x[30] += 1500

# Recalculate the statistics
print()
print("Mean= ", np.mean(x))
print("Median= ", np.median(x))
print("Max/Min=", x.max(), " ", x.min())
print("StdDev=", np.std(x))
print("90th Percentile", np.percentile(x, 90))

# Measure the percentile intervals and then estimate Standard Deviation of the distribution, both from median to the 90th percentile and from the 10th to 90th percentile
p90 = np.percentile(x, 90)
p10 = np.percentile(x, 10)
p50 = np.median(x)
# p50 to p90 is 1.2815 sigma
rSig = (p90-p50)/1.2815
print("Robust Sigma=", rSig)

rSig = (p90-p10)/(2*1.2815)
print("Robust Sigma=", rSig)

내가 얻는 출력은 다음과 같습니다.

Mean=  4.99760520022
Median=  4.95395274981
Max/Min= 11.1226494654   -2.15388472011
Sigma= 1.976629928
90th Percentile 7.52065379649

Mean=  9.64760520022
Median=  4.95667658782
Max/Min= 2205.43861943   -2.15388472011
Sigma= 88.6263902244
90th Percentile 7.60646688694

Robust Sigma= 2.06772555531
Robust Sigma= 1.99878292462

예상 값 2에 가깝습니다.

5 표준 편차 위 / 아래 점을 제거하려는 경우 (1000 점으로 1 값> 3 표준 편차 예상) :

y = x[abs(x - p50) < rSig*5]

# Print the statistics again
print("Mean= ", np.mean(y))
print("Median= ", np.median(y))
print("Max/Min=", y.max(), " ", y.min())
print("StdDev=", np.std(y))

다음을 제공합니다.

Mean=  4.99755359935
Median=  4.95213030447
Max/Min= 11.1226494654   -2.15388472011
StdDev= 1.97692712883

어떤 접근 방식이 더 효율적이고 강력한 지 모르겠습니다.

I wanted to do something similar, except setting the number to NaN rather than removing it from the data, since if you remove it you change the length which can mess up plotting (i.e. if you're only removing outliers from one column in a table, but you need it to remain the same as the other columns so you can plot them against each other).

To do so I used numpy's masking functions:

def reject_outliers(data, m=2):
    stdev = np.std(data)
    mean = np.mean(data)
    maskMin = mean - stdev * m
    maskMax = mean + stdev * m
    mask = np.ma.masked_outside(data, maskMin, maskMax)
    print('Masking values outside of {} and {}'.format(maskMin, maskMax))
    return mask

I would like to provide two methods in this answer, solution based on "z score" and solution based on "IQR".

The code provided in this answer works on both single dim numpy array and multiple numpy array.

Let's import some modules firstly.

import collections
import numpy as np
import scipy.stats as stat
from scipy.stats import iqr

z score based method

This method will test if the number falls outside the three standard deviations. Based on this rule, if the value is outlier, the method will return true, if not, return false.

def sd_outlier(x, axis = None, bar = 3, side = 'both'):
    assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'

    d_z = stat.zscore(x, axis = axis)

    if side == 'gt':
        return d_z > bar
    elif side == 'lt':
        return d_z < -bar
    elif side == 'both':
        return np.abs(d_z) > bar

IQR based method

This method will test if the value is less than q1 - 1.5 * iqr or greater than q3 + 1.5 * iqr, which is similar to SPSS's plot method.

def q1(x, axis = None):
    return np.percentile(x, 25, axis = axis)

def q3(x, axis = None):
    return np.percentile(x, 75, axis = axis)

def iqr_outlier(x, axis = None, bar = 1.5, side = 'both'):
    assert side in ['gt', 'lt', 'both'], 'Side should be `gt`, `lt` or `both`.'

    d_iqr = iqr(x, axis = axis)
    d_q1 = q1(x, axis = axis)
    d_q3 = q3(x, axis = axis)
    iqr_distance = np.multiply(d_iqr, bar)

    stat_shape = list(x.shape)

    if isinstance(axis, collections.Iterable):
        for single_axis in axis:
            stat_shape[single_axis] = 1
    else:
        stat_shape[axis] = 1

    if side in ['gt', 'both']:
        upper_range = d_q3 + iqr_distance
        upper_outlier = np.greater(x - upper_range.reshape(stat_shape), 0)
    if side in ['lt', 'both']:
        lower_range = d_q1 - iqr_distance
        lower_outlier = np.less(x - lower_range.reshape(stat_shape), 0)

    if side == 'gt':
        return upper_outlier
    if side == 'lt':
        return lower_outlier
    if side == 'both':
        return np.logical_or(upper_outlier, lower_outlier)

Finally, if you want to filter out the outliers, use a numpy selector.

Have a nice day.

Consider that all the above methods fail when your standard deviation gets very large due to huge outliers.

(Simalar as the average caluclation fails and should rather caluclate the median. Though, the average is "more prone to such an error as the stdDv".)

You could try to iteratively apply your algorithm or you filter using the interquartile range: (here "factor" relates to a n*sigma range, yet only when your data follows a Gaussian distribution)

import numpy as np

def sortoutOutliers(dataIn,factor):
    quant3, quant1 = np.percentile(dataIn, [75 ,25])
    iqr = quant3 - quant1
    iqrSigma = iqr/1.34896
    medData = np.median(dataIn)
    dataOut = [ x for x in dataIn if ( (x > medData - factor* iqrSigma) and (x < medData + factor* iqrSigma) ) ] 
    return(dataOut)

참고URL : https://stackoverflow.com/questions/11686720/is-there-a-numpy-builtin-to-reject-outliers-from-a-list

'Programing' 카테고리의 다른 글

SQL Server : CLR이 활성화되었는지 확인하는 방법은 무엇입니까? (0)	2020.09.24
JavaScript에서 json-object의 키 가져 오기 (0)	2020.09.23
이 ssh 터널을 닫는 방법은 무엇입니까? (0)	2020.09.23
입력 유형 날짜에 날짜 설정 (0)	2020.09.23
자바에서 시차를 계산하는 방법은 무엇입니까? (0)	2020.09.23

현재글목록에서 이상 값을 거부하는 numpy 내장 기능이 있습니까?

crosscheck

목록에서 이상 값을 거부하는 numpy 내장 기능이 있습니까?

목록에서 이상 값을 거부하는 numpy 내장 기능이 있습니까?

z score based method

IQR based method

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

목록에서 이상 값을 거부하는 numpy 내장 기능이 있습니까?

목록에서 이상 값을 거부하는 numpy 내장 기능이 있습니까?

z score based method

IQR based method

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바