Programing

pandas 시리즈에서 요소의 인덱스 찾기

crosscheck 2020. 7. 17. 19:16
반응형

pandas 시리즈에서 요소의 인덱스 찾기


나는 이것이 매우 기본적인 질문이라는 것을 알고 있지만 어떤 이유로 든 대답을 찾을 수 없습니다. 파이썬 팬더에서 시리즈의 특정 요소 색인을 어떻게 얻을 수 있습니까? (첫 번째로 충분할 것입니다)

즉, 나는 다음과 같은 것을 원합니다 :

import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3

확실히 루프를 사용하여 이러한 메소드를 정의 할 수 있습니다.

def find(s, el):
    for i in s.index:
        if s[i] == el: 
            return i
    return None

print find(myseries, 7)

그러나 더 좋은 방법이 있어야한다고 생각합니다. 있습니까?


>>> myseries[myseries == 7]
3    7
dtype: int64
>>> myseries[myseries == 7].index[0]
3

나는 그것을하는 더 좋은 방법이 있어야한다고 인정하지만, 최소한 객체를 반복하고 반복하는 것을 피하고 그것을 C 레벨로 옮깁니다.


인덱스로 변환하면 사용할 수 있습니다 get_loc

In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])

In [3]: Index(myseries).get_loc(7)
Out[3]: 3

In [4]: Index(myseries).get_loc(10)
KeyError: 10

중복 처리

In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)

연속하지 않은 경우 반환 부울 ​​배열을 반환합니다

In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False,  True, False, False,  True, False], dtype=bool)

내부적으로 해시 테이블을 사용하므로 매우 빠릅니다.

In [7]: s = Series(randint(0,10,10000))

In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop

In [12]: i = Index(s)

In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop

Viktor가 지적했듯이 인덱스 생성에는 일회성 생성 오버 헤드가 있습니다 (예를 들어, 인덱스로 실제로 무언가를 할 때 발생합니다 is_unique)

In [2]: s = Series(randint(0,10,10000))

In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop

In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop

In [92]: (myseries==7).argmax()
Out[92]: 3

This works if you know 7 is there in advance. You can check this with (myseries==7).any()

Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is

In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']

Another way to do this, although equally unsatisfying is:

s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])

list(s).index(7)

returns: 3

On time tests using a current dataset I'm working with (consider it random):

[64]:    %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop

In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop


In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop

If you use numpy, you can get an array of the indecies that your value is found:

import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)

This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:

(array([3], dtype=int64),)

you can use Series.idxmax()

>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>> 

Reference Viktor Kerkez (Aug 20 '13 at 5:52) Jonathan Eunice (Nov 7 '16 at 14:03)

>>> myseries[myseries == 7]
3    7
dtype: int64
>>> myseries[myseries == 7].index   # using index[0] specifies the output of the first occurrence only.  Using index without adding the element index will give you indexes all occurrences if the series had more then one 7 there.  It still presumes you know which number you are looking for.  
3 

Often your value occurs at multiple indices:

>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')

I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.

Here are the speed tests on a 2013 MacBook Pro in Python 3.7.

In [1]: import pandas as pd                                                

In [2]: import numpy as np                                                 

In [3]: data = [406400, 203200, 101600,  76100,  50800,  25400,  19050,  12700, 
   ...:          9500,   6700,   4750,   3350,   2360,   1700,   1180,    850, 
   ...:           600,    425,    300,    212,    150,    106,     75,     53, 
   ...:            38]                                                                               

In [4]: myseries = pd.Series(data, index=range(1,26))                                                

In [5]: myseries[21]                                                                                 
Out[5]: 150

In [7]: %timeit myseries[myseries == 150].index[0]                                                   
416 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit myseries[myseries == 150].first_valid_index()                                        
585 µs ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: %timeit myseries.where(myseries == 150).first_valid_index()                                  
652 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [10]: %timeit myseries.index[np.where(myseries == 150)[0][0]]                                     
195 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [11]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]                                    
77.4 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: %timeit myseries.index[list(myseries).index(150)]
14.1 µs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@Jeff's answer seems to be the fastest - although it doesn't handle duplicates.

Correction: Sorry, I missed one, @Alex Spangher's solution using the list index method is by far the fastest.

Hope this helps.

Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.

참고URL : https://stackoverflow.com/questions/18327624/find-elements-index-in-pandas-series

반응형