pandas 시리즈에서 요소의 인덱스 찾기
나는 이것이 매우 기본적인 질문이라는 것을 알고 있지만 어떤 이유로 든 대답을 찾을 수 없습니다. 파이썬 팬더에서 시리즈의 특정 요소 색인을 어떻게 얻을 수 있습니까? (첫 번째로 충분할 것입니다)
즉, 나는 다음과 같은 것을 원합니다 :
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
print myseries.find(7) # should output 3
확실히 루프를 사용하여 이러한 메소드를 정의 할 수 있습니다.
def find(s, el):
for i in s.index:
if s[i] == el:
return i
return None
print find(myseries, 7)
그러나 더 좋은 방법이 있어야한다고 생각합니다. 있습니까?
>>> myseries[myseries == 7]
3 7
dtype: int64
>>> myseries[myseries == 7].index[0]
3
나는 그것을하는 더 좋은 방법이 있어야한다고 인정하지만, 최소한 객체를 반복하고 반복하는 것을 피하고 그것을 C 레벨로 옮깁니다.
인덱스로 변환하면 사용할 수 있습니다 get_loc
In [1]: myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
In [3]: Index(myseries).get_loc(7)
Out[3]: 3
In [4]: Index(myseries).get_loc(10)
KeyError: 10
중복 처리
In [5]: Index([1,1,2,2,3,4]).get_loc(2)
Out[5]: slice(2, 4, None)
연속하지 않은 경우 반환 부울 배열을 반환합니다
In [6]: Index([1,1,2,1,3,2,4]).get_loc(2)
Out[6]: array([False, False, True, False, False, True, False], dtype=bool)
내부적으로 해시 테이블을 사용하므로 매우 빠릅니다.
In [7]: s = Series(randint(0,10,10000))
In [9]: %timeit s[s == 5]
1000 loops, best of 3: 203 µs per loop
In [12]: i = Index(s)
In [13]: %timeit i.get_loc(5)
1000 loops, best of 3: 226 µs per loop
Viktor가 지적했듯이 인덱스 생성에는 일회성 생성 오버 헤드가 있습니다 (예를 들어, 인덱스로 실제로 무언가를 할 때 발생합니다 is_unique)
In [2]: s = Series(randint(0,10,10000))
In [3]: %timeit Index(s)
100000 loops, best of 3: 9.6 µs per loop
In [4]: %timeit Index(s).is_unique
10000 loops, best of 3: 140 µs per loop
In [92]: (myseries==7).argmax()
Out[92]: 3
This works if you know 7 is there in advance. You can check this with (myseries==7).any()
Another approach (very similar to the first answer) that also accounts for multiple 7's (or none) is
In [122]: myseries = pd.Series([1,7,0,7,5], index=['a','b','c','d','e'])
In [123]: list(myseries[myseries==7].index)
Out[123]: ['b', 'd']
Another way to do this, although equally unsatisfying is:
s = pd.Series([1,3,0,7,5],index=[0,1,2,3,4])
list(s).index(7)
returns: 3
On time tests using a current dataset I'm working with (consider it random):
[64]: %timeit pd.Index(article_reference_df.asset_id).get_loc('100000003003614')
10000 loops, best of 3: 60.1 µs per loop
In [66]: %timeit article_reference_df.asset_id[article_reference_df.asset_id == '100000003003614'].index[0]
1000 loops, best of 3: 255 µs per loop
In [65]: %timeit list(article_reference_df.asset_id).index('100000003003614')
100000 loops, best of 3: 14.5 µs per loop
If you use numpy, you can get an array of the indecies that your value is found:
import numpy as np
import pandas as pd
myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
np.where(myseries == 7)
This returns a one element tuple containing an array of the indecies where 7 is the value in myseries:
(array([3], dtype=int64),)
you can use Series.idxmax()
>>> import pandas as pd
>>> myseries = pd.Series([1,4,0,7,5], index=[0,1,2,3,4])
>>> myseries.idxmax()
3
>>>
Reference Viktor Kerkez (Aug 20 '13 at 5:52) Jonathan Eunice (Nov 7 '16 at 14:03)
>>> myseries[myseries == 7]
3 7
dtype: int64
>>> myseries[myseries == 7].index # using index[0] specifies the output of the first occurrence only. Using index without adding the element index will give you indexes all occurrences if the series had more then one 7 there. It still presumes you know which number you are looking for.
3
Often your value occurs at multiple indices:
>>> myseries = pd.Series([0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1])
>>> myseries.index[myseries == 1]
Int64Index([3, 4, 5, 6, 10, 11], dtype='int64')
I'm impressed with all the answers here. This is not a new answer, just an attempt to summarize the timings of all these methods. I considered the case of a series with 25 elements and assumed the general case where the index could contain any values and you want the index value corresponding to the search value which is towards the end of the series.
Here are the speed tests on a 2013 MacBook Pro in Python 3.7.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: data = [406400, 203200, 101600, 76100, 50800, 25400, 19050, 12700,
...: 9500, 6700, 4750, 3350, 2360, 1700, 1180, 850,
...: 600, 425, 300, 212, 150, 106, 75, 53,
...: 38]
In [4]: myseries = pd.Series(data, index=range(1,26))
In [5]: myseries[21]
Out[5]: 150
In [7]: %timeit myseries[myseries == 150].index[0]
416 µs ± 5.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: %timeit myseries[myseries == 150].first_valid_index()
585 µs ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: %timeit myseries.where(myseries == 150).first_valid_index()
652 µs ± 23.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [10]: %timeit myseries.index[np.where(myseries == 150)[0][0]]
195 µs ± 1.18 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [11]: %timeit myseries.index[pd.Index(myseries).get_loc(150)]
77.4 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [12]: %timeit myseries.index[list(myseries).index(150)]
14.1 µs ± 42.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
@Jeff's answer seems to be the fastest - although it doesn't handle duplicates.
Correction: Sorry, I missed one, @Alex Spangher's solution using the list index method is by far the fastest.
Hope this helps.
Amazing that such a simple operation requires such convoluted solutions and many are so slow. Over half a millisecond in some cases to find a value in a series of 25.
참고URL : https://stackoverflow.com/questions/18327624/find-elements-index-in-pandas-series
'Programing' 카테고리의 다른 글
| Factory Girl을 사용하여 클립 첨부 파일을 생성하는 방법 (0) | 2020.07.17 |
|---|---|
| pip로 특정 버전의 패키지를 설치하는 방법은 무엇입니까? (0) | 2020.07.17 |
| UIView의 서브 뷰를 중앙에 배치하는 방법 (0) | 2020.07.16 |
| Android 마켓 / 플레이 스토어에서 프로모션 및 기능 그래픽이란 무엇입니까? (0) | 2020.07.16 |
| JavaLaunchHelper 클래스는 둘 다에서 구현됩니다. (0) | 2020.07.16 |