gcc std :: unordered_map 구현이 느립니까? 그렇다면-왜?
우리는 C ++로 고성능 핵심 소프트웨어를 개발하고 있습니다. 거기에 동시 해시 맵이 필요하고 구현되었습니다. 그래서 우리는 동시 해시 맵이 .NET과 비교하여 얼마나 느린 지 알아 내기 위해 벤치 마크를 작성했습니다 std::unordered_map
.
하지만 std::unordered_map
엄청나게 느린 것 같습니다 ... 그래서 이것은 우리의 마이크로 벤치 마크입니다 (동시 맵의 경우 잠금이 최적화되지 않도록 새 스레드를 생성했으며 google::dense_hash_map
,으로 벤치마킹하기 때문에 0을 삽입하지 않습니다 . null 값이 필요함) :
boost::random::mt19937 rng;
boost::random::uniform_int_distribution<> dist(std::numeric_limits<uint64_t>::min(), std::numeric_limits<uint64_t>::max());
std::vector<uint64_t> vec(SIZE);
for (int i = 0; i < SIZE; ++i) {
uint64_t val = 0;
while (val == 0) {
val = dist(rng);
}
vec[i] = val;
}
std::unordered_map<int, long double> map;
auto begin = std::chrono::high_resolution_clock::now();
for (int i = 0; i < SIZE; ++i) {
map[vec[i]] = 0.0;
}
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
std::cout << "inserts: " << elapsed.count() << std::endl;
std::random_shuffle(vec.begin(), vec.end());
begin = std::chrono::high_resolution_clock::now();
long double val;
for (int i = 0; i < SIZE; ++i) {
val = map[vec[i]];
}
end = std::chrono::high_resolution_clock::now();
elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin);
std::cout << "get: " << elapsed.count() << std::endl;
(편집 : 전체 소스 코드는 여기에서 찾을 수 있습니다 : http://pastebin.com/vPqf7eya )
결과 std::unordered_map
는 다음과 같습니다.
inserts: 35126
get : 2959
대상 google::dense_map
:
inserts: 3653
get : 816
수작업으로 지원되는 동시 맵의 경우 (잠금을 수행하지만 벤치 마크는 단일 스레드이지만 별도의 스폰 스레드에 있음) :
inserts: 5213
get : 2594
pthread 지원없이 벤치 마크 프로그램을 컴파일하고 메인 스레드에서 모든 것을 실행하면 수동으로 지원되는 동시 맵에 대해 다음과 같은 결과가 나타납니다.
inserts: 4441
get : 1180
다음 명령으로 컴파일합니다.
g++-4.7 -O3 -DNDEBUG -I/tmp/benchmap/sparsehash-2.0.2/src/ -std=c++11 -pthread main.cc
따라서 특히 삽입물은 std::unordered_map
매우 비쌉니다. 다른지도의 경우 35 초 대 3-5 초입니다. 또한 조회 시간이 상당히 높은 것 같습니다.
내 질문 : 왜 그렇습니까? 누군가가 std::tr1::unordered_map
자신의 구현보다 느린 이유를 묻는 stackoverflow에 대한 또 다른 질문을 읽었습니다 . 가장 높은 등급의 답변 std::tr1::unordered_map
은 더 복잡한 인터페이스를 구현해야한다는 것입니다. 그러나 나는이 주장을 볼 수 없다 : 우리 std::unordered_map
는 Concurrent_map에서 버킷 접근 방식을 사용하고, 버킷 접근 방식도 사용한다 ( google::dense_hash_map
그렇지 않지만 std::unordered_map
적어도 우리의 손으로 지원하는 동시성 안전 버전보다 빠르다?). 그 외에는 인터페이스에서 해시 맵의 성능을 저하시키는 기능을 강제하는 것을 볼 수 없습니다.
그래서 내 질문 : std::unordered_map
매우 느리게 보이는 것이 사실 입니까? 아니오 인 경우 : 무엇이 잘못 되었습니까? 그렇다면 : 그 이유는 무엇입니까?
And my main question: why is inserting a value into a std::unordered_map
so terrible expensive (even if we reserve enough space at the beginning, it does not perform much better - so rehashing seems not to be the problem)?
EDIT:
First of all: yes the presented benchmark is not flawless - this is because we played around a lot with it and it is just a hack (for example the uint64
distribution to generate ints would in practice not be a good idea, exclude 0 in a loop is kind of stupid etc...).
At the moment most comments explain, that I can make the unordered_map faster by preallocating enough space for it. In our application this is just not possible: we are developing a database management system and need a hash map to store some data during a transaction (for example locking information). So this map can be everything from 1 (user just makes one insert and commits) to billions of entries (if full table scans happen). It is just impossible to preallocate enough space here (and just allocate a lot in the beginning will consume too much memory).
Furthermore, I apologize, that I did not state my question clear enough: I am not really interested in making unordered_map fast (using googles dense hash map works fine for us), I just don't really understand where this huge performance differences come from. It can not be just preallocation (even with enough preallocated memory, the dense map is an order of magnitude faster than unordered_map, our hand backed concurrent map starts with an array of size 64 - so a smaller one than unordered_map).
So what is the reason for this bad performance of std::unordered_map
? Or differently asked: Could one write an implementation of the std::unordered_map
interface which is standard conform and (nearly) as fast as googles dense hash map? Or is there something in the standard that enforces the implementer to chose an inefficient way to implement it?
EDIT 2:
By profiling I see that a lot of time is used for integer divions. std::unordered_map
uses prime numbers for the array size, while the other implementations use powers of two. Why does std::unordered_map
use prime-numbers? To perform better if the hash is bad? For good hashes it does imho make no difference.
EDIT 3:
These are the numbers for std::map
:
inserts: 16462
get : 16978
Sooooooo: why are inserts into a std::map
faster than inserts into a std::unordered_map
... I mean WAT? std::map
has a worse locality (tree vs array), needs to make more allocations (per insert vs per rehash + plus ~1 for each collision) and, most important: has another algorithmic complexity (O(logn) vs O(1))!
I found the reason: it is a Problem of gcc-4.7!!
With gcc-4.7
inserts: 37728
get : 2985
With gcc-4.6
inserts: 2531
get : 1565
So std::unordered_map
in gcc-4.7 is broken (or my installation, which is an installation of gcc-4.7.0 on Ubuntu - and another installation which is gcc 4.7.1 on debian testing).
I will submit a bug report.. until then: DO NOT use std::unordered_map
with gcc 4.7!
I am guessing that you have not properly sized your unordered_map
, as Ylisar suggested. When chains grow too long in unordered_map
, the g++ implementation will automatically rehash to a larger hash table, and this would be a big drag on performance. If I remember correctly, unordered_map
defaults to (smallest prime larger than) 100
.
I didn't have chrono
on my system, so I timed with times()
.
template <typename TEST>
void time_test (TEST t, const char *m) {
struct tms start;
struct tms finish;
long ticks_per_second;
times(&start);
t();
times(&finish);
ticks_per_second = sysconf(_SC_CLK_TCK);
std::cout << "elapsed: "
<< ((finish.tms_utime - start.tms_utime
+ finish.tms_stime - start.tms_stime)
/ (1.0 * ticks_per_second))
<< " " << m << std::endl;
}
I used a SIZE
of 10000000
, and had to change things a bit for my version of boost
. Also note, I pre-sized the hash table to match SIZE/DEPTH
, where DEPTH
is an estimate of the length of the bucket chain due to hash collisions.
Edit: Howard points out to me in comments that the max load factor for unordered_map
is 1
. So, the DEPTH
controls how many times the code will rehash.
#define SIZE 10000000
#define DEPTH 3
std::vector<uint64_t> vec(SIZE);
boost::mt19937 rng;
boost::uniform_int<uint64_t> dist(std::numeric_limits<uint64_t>::min(),
std::numeric_limits<uint64_t>::max());
std::unordered_map<int, long double> map(SIZE/DEPTH);
void
test_insert () {
for (int i = 0; i < SIZE; ++i) {
map[vec[i]] = 0.0;
}
}
void
test_get () {
long double val;
for (int i = 0; i < SIZE; ++i) {
val = map[vec[i]];
}
}
int main () {
for (int i = 0; i < SIZE; ++i) {
uint64_t val = 0;
while (val == 0) {
val = dist(rng);
}
vec[i] = val;
}
time_test(test_insert, "inserts");
std::random_shuffle(vec.begin(), vec.end());
time_test(test_insert, "get");
}
Edit:
I modified the code so that I could change out DEPTH
more easily.
#ifndef DEPTH
#define DEPTH 10000000
#endif
So, by default, the worst size for the hash table is chosen.
elapsed: 7.12 inserts, elapsed: 2.32 get, -DDEPTH=10000000
elapsed: 6.99 inserts, elapsed: 2.58 get, -DDEPTH=1000000
elapsed: 8.94 inserts, elapsed: 2.18 get, -DDEPTH=100000
elapsed: 5.23 inserts, elapsed: 2.41 get, -DDEPTH=10000
elapsed: 5.35 inserts, elapsed: 2.55 get, -DDEPTH=1000
elapsed: 6.29 inserts, elapsed: 2.05 get, -DDEPTH=100
elapsed: 6.76 inserts, elapsed: 2.03 get, -DDEPTH=10
elapsed: 2.86 inserts, elapsed: 2.29 get, -DDEPTH=1
My conclusion is that there is not much significant performance difference for any initial hash table size other than making it equal to the entire expected number of unique insertions. Also, I don't see the order of magnitude performance difference that you are observing.
I have run your code using a 64 bit / AMD / 4 cores (2.1GHz) computer and it gave me the following results:
MinGW-W64 4.9.2:
Using std::unordered_map:
inserts: 9280
get: 3302
Using std::map:
inserts: 23946
get: 24824
VC 2015 with all the optimization flags I know:
Using std::unordered_map:
inserts: 7289
get: 1908
Using std::map:
inserts: 19222
get: 19711
I have not tested the code using GCC but I think it may be comparable to the performance of VC, so if that is true, then GCC 4.9 std::unordered_map it's still broken.
[EDIT]
So yes, as someone said in the comments, there is no reason to think that the performance of GCC 4.9.x would be comparable to VC performance. When I have the change I will be testing the code on GCC.
My answer is just to establish some kind of knowledge base to other answers.
참고URL : https://stackoverflow.com/questions/11614106/is-gcc-stdunordered-map-implementation-slow-if-so-why
'Programing' 카테고리의 다른 글
장식 된 기능의 서명 보존 (0) | 2020.08.21 |
---|---|
"유용한"C ++ 바이너리 검색 알고리즘은 어디서 구할 수 있습니까? (0) | 2020.08.21 |
Python에서 클래스 속성을 문서화하는 방법은 무엇입니까? (0) | 2020.08.21 |
작성하는 모든 프로그램에서 사용할 Android 개발 용 라이브러리를 만드는 방법은 무엇입니까? (0) | 2020.08.21 |
ThreadPoolExecutor의 submit () 메서드 블록이 포화 상태 인 경우 만드는 방법은 무엇입니까? (0) | 2020.08.21 |