Programing

GCC 5.4.0으로 비약적인 발전

crosscheck 2020. 5. 25. 20:59
반응형

GCC 5.4.0으로 비약적인 발전


나는 다음과 같은 기능을 가지고 있었다 (중요 부분만을 보여줌) :

double CompareShifted(const std::vector<uint16_t>& l, const std::vector<uint16_t> &curr, int shift, int shiftY)  {
...
  for(std::size_t i=std::max(0,-shift);i<max;i++) {
     if ((curr[i] < 479) && (l[i + shift] < 479)) {
       nontopOverlap++;
     }
     ...
  }
...
}

이렇게 작성하면이 기능은 내 컴퓨터에서 ~ 34ms가 걸렸습니다. 조건을 부울 곱셈으로 변경 한 후 (코드를 다음과 같이 표시) :

double CompareShifted(const std::vector<uint16_t>& l, const std::vector<uint16_t> &curr, int shift, int shiftY)  {
...
  for(std::size_t i=std::max(0,-shift);i<max;i++) {
     if ((curr[i] < 479) * (l[i + shift] < 479)) {
       nontopOverlap++;
     }
     ...
  }
...
}

실행 시간은 ~ 19ms로 줄었습니다.

사용 된 컴파일러는 -O3을 사용하는 GCC 5.4.0이며 godbolt.org를 사용하여 생성 된 asm 코드를 확인한 후 첫 번째 예제는 점프를 생성하지만 두 번째 예제는 점프를 생성하지 않는다는 것을 알았습니다. 첫 번째 예제를 사용할 때 점프 명령을 생성하는 GCC 6.2.0을 사용하기로 결정했지만 GCC 7은 더 이상 생성하지 않는 것 같습니다.

코드 속도를 높이는이 방법을 찾는 것은 다소 번거롭고 시간이 많이 걸렸습니다. 컴파일러는 왜 이런 식으로 동작합니까? 프로그래머가주의해야 할 의도입니까? 이것과 비슷한 것이 더 있습니까?

편집 : godbolt에 연결 https://godbolt.org/g/5lKPF3


논리 AND 연산자 ( &&)는 단락 평가를 사용합니다. 즉, 두 번째 테스트는 첫 번째 비교가 true로 평가되는 경우에만 수행됩니다. 이것은 종종 당신이 요구하는 의미론입니다. 예를 들어 다음 코드를 고려하십시오.

if ((p != nullptr) && (p->first > 0))

역 참조하기 전에 포인터가 널이 아닌지 확인해야합니다. 이것이 단락 평가 가 아닌 경우 널 포인터를 역 참조하기 때문에 동작이 정의되지 않은 것입니다.

조건 평가가 비싼 프로세스 인 경우 단락 평가로 성능이 향상 될 수도 있습니다. 예를 들면 다음과 같습니다.

if ((DoLengthyCheck1(p) && (DoLengthyCheck2(p))

경우 DoLengthyCheck1에 실패, 전화 아무 문제가 없다 DoLengthyCheck2.

그러나 결과 바이너리에서는 단락 연산으로 인해 종종 두 가지 분기가 발생하는데, 이는 컴파일러가 이러한 의미를 보존하는 가장 쉬운 방법이기 때문입니다. (어느 왜 동전의 반대편에, 단락 회로 평가 때로는 수 있습니다 금지 . 최적화 가능성이) 당신은 당신을 위해 생성 된 오브젝트 코드의 관련 부분을 보면이 볼 수 ifGCC 5.4에 의해 문 :

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13w, 478         ; (curr[i] < 479)
    ja      .L5

    cmp     ax, 478           ; (l[i + shift] < 479)
    ja      .L5

    add     r8d, 1            ; nontopOverlap++

여기에 두 개의 비교 ( cmp명령)가 있으며 각각 별도의 조건부 점프 / 분기 ( ja또는 위의 경우 점프)가 이어집니다.

일반적으로 분기가 느리기 때문에 빡빡한 고리에서는 피해야합니다. 이것은 거의 모든 x86 프로세서에서 겸손한 8088 (매우 느린 페치 시간과 매우 작은 프리 페치 큐 (명령 캐시와 비교할 수 있음), 분기 예측이 완전히 결여 됨)로 인해 브랜치에서 캐시를 덤프해야했습니다. )를 현대적인 구현 (긴 파이프 라인으로 잘못 예측 한 지점을 비슷하게 비싸게 만드는)에 적용합니다. 내가 미끄러운 작은 경고에 주목하십시오. Pentium Pro 이후의 최신 프로세서에는 지점 비용을 최소화하도록 설계된 고급 지점 예측 엔진이 있습니다. 지점의 방향을 올바르게 예측할 수 있으면 비용이 최소화됩니다. 대부분의 경우, 이것은 잘 작동하지만 분기 예측 변수가 귀하의 편이 아닌 병리학 적 사례에 들어가면,코드가 매우 느려질 수 있습니다 . 배열이 정렬되지 않았다고 말했기 때문에 아마도 여기에있을 것입니다.

당신은 벤치 마크는 대체 있음을 확인 말 &&로모그래퍼 것은 *눈에 띄게 더 빠른 코드를 만든다. 그 이유는 객체 코드의 관련 부분을 비교할 때 분명합니다.

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    xor     r15d, r15d        ; (curr[i] < 479)
    cmp     r13w, 478
    setbe   r15b

    xor     r14d, r14d        ; (l[i + shift] < 479)
    cmp     ax, 478
    setbe   r14b

    imul    r14d, r15d        ; meld results of the two comparisons

    cmp     r14d, 1           ; nontopOverlap++
    sbb     r8d, -1

It is a bit counter-intuitive that this could be faster, since there are more instructions here, but that is how optimization works sometimes. You see the same comparisons (cmp) being done here, but now, each is preceded by an xor and followed by a setbe. The XOR is just a standard trick for clearing a register. The setbe is an x86 instruction that sets a bit based on the value of a flag, and is often used to implement branchless code. Here, setbe is the inverse of ja. It sets its destination register to 1 if the comparison was below-or-equal (since the register was pre-zeroed, it will be 0 otherwise), whereas ja branched if the comparison was above. Once these two values have been obtained in the r15b and r14b registers, they are multiplied together using imul. Multiplication was traditionally a relatively slow operation, but it is darn fast on modern processors, and this will be especially fast, because it's only multiplying two byte-sized values.

You could just as easily have replaced the multiplication with the bitwise AND operator (&), which does not do short-circuit evaluation. This makes the code much clearer, and is a pattern that compilers generally recognize. But when you do this with your code and compile it with GCC 5.4, it continues to emit the first branch:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13w, 478         ; (curr[i] < 479)
    ja      .L4

    cmp     ax, 478           ; (l[i + shift] < 479)
    setbe   r14b

    cmp     r14d, 1           ; nontopOverlap++
    sbb     r8d, -1

There is no technical reason it had to emit the code this way, but for some reason, its internal heuristics are telling it that this is faster. It would probably be faster if the branch predictor was on your side, but it will likely be slower if branch prediction fails more often than it succeeds.

Newer generations of the compiler (and other compilers, like Clang) know this rule, and will sometimes use it to generate the same code that you would have sought by hand-optimizing. I regularly see Clang translate && expressions to the same code that would have been emitted if I'd have used &. The following is the relevant output from GCC 6.2 with your code using the normal && operator:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13d, 478         ; (curr[i] < 479)
    jg      .L7

    xor     r14d, r14d        ; (l[i + shift] < 479)
    cmp     eax, 478
    setle   r14b

    add     esi, r14d         ; nontopOverlap++

Note how clever this is! It is using signed conditions (jg and setle) as opposed to unsigned conditions (ja and setbe), but this isn't important. You can see that it still does the compare-and-branch for the first condition like the older version, and uses the same setCC instruction to generate branchless code for the second condition, but it has gotten a lot more efficient in how it does the increment. Instead of doing a second, redundant comparison to set the flags for a sbb operation, it uses the knowledge that r14d will be either 1 or 0 to simply unconditionally add this value to nontopOverlap. If r14d is 0, then the addition is a no-op; otherwise, it adds 1, exactly like it is supposed to do.

GCC 6.2 actually produces more efficient code when you use the short-circuiting && operator than the bitwise & operator:

    movzx   r13d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r13d, 478         ; (curr[i] < 479)
    jg      .L6

    cmp     eax, 478          ; (l[i + shift] < 479)
    setle   r14b

    cmp     r14b, 1           ; nontopOverlap++
    sbb     esi, -1

The branch and the conditional set are still there, but now it reverts back to the less-clever way of incrementing nontopOverlap. This is an important lesson in why you should be careful when trying to out-clever your compiler!

But if you can prove with benchmarks that the branching code is actually slower, then it may pay to try and out-clever your compiler. You just have to do so with careful inspection of the disassembly—and be prepared to re-evaluate your decisions when you upgrade to a later version of the compiler. For example, the code you have could be rewritten as:

nontopOverlap += ((curr[i] < 479) & (l[i + shift] < 479));

There is no if statement here at all, and the vast majority of compilers will never think about emitting branching code for this. GCC is no exception; all versions generate something akin to the following:

    movzx   r14d, WORD PTR [rbp+rcx*2]
    movzx   eax,  WORD PTR [rbx+rcx*2]

    cmp     r14d, 478         ; (curr[i] < 479)
    setle   r15b

    xor     r13d, r13d        ; (l[i + shift] < 479)
    cmp     eax, 478
    setle   r13b

    and     r13d, r15d        ; meld results of the two comparisons
    add     esi, r13d         ; nontopOverlap++

If you've been following along with the previous examples, this should look very familiar to you. Both comparisons are done in a branchless way, the intermediate results are anded together, and then this result (which will be either 0 or 1) is added to nontopOverlap. If you want branchless code, this will virtually ensure that you get it.

GCC 7 has gotten even smarter. It now generates virtually identical code (excepting some slight rearrangement of instructions) for the above trick as the original code. So, the answer to your question, "Why does the compiler behave this way?", is probably because they're not perfect! They try to use heuristics to generate the most optimal code possible, but they don't always make the best decisions. But at least they can get smarter over time!

One way of looking at this situation is that the branching code has the better best-case performance. If branch prediction is successful, skipping unnecessary operations will result in a slightly faster running time. However, branchless code has the better worst-case performance. If branch prediction fails, executing a few additional instructions as necessary to avoid a branch will definitely be faster than a mispredicted branch. Even the smartest and most clever of compilers will have a hard time making this choice.

And for your question of whether this is something programmers need to watch out for, the answer is almost certainly no, except in certain hot loops that you are trying to speed up via micro-optimizations. Then, you sit down with the disassembly and find ways to tweak it. And, as I said before, be prepared to revisit those decisions when you update to a newer version of the compiler, because it may either do something stupid with your tricky code, or it may have changed its optimization heuristics enough that you can go back to using your original code. Comment thoroughly!


One important thing to note is that

(curr[i] < 479) && (l[i + shift] < 479)

and

(curr[i] < 479) * (l[i + shift] < 479)

are not semantically equivalent! In particular, the if you ever have the situation where:

  • 0 <= i and i < curr.size() are both true
  • curr[i] < 479 is false
  • i + shift < 0 or i + shift >= l.size() is true

then the expression (curr[i] < 479) && (l[i + shift] < 479) is guaranteed to be a well-defined boolean value. For example, it does not cause a segmentation fault.

However, under these circumstances, the expression (curr[i] < 479) * (l[i + shift] < 479) is undefined behavior; it is allowed to cause a segmentation fault.

This means that for the original code snippet, for example, the compiler can't just write a loop that performs both comparisons and does an and operation, unless the compiler can also prove that l[i + shift] will never cause a segfault in a situation it's required not to.

In short, the original piece of code offers fewer opportunities for optimization than the latter. (of course, whether or not the compiler recognizes the opportunity is an entirely different question)

You might fix the original version by instead doing

bool t1 = (curr[i] < 479);
bool t2 = (l[i + shift] < 479);
if (t1 && t2) {
    // ...

The && operator implements short-circuit evaluation. This means that the second operand is only evaluated if the first one evaluates to true. This certainly results in a jump in that case.

You can create a small example to show this:

#include <iostream>

bool f(int);
bool g(int);

void test(int x, int y)
{
  if ( f(x) && g(x)  )
  {
    std::cout << "ok";
  }
}

The assembler output can be found here.

You can see the generated code first calls f(x), then checks the output and jumps to the evaluation of g(x) when this was true. Otherwise it leaves the function.

Using "boolean" multiplication instead forces the evaluation of both operands every time and thus does not need a jump.

Depending on the data, the jump can cause a slow down because it disturbs the pipeline of the CPU and other things like speculative execution. Normally branch prediction helps, but if your data is random there is not much which can be predicted.


This might be because when you are using the logical operator && the compiler has to check two conditions for the if statement to succeed. However in the second case since you are implicitly converting an int value to a bool, the compiler makes some assumptions based on the types and values being passed in, along with (possibly) a single jump condition. It is also possible that the compiler completely optimizes away the jmps with bit shifts.

참고URL : https://stackoverflow.com/questions/40991778/an-expensive-jump-with-gcc-5-4-0

반응형