C ++에서 명령문 순서 적용

Programing

C ++에서 명령문 순서 적용

crosscheck 2020. 8. 18. 07:12

C ++에서 명령문 순서 적용

고정 된 순서로 실행하려는 문이 여러 개 있다고 가정합니다. 최적화 수준 2에서 g ++를 사용하여 일부 명령문을 재정렬 할 수 있습니다. 특정 명령문 순서를 적용하려면 어떤 도구가 필요합니까?

다음 예를 고려하십시오.

using Clock = std::chrono::high_resolution_clock;

auto t1 = Clock::now(); // Statement 1
foo();                  // Statement 2
auto t2 = Clock::now(); // Statement 3

auto elapsedTime = t2 - t1;

이 예에서는 문 1-3이 주어진 순서로 실행되는 것이 중요합니다. 그러나 컴파일러는 문 2가 1과 3과 독립적이라고 생각하고 다음과 같이 코드를 실행할 수 없습니까?

using Clock=std::chrono::high_resolution_clock;

foo();                  // Statement 2
auto t1 = Clock::now(); // Statement 1
auto t2 = Clock::now(); // Statement 3

auto elapsedTime = t2 - t1;

C ++ 표준위원회와 논의한 후 좀 더 포괄적 인 답변을 제공하고자합니다. C ++위원회의 일원 일뿐만 아니라 LLVM 및 Clang 컴파일러의 개발자이기도합니다.

기본적으로 이러한 변환을 달성하기 위해 순서에서 장벽이나 일부 작업을 사용할 방법이 없습니다. 근본적인 문제는 정수 덧셈과 같은 동작 의미가 구현에 완전히 알려져 있다는 것입니다. 시뮬레이션 할 수 있고 올바른 프로그램으로 관찰 할 수 없다는 것을 알고 항상 자유롭게 이동할 수 있습니다.

이를 방지하려고 노력할 수는 있지만 매우 부정적인 결과를 낳고 결국 실패 할 것입니다.

첫째, 컴파일러에서이를 방지하는 유일한 방법은 이러한 모든 기본 작업을 관찰 할 수 있음을 알리는 것입니다. 문제는 이것이 컴파일러 최적화의 압도적 인 대부분을 배제한다는 것입니다. 컴파일러 내부에는 타이밍 이 관찰 가능 하다는 것을 모델링 할 수있는 좋은 메커니즘이 기본적으로 없습니다 . 어떤 작업에 시간이 걸리는지에 대한 좋은 모델도 없습니다 . 예를 들어 32 비트 부호없는 정수를 부호없는 64 비트 정수로 변환하는 데 시간이 걸리나요? x86-64에서는 시간이 걸리지 않지만 다른 아키텍처에서는 0이 아닌 시간이 걸립니다. 여기에는 일반적으로 정답이 없습니다.

그러나 컴파일러가 이러한 작업의 순서를 변경하지 못하도록 막는 데 성공하더라도 이것이 충분하다는 보장은 없습니다. x86 머신에서 C ++ 프로그램을 실행하는 유효하고 적합한 방법 인 DynamoRIO를 고려하십시오. 이것은 프로그램의 기계어 코드를 동적으로 평가하는 시스템입니다. 할 수있는 한 가지는 온라인 최적화이며, 타이밍 밖에서 전체 범위의 기본 산술 명령을 추측 적으로 실행할 수도 있습니다. 그리고이 동작은 동적 평가자에게 고유하지 않습니다. 실제 x86 CPU는 명령 (훨씬 적은 수)을 추측하고 동적으로 순서를 변경합니다.

본질적인 깨달음은 (타이밍 수준에서도) 산술을 관찰 할 수 없다는 사실이 컴퓨터의 레이어에 스며든다는 것입니다. 컴파일러, 런타임 및 종종 하드웨어에도 해당됩니다. 관찰 가능하도록 강제하는 것은 컴파일러를 극적으로 제한하지만 하드웨어도 극적으로 제한합니다.

그러나이 모든 것이 여러분이 희망을 잃게해서는 안됩니다. 기본적인 수학 연산의 실행 시간을 정하고 싶을 때 안정적으로 작동하는 기술을 잘 연구했습니다. 일반적으로 마이크로 벤치마킹을 할 때 사용됩니다 . CppCon2015에서 이에 대해 이야기했습니다 : https://youtu.be/nXaxk27zwlk

여기에 표시된 기술은 Google과 같은 다양한 마이크로 벤치 마크 라이브러리에서도 제공됩니다. https://github.com/google/benchmark#preventing-optimisation

이러한 기술의 핵심은 데이터에 집중하는 것입니다. 옵티 마이저에 대해 계산에 대한 입력을 불투명하게 만들고 옵티 마이저에 대해 계산 결과를 불투명하게 만듭니다. 이 작업을 마치면 안정적으로 시간을 측정 할 수 있습니다. 원래 질문에서 실제 버전의 예제를 살펴 보되 foo구현 에 완전히 표시 되는 정의 를 사용합니다. 또한 DoNotOptimize여기에서 찾을 수있는 Google Benchmark 라이브러리에서 (비 휴대용) 버전을 추출했습니다 . https://github.com/google/benchmark/blob/master/include/benchmark/benchmark_api.h#L208

#include <chrono>

template <class T>
__attribute__((always_inline)) inline void DoNotOptimize(const T &value) {
  asm volatile("" : "+m"(const_cast<T &>(value)));
}

// The compiler has full knowledge of the implementation.
static int foo(int x) { return x * 2; }

auto time_foo() {
  using Clock = std::chrono::high_resolution_clock;

  auto input = 42;

  auto t1 = Clock::now();         // Statement 1
  DoNotOptimize(input);
  auto output = foo(input);       // Statement 2
  DoNotOptimize(output);
  auto t2 = Clock::now();         // Statement 3

  return t2 - t1;
}

여기서는 입력 데이터와 출력 데이터가 계산 주변에서 최적화 불가능한 것으로 표시되고 foo해당 마커 주변에서만 계산 된 타이밍이 있는지 확인합니다 . 데이터를 사용하여 계산을 좁히기 때문에 두 타이밍 사이에 머무르는 것이 보장되지만 계산 자체는 최적화 될 수 있습니다. Clang / LLVM의 최근 빌드에서 생성 된 결과 x86-64 어셈블리는 다음과 같습니다.

% ./bin/clang++ -std=c++14 -c -S -o - so.cpp -O3
        .text
        .file   "so.cpp"
        .globl  _Z8time_foov
        .p2align        4, 0x90
        .type   _Z8time_foov,@function
_Z8time_foov:                           # @_Z8time_foov
        .cfi_startproc
# BB#0:                                 # %entry
        pushq   %rbx
.Ltmp0:
        .cfi_def_cfa_offset 16
        subq    $16, %rsp
.Ltmp1:
        .cfi_def_cfa_offset 32
.Ltmp2:
        .cfi_offset %rbx, -16
        movl    $42, 8(%rsp)
        callq   _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, %rbx
        #APP
        #NO_APP
        movl    8(%rsp), %eax
        addl    %eax, %eax              # This is "foo"!
        movl    %eax, 12(%rsp)
        #APP
        #NO_APP
        callq   _ZNSt6chrono3_V212system_clock3nowEv
        subq    %rbx, %rax
        addq    $16, %rsp
        popq    %rbx
        retq
.Lfunc_end0:
        .size   _Z8time_foov, .Lfunc_end0-_Z8time_foov
        .cfi_endproc


        .ident  "clang version 3.9.0 (trunk 273389) (llvm/trunk 273380)"
        .section        ".note.GNU-stack","",@progbits

여기에서 컴파일러가 호출을 foo(input)단일 명령어로 최적화하는 것을 볼 수 addl %eax, %eax있지만, 타이밍 외부로 이동하거나 상수 입력에도 불구하고 완전히 제거하지 않습니다.

이것이 도움이되기를 바라며, C ++ 표준위원회는 DoNotOptimize여기 와 유사한 API 표준화 가능성을 검토하고 있습니다.

요약:

재정렬을 방지 할 수있는 확실한 방법은없는 것 같지만 링크 타임 / 전체 프로그램 최적화가 활성화되지 않은 한 별도의 컴파일 단위에서 호출 된 함수를 찾는 것이 상당히 좋은 방법 인 것 같습니다 . (적어도 GCC에서는 이것이 다른 컴파일러에서도 가능하다는 것을 논리에서 제안 할 수 있습니다.) 이것은 함수 호출 비용으로 발생합니다. 인라인 코드는 정의에 따라 동일한 컴파일 단위에 있으며 재정렬이 가능합니다.

원래 답변 :

GCC는 -O2 최적화에 따라 호출 순서를 변경합니다.

#include <chrono>
static int foo(int x)    // 'static' or not here doesn't affect ordering.
{
    return x*2;
}
int fred(int x)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    int y = foo(x);
    auto t2 = std::chrono::high_resolution_clock::now();
    return y;
}

GCC 5.3.0 :

g++ -S --std=c++11 -O0 fred.cpp :

_ZL3fooi:
        pushq   %rbp
        movq    %rsp, %rbp
        movl    %ecx, 16(%rbp)
        movl    16(%rbp), %eax
        addl    %eax, %eax
        popq    %rbp
        ret
_Z4fredi:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $64, %rsp
        movl    %ecx, 16(%rbp)
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, -16(%rbp)
        movl    16(%rbp), %ecx
        call    _ZL3fooi
        movl    %eax, -4(%rbp)
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movq    %rax, -32(%rbp)
        movl    -4(%rbp), %eax
        addq    $64, %rsp
        popq    %rbp
        ret

그러나:

g++ -S --std=c++11 -O2 fred.cpp :

_Z4fredi:
        pushq   %rbx
        subq    $32, %rsp
        movl    %ecx, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        call    _ZNSt6chrono3_V212system_clock3nowEv
        leal    (%rbx,%rbx), %eax
        addq    $32, %rsp
        popq    %rbx
        ret

이제 foo ()를 extern 함수로 사용합니다.

#include <chrono>
int foo(int x);
int fred(int x)
{
    auto t1 = std::chrono::high_resolution_clock::now();
    int y = foo(x);
    auto t2 = std::chrono::high_resolution_clock::now();
    return y;
}

g++ -S --std=c++11 -O2 fred.cpp :

_Z4fredi:
        pushq   %rbx
        subq    $32, %rsp
        movl    %ecx, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movl    %ebx, %ecx
        call    _Z3fooi
        movl    %eax, %ebx
        call    _ZNSt6chrono3_V212system_clock3nowEv
        movl    %ebx, %eax
        addq    $32, %rsp
        popq    %rbx
        ret

그러나 이것이 -flto (링크 시간 최적화)와 연결되어있는 경우 :

0000000100401710 <main>:
   100401710:   53                      push   %rbx
   100401711:   48 83 ec 20             sub    $0x20,%rsp
   100401715:   89 cb                   mov    %ecx,%ebx
   100401717:   e8 e4 ff ff ff          callq  100401700 <__main>
   10040171c:   e8 bf f9 ff ff          callq  1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
   100401721:   e8 ba f9 ff ff          callq  1004010e0 <_ZNSt6chrono3_V212system_clock3nowEv>
   100401726:   8d 04 1b                lea    (%rbx,%rbx,1),%eax
   100401729:   48 83 c4 20             add    $0x20,%rsp
   10040172d:   5b                      pop    %rbx
   10040172e:   c3                      retq

재정렬은 컴파일러 또는 프로세서에서 수행 할 수 있습니다.

대부분의 컴파일러는 읽기-쓰기 명령어의 재정렬을 방지하기 위해 플랫폼 별 방법을 제공합니다. gcc에서 이것은

asm volatile("" ::: "memory");

( 자세한 내용은 여기 )

이것은 읽기 / 쓰기에 의존하는 한 재정렬 작업을 간접적으로 만 방지합니다.

실제로 나는 시스템 호출이 Clock::now()그러한 장벽과 동일한 효과를 갖는 시스템을 아직 보지 못했습니다 . 결과 어셈블리를 검사하여 확인할 수 있습니다.

그러나 테스트중인 함수가 컴파일 시간 동안 평가되는 것은 드문 일이 아닙니다. "현실적인"실행을 적용하려면 foo()I / O 또는 volatile읽기 에서 입력을 유도해야 할 수 있습니다 .

또 다른 옵션은 inlining을 비활성화하는 것입니다. foo()다시 말하지만 이것은 컴파일러에 따라 다르며 일반적으로 이식성이 없지만 동일한 효과를 갖습니다.

gcc에서 이것은 __attribute__ ((noinline))

@Ruslan은 근본적인 문제를 제기합니다.이 측정이 얼마나 현실적입니까?

Execution time is affected by many factors: one is the actual hardware we are running on, the other is concurrent access to shared resources like cache, memory, disk and CPU cores.

So what we usually do to get comparable timings: make sure they are reproducible with a low error margin. This makes them somewhat artificial.

"hot cache" vs. "cold cache" execution performance can easily differ by an order of magnitude - but in reality, it will be something inbetween ("lukewarm"?)

The C++ language defines what is observable in a number of ways.

If foo() does nothing observable, then it can be eliminated completely. If foo() only does a computation that stores values in "local" state (be it on the stack or in an object somewhere), and the compiler can prove that no safely-derived pointer can get into the Clock::now() code, then there are no observable consequences to moving the Clock::now() calls.

If foo() interacted with a file or the display, and the compiler cannot prove that Clock::now() does not interact with the file or the display, then reordering cannot be done, because interaction with a file or display is observable behavior.

While you can use compiler-specific hacks to force code not to move around (like inline assembly), another approach is to attempt to outsmart your compiler.

Create a dynamically loaded library. Load it prior to the code in question.

That library exposes one thing:

namespace details {
  void execute( void(*)(void*), void *);
}

and wraps it like this:

template<class F>
void execute( F f ) {
  struct bundle_t {
    F f;
  } bundle = {std::forward<F>(f)};

  auto tmp_f = [](void* ptr)->void {
    auto* pb = static_cast<bundle_t*>(ptr);
    (pb->f)();
  };
  details::execute( tmp_f, &bundle );
}

which packs up a nullary lambda and uses the dynamic library to run it in a context that the compiler cannot understand.

Inside the dynamic library, we do:

void details::execute( void(*f)(void*), void *p) {
  f(p);
}

which is pretty simple.

Now to reorder the calls to execute, it must understand the dynamic library, which it cannot while compiling your test code.

It can still eliminate foo()s with zero side effects, but you win some, you lose some.

No it can't. According to the C++ standard [intro.execution]:

14 Every value computation and side effect associated with a full-expression is sequenced before every value computation and side effect associated with the next full-expression to be evaluated.

A full-expression is basically a statement terminated by a semicolon. As you can see the above rule stipulates statements must be executed in order. It is within statements that the compiler is allowed more free rein (i.e. it is under some circumstance allowed to evaluate expressions that make up a statement in orders other than left-to-right or anything else specific).

Note the conditions for the as-if rule to apply are not met here. It is unreasonable to think that any compiler would be able to prove that reordering calls to get the system time would not affect observable program behaviour. If there was a circumstance in which two calls to get the time could be reordered without changing observed behaviour, it would be extremely inefficient to actually produce a compiler that analyses a program with enough understanding to be able to infer this with certainty.

No.

Sometimes, by the "as-if" rule, statements may be re-ordered. This is not because they are logically independent of each other, but because that independence allows such a re-ordering to occur without changing the semantics of the program.

Moving a system call that obtains the current time obviously does not satisfy that condition. A compiler that knowingly or unknowingly does so is non-compliant and really silly.

In general, I wouldn't expect any expression that results in a system call to be "second-guessed" by even an aggressively optimizing compiler. It just doesn't know enough about what that system call does.

noinline function + inline assembly black box + full data dependencies

This is based on https://stackoverflow.com/a/38025837/895245 but because I didn't see any clear justification of why the ::now() cannot be reordered there, I would rather be paranoid and put it inside a noinline function together with the asm.

This way I'm pretty sure the reordering cannot happen, since the noinline "ties" the the ::now and the data dependency.

main.cpp

#include <chrono>
#include <iostream>
#include <string>

// noinline ensures that the ::now() cannot be split from the __asm__
template <class T>
__attribute__((noinline)) auto get_clock(T& value) {
    // Make the compiler think we actually use / modify the value.
    // It can't "see" what is going on inside the assembly string.
    __asm__ __volatile__("" : "+m"(value));
    return std::chrono::high_resolution_clock::now();
}

template <class T>
static T foo(T niters) {
    T result = 42;
    for (T i = 0; i < niters; ++i) {
        result = (result * result) - (3 * result) + 1;
    }
    return result;
}

int main(int argc, char **argv) {
    unsigned long long input;
    if (argc > 1) {
        input = std::stoull(argv[1], NULL, 0);
    } else {
        input = 10000;
    }

    // Must come before because it could modify input
    // which is passed as a reference.
    auto t1 = get_clock(input);
    auto output = foo(input);
    // Must come after as it could use the output.
    auto t2 = get_clock(output);
    std::cout << "output " << output << std::endl;
    std::cout << "time "
              << std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count()
              << std::endl;
}

Compile and run:

g++ -ggdb3 -O3 -std=c++14 -Wall -Wextra -pedantic -o main.out main.cpp
./main.out 1000
./main.out 10000
./main.out 100000

The only minor downside of this method is that we add one extra callq instruction over an inline method. objdump -CD shows that main contains:

    11b5:       e8 26 03 00 00          callq  14e0 <auto get_clock<unsigned long long>(unsigned long long&)>
    11ba:       48 8b 34 24             mov    (%rsp),%rsi
    11be:       48 89 c5                mov    %rax,%rbp
    11c1:       b8 2a 00 00 00          mov    $0x2a,%eax
    11c6:       48 85 f6                test   %rsi,%rsi
    11c9:       74 1a                   je     11e5 <main+0x65>
    11cb:       31 d2                   xor    %edx,%edx
    11cd:       0f 1f 00                nopl   (%rax)
    11d0:       48 8d 48 fd             lea    -0x3(%rax),%rcx
    11d4:       48 83 c2 01             add    $0x1,%rdx
    11d8:       48 0f af c1             imul   %rcx,%rax
    11dc:       48 83 c0 01             add    $0x1,%rax
    11e0:       48 39 d6                cmp    %rdx,%rsi
    11e3:       75 eb                   jne    11d0 <main+0x50>
    11e5:       48 89 df                mov    %rbx,%rdi
    11e8:       48 89 44 24 08          mov    %rax,0x8(%rsp)
    11ed:       e8 ee 02 00 00          callq  14e0 <auto get_clock<unsigned long long>(unsigned long long&)>

so we see that foo was inlined, but get_clock were not and surround it.

get_clock itself however is extremely efficient, consisting of a single leaf call optimized instruction that doesn't even touch the stack:

00000000000014e0 <auto get_clock<unsigned long long>(unsigned long long&)>:
    14e0:       e9 5b fb ff ff          jmpq   1040 <std::chrono::_V2::system_clock::now()@plt>

Since the clock precision is itself limited, I think that is unlikely that you will be able to notice the timing effects of one extra jmpq. Note that one call is required regardless since ::now() is in a shared library.

Call ::now() from inline assembly with a data dependency

This would be the most efficient solution possible, overcoming even the extra jmpq mentioned above.

This is unfortunately extremely hard to do correctly as shown at: Calling printf in extended inline ASM

If your time measurement can be done directly in inline assembly without a call however, then this technique can be used. This is the case for example for gem5 magic instrumentation instructions, x86 RDTSC (not sure if this is representative anymore) and possibly other performance counters.

Tested with GCC 8.3.0, Ubuntu 19.04.

참고URL : https://stackoverflow.com/questions/37786547/enforcing-statement-order-in-c

'Programing' 카테고리의 다른 글

Node.js, Express 및 Mongoose를 사용하여 이미지 업로드 (0)	2020.08.18
Go에서 들여 쓰기 : 탭 또는 공백? (0)	2020.08.18
오류 : "삽입 할 노드가 다른 문서 컨텍스트에 있습니다." (0)	2020.08.18
Django 1.7의 초기 마이그레이션에서 다시 마이그레이션하는 방법은 무엇입니까? (0)	2020.08.18
비 이스케이프 매개 변수의 클로저 사용으로 인해 이스케이프 될 수 있음 (0)	2020.08.18

현재글C ++에서 명령문 순서 적용

crosscheck

C ++에서 명령문 순서 적용

C ++에서 명령문 순서 적용

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

C ++에서 명령문 순서 적용

C ++에서 명령문 순서 적용

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바