Advanced profiling topics. PEBS and LBR.

Research Log/Tracing

Advanced profiling topics. PEBS and LBR.

ycchae 2022. 3. 6. 19:01

multiplexing and scaling events

If there are more events than counters, the kernel uses time multiplexing(switch frequency = HZ, generally 100 or 1000) to give each event a chance to access the monitoring hardware. Multiplexing only applies to PMU events. Multiplexing을 사용하면 event 들이 매번 측정되지 않는다. 실행의 마지막에 tool이 따로 계산을 수행한다. final_count = raw_count * time_enabled/time_running. 그러므로 workload에 따라 blind spot이 생길 수 있다.

Generic counters can measure any events. Fixed counters can only measure one event. Some counters may be reserved for special purposes, such as a watchdog timer. 멀티플렉싱과 확장을 피하기 위해서는 보고 싶은 이벤트의 개수를 줄이거나 이벤트 A와 B가 항상 함께 측정되도록 보장하는 방식을 사용하면 된다. perf_events 커널 인터페이스는 이벤트 그룹화를 지원하지만 현재 perf는 지원하지 않는다.

즉, Counter 보다 더 많은 event가 있으면 커널은 time multiplexing을 사용하여 각 event가 monitoring hardware에 접근할 수 있는 기회를 만든다. Multiplexing을 사용하면 event는 매번 측정되지 않고 실행의 마지막에 위의 식을 사용하여 계산된다.

예를 들어, profiling 도중 counter를 측정하는 것이 5번 측정할 수 있었는데 각 측정 간격은 100ms (time_enabled) 동안 지속되었다. 프로그램 실행 시간은 1s (time_running) 이다. 이 counter의 총 이벤트 수는 10000 (raw_count) 이다. 따라서 final_count는 20000과 같다.

이것은 인텔 VTune Amplifier의 "general-exploration" analysis가 프로그램의 단일 실행으로 거의 100 개의 서로 다른 이벤트를 수집 하는 방법을 설명할 수 있다. Callibrating 목적을 위해 profiling toll은 일반적으로 측정된 이벤트 수를 신뢰할 수 있는지 또는 의존하기에는 너무 낮은지를 결정하기 위해 다른 카운터에 대한 threshold를 갖습니다 (see MUX reliability).

멀티플렉싱 이벤트에 대한 가장 쉬운 알고리즘은 라운드 로빈 방식으로 관리하는 것입니다. 따라서 각 이벤트는 결국 실행할 기회를 얻게됩니다. N 개의 카운터가있는 경우 라운드 로빈 목록의 처음 N 개 이벤트까지 PMU에 프로그래밍됩니다. 특정 상황에서는 일부 이벤트가 함께 측정되지 않거나 동일한 카운터를 놓고 경쟁하기 때문에 이보다 적을 수 있습니다.

확장을 방지하기 위해 사용 가능한 물리적 PMC의 양보다 크지 않도록 이벤트 수를 줄일 수 있습니다.

Runtime overhead of characterizing and profiling

계산 및 샘플링 모드의 런타임 오버 헤드에 대해 A. Nowak 및 G. Bitzes가 작성한 매우 좋은 논문이 있습니다. 그들은 서로 다른 구성에서 48 개의 논리 코어가있는 Xeon 기반 컴퓨터에서 프로파일 링 오버 헤드를 측정했습니다. 비활성화 / 활성화 된 하이퍼 스레딩, 모든 / 여러 / 하나의 코어에서 작업 실행 및 1/4/8/16 개의 서로 다른 메트릭 수집.
내 해석으로는 count mode에서 런타임 오버 헤드 (1-2 %%)가 거의 없습니다. sampling mode에서는 서로 다른 카운터간에 멀티플렉싱하지 않는 한 저렴합니다 (샘플링 빈도를 너무 높게 유지하지 않음). 그러나 사용 가능한 물리적 PMU 카운터보다 더 많은 카운터를 수집하려고하면 수집하려는 카운터 수에 따라 약 5-15 %의 성능 저하를 얻을 수 있습니다. 마지막으로 샘플링 주파수가 높을수록 더 많은 인터럽트를 처리해야하므로 profiling overhead가 커집니다.

Interrupt- vs. event-based sampling

인터럽트 기반 샘플링은 최신 프로세서에 skid를 사용한다. 즉, 각 샘플에 저장된 instruction pointer는 counter가 실제로 오버플로되는 위치, 즉 sampling period가 끝났을 때가 아니라 PMU interrupt를 처리하기 위해 프로그램이 중단된 위치를 지정합니다. 어떤 경우에는 두 지점 사이의 거리는 수십 개 이상의 instruction이 되거나 branch가 될 수 있다. Let’s assume that on retirement of instr1 we have an overflow of the counter that samples “instruction retired” events. event의 발생과 intterupt 발생 사이에 있는 microarchitecture latency로 인해 overflow를 발생시킨 instruction과 가까운 위치에서 interuupt를 생성하는 것이 때때로 어렵다. 따라서 interrupt가 생성될 때까지 여러 instruction이 진행된 이후의 IP를 얻게 된다. 그러므로 interrupt service routine에서 register state를 재구성 할 때 약간 부정확한 데이터를 얻게된다.

Processor Event-Based Sampling (PEBS)

The problem with the skids is possible to mitigate by having the processor itself store the instruction pointer (along with other information) in a designated buffer in memory – no interrupts are issued for each sample and the instruction pointer is off only by a single instruction, at most. This needs to be supported by the hardware, and is typically available only for a subset of supported events – this capability is called Processor Event-Based Sampling (PEBS) on Intel processors. You can also see people call it Precise Event-Based Sampling, but according to Intel manuals, first word is “Processor” not “Precise”. But it basically means the same thing.

스키드 문제는 프로세서 자체가 instruction pointer(다른 정보와 함께)를 메모리의 지정된 버퍼에 저장하도록함으로써 완화 할 수 있다. 각 sample에 대해 인터럽트가 발생하지 않고 instruction pointer가 최대 한개의 instruction에 의해서만 꺼집니다. 이는 하드웨어에서 지원해야하며 일반적으로 지원되는 이벤트의 하위 집합에만 사용할 수 있다. 이 기능을 Intel PEBS (Processor Event-Based Sampling)라고 한다. 사람들이 이를 Precise Event-Based Sampling이라고도 하지만인텔 설명서에 따르면 첫 번째 단어는 "Precise"가 아니라 "Processor"입니다. 그러나 그것은 기본적으로 같은 것을 의미한다.

Counter가 machine state를 capture 할 때, processor는 machine state incofrmation을 software에 의해 정해진 memory buffer에 기록할 것이다. Counter가 overflow 될 때, PEBS 하드웨어가 준비된다. 다음 PEBS event가 발생하면 PEBS 하드웨어가 assist를 트리거하고 PEBS record가 PEBS buffer에 기록되도록 한다. 이 record에는 architectural state of the processor (state of the general purpose registers, EIP register, EFLAGS register)를 포함하고 있다. PEBS를 사용하는 경우 sample의 format은 프로세서에 의해 지정되므로 이를 확인하기 위해서는 Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B, Chapter 18.을 봐야한다.

Not all events support PEBS. For example, on Sandy Bridge there are 7 PEBS events supported:

INST_RETIRED
UOPS_RETIRED
BR_INST_RETIRED
BR_MISP_RETIRED
MEM_UOPS_RETIRED
MEM_LOAD_UOPS_RETIRED
MEM_LOAD_UOPS_LLC_HIT_RETIRED

PEBS buffer consists of records. Each sample contains the machine state of the processor at the time the counter overflowed. Here is the example of PEBS record (picture below is taken from Intel manual):

사진

You can use PEBS with perf by adding :p and :pp suffix to the event specifier:

perf record -e event:pp

Benefits of using PEBS:

The skid is mimized compared to regular interrupted instruction pointer.
Reduce the overhead because the Linux kernel is only involved when the PEBS buffer fills up, 즉, 많은 샘플을 사용할 수있을 때까지 인터럽트가 발생하지 않아서 오버헤드 감소.

Last Branch Record (LBR)

마지막 분기 기록 메커니즘은 분기 명령 (예 : JMP, Jcc, LOOP 및 CALL 명령)뿐만 아니라 명령 포인터를 변경하는 기타 작업 (예 : 외부 인터럽트, 트랩 및 오류)도 추적합니다. 분기 기록 메커니즘은 일반적으로 LBR (마지막 분기 기록) 스택이라고하는 MSR (모델 별 레지스터) 세트를 사용합니다. LBR 스택의 크기와 정확한 위치는 일반적으로 모델에 따라 다릅니다.

Last Branch Record (LBR) Stack — The LBR consists of N pairs of MSRs (N is, again, model specific) that store source and destination address of recent branches. Last Branch Record Top-of-Stack (TOS) Pointer — contains a pointer to the MSR in the LBR stack that contains the most recent branch, interrupt, or exception recorded.

There are two important usages for LBR as mentioned in Intel® 64 and IA-32 Architectures Optimization Reference Manual, Chapter B.3.3.4:

Collecting Call Counts and Function Arguments. If the LBRs are captured for PMIs triggered by the BR_INST_RETIRED.NEAR_CALL event, then the call count per calling function can be determined by simply using the last entry in LBR. As the PEBS IP will equal the last target IP in the LBR, it is the entry point of the calling function. Similarly, the last source in the LBR buffer was the call site from within the calling function. If the full PEBS record is captured as well, then for functions with limited numbers of arguments on 64-bit OS’s, you can sample both the call counts and the function arguments.
Basic Block Execution Counts. This is rather complicated to explain, so I refer a reader for the manual to read more about this.
UPD: Precise timing of machine code.
UPD: Estimating branch probability.

From a user perspective LBR can be used for collecting call-graph information even if you compiled your app without frame pointers (controlled by compiler option ‘-fomit-frame-pointer’, ON by default):

$ perf record --call-graph lbr

Using LBR in perf also allows you to see where were the most amount of branches:

$ perf record -b