perf-amd-ibs.txt 7.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223
  1. perf-amd-ibs(1)
  2. ===============
  3. NAME
  4. ----
  5. perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool
  6. SYNOPSIS
  7. --------
  8. [verse]
  9. 'perf record' -e ibs_op//
  10. 'perf record' -e ibs_fetch//
  11. DESCRIPTION
  12. -----------
  13. Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP)
  14. profiling support on AMD platforms. IBS has two independent components: IBS
  15. Op and IBS Fetch. IBS Op sampling provides information about instruction
  16. execution (micro-op execution to be precise) with details like d-cache
  17. hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch
  18. behavior etc. IBS Fetch sampling provides information about instruction fetch
  19. with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is
  20. per-smt-thread i.e. each SMT hardware thread contains standalone IBS units.
  21. Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited
  22. using the Linux perf utility. The following files will be created at boot time
  23. if IBS is supported by the hardware and kernel.
  24. /sys/bus/event_source/devices/ibs_op/
  25. /sys/bus/event_source/devices/ibs_fetch/
  26. IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports
  27. one event: fetch ops.
  28. IBS PMUs do not have user/kernel filtering capability and thus it requires
  29. CAP_SYS_ADMIN or CAP_PERFMON privilege.
  30. IBS VS. REGULAR CORE PMU
  31. ------------------------
  32. IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has
  33. no skid. Whereas the IP recorded by regular core PMU will have some skid
  34. (sample was generated at IP X but perf would record it at IP X+n). Hence,
  35. regular core PMU might not help for profiling with instruction level
  36. precision. Further, IBS provides additional information about the sample in
  37. question. On the other hand, regular core PMU has it's own advantages like
  38. plethora of events, counting mode (less interference), up to 6 parallel
  39. counters, event grouping support, filtering capabilities etc.
  40. Three regular core PMU events are internally forwarded to IBS Op PMU when
  41. precise_ip attribute is set:
  42. -e cpu-cycles:p becomes -e ibs_op//
  43. -e r076:p becomes -e ibs_op//
  44. -e r0C1:p becomes -e ibs_op/cnt_ctl=1/
  45. EXAMPLES
  46. --------
  47. IBS Op PMU
  48. ~~~~~~~~~~
  49. System-wide profile, cycles event, sampling period: 100000
  50. # perf record -e ibs_op// -c 100000 -a
  51. Per-cpu profile (cpu10), cycles event, sampling period: 100000
  52. # perf record -e ibs_op// -c 100000 -C 10
  53. Per-cpu profile (cpu10), cycles event, sampling freq: 1000
  54. # perf record -e ibs_op// -F 1000 -C 10
  55. System-wide profile, uOps event, sampling period: 100000
  56. # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a
  57. Same command, but also capture IBS register raw dump along with perf sample:
  58. # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples
  59. System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)
  60. # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a
  61. System-wide profile, cycles event, sampling period: 100000, LdLat filtering (Zen5
  62. onward)
  63. # perf record -e ibs_op/ldlat=128/ -c 100000 -a
  64. Supported load latency threshold values are 128 to 2048 (both inclusive).
  65. Latency value which is a multiple of 128 incurs a little less profiling
  66. overhead compared to other values.
  67. Per process(upstream v6.2 onward), uOps event, sampling period: 100000
  68. # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234
  69. Per process(upstream v6.2 onward), uOps event, sampling period: 100000
  70. # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls
  71. To analyse recorded profile in aggregate mode
  72. # perf report
  73. /* Select a line and press 'a' to drill down at instruction level. */
  74. To go over each sample
  75. # perf script
  76. Raw dump of IBS registers when profiled with --raw-samples
  77. # perf report -D
  78. /* Look for PERF_RECORD_SAMPLE */
  79. Example register raw dump:
  80. ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1
  81. Val 1 CntCtl 0=cycles CurCnt 707
  82. IbsOpRip: ffffffff8204aea7
  83. ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597
  84. BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1
  85. ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM
  86. ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0
  87. DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0
  88. DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0
  89. DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1
  90. DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes
  91. OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0
  92. IbsDCLinAd: ff110008a5398920
  93. IbsDCPhysAd: 00000008a5398920
  94. IBS applied in a real world usecase
  95. ~90% regression was observed in tbench with specific scheduler hint
  96. which was counter intuitive. IBS profile of good and bad run captured
  97. using perf helped in identifying exact cause of the problem:
  98. https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com
  99. IBS Fetch PMU
  100. ~~~~~~~~~~~~~
  101. Similar commands can be used with Fetch PMU as well.
  102. System-wide profile, fetch ops event, sampling period: 100000
  103. # perf record -e ibs_fetch// -c 100000 -a
  104. System-wide profile, fetch ops event, sampling period: 100000, Random enable
  105. # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a
  106. Random enable adds small degree of variability to sample period. This
  107. helps in cases like long running loops where PMU is tagging the same
  108. instruction over and over because of fixed sample period.
  109. etc.
  110. PERF MEM AND PERF C2C
  111. ---------------------
  112. perf mem is a memory access profiler tool and perf c2c is a shared data
  113. cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD.
  114. Below is a simple example of the perf mem tool.
  115. # perf mem record -c 100000 -- make
  116. # perf mem report
  117. A normal perf mem report output will provide detailed memory access profile.
  118. New output fields will show related access info together. For example:
  119. # perf mem report -F overhead,cache,snoop,comm
  120. ...
  121. # Samples: 92K of event 'ibs_op//'
  122. # Total weight : 531104
  123. #
  124. # ---------- Cache ----------- --- Snoop ----
  125. # Overhead L1 L2 L1-buf Other HitM Other Command
  126. # ........ ............................ .............. ..........
  127. #
  128. 76.07% 5.8% 35.7% 0.0% 34.6% 23.3% 52.8% cc1
  129. 5.79% 0.2% 0.0% 0.0% 5.6% 0.1% 5.7% make
  130. 5.78% 0.1% 4.4% 0.0% 1.2% 0.5% 5.3% gcc
  131. 5.33% 0.3% 3.9% 0.0% 1.1% 0.2% 5.2% as
  132. 5.00% 0.1% 3.8% 0.0% 1.0% 0.3% 4.7% sh
  133. 1.56% 0.1% 0.1% 0.0% 1.4% 0.6% 0.9% ld
  134. 0.28% 0.1% 0.0% 0.0% 0.2% 0.1% 0.2% pkg-config
  135. 0.09% 0.0% 0.0% 0.0% 0.1% 0.0% 0.1% git
  136. 0.03% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% rm
  137. ...
  138. Also, it can be aggregated based on various memory access info using the
  139. sort keys. For example:
  140. # perf mem report -s mem,snoop
  141. ...
  142. # Samples: 92K of event 'ibs_op//'
  143. # Total weight : 531104
  144. # Sort order : mem,snoop
  145. #
  146. # Overhead Samples Memory access Snoop
  147. # ........ ............ ....................................... ............
  148. #
  149. 47.99% 1509 L2 hit N/A
  150. 25.08% 338 core, same node Any cache hit HitM
  151. 10.24% 54374 N/A N/A
  152. 6.77% 35938 L1 hit N/A
  153. 6.39% 101 core, same node Any cache hit N/A
  154. 3.50% 69 RAM hit N/A
  155. 0.03% 158 LFB/MAB hit N/A
  156. 0.00% 2 Uncached hit N/A
  157. Please refer to their man page for more detail.
  158. SEE ALSO
  159. --------
  160. linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
  161. linkperf:perf-mem[1], linkperf:perf-c2c[1]