| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223 |
- perf-amd-ibs(1)
- ===============
- NAME
- ----
- perf-amd-ibs - Support for AMD Instruction-Based Sampling (IBS) with perf tool
- SYNOPSIS
- --------
- [verse]
- 'perf record' -e ibs_op//
- 'perf record' -e ibs_fetch//
- DESCRIPTION
- -----------
- Instruction-Based Sampling (IBS) provides precise Instruction Pointer (IP)
- profiling support on AMD platforms. IBS has two independent components: IBS
- Op and IBS Fetch. IBS Op sampling provides information about instruction
- execution (micro-op execution to be precise) with details like d-cache
- hit/miss, d-TLB hit/miss, cache miss latency, load/store data source, branch
- behavior etc. IBS Fetch sampling provides information about instruction fetch
- with details like i-cache hit/miss, i-TLB hit/miss, fetch latency etc. IBS is
- per-smt-thread i.e. each SMT hardware thread contains standalone IBS units.
- Both, IBS Op and IBS Fetch, are exposed as PMUs by Linux and can be exploited
- using the Linux perf utility. The following files will be created at boot time
- if IBS is supported by the hardware and kernel.
- /sys/bus/event_source/devices/ibs_op/
- /sys/bus/event_source/devices/ibs_fetch/
- IBS Op PMU supports two events: cycles and micro ops. IBS Fetch PMU supports
- one event: fetch ops.
- IBS PMUs do not have user/kernel filtering capability and thus it requires
- CAP_SYS_ADMIN or CAP_PERFMON privilege.
- IBS VS. REGULAR CORE PMU
- ------------------------
- IBS gives samples with precise IP, i.e. the IP recorded with IBS sample has
- no skid. Whereas the IP recorded by regular core PMU will have some skid
- (sample was generated at IP X but perf would record it at IP X+n). Hence,
- regular core PMU might not help for profiling with instruction level
- precision. Further, IBS provides additional information about the sample in
- question. On the other hand, regular core PMU has it's own advantages like
- plethora of events, counting mode (less interference), up to 6 parallel
- counters, event grouping support, filtering capabilities etc.
- Three regular core PMU events are internally forwarded to IBS Op PMU when
- precise_ip attribute is set:
- -e cpu-cycles:p becomes -e ibs_op//
- -e r076:p becomes -e ibs_op//
- -e r0C1:p becomes -e ibs_op/cnt_ctl=1/
- EXAMPLES
- --------
- IBS Op PMU
- ~~~~~~~~~~
- System-wide profile, cycles event, sampling period: 100000
- # perf record -e ibs_op// -c 100000 -a
- Per-cpu profile (cpu10), cycles event, sampling period: 100000
- # perf record -e ibs_op// -c 100000 -C 10
- Per-cpu profile (cpu10), cycles event, sampling freq: 1000
- # perf record -e ibs_op// -F 1000 -C 10
- System-wide profile, uOps event, sampling period: 100000
- # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a
- Same command, but also capture IBS register raw dump along with perf sample:
- # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -a --raw-samples
- System-wide profile, uOps event, sampling period: 100000, L3MissOnly (Zen4 onward)
- # perf record -e ibs_op/cnt_ctl=1,l3missonly=1/ -c 100000 -a
- System-wide profile, cycles event, sampling period: 100000, LdLat filtering (Zen5
- onward)
- # perf record -e ibs_op/ldlat=128/ -c 100000 -a
- Supported load latency threshold values are 128 to 2048 (both inclusive).
- Latency value which is a multiple of 128 incurs a little less profiling
- overhead compared to other values.
- Per process(upstream v6.2 onward), uOps event, sampling period: 100000
- # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -p 1234
- Per process(upstream v6.2 onward), uOps event, sampling period: 100000
- # perf record -e ibs_op/cnt_ctl=1/ -c 100000 -- ls
- To analyse recorded profile in aggregate mode
- # perf report
- /* Select a line and press 'a' to drill down at instruction level. */
- To go over each sample
- # perf script
- Raw dump of IBS registers when profiled with --raw-samples
- # perf report -D
- /* Look for PERF_RECORD_SAMPLE */
- Example register raw dump:
- ibs_op_ctl: 000002c30006186a MaxCnt 100000 L3MissOnly 0 En 1
- Val 1 CntCtl 0=cycles CurCnt 707
- IbsOpRip: ffffffff8204aea7
- ibs_op_data: 0000010002550001 CompToRetCtr 1 TagToRetCtr 597
- BrnRet 0 RipInvalid 0 BrnFuse 0 Microcode 1
- ibs_op_data2: 0000000000000013 RmtNode 1 DataSrc 3=DRAM
- ibs_op_data3: 0000000031960092 LdOp 0 StOp 1 DcL1TlbMiss 0
- DcL2TlbMiss 0 DcL1TlbHit2M 1 DcL1TlbHit1G 0 DcL2TlbHit2M 0
- DcMiss 1 DcMisAcc 0 DcWcMemAcc 0 DcUcMemAcc 0 DcLockedOp 0
- DcMissNoMabAlloc 0 DcLinAddrValid 1 DcPhyAddrValid 1
- DcL2TlbHit1G 0 L2Miss 1 SwPf 0 OpMemWidth 32 bytes
- OpDcMissOpenMemReqs 12 DcMissLat 0 TlbRefillLat 0
- IbsDCLinAd: ff110008a5398920
- IbsDCPhysAd: 00000008a5398920
- IBS applied in a real world usecase
- ~90% regression was observed in tbench with specific scheduler hint
- which was counter intuitive. IBS profile of good and bad run captured
- using perf helped in identifying exact cause of the problem:
- https://lore.kernel.org/r/20220921063638.2489-1-kprateek.nayak@amd.com
- IBS Fetch PMU
- ~~~~~~~~~~~~~
- Similar commands can be used with Fetch PMU as well.
- System-wide profile, fetch ops event, sampling period: 100000
- # perf record -e ibs_fetch// -c 100000 -a
- System-wide profile, fetch ops event, sampling period: 100000, Random enable
- # perf record -e ibs_fetch/rand_en=1/ -c 100000 -a
- Random enable adds small degree of variability to sample period. This
- helps in cases like long running loops where PMU is tagging the same
- instruction over and over because of fixed sample period.
- etc.
- PERF MEM AND PERF C2C
- ---------------------
- perf mem is a memory access profiler tool and perf c2c is a shared data
- cacheline analyser tool. Both of them internally uses IBS Op PMU on AMD.
- Below is a simple example of the perf mem tool.
- # perf mem record -c 100000 -- make
- # perf mem report
- A normal perf mem report output will provide detailed memory access profile.
- New output fields will show related access info together. For example:
- # perf mem report -F overhead,cache,snoop,comm
- ...
- # Samples: 92K of event 'ibs_op//'
- # Total weight : 531104
- #
- # ---------- Cache ----------- --- Snoop ----
- # Overhead L1 L2 L1-buf Other HitM Other Command
- # ........ ............................ .............. ..........
- #
- 76.07% 5.8% 35.7% 0.0% 34.6% 23.3% 52.8% cc1
- 5.79% 0.2% 0.0% 0.0% 5.6% 0.1% 5.7% make
- 5.78% 0.1% 4.4% 0.0% 1.2% 0.5% 5.3% gcc
- 5.33% 0.3% 3.9% 0.0% 1.1% 0.2% 5.2% as
- 5.00% 0.1% 3.8% 0.0% 1.0% 0.3% 4.7% sh
- 1.56% 0.1% 0.1% 0.0% 1.4% 0.6% 0.9% ld
- 0.28% 0.1% 0.0% 0.0% 0.2% 0.1% 0.2% pkg-config
- 0.09% 0.0% 0.0% 0.0% 0.1% 0.0% 0.1% git
- 0.03% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% rm
- ...
- Also, it can be aggregated based on various memory access info using the
- sort keys. For example:
- # perf mem report -s mem,snoop
- ...
- # Samples: 92K of event 'ibs_op//'
- # Total weight : 531104
- # Sort order : mem,snoop
- #
- # Overhead Samples Memory access Snoop
- # ........ ............ ....................................... ............
- #
- 47.99% 1509 L2 hit N/A
- 25.08% 338 core, same node Any cache hit HitM
- 10.24% 54374 N/A N/A
- 6.77% 35938 L1 hit N/A
- 6.39% 101 core, same node Any cache hit N/A
- 3.50% 69 RAM hit N/A
- 0.03% 158 LFB/MAB hit N/A
- 0.00% 2 Uncached hit N/A
- Please refer to their man page for more detail.
- SEE ALSO
- --------
- linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
- linkperf:perf-mem[1], linkperf:perf-c2c[1]
|