perf-sched.txt 27 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500
  1. perf-sched(1)
  2. =============
  3. NAME
  4. ----
  5. perf-sched - Tool to trace/measure scheduler properties (latencies)
  6. SYNOPSIS
  7. --------
  8. [verse]
  9. 'perf sched' {record|latency|map|replay|script|timehist|stats}
  10. DESCRIPTION
  11. -----------
  12. There are several variants of 'perf sched':
  13. 'perf sched record <command>' to record the scheduling events
  14. of an arbitrary workload.
  15. 'perf sched latency' to report the per task scheduling latencies
  16. and other scheduling properties of the workload.
  17. Example usage:
  18. perf sched record -- sleep 1
  19. perf sched latency
  20. -------------------------------------------------------------------------------------------------------------------------------------------
  21. Task | Runtime ms | Count | Avg delay ms | Max delay ms | Max delay start | Max delay end |
  22. -------------------------------------------------------------------------------------------------------------------------------------------
  23. perf:(2) | 2.804 ms | 66 | avg: 0.524 ms | max: 1.069 ms | max start: 254752.314960 s | max end: 254752.316029 s
  24. NetworkManager:1343 | 0.372 ms | 13 | avg: 0.008 ms | max: 0.013 ms | max start: 254751.551153 s | max end: 254751.551166 s
  25. kworker/1:2-xfs:4649 | 0.012 ms | 1 | avg: 0.008 ms | max: 0.008 ms | max start: 254751.519807 s | max end: 254751.519815 s
  26. kworker/3:1-xfs:388 | 0.011 ms | 1 | avg: 0.006 ms | max: 0.006 ms | max start: 254751.519809 s | max end: 254751.519815 s
  27. sleep:147736 | 0.938 ms | 3 | avg: 0.006 ms | max: 0.007 ms | max start: 254751.313817 s | max end: 254751.313824 s
  28. It shows Runtime(time that a task spent actually running on the CPU),
  29. Count(number of times a delay was calculated) and delay(time that a
  30. task was ready to run but was kept waiting).
  31. Tasks with the same command name are merged and the merge count is
  32. given within (), However if -p option is used, pid is mentioned.
  33. 'perf sched script' to see a detailed trace of the workload that
  34. was recorded (aliased to 'perf script' for now).
  35. 'perf sched replay' to simulate the workload that was recorded
  36. via perf sched record. (this is done by starting up mockup threads
  37. that mimic the workload based on the events in the trace. These
  38. threads can then replay the timings (CPU runtime and sleep patterns)
  39. of the workload as it occurred when it was recorded - and can repeat
  40. it a number of times, measuring its performance.)
  41. 'perf sched map' to print a textual context-switching outline of
  42. workload captured via perf sched record. Columns stand for
  43. individual CPUs, and the two-letter shortcuts stand for tasks that
  44. are running on a CPU. A '*' denotes the CPU that had the event, and
  45. a dot signals an idle CPU.
  46. 'perf sched timehist' provides an analysis of scheduling events.
  47. Example usage:
  48. perf sched record -- sleep 1
  49. perf sched timehist
  50. By default it shows the individual schedule events, including the wait
  51. time (time between sched-out and next sched-in events for the task), the
  52. task scheduling delay (time between runnable and actually running) and
  53. run time for the task:
  54. time cpu task name wait time sch delay run time
  55. [tid/pid] (msec) (msec) (msec)
  56. -------------- ------ -------------------- --------- --------- ---------
  57. 79371.874569 [0011] gcc[31949] 0.014 0.000 1.148
  58. 79371.874591 [0010] gcc[31951] 0.000 0.000 0.024
  59. 79371.874603 [0010] migration/10[59] 3.350 0.004 0.011
  60. 79371.874604 [0011] <idle> 1.148 0.000 0.035
  61. 79371.874723 [0005] <idle> 0.016 0.000 1.383
  62. 79371.874746 [0005] gcc[31949] 0.153 0.078 0.022
  63. ...
  64. Times are in msec.usec.
  65. 'perf sched stats {record | report | diff} <command>' to capture, report the diff
  66. in schedstat counters and show the difference between perf sched stats report
  67. respectively. schedstat counters which are present in the linux kernel and are
  68. exposed through the file ``/proc/schedstat``. These counters are enabled or disabled
  69. via the sysctl governed by the file ``/proc/sys/kernel/sched_schedstats``. These
  70. counters accounts for many scheduler events such as ``schedule()`` calls, load-balancing
  71. events, ``try_to_wakeup()`` call among others. This is useful in understanding the
  72. scheduler behavior for the workload.
  73. Note: The tool will not give correct results if there is topological reordering or
  74. online/offline of cpus in between capturing snapshots of `/proc/schedstat`.
  75. Example usage:
  76. perf sched stats record -- sleep 1
  77. perf sched stats report
  78. perf sched stats diff
  79. A detailed description of the schedstats can be found in the Kernel Documentation:
  80. https://www.kernel.org/doc/html/latest/scheduler/sched-stats.html
  81. The result can be interpreted as follows:
  82. The `perf sched stats report` starts with description of the columns present in
  83. the report. These column names are given before cpu and domain stats to improve
  84. the readability of the report.
  85. ----------------------------------------------------------------------------------------------------
  86. DESC -> Description of the field
  87. COUNT -> Value of the field
  88. PCT_CHANGE -> Percent change with corresponding base value
  89. AVG_JIFFIES -> Avg time in jiffies between two consecutive occurrence of event
  90. ----------------------------------------------------------------------------------------------------
  91. Next is the total profiling time in terms of jiffies:
  92. ----------------------------------------------------------------------------------------------------
  93. Time elapsed (in jiffies) : 2323
  94. ----------------------------------------------------------------------------------------------------
  95. Next is CPU scheduling statistics. These are simple diffs of /proc/schedstat CPU lines
  96. along with description. The report also prints % relative to base stat.
  97. In the example below, schedule() left the CPU0 idle 36.58% of the time. 0.45% of total
  98. try_to_wake_up() was to wakeup local CPU. And, the total waittime by tasks on CPU0 is
  99. 48.70% of the total runtime by tasks on the same CPU.
  100. ----------------------------------------------------------------------------------------------------
  101. CPU 0
  102. ----------------------------------------------------------------------------------------------------
  103. DESC COUNT PCT_CHANGE
  104. ----------------------------------------------------------------------------------------------------
  105. yld_count : 0
  106. array_exp : 0
  107. sched_count : 402267
  108. sched_goidle : 147161 ( 36.58% )
  109. ttwu_count : 236309
  110. ttwu_local : 1062 ( 0.45% )
  111. rq_cpu_time : 7083791148
  112. run_delay : 3449973971 ( 48.70% )
  113. pcount : 255035
  114. ----------------------------------------------------------------------------------------------------
  115. Next is load balancing statistics. For each of the sched domains
  116. (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
  117. the following three categories:
  118. 1) Idle Load Balance: Load balancing performed on behalf of a long
  119. idling CPU by some other CPU.
  120. 2) Busy Load Balance: Load balancing performed when the CPU was busy.
  121. 3) New Idle Balance : Load balancing performed when a CPU just became
  122. idle.
  123. Under each of these three categories, sched stats report provides
  124. different load balancing statistics. Along with direct stats, the
  125. report also contains derived metrics prefixed with *. Example:
  126. ----------------------------------------------------------------------------------------------------
  127. CPU 0, DOMAIN SMT CPUS 0,64
  128. ----------------------------------------------------------------------------------------------------
  129. DESC COUNT AVG_JIFFIES
  130. ----------------------------------------- <Category busy> ------------------------------------------
  131. busy_lb_count : 136 $ 17.08 $
  132. busy_lb_balanced : 131 $ 17.73 $
  133. busy_lb_failed : 0 $ 0.00 $
  134. busy_lb_imbalance_load : 58
  135. busy_lb_imbalance_util : 0
  136. busy_lb_imbalance_task : 0
  137. busy_lb_imbalance_misfit : 0
  138. busy_lb_gained : 7
  139. busy_lb_hot_gained : 0
  140. busy_lb_nobusyq : 2 $ 1161.50 $
  141. busy_lb_nobusyg : 129 $ 18.01 $
  142. *busy_lb_success_count : 5
  143. *busy_lb_avg_pulled : 1.40
  144. ----------------------------------------- <Category idle> ------------------------------------------
  145. idle_lb_count : 449 $ 5.17 $
  146. idle_lb_balanced : 382 $ 6.08 $
  147. idle_lb_failed : 3 $ 774.33 $
  148. idle_lb_imbalance_load : 0
  149. idle_lb_imbalance_util : 0
  150. idle_lb_imbalance_task : 71
  151. idle_lb_imbalance_misfit : 0
  152. idle_lb_gained : 67
  153. idle_lb_hot_gained : 0
  154. idle_lb_nobusyq : 0 $ 0.00 $
  155. idle_lb_nobusyg : 382 $ 6.08 $
  156. *idle_lb_success_count : 64
  157. *idle_lb_avg_pulled : 1.05
  158. ---------------------------------------- <Category newidle> ----------------------------------------
  159. newidle_lb_count : 30471 $ 0.08 $
  160. newidle_lb_balanced : 28490 $ 0.08 $
  161. newidle_lb_failed : 633 $ 3.67 $
  162. newidle_lb_imbalance_load : 0
  163. newidle_lb_imbalance_util : 0
  164. newidle_lb_imbalance_task : 2040
  165. newidle_lb_imbalance_misfit : 0
  166. newidle_lb_gained : 1348
  167. newidle_lb_hot_gained : 0
  168. newidle_lb_nobusyq : 6 $ 387.17 $
  169. newidle_lb_nobusyg : 26634 $ 0.09 $
  170. *newidle_lb_success_count : 1348
  171. *newidle_lb_avg_pulled : 1.00
  172. ----------------------------------------------------------------------------------------------------
  173. Consider following line:
  174. newidle_lb_balanced : 28490 $ 0.08 $
  175. While profiling was active, the load-balancer found 28490 times the load
  176. needs to be balanced on a newly idle CPU 0. Following value encapsulated
  177. inside $ is average jiffies between two events (2323 / 28490 = 0.08).
  178. Next are active_load_balance() stats. alb did not trigger while the
  179. profiling was active, hence it's all 0s.
  180. --------------------------------- <Category active_load_balance()> ---------------------------------
  181. alb_count : 0
  182. alb_failed : 0
  183. alb_pushed : 0
  184. ----------------------------------------------------------------------------------------------------
  185. Next are sched_balance_exec() and sched_balance_fork() stats. They are
  186. not used but we kept it in RFC just for legacy purpose. Unless opposed,
  187. we plan to remove them in next revision.
  188. Next are wakeup statistics. For every domain, the report also shows
  189. task-wakeup statistics. Example:
  190. ------------------------------------------ <Wakeup Info> -------------------------------------------
  191. ttwu_wake_remote : 1590
  192. ttwu_move_affine : 84
  193. ttwu_move_balance : 0
  194. ----------------------------------------------------------------------------------------------------
  195. Same set of stats are reported for each CPU and each domain level.
  196. How to interpret the diff
  197. ~~~~~~~~~~~~~~~~~~~~~~~~~
  198. The `perf sched stats diff` will also start with explaining the columns
  199. present in the diff. Then it will show the diff in time in terms of
  200. jiffies. The order of the values depends on the order of input data
  201. files. It will take `perf.data.old` and `perf.data` respectively as the
  202. defaults for comparison. Example:
  203. ----------------------------------------------------------------------------------------------------
  204. Time elapsed (in jiffies) : 2009, 2001
  205. ----------------------------------------------------------------------------------------------------
  206. Below is the sample representing the difference in cpu and domain stats of
  207. two runs. Here third column or the values enclosed in `|...|` shows the
  208. percent change between the two. Second and fourth columns shows the
  209. side-by-side representions of the corresponding fields from `perf sched
  210. stats report`.
  211. ----------------------------------------------------------------------------------------------------
  212. CPU <ALL CPUS SUMMARY>
  213. ----------------------------------------------------------------------------------------------------
  214. DESC COUNT1 COUNT2 PCT_CHANG>
  215. ----------------------------------------------------------------------------------------------------
  216. yld_count : 0, 0 | 0.00>
  217. array_exp : 0, 0 | 0.00>
  218. sched_count : 528533, 412573 | -21.94>
  219. sched_goidle : 193426, 146082 | -24.48>
  220. ttwu_count : 313134, 385975 | 23.26>
  221. ttwu_local : 1126, 1282 | 13.85>
  222. rq_cpu_time : 8257200244, 8301250047 | 0.53>
  223. run_delay : 4728347053, 3997100703 | -15.47>
  224. pcount : 335031, 266396 | -20.49>
  225. ----------------------------------------------------------------------------------------------------
  226. Below is the sample of domain stats diff:
  227. ----------------------------------------------------------------------------------------------------
  228. CPU <ALL CPUS SUMMARY>, DOMAIN SMT
  229. ----------------------------------------------------------------------------------------------------
  230. DESC COUNT1 COUNT2 PCT_CHANG>
  231. ----------------------------------------- <Category busy> ------------------------------------------
  232. busy_lb_count : 122, 80 | -34.43>
  233. busy_lb_balanced : 115, 76 | -33.91>
  234. busy_lb_failed : 1, 3 | 200.00>
  235. busy_lb_imbalance_load : 35, 49 | 40.00>
  236. busy_lb_imbalance_util : 0, 0 | 0.00>
  237. busy_lb_imbalance_task : 0, 0 | 0.00>
  238. busy_lb_imbalance_misfit : 0, 0 | 0.00>
  239. busy_lb_gained : 7, 2 | -71.43>
  240. busy_lb_hot_gained : 0, 0 | 0.00>
  241. busy_lb_nobusyq : 0, 0 | 0.00>
  242. busy_lb_nobusyg : 115, 76 | -33.91>
  243. *busy_lb_success_count : 6, 1 | -83.33>
  244. *busy_lb_avg_pulled : 1.17, 2.00 | 71.43>
  245. ----------------------------------------- <Category idle> ------------------------------------------
  246. idle_lb_count : 568, 620 | 9.15>
  247. idle_lb_balanced : 462, 449 | -2.81>
  248. idle_lb_failed : 11, 21 | 90.91>
  249. idle_lb_imbalance_load : 0, 0 | 0.00>
  250. idle_lb_imbalance_util : 0, 0 | 0.00>
  251. idle_lb_imbalance_task : 115, 189 | 64.35>
  252. idle_lb_imbalance_misfit : 0, 0 | 0.00>
  253. idle_lb_gained : 103, 169 | 64.08>
  254. idle_lb_hot_gained : 0, 0 | 0.00>
  255. idle_lb_nobusyq : 0, 0 | 0.00>
  256. idle_lb_nobusyg : 462, 449 | -2.81>
  257. *idle_lb_success_count : 95, 150 | 57.89>
  258. *idle_lb_avg_pulled : 1.08, 1.13 | 3.92>
  259. ---------------------------------------- <Category newidle> ----------------------------------------
  260. newidle_lb_count : 16961, 3155 | -81.40>
  261. newidle_lb_balanced : 15646, 2556 | -83.66>
  262. newidle_lb_failed : 397, 142 | -64.23>
  263. newidle_lb_imbalance_load : 0, 0 | 0.00>
  264. newidle_lb_imbalance_util : 0, 0 | 0.00>
  265. newidle_lb_imbalance_task : 1376, 655 | -52.40>
  266. newidle_lb_imbalance_misfit : 0, 0 | 0.00>
  267. newidle_lb_gained : 917, 457 | -50.16>
  268. newidle_lb_hot_gained : 0, 0 | 0.00>
  269. newidle_lb_nobusyq : 3, 1 | -66.67>
  270. newidle_lb_nobusyg : 14480, 2103 | -85.48>
  271. *newidle_lb_success_count : 918, 457 | -50.22>
  272. *newidle_lb_avg_pulled : 1.00, 1.00 | 0.11>
  273. --------------------------------- <Category active_load_balance()> ---------------------------------
  274. alb_count : 0, 1 | 0.00>
  275. alb_failed : 0, 0 | 0.00>
  276. alb_pushed : 0, 1 | 0.00>
  277. --------------------------------- <Category sched_balance_exec()> ----------------------------------
  278. sbe_count : 0, 0 | 0.00>
  279. sbe_balanced : 0, 0 | 0.00>
  280. sbe_pushed : 0, 0 | 0.00>
  281. --------------------------------- <Category sched_balance_fork()> ----------------------------------
  282. sbf_count : 0, 0 | 0.00>
  283. sbf_balanced : 0, 0 | 0.00>
  284. sbf_pushed : 0, 0 | 0.00>
  285. ------------------------------------------ <Wakeup Info> -------------------------------------------
  286. ttwu_wake_remote : 2031, 2914 | 43.48>
  287. ttwu_move_affine : 73, 124 | 69.86>
  288. ttwu_move_balance : 0, 0 | 0.00>
  289. ----------------------------------------------------------------------------------------------------
  290. OPTIONS
  291. -------
  292. Applicable to {record|latency|map|replay|script}
  293. -i::
  294. --input=<file>::
  295. Input file name. (default: perf.data unless stdin is a fifo)
  296. -v::
  297. --verbose::
  298. Be more verbose. (show symbol address, etc)
  299. -D::
  300. --dump-raw-trace=::
  301. Display verbose dump of the sched data.
  302. -f::
  303. --force::
  304. Don't complain, do it.
  305. OPTIONS for 'perf sched latency'
  306. -------------------------------
  307. -C::
  308. --CPU <n>::
  309. CPU to profile on.
  310. -p::
  311. --pids::
  312. latency stats per pid instead of per command name.
  313. -s::
  314. --sort <key[,key2...]>::
  315. sort by key(s): runtime, switch, avg, max
  316. by default it's sorted by "avg ,max ,switch ,runtime".
  317. OPTIONS for 'perf sched map'
  318. ----------------------------
  319. --compact::
  320. Show only CPUs with activity. Helps visualizing on high core
  321. count systems.
  322. --cpus::
  323. Show just entries with activities for the given CPUs.
  324. --color-cpus::
  325. Highlight the given cpus.
  326. --color-pids::
  327. Highlight the given pids.
  328. --task-name <task>::
  329. Map output only for the given task name(s). Separate the
  330. task names with a comma (without whitespace). The sched-out
  331. time is printed and is represented by '*-' for the given
  332. task name(s).
  333. ('-' indicates other tasks while '.' is idle).
  334. --fuzzy-name::
  335. Given task name(s) can be partially matched (fuzzy matching).
  336. OPTIONS for 'perf sched timehist'
  337. ---------------------------------
  338. -k::
  339. --vmlinux=<file>::
  340. vmlinux pathname
  341. --kallsyms=<file>::
  342. kallsyms pathname
  343. -g::
  344. --call-graph::
  345. Display call chains if present (default on).
  346. --max-stack::
  347. Maximum number of functions to display in backtrace, default 5.
  348. -C=::
  349. --cpu=::
  350. Only show events for the given CPU(s) (comma separated list).
  351. -p=::
  352. --pid=::
  353. Only show events for given process ID (comma separated list).
  354. -t=::
  355. --tid=::
  356. Only show events for given thread ID (comma separated list).
  357. -s::
  358. --summary::
  359. Show only a summary of scheduling by thread with min, max, and average
  360. run times (in sec) and relative stddev.
  361. -S::
  362. --with-summary::
  363. Show all scheduling events followed by a summary by thread with min,
  364. max, and average run times (in sec) and relative stddev.
  365. --symfs=<directory>::
  366. Look for files with symbols relative to this directory.
  367. -V::
  368. --cpu-visual::
  369. Show visual aid for sched switches by CPU: 'i' marks idle time,
  370. 's' are scheduler events.
  371. -w::
  372. --wakeups::
  373. Show wakeup events.
  374. -M::
  375. --migrations::
  376. Show migration events.
  377. -n::
  378. --next::
  379. Show next task.
  380. -I::
  381. --idle-hist::
  382. Show idle-related events only.
  383. --time::
  384. Only analyze samples within given time window: <start>,<stop>. Times
  385. have the format seconds.microseconds. If start is not given (i.e., time
  386. string is ',x.y') then analysis starts at the beginning of the file. If
  387. stop time is not given (i.e, time string is 'x.y,') then analysis goes
  388. to end of file.
  389. --state::
  390. Show task state when it switched out.
  391. --show-prio::
  392. Show task priority.
  393. --prio::
  394. Only show events for given task priority(ies). Multiple priorities can be
  395. provided as a comma-separated list with no spaces: 0,120. Ranges of
  396. priorities are specified with -: 120-129. A combination of both can also be
  397. provided: 0,120-129.
  398. -P::
  399. --pre-migrations::
  400. Show pre-migration wait time. pre-migration wait time is the time spent
  401. by a task waiting on a runqueue but not getting the chance to run there
  402. and is migrated to a different runqueue where it is finally run. This
  403. time between sched_wakeup and migrate_task is the pre-migration wait
  404. time.
  405. OPTIONS for 'perf sched replay'
  406. ------------------------------
  407. -r::
  408. --repeat <n>::
  409. repeat the workload n times (0: infinite). Default is 10.
  410. SEE ALSO
  411. --------
  412. linkperf:perf-record[1]