| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500 |
- perf-sched(1)
- =============
- NAME
- ----
- perf-sched - Tool to trace/measure scheduler properties (latencies)
- SYNOPSIS
- --------
- [verse]
- 'perf sched' {record|latency|map|replay|script|timehist|stats}
- DESCRIPTION
- -----------
- There are several variants of 'perf sched':
- 'perf sched record <command>' to record the scheduling events
- of an arbitrary workload.
- 'perf sched latency' to report the per task scheduling latencies
- and other scheduling properties of the workload.
- Example usage:
- perf sched record -- sleep 1
- perf sched latency
- -------------------------------------------------------------------------------------------------------------------------------------------
- Task | Runtime ms | Count | Avg delay ms | Max delay ms | Max delay start | Max delay end |
- -------------------------------------------------------------------------------------------------------------------------------------------
- perf:(2) | 2.804 ms | 66 | avg: 0.524 ms | max: 1.069 ms | max start: 254752.314960 s | max end: 254752.316029 s
- NetworkManager:1343 | 0.372 ms | 13 | avg: 0.008 ms | max: 0.013 ms | max start: 254751.551153 s | max end: 254751.551166 s
- kworker/1:2-xfs:4649 | 0.012 ms | 1 | avg: 0.008 ms | max: 0.008 ms | max start: 254751.519807 s | max end: 254751.519815 s
- kworker/3:1-xfs:388 | 0.011 ms | 1 | avg: 0.006 ms | max: 0.006 ms | max start: 254751.519809 s | max end: 254751.519815 s
- sleep:147736 | 0.938 ms | 3 | avg: 0.006 ms | max: 0.007 ms | max start: 254751.313817 s | max end: 254751.313824 s
- It shows Runtime(time that a task spent actually running on the CPU),
- Count(number of times a delay was calculated) and delay(time that a
- task was ready to run but was kept waiting).
- Tasks with the same command name are merged and the merge count is
- given within (), However if -p option is used, pid is mentioned.
- 'perf sched script' to see a detailed trace of the workload that
- was recorded (aliased to 'perf script' for now).
- 'perf sched replay' to simulate the workload that was recorded
- via perf sched record. (this is done by starting up mockup threads
- that mimic the workload based on the events in the trace. These
- threads can then replay the timings (CPU runtime and sleep patterns)
- of the workload as it occurred when it was recorded - and can repeat
- it a number of times, measuring its performance.)
- 'perf sched map' to print a textual context-switching outline of
- workload captured via perf sched record. Columns stand for
- individual CPUs, and the two-letter shortcuts stand for tasks that
- are running on a CPU. A '*' denotes the CPU that had the event, and
- a dot signals an idle CPU.
- 'perf sched timehist' provides an analysis of scheduling events.
-
- Example usage:
- perf sched record -- sleep 1
- perf sched timehist
-
- By default it shows the individual schedule events, including the wait
- time (time between sched-out and next sched-in events for the task), the
- task scheduling delay (time between runnable and actually running) and
- run time for the task:
-
- time cpu task name wait time sch delay run time
- [tid/pid] (msec) (msec) (msec)
- -------------- ------ -------------------- --------- --------- ---------
- 79371.874569 [0011] gcc[31949] 0.014 0.000 1.148
- 79371.874591 [0010] gcc[31951] 0.000 0.000 0.024
- 79371.874603 [0010] migration/10[59] 3.350 0.004 0.011
- 79371.874604 [0011] <idle> 1.148 0.000 0.035
- 79371.874723 [0005] <idle> 0.016 0.000 1.383
- 79371.874746 [0005] gcc[31949] 0.153 0.078 0.022
- ...
-
- Times are in msec.usec.
- 'perf sched stats {record | report | diff} <command>' to capture, report the diff
- in schedstat counters and show the difference between perf sched stats report
- respectively. schedstat counters which are present in the linux kernel and are
- exposed through the file ``/proc/schedstat``. These counters are enabled or disabled
- via the sysctl governed by the file ``/proc/sys/kernel/sched_schedstats``. These
- counters accounts for many scheduler events such as ``schedule()`` calls, load-balancing
- events, ``try_to_wakeup()`` call among others. This is useful in understanding the
- scheduler behavior for the workload.
- Note: The tool will not give correct results if there is topological reordering or
- online/offline of cpus in between capturing snapshots of `/proc/schedstat`.
- Example usage:
- perf sched stats record -- sleep 1
- perf sched stats report
- perf sched stats diff
- A detailed description of the schedstats can be found in the Kernel Documentation:
- https://www.kernel.org/doc/html/latest/scheduler/sched-stats.html
- The result can be interpreted as follows:
- The `perf sched stats report` starts with description of the columns present in
- the report. These column names are given before cpu and domain stats to improve
- the readability of the report.
- ----------------------------------------------------------------------------------------------------
- DESC -> Description of the field
- COUNT -> Value of the field
- PCT_CHANGE -> Percent change with corresponding base value
- AVG_JIFFIES -> Avg time in jiffies between two consecutive occurrence of event
- ----------------------------------------------------------------------------------------------------
- Next is the total profiling time in terms of jiffies:
- ----------------------------------------------------------------------------------------------------
- Time elapsed (in jiffies) : 2323
- ----------------------------------------------------------------------------------------------------
- Next is CPU scheduling statistics. These are simple diffs of /proc/schedstat CPU lines
- along with description. The report also prints % relative to base stat.
- In the example below, schedule() left the CPU0 idle 36.58% of the time. 0.45% of total
- try_to_wake_up() was to wakeup local CPU. And, the total waittime by tasks on CPU0 is
- 48.70% of the total runtime by tasks on the same CPU.
- ----------------------------------------------------------------------------------------------------
- CPU 0
- ----------------------------------------------------------------------------------------------------
- DESC COUNT PCT_CHANGE
- ----------------------------------------------------------------------------------------------------
- yld_count : 0
- array_exp : 0
- sched_count : 402267
- sched_goidle : 147161 ( 36.58% )
- ttwu_count : 236309
- ttwu_local : 1062 ( 0.45% )
- rq_cpu_time : 7083791148
- run_delay : 3449973971 ( 48.70% )
- pcount : 255035
- ----------------------------------------------------------------------------------------------------
- Next is load balancing statistics. For each of the sched domains
- (eg: `SMT`, `MC`, `DIE`...), the scheduler computes statistics under
- the following three categories:
- 1) Idle Load Balance: Load balancing performed on behalf of a long
- idling CPU by some other CPU.
- 2) Busy Load Balance: Load balancing performed when the CPU was busy.
- 3) New Idle Balance : Load balancing performed when a CPU just became
- idle.
- Under each of these three categories, sched stats report provides
- different load balancing statistics. Along with direct stats, the
- report also contains derived metrics prefixed with *. Example:
- ----------------------------------------------------------------------------------------------------
- CPU 0, DOMAIN SMT CPUS 0,64
- ----------------------------------------------------------------------------------------------------
- DESC COUNT AVG_JIFFIES
- ----------------------------------------- <Category busy> ------------------------------------------
- busy_lb_count : 136 $ 17.08 $
- busy_lb_balanced : 131 $ 17.73 $
- busy_lb_failed : 0 $ 0.00 $
- busy_lb_imbalance_load : 58
- busy_lb_imbalance_util : 0
- busy_lb_imbalance_task : 0
- busy_lb_imbalance_misfit : 0
- busy_lb_gained : 7
- busy_lb_hot_gained : 0
- busy_lb_nobusyq : 2 $ 1161.50 $
- busy_lb_nobusyg : 129 $ 18.01 $
- *busy_lb_success_count : 5
- *busy_lb_avg_pulled : 1.40
- ----------------------------------------- <Category idle> ------------------------------------------
- idle_lb_count : 449 $ 5.17 $
- idle_lb_balanced : 382 $ 6.08 $
- idle_lb_failed : 3 $ 774.33 $
- idle_lb_imbalance_load : 0
- idle_lb_imbalance_util : 0
- idle_lb_imbalance_task : 71
- idle_lb_imbalance_misfit : 0
- idle_lb_gained : 67
- idle_lb_hot_gained : 0
- idle_lb_nobusyq : 0 $ 0.00 $
- idle_lb_nobusyg : 382 $ 6.08 $
- *idle_lb_success_count : 64
- *idle_lb_avg_pulled : 1.05
- ---------------------------------------- <Category newidle> ----------------------------------------
- newidle_lb_count : 30471 $ 0.08 $
- newidle_lb_balanced : 28490 $ 0.08 $
- newidle_lb_failed : 633 $ 3.67 $
- newidle_lb_imbalance_load : 0
- newidle_lb_imbalance_util : 0
- newidle_lb_imbalance_task : 2040
- newidle_lb_imbalance_misfit : 0
- newidle_lb_gained : 1348
- newidle_lb_hot_gained : 0
- newidle_lb_nobusyq : 6 $ 387.17 $
- newidle_lb_nobusyg : 26634 $ 0.09 $
- *newidle_lb_success_count : 1348
- *newidle_lb_avg_pulled : 1.00
- ----------------------------------------------------------------------------------------------------
- Consider following line:
- newidle_lb_balanced : 28490 $ 0.08 $
- While profiling was active, the load-balancer found 28490 times the load
- needs to be balanced on a newly idle CPU 0. Following value encapsulated
- inside $ is average jiffies between two events (2323 / 28490 = 0.08).
- Next are active_load_balance() stats. alb did not trigger while the
- profiling was active, hence it's all 0s.
- --------------------------------- <Category active_load_balance()> ---------------------------------
- alb_count : 0
- alb_failed : 0
- alb_pushed : 0
- ----------------------------------------------------------------------------------------------------
- Next are sched_balance_exec() and sched_balance_fork() stats. They are
- not used but we kept it in RFC just for legacy purpose. Unless opposed,
- we plan to remove them in next revision.
- Next are wakeup statistics. For every domain, the report also shows
- task-wakeup statistics. Example:
- ------------------------------------------ <Wakeup Info> -------------------------------------------
- ttwu_wake_remote : 1590
- ttwu_move_affine : 84
- ttwu_move_balance : 0
- ----------------------------------------------------------------------------------------------------
- Same set of stats are reported for each CPU and each domain level.
- How to interpret the diff
- ~~~~~~~~~~~~~~~~~~~~~~~~~
- The `perf sched stats diff` will also start with explaining the columns
- present in the diff. Then it will show the diff in time in terms of
- jiffies. The order of the values depends on the order of input data
- files. It will take `perf.data.old` and `perf.data` respectively as the
- defaults for comparison. Example:
- ----------------------------------------------------------------------------------------------------
- Time elapsed (in jiffies) : 2009, 2001
- ----------------------------------------------------------------------------------------------------
- Below is the sample representing the difference in cpu and domain stats of
- two runs. Here third column or the values enclosed in `|...|` shows the
- percent change between the two. Second and fourth columns shows the
- side-by-side representions of the corresponding fields from `perf sched
- stats report`.
- ----------------------------------------------------------------------------------------------------
- CPU <ALL CPUS SUMMARY>
- ----------------------------------------------------------------------------------------------------
- DESC COUNT1 COUNT2 PCT_CHANG>
- ----------------------------------------------------------------------------------------------------
- yld_count : 0, 0 | 0.00>
- array_exp : 0, 0 | 0.00>
- sched_count : 528533, 412573 | -21.94>
- sched_goidle : 193426, 146082 | -24.48>
- ttwu_count : 313134, 385975 | 23.26>
- ttwu_local : 1126, 1282 | 13.85>
- rq_cpu_time : 8257200244, 8301250047 | 0.53>
- run_delay : 4728347053, 3997100703 | -15.47>
- pcount : 335031, 266396 | -20.49>
- ----------------------------------------------------------------------------------------------------
- Below is the sample of domain stats diff:
- ----------------------------------------------------------------------------------------------------
- CPU <ALL CPUS SUMMARY>, DOMAIN SMT
- ----------------------------------------------------------------------------------------------------
- DESC COUNT1 COUNT2 PCT_CHANG>
- ----------------------------------------- <Category busy> ------------------------------------------
- busy_lb_count : 122, 80 | -34.43>
- busy_lb_balanced : 115, 76 | -33.91>
- busy_lb_failed : 1, 3 | 200.00>
- busy_lb_imbalance_load : 35, 49 | 40.00>
- busy_lb_imbalance_util : 0, 0 | 0.00>
- busy_lb_imbalance_task : 0, 0 | 0.00>
- busy_lb_imbalance_misfit : 0, 0 | 0.00>
- busy_lb_gained : 7, 2 | -71.43>
- busy_lb_hot_gained : 0, 0 | 0.00>
- busy_lb_nobusyq : 0, 0 | 0.00>
- busy_lb_nobusyg : 115, 76 | -33.91>
- *busy_lb_success_count : 6, 1 | -83.33>
- *busy_lb_avg_pulled : 1.17, 2.00 | 71.43>
- ----------------------------------------- <Category idle> ------------------------------------------
- idle_lb_count : 568, 620 | 9.15>
- idle_lb_balanced : 462, 449 | -2.81>
- idle_lb_failed : 11, 21 | 90.91>
- idle_lb_imbalance_load : 0, 0 | 0.00>
- idle_lb_imbalance_util : 0, 0 | 0.00>
- idle_lb_imbalance_task : 115, 189 | 64.35>
- idle_lb_imbalance_misfit : 0, 0 | 0.00>
- idle_lb_gained : 103, 169 | 64.08>
- idle_lb_hot_gained : 0, 0 | 0.00>
- idle_lb_nobusyq : 0, 0 | 0.00>
- idle_lb_nobusyg : 462, 449 | -2.81>
- *idle_lb_success_count : 95, 150 | 57.89>
- *idle_lb_avg_pulled : 1.08, 1.13 | 3.92>
- ---------------------------------------- <Category newidle> ----------------------------------------
- newidle_lb_count : 16961, 3155 | -81.40>
- newidle_lb_balanced : 15646, 2556 | -83.66>
- newidle_lb_failed : 397, 142 | -64.23>
- newidle_lb_imbalance_load : 0, 0 | 0.00>
- newidle_lb_imbalance_util : 0, 0 | 0.00>
- newidle_lb_imbalance_task : 1376, 655 | -52.40>
- newidle_lb_imbalance_misfit : 0, 0 | 0.00>
- newidle_lb_gained : 917, 457 | -50.16>
- newidle_lb_hot_gained : 0, 0 | 0.00>
- newidle_lb_nobusyq : 3, 1 | -66.67>
- newidle_lb_nobusyg : 14480, 2103 | -85.48>
- *newidle_lb_success_count : 918, 457 | -50.22>
- *newidle_lb_avg_pulled : 1.00, 1.00 | 0.11>
- --------------------------------- <Category active_load_balance()> ---------------------------------
- alb_count : 0, 1 | 0.00>
- alb_failed : 0, 0 | 0.00>
- alb_pushed : 0, 1 | 0.00>
- --------------------------------- <Category sched_balance_exec()> ----------------------------------
- sbe_count : 0, 0 | 0.00>
- sbe_balanced : 0, 0 | 0.00>
- sbe_pushed : 0, 0 | 0.00>
- --------------------------------- <Category sched_balance_fork()> ----------------------------------
- sbf_count : 0, 0 | 0.00>
- sbf_balanced : 0, 0 | 0.00>
- sbf_pushed : 0, 0 | 0.00>
- ------------------------------------------ <Wakeup Info> -------------------------------------------
- ttwu_wake_remote : 2031, 2914 | 43.48>
- ttwu_move_affine : 73, 124 | 69.86>
- ttwu_move_balance : 0, 0 | 0.00>
- ----------------------------------------------------------------------------------------------------
- OPTIONS
- -------
- Applicable to {record|latency|map|replay|script}
- -i::
- --input=<file>::
- Input file name. (default: perf.data unless stdin is a fifo)
- -v::
- --verbose::
- Be more verbose. (show symbol address, etc)
- -D::
- --dump-raw-trace=::
- Display verbose dump of the sched data.
- -f::
- --force::
- Don't complain, do it.
- OPTIONS for 'perf sched latency'
- -------------------------------
- -C::
- --CPU <n>::
- CPU to profile on.
- -p::
- --pids::
- latency stats per pid instead of per command name.
- -s::
- --sort <key[,key2...]>::
- sort by key(s): runtime, switch, avg, max
- by default it's sorted by "avg ,max ,switch ,runtime".
- OPTIONS for 'perf sched map'
- ----------------------------
- --compact::
- Show only CPUs with activity. Helps visualizing on high core
- count systems.
- --cpus::
- Show just entries with activities for the given CPUs.
- --color-cpus::
- Highlight the given cpus.
- --color-pids::
- Highlight the given pids.
- --task-name <task>::
- Map output only for the given task name(s). Separate the
- task names with a comma (without whitespace). The sched-out
- time is printed and is represented by '*-' for the given
- task name(s).
- ('-' indicates other tasks while '.' is idle).
- --fuzzy-name::
- Given task name(s) can be partially matched (fuzzy matching).
- OPTIONS for 'perf sched timehist'
- ---------------------------------
- -k::
- --vmlinux=<file>::
- vmlinux pathname
- --kallsyms=<file>::
- kallsyms pathname
- -g::
- --call-graph::
- Display call chains if present (default on).
- --max-stack::
- Maximum number of functions to display in backtrace, default 5.
- -C=::
- --cpu=::
- Only show events for the given CPU(s) (comma separated list).
- -p=::
- --pid=::
- Only show events for given process ID (comma separated list).
- -t=::
- --tid=::
- Only show events for given thread ID (comma separated list).
- -s::
- --summary::
- Show only a summary of scheduling by thread with min, max, and average
- run times (in sec) and relative stddev.
- -S::
- --with-summary::
- Show all scheduling events followed by a summary by thread with min,
- max, and average run times (in sec) and relative stddev.
- --symfs=<directory>::
- Look for files with symbols relative to this directory.
- -V::
- --cpu-visual::
- Show visual aid for sched switches by CPU: 'i' marks idle time,
- 's' are scheduler events.
- -w::
- --wakeups::
- Show wakeup events.
- -M::
- --migrations::
- Show migration events.
- -n::
- --next::
- Show next task.
- -I::
- --idle-hist::
- Show idle-related events only.
- --time::
- Only analyze samples within given time window: <start>,<stop>. Times
- have the format seconds.microseconds. If start is not given (i.e., time
- string is ',x.y') then analysis starts at the beginning of the file. If
- stop time is not given (i.e, time string is 'x.y,') then analysis goes
- to end of file.
- --state::
- Show task state when it switched out.
- --show-prio::
- Show task priority.
- --prio::
- Only show events for given task priority(ies). Multiple priorities can be
- provided as a comma-separated list with no spaces: 0,120. Ranges of
- priorities are specified with -: 120-129. A combination of both can also be
- provided: 0,120-129.
- -P::
- --pre-migrations::
- Show pre-migration wait time. pre-migration wait time is the time spent
- by a task waiting on a runqueue but not getting the chance to run there
- and is migrated to a different runqueue where it is finally run. This
- time between sched_wakeup and migrate_task is the pre-migration wait
- time.
- OPTIONS for 'perf sched replay'
- ------------------------------
- -r::
- --repeat <n>::
- repeat the workload n times (0: infinite). Default is 10.
- SEE ALSO
- --------
- linkperf:perf-record[1]
|