rseq.rst 5.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140
  1. =====================
  2. Restartable Sequences
  3. =====================
  4. Restartable Sequences allow to register a per thread userspace memory area
  5. to be used as an ABI between kernel and userspace for three purposes:
  6. * userspace restartable sequences
  7. * quick access to read the current CPU number, node ID from userspace
  8. * scheduler time slice extensions
  9. Restartable sequences (per-cpu atomics)
  10. ---------------------------------------
  11. Restartable sequences allow userspace to perform update operations on
  12. per-cpu data without requiring heavyweight atomic operations. The actual
  13. ABI is unfortunately only available in the code and selftests.
  14. Quick access to CPU number, node ID
  15. -----------------------------------
  16. Allows to implement per CPU data efficiently. Documentation is in code and
  17. selftests. :(
  18. Scheduler time slice extensions
  19. -------------------------------
  20. This allows a thread to request a time slice extension when it enters a
  21. critical section to avoid contention on a resource when the thread is
  22. scheduled out inside of the critical section.
  23. The prerequisites for this functionality are:
  24. * Enabled in Kconfig
  25. * Enabled at boot time (default is enabled)
  26. * A rseq userspace pointer has been registered for the thread
  27. The thread has to enable the functionality via prctl(2)::
  28. prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET,
  29. PR_RSEQ_SLICE_EXT_ENABLE, 0, 0);
  30. prctl() returns 0 on success or otherwise with the following error codes:
  31. ========= ==============================================================
  32. Errorcode Meaning
  33. ========= ==============================================================
  34. EINVAL Functionality not available or invalid function arguments.
  35. Note: arg4 and arg5 must be zero
  36. ENOTSUPP Functionality was disabled on the kernel command line
  37. ENXIO Available, but no rseq user struct registered
  38. ========= ==============================================================
  39. The state can be also queried via prctl(2)::
  40. prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0);
  41. prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if
  42. disabled. Otherwise it returns with the following error codes:
  43. ========= ==============================================================
  44. Errorcode Meaning
  45. ========= ==============================================================
  46. EINVAL Functionality not available or invalid function arguments.
  47. Note: arg3 and arg4 and arg5 must be zero
  48. ========= ==============================================================
  49. The availability and status is also exposed via the rseq ABI struct flags
  50. field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the
  51. ``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user
  52. space and only for informational purposes.
  53. If the mechanism was enabled via prctl(), the thread can request a time
  54. slice extension by setting rseq::slice_ctrl::request to 1. If the thread is
  55. interrupted and the interrupt results in a reschedule request in the
  56. kernel, then the kernel can grant a time slice extension and return to
  57. userspace instead of scheduling out. The length of the extension is
  58. determined by debugfs:rseq/slice_ext_nsec. The default value is 5 usec; which
  59. is the minimum value. It can be incremented to 50 usecs, however doing so
  60. can/will affect the minimum scheduling latency.
  61. Any proposed changes to this default will have to come with a selftest and
  62. rseq-slice-hist.py output that shows the new value has merrit.
  63. The kernel indicates the grant by clearing rseq::slice_ctrl::request and
  64. setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the
  65. thread after granting the extension, the kernel clears the granted bit to
  66. indicate that to userspace.
  67. If the request bit is still set when the leaving the critical section,
  68. userspace can clear it and continue.
  69. If the granted bit is set, then userspace invokes rseq_slice_yield(2) when
  70. leaving the critical section to relinquish the CPU. The kernel enforces
  71. this by arming a timer to prevent misbehaving userspace from abusing this
  72. mechanism.
  73. If both the request bit and the granted bit are false when leaving the
  74. critical section, then this indicates that a grant was revoked and no
  75. further action is required by userspace.
  76. The required code flow is as follows::
  77. rseq->slice_ctrl.request = 1;
  78. barrier(); // Prevent compiler reordering
  79. critical_section();
  80. barrier(); // Prevent compiler reordering
  81. rseq->slice_ctrl.request = 0;
  82. if (rseq->slice_ctrl.granted)
  83. rseq_slice_yield();
  84. As all of this is strictly CPU local, there are no atomicity requirements.
  85. Checking the granted state is racy, but that cannot be avoided at all::
  86. if (rseq->slice_ctrl.granted)
  87. -> Interrupt results in schedule and grant revocation
  88. rseq_slice_yield();
  89. So there is no point in pretending that this might be solved by an atomic
  90. operation.
  91. If the thread issues a syscall other than rseq_slice_yield(2) within the
  92. granted timeslice extension, the grant is also revoked and the CPU is
  93. relinquished immediately when entering the kernel. This is required as
  94. syscalls might consume arbitrary CPU time until they reach a scheduling
  95. point when the preemption model is either NONE or VOLUNTARY and therefore
  96. might exceed the grant by far.
  97. The preferred solution for user space is to use rseq_slice_yield(2) which
  98. is side effect free. The support for arbitrary syscalls is required to
  99. support onion layer architectured applications, where the code handling the
  100. critical section and requesting the time slice extension has no control
  101. over the code within the critical section.
  102. The kernel enforces flag consistency and terminates the thread with SIGSEGV
  103. if it detects a violation.