syscall-user-dispatch.rst 4.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =====================
  3. Syscall User Dispatch
  4. =====================
  5. Background
  6. ----------
  7. Compatibility layers like Wine need a way to efficiently emulate system
  8. calls of only a part of their process - the part that has the
  9. incompatible code - while being able to execute native syscalls without
  10. a high performance penalty on the native part of the process. Seccomp
  11. falls short on this task, since it has limited support to efficiently
  12. filter syscalls based on memory regions, and it doesn't support removing
  13. filters. Therefore a new mechanism is necessary.
  14. Syscall User Dispatch brings the filtering of the syscall dispatcher
  15. address back to userspace. The application is in control of a flip
  16. switch, indicating the current personality of the process. A
  17. multiple-personality application can then flip the switch without
  18. invoking the kernel, when crossing the compatibility layer API
  19. boundaries, to enable/disable the syscall redirection and execute
  20. syscalls directly (disabled) or send them to be emulated in userspace
  21. through a SIGSYS.
  22. The goal of this design is to provide very quick compatibility layer
  23. boundary crosses, which is achieved by not executing a syscall to change
  24. personality every time the compatibility layer executes. Instead, a
  25. userspace memory region exposed to the kernel indicates the current
  26. personality, and the application simply modifies that variable to
  27. configure the mechanism.
  28. There is a relatively high cost associated with handling signals on most
  29. architectures, like x86, but at least for Wine, syscalls issued by
  30. native Windows code are currently not known to be a performance problem,
  31. since they are quite rare, at least for modern gaming applications.
  32. Since this mechanism is designed to capture syscalls issued by
  33. non-native applications, it must function on syscalls whose invocation
  34. ABI is completely unexpected to Linux. Syscall User Dispatch, therefore
  35. doesn't rely on any of the syscall ABI to make the filtering. It uses
  36. only the syscall dispatcher address and the userspace key.
  37. As the ABI of these intercepted syscalls is unknown to Linux, these
  38. syscalls are not instrumentable via ptrace or the syscall tracepoints.
  39. Interface
  40. ---------
  41. A thread can setup this mechanism on supported kernels by executing the
  42. following prctl:
  43. prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])
  44. <op> is either PR_SYS_DISPATCH_EXCLUSIVE_ON/PR_SYS_DISPATCH_INCLUSIVE_ON
  45. or PR_SYS_DISPATCH_OFF, to enable and disable the mechanism globally for
  46. that thread. When PR_SYS_DISPATCH_OFF is used, the other fields must be zero.
  47. For PR_SYS_DISPATCH_EXCLUSIVE_ON [<offset>, <offset>+<length>) delimit
  48. a memory region interval from which syscalls are always executed directly,
  49. regardless of the userspace selector. This provides a fast path for the
  50. C library, which includes the most common syscall dispatchers in the native
  51. code applications, and also provides a way for the signal handler to return
  52. without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this
  53. interface should make sure that at least the signal trampoline code is
  54. included in this region. In addition, for syscalls that implement the
  55. trampoline code on the vDSO, that trampoline is never intercepted.
  56. For PR_SYS_DISPATCH_INCLUSIVE_ON [<offset>, <offset>+<length>) delimit
  57. a memory region interval from which syscalls are dispatched based on
  58. the userspace selector. Syscalls from outside of the range are always
  59. executed directly.
  60. [selector] is a pointer to a char-sized region in the process memory
  61. region, that provides a quick way to enable disable syscall redirection
  62. thread-wide, without the need to invoke the kernel directly. selector
  63. can be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK.
  64. Any other value should terminate the program with a SIGSYS.
  65. Additionally, a tasks syscall user dispatch configuration can be peeked
  66. and poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace
  67. requests. This is useful for checkpoint/restart software.
  68. Security Notes
  69. --------------
  70. Syscall User Dispatch provides functionality for compatibility layers to
  71. quickly capture system calls issued by a non-native part of the
  72. application, while not impacting the Linux native regions of the
  73. process. It is not a mechanism for sandboxing system calls, and it
  74. should not be seen as a security mechanism, since it is trivial for a
  75. malicious application to subvert the mechanism by jumping to an allowed
  76. dispatcher region prior to executing the syscall, or to discover the
  77. address and modify the selector value. If the use case requires any
  78. kind of security sandboxing, Seccomp should be used instead.
  79. Any fork or exec of the existing process resets the mechanism to
  80. PR_SYS_DISPATCH_OFF.