hardware.rst 6.9 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
  1. .. SPDX-License-Identifier: GPL-2.0
  2. ====================
  3. Considering hardware
  4. ====================
  5. :Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
  6. The way a workload is handled can be influenced by the hardware it runs on.
  7. Key components include the CPU, memory, and the buses that connect them.
  8. These resources are shared among all applications on the system.
  9. As a result, heavy utilization of one resource by a single application
  10. can affect the deterministic handling of workloads in other applications.
  11. Below is a brief overview.
  12. System memory and cache
  13. -----------------------
  14. Main memory and the associated caches are the most common shared resources among
  15. tasks in a system. One task can dominate the available caches, forcing another
  16. task to wait until a cache line is written back to main memory before it can
  17. proceed. The impact of this contention varies based on write patterns and the
  18. size of the caches available. Larger caches may reduce stalls because more lines
  19. can be buffered before being written back. Conversely, certain write patterns
  20. may trigger the cache controller to flush many lines at once, causing
  21. applications to stall until the operation completes.
  22. This issue can be partly mitigated if applications do not share the same CPU
  23. cache. The kernel is aware of the cache topology and exports this information to
  24. user space. Tools such as **lstopo** from the Portable Hardware Locality (hwloc)
  25. project (https://www.open-mpi.org/projects/hwloc/) can visualize the hierarchy.
  26. Avoiding shared L2 or L3 caches is not always possible. Even when cache sharing
  27. is minimized, bottlenecks can still occur when accessing system memory. Memory
  28. is used not only by the CPU but also by peripheral devices via DMA, such as
  29. graphics cards or network adapters.
  30. In some cases, cache and memory bottlenecks can be controlled if the hardware
  31. provides the necessary support. On x86 systems, Intel offers Cache Allocation
  32. Technology (CAT), which enables cache partitioning among applications and
  33. provides control over the interconnect. AMD provides similar functionality under
  34. Platform Quality of Service (PQoS). On Arm64, the equivalent is Memory
  35. System Resource Partitioning and Monitoring (MPAM).
  36. These features can be configured through the Linux Resource Control interface.
  37. For details, see Documentation/filesystems/resctrl.rst.
  38. The perf tool can be used to monitor cache behavior. It can analyze
  39. cache misses of an application and compare how they change under
  40. different workloads on a neighboring CPU. Even more powerful, the perf
  41. c2c tool can help identify cache-to-cache issues, where multiple CPU
  42. cores repeatedly access and modify data on the same cache line.
  43. Hardware buses
  44. --------------
  45. Real-time systems often need to access hardware directly to perform their work.
  46. Any latency in this process is undesirable, as it can affect the outcome of the
  47. task. For example, on an I/O bus, a changed output may not become immediately
  48. visible but instead appear with variable delay depending on the latency of the
  49. bus used for communication.
  50. A bus such as PCI is relatively simple because register accesses are routed
  51. directly to the connected device. In the worst case, a read operation stalls the
  52. CPU until the device responds.
  53. A bus such as USB is more complex, involving multiple layers. A register read
  54. or write is wrapped in a USB Request Block (URB), which is then sent by the
  55. USB host controller to the device. Timing and latency are influenced by the
  56. underlying USB bus. Requests cannot be sent immediately; they must align with
  57. the next frame boundary according to the endpoint type and the host controller's
  58. scheduling rules. This can introduce delays and additional latency. For example,
  59. a network device connected via USB may still deliver sufficient throughput, but
  60. the added latency when sending or receiving packets may fail to meet the
  61. requirements of certain real-time use cases.
  62. Additional restrictions on bus latency can arise from power management. For
  63. instance, PCIe with Active State Power Management (ASPM) enabled can suspend
  64. the link between the device and the host. While this behavior is beneficial for
  65. power savings, it delays device access and adds latency to responses. This issue
  66. is not limited to PCIe; internal buses within a System-on-Chip (SoC) can also be
  67. affected by power management mechanisms.
  68. Virtualization
  69. --------------
  70. In a virtualized environment such as KVM, each guest CPU is represented as a
  71. thread on the host. If such a thread runs with real-time priority, the system
  72. should be tested to confirm it can sustain this behavior over extended periods.
  73. Because of its priority, the thread will not be preempted by lower-priority
  74. threads (such as SCHED_OTHER), which may then receive no CPU time. This can
  75. cause problems if a lower-priority thread is pinned to a CPU already occupied by
  76. a real-time task and unable to make progress. Even if a CPU has been isolated,
  77. the system may still (accidentally) start a per‑CPU thread on that CPU.
  78. Ensuring that a guest CPU goes idle is difficult, as it requires avoiding both
  79. task scheduling and interrupt handling. Furthermore, if the guest CPU does go
  80. idle but the guest system is booted with the option **idle=poll**, the guest
  81. CPU will never enter an idle state and will instead spin until an event
  82. arrives.
  83. Device handling introduces additional considerations. Emulated PCI devices or
  84. VirtIO devices require a counterpart on the host to complete requests. This
  85. adds latency because the host must intercept and either process the request
  86. directly or schedule a thread for its completion. These delays can be avoided if
  87. the required PCI device is passed directly through to the guest. Some devices,
  88. such as networking or storage controllers, support the PCIe SR-IOV feature.
  89. SR-IOV allows a single PCIe device to be divided into multiple virtual functions,
  90. which can then be assigned to different guests.
  91. Networking
  92. ----------
  93. For low-latency networking, the full networking stack may be undesirable, as it
  94. can introduce additional sources of delay. In this context, XDP can be used
  95. as a shortcut to bypass much of the stack while still relying on the kernel's
  96. network driver.
  97. The requirements are that the network driver must support XDP- preferably using
  98. an "skb pool" and that the application must use an XDP socket. Additional
  99. configuration may involve BPF filters, tuning networking queues, or configuring
  100. qdiscs for time-based transmission. These techniques are often
  101. applied in Time-Sensitive Networking (TSN) environments.
  102. Documenting all required steps exceeds the scope of this text. For detailed
  103. guidance, see the TSN documentation at https://tsn.readthedocs.io.
  104. Another useful resource is the Linux Real-Time Communication Testbench
  105. https://github.com/Linutronix/RTC-Testbench.
  106. The goal of this project is to validate real-time network communication. It can
  107. be thought of as a "cyclictest" for networking and also serves as a starting
  108. point for application development.