Aarav90-cpu
/
ARK-OS


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
							.. SPDX-License-Identifier: GPL-2.0

====================
Considering hardware
====================

:Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

The way a workload is handled can be influenced by the hardware it runs on.
Key components include the CPU, memory, and the buses that connect them.
These resources are shared among all applications on the system.
As a result, heavy utilization of one resource by a single application
can affect the deterministic handling of workloads in other applications.

Below is a brief overview.

System memory and cache
-----------------------

Main memory and the associated caches are the most common shared resources among
tasks in a system. One task can dominate the available caches, forcing another
task to wait until a cache line is written back to main memory before it can
proceed. The impact of this contention varies based on write patterns and the
size of the caches available. Larger caches may reduce stalls because more lines
can be buffered before being written back. Conversely, certain write patterns
may trigger the cache controller to flush many lines at once, causing
applications to stall until the operation completes.

This issue can be partly mitigated if applications do not share the same CPU
cache. The kernel is aware of the cache topology and exports this information to
user space. Tools such as **lstopo** from the Portable Hardware Locality (hwloc)
project (https://www.open-mpi.org/projects/hwloc/) can visualize the hierarchy.

Avoiding shared L2 or L3 caches is not always possible. Even when cache sharing
is minimized, bottlenecks can still occur when accessing system memory. Memory
is used not only by the CPU but also by peripheral devices via DMA, such as
graphics cards or network adapters.

In some cases, cache and memory bottlenecks can be controlled if the hardware
provides the necessary support. On x86 systems, Intel offers Cache Allocation
Technology (CAT), which enables cache partitioning among applications and
provides control over the interconnect. AMD provides similar functionality under
Platform Quality of Service (PQoS). On Arm64, the equivalent is Memory
System Resource Partitioning and Monitoring (MPAM).

These features can be configured through the Linux Resource Control interface.
For details, see Documentation/filesystems/resctrl.rst.

The perf tool can be used to monitor cache behavior. It can analyze
cache misses of an application and compare how they change under
different workloads on a neighboring CPU. Even more powerful, the perf
c2c tool can help identify cache-to-cache issues, where multiple CPU
cores repeatedly access and modify data on the same cache line.

Hardware buses
--------------

Real-time systems often need to access hardware directly to perform their work.
Any latency in this process is undesirable, as it can affect the outcome of the
task. For example, on an I/O bus, a changed output may not become immediately
visible but instead appear with variable delay depending on the latency of the
bus used for communication.

A bus such as PCI is relatively simple because register accesses are routed
directly to the connected device. In the worst case, a read operation stalls the
CPU until the device responds.

A bus such as USB is more complex, involving multiple layers. A register read
or write is wrapped in a USB Request Block (URB), which is then sent by the
USB host controller to the device. Timing and latency are influenced by the
underlying USB bus. Requests cannot be sent immediately; they must align with
the next frame boundary according to the endpoint type and the host controller's
scheduling rules. This can introduce delays and additional latency. For example,
a network device connected via USB may still deliver sufficient throughput, but
the added latency when sending or receiving packets may fail to meet the
requirements of certain real-time use cases.

Additional restrictions on bus latency can arise from power management. For
instance, PCIe with Active State Power Management (ASPM) enabled can suspend
the link between the device and the host. While this behavior is beneficial for
power savings, it delays device access and adds latency to responses. This issue
is not limited to PCIe; internal buses within a System-on-Chip (SoC) can also be
affected by power management mechanisms.

Virtualization
--------------

In a virtualized environment such as KVM, each guest CPU is represented as a
thread on the host. If such a thread runs with real-time priority, the system
should be tested to confirm it can sustain this behavior over extended periods.
Because of its priority, the thread will not be preempted by lower-priority
threads (such as SCHED_OTHER), which may then receive no CPU time. This can
cause problems if a lower-priority thread is pinned to a CPU already occupied by
a real-time task and unable to make progress. Even if a CPU has been isolated,
the system may still (accidentally) start a per‑CPU thread on that CPU.
Ensuring that a guest CPU goes idle is difficult, as it requires avoiding both
task scheduling and interrupt handling. Furthermore, if the guest CPU does go
idle but the guest system is booted with the option **idle=poll**, the guest
CPU will never enter an idle state and will instead spin until an event
arrives.

Device handling introduces additional considerations. Emulated PCI devices or
VirtIO devices require a counterpart on the host to complete requests. This
adds latency because the host must intercept and either process the request
directly or schedule a thread for its completion. These delays can be avoided if
the required PCI device is passed directly through to the guest. Some devices,
such as networking or storage controllers, support the PCIe SR-IOV feature.
SR-IOV allows a single PCIe device to be divided into multiple virtual functions,
which can then be assigned to different guests.

Networking
----------

For low-latency networking, the full networking stack may be undesirable, as it
can introduce additional sources of delay. In this context, XDP can be used
as a shortcut to bypass much of the stack while still relying on the kernel's
network driver.

The requirements are that the network driver must support XDP- preferably using
an "skb pool" and that the application must use an XDP socket. Additional
configuration may involve BPF filters, tuning networking queues, or configuring
qdiscs for time-based transmission. These techniques are often
applied in Time-Sensitive Networking (TSN) environments.

Documenting all required steps exceeds the scope of this text. For detailed
guidance, see the TSN documentation at https://tsn.readthedocs.io.

Another useful resource is the Linux Real-Time Communication Testbench
https://github.com/Linutronix/RTC-Testbench.
The goal of this project is to validate real-time network communication. It can
be thought of as a "cyclictest" for networking and also serves as a starting
point for application development.