| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281 |
- .. SPDX-License-Identifier: GPL-2.0-only
- .. include:: <isonum.txt>
- =========
- AMD NPU
- =========
- :Copyright: |copy| 2024 Advanced Micro Devices, Inc.
- :Author: Sonal Santan <sonal.santan@amd.com>
- Overview
- ========
- AMD NPU (Neural Processing Unit) is a multi-user AI inference accelerator
- integrated into AMD client APU. NPU enables efficient execution of Machine
- Learning applications like CNN, LLM, etc. NPU is based on
- `AMD XDNA Architecture`_. NPU is managed by **amdxdna** driver.
- Hardware Description
- ====================
- AMD NPU consists of the following hardware components:
- AMD XDNA Array
- --------------
- AMD XDNA Array comprises of 2D array of compute and memory tiles built with
- `AMD AI Engine Technology`_. Each column has 4 rows of compute tiles and 1
- row of memory tile. Each compute tile contains a VLIW processor with its own
- dedicated program and data memory. The memory tile acts as L2 memory. The 2D
- array can be partitioned at a column boundary creating a spatially isolated
- partition which can be bound to a workload context.
- Each column also has dedicated DMA engines to move data between host DDR and
- memory tile.
- AMD Phoenix and AMD Hawk Point client NPU have a 4x5 topology, i.e., 4 rows of
- compute tiles arranged into 5 columns. AMD Strix Point client APU have 4x8
- topology, i.e., 4 rows of compute tiles arranged into 8 columns.
- Shared L2 Memory
- ----------------
- The single row of memory tiles create a pool of software managed on chip L2
- memory. DMA engines are used to move data between host DDR and memory tiles.
- AMD Phoenix and AMD Hawk Point NPUs have a total of 2560 KB of L2 memory.
- AMD Strix Point NPU has a total of 4096 KB of L2 memory.
- Microcontroller
- ---------------
- A microcontroller runs NPU Firmware which is responsible for command processing,
- XDNA Array partition setup, XDNA Array configuration, workload context
- management and workload orchestration.
- NPU Firmware uses a dedicated instance of an isolated non-privileged context
- called ERT to service each workload context. ERT is also used to execute user
- provided ``ctrlcode`` associated with the workload context.
- NPU Firmware uses a single isolated privileged context called MERT to service
- management commands from the amdxdna driver.
- Mailboxes
- ---------
- The microcontroller and amdxdna driver use a privileged channel for management
- tasks like setting up of contexts, telemetry, query, error handling, setting up
- user channel, etc. As mentioned before, privileged channel requests are
- serviced by MERT. The privileged channel is bound to a single mailbox.
- The microcontroller and amdxdna driver use a dedicated user channel per
- workload context. The user channel is primarily used for submitting work to
- the NPU. As mentioned before, a user channel requests are serviced by an
- instance of ERT. Each user channel is bound to its own dedicated mailbox.
- PCIe EP
- -------
- NPU is visible to the x86 host CPU as a PCIe device with multiple BARs and some
- MSI-X interrupt vectors. NPU uses a dedicated high bandwidth SoC level fabric
- for reading or writing into host memory. Each instance of ERT gets its own
- dedicated MSI-X interrupt. MERT gets a single instance of MSI-X interrupt.
- The number of PCIe BARs varies depending on the specific device. Based on their
- functions, PCIe BARs can generally be categorized into the following types.
- * PSP BAR: Expose the AMD PSP (Platform Security Processor) function
- * SMU BAR: Expose the AMD SMU (System Management Unit) function
- * SRAM BAR: Expose ring buffers for the mailbox
- * Mailbox BAR: Expose the mailbox control registers (head, tail and ISR
- registers etc.)
- * Public Register BAR: Expose public registers
- On specific devices, the above-mentioned BAR type might be combined into a
- single physical PCIe BAR. Or a module might require two physical PCIe BARs to
- be fully functional. For example,
- * On AMD Phoenix device, PSP, SMU, Public Register BARs are on PCIe BAR index 0.
- * On AMD Strix Point device, Mailbox and Public Register BARs are on PCIe BAR
- index 0. The PSP has some registers in PCIe BAR index 0 (Public Register BAR)
- and PCIe BAR index 4 (PSP BAR).
- Process Isolation Hardware
- --------------------------
- As explained before, XDNA Array can be dynamically divided into isolated
- spatial partitions, each of which may have one or more columns. The spatial
- partition is setup by programming the column isolation registers by the
- microcontroller. Each spatial partition is associated with a PASID which is
- also programmed by the microcontroller. Hence multiple spatial partitions in
- the NPU can make concurrent host access protected by PASID.
- The NPU FW itself uses microcontroller MMU enforced isolated contexts for
- servicing user and privileged channel requests.
- Mixed Spatial and Temporal Scheduling
- =====================================
- AMD XDNA architecture supports mixed spatial and temporal (time sharing)
- scheduling of 2D array. This means that spatial partitions may be setup and
- torn down dynamically to accommodate various workloads. A *spatial* partition
- may be *exclusively* bound to one workload context while another partition may
- be *temporarily* bound to more than one workload contexts. The microcontroller
- updates the PASID for a temporarily shared partition to match the context that
- has been bound to the partition at any moment.
- Resource Solver
- ---------------
- The Resource Solver component of the amdxdna driver manages the allocation
- of 2D array among various workloads. Every workload describes the number
- of columns required to run the NPU binary in its metadata. The Resource Solver
- component uses hints passed by the workload and its own heuristics to
- decide 2D array (re)partition strategy and mapping of workloads for spatial and
- temporal sharing of columns. The FW enforces the context-to-column(s) resource
- binding decisions made by the Resource Solver.
- AMD Phoenix and AMD Hawk Point client NPU can support 6 concurrent workload
- contexts. AMD Strix Point can support 16 concurrent workload contexts.
- Application Binaries
- ====================
- A NPU application workload is comprised of two separate binaries which are
- generated by the NPU compiler.
- 1. AMD XDNA Array overlay, which is used to configure a NPU spatial partition.
- The overlay contains instructions for setting up the stream switch
- configuration and ELF for the compute tiles. The overlay is loaded on the
- spatial partition bound to the workload by the associated ERT instance.
- Refer to the
- `Versal Adaptive SoC AIE-ML Architecture Manual (AM020)`_ for more details.
- 2. ``ctrlcode``, used for orchestrating the overlay loaded on the spatial
- partition. ``ctrlcode`` is executed by the ERT running in protected mode on
- the microcontroller in the context of the workload. ``ctrlcode`` is made up
- of a sequence of opcodes named ``XAie_TxnOpcode``. Refer to the
- `AI Engine Run Time`_ for more details.
- Special Host Buffers
- ====================
- Per-context Instruction Buffer
- ------------------------------
- Every workload context uses a host resident 64 MB buffer which is memory
- mapped into the ERT instance created to service the workload. The ``ctrlcode``
- used by the workload is copied into this special memory. This buffer is
- protected by PASID like all other input/output buffers used by that workload.
- Instruction buffer is also mapped into the user space of the workload.
- Global Privileged Buffer
- ------------------------
- In addition, the driver also allocates a single buffer for maintenance tasks
- like recording errors from MERT. This global buffer uses the global IOMMU
- domain and is only accessible by MERT.
- High-level Use Flow
- ===================
- Here are the steps to run a workload on AMD NPU:
- 1. Compile the workload into an overlay and a ``ctrlcode`` binary.
- 2. Userspace opens a context in the driver and provides the overlay.
- 3. The driver checks with the Resource Solver for provisioning a set of columns
- for the workload.
- 4. The driver then asks MERT to create a context on the device with the desired
- columns.
- 5. MERT then creates an instance of ERT. MERT also maps the Instruction Buffer
- into ERT memory.
- 6. The userspace then copies the ``ctrlcode`` to the Instruction Buffer.
- 7. Userspace then creates a command buffer with pointers to input, output, and
- instruction buffer; it then submits command buffer with the driver and goes
- to sleep waiting for completion.
- 8. The driver sends the command over the Mailbox to ERT.
- 9. ERT *executes* the ``ctrlcode`` in the instruction buffer.
- 10. Execution of the ``ctrlcode`` kicks off DMAs to and from the host DDR while
- AMD XDNA Array is running.
- 11. When ERT reaches end of ``ctrlcode``, it raises an MSI-X to send completion
- signal to the driver which then wakes up the waiting workload.
- Boot Flow
- =========
- amdxdna driver uses PSP to securely load signed NPU FW and kick off the boot
- of the NPU microcontroller. amdxdna driver then waits for the alive signal in
- a special location on BAR 0. The NPU is switched off during SoC suspend and
- turned on after resume where the NPU FW is reloaded, and the handshake is
- performed again.
- Userspace components
- ====================
- Compiler
- --------
- Peano is an LLVM based open-source single core compiler for AMD XDNA Array
- compute tile. Peano is available at:
- https://github.com/Xilinx/llvm-aie
- IRON is an open-source array compiler for AMD XDNA Array based NPU which uses
- Peano underneath. IRON is available at:
- https://github.com/Xilinx/mlir-aie
- Usermode Driver (UMD)
- ---------------------
- The open-source XRT runtime stack interfaces with amdxdna kernel driver. XRT
- can be found at:
- https://github.com/Xilinx/XRT
- The open-source XRT shim for NPU is can be found at:
- https://github.com/amd/xdna-driver
- DMA Operation
- =============
- DMA operation instructions are encoded in the ``ctrlcode`` as
- ``XAIE_IO_BLOCKWRITE`` opcode. When ERT executes ``XAIE_IO_BLOCKWRITE``, DMA
- operations between host DDR and L2 memory are effected.
- Error Handling
- ==============
- When MERT detects an error in AMD XDNA Array, it pauses execution for that
- workload context and sends an asynchronous message to the driver over the
- privileged channel. The driver then sends a buffer pointer to MERT to capture
- the register states for the partition bound to faulting workload context. The
- driver then decodes the error by reading the contents of the buffer pointer.
- Telemetry
- =========
- MERT can report various kinds of telemetry information like the following:
- * L1 interrupt counter
- * DMA counter
- * Deep Sleep counter
- * etc.
- References
- ==========
- - `AMD XDNA Architecture <https://www.amd.com/en/technologies/xdna.html>`_
- - `AMD AI Engine Technology <https://www.xilinx.com/products/technology/ai-engine.html>`_
- - `Peano <https://github.com/Xilinx/llvm-aie>`_
- - `Versal Adaptive SoC AIE-ML Architecture Manual (AM020) <https://docs.amd.com/r/en-US/am020-versal-aie-ml>`_
- - `AI Engine Run Time <https://github.com/Xilinx/aie-rt/tree/release/main_aig>`_
|