| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252 |
- .. SPDX-License-Identifier: GPL-2.0
- =============
- Devlink DPIPE
- =============
- Background
- ==========
- While performing the hardware offloading process, much of the hardware
- specifics cannot be presented. These details are useful for debugging, and
- ``devlink-dpipe`` provides a standardized way to provide visibility into the
- offloading process.
- For example, the routing longest prefix match (LPM) algorithm used by the
- Linux kernel may differ from the hardware implementation. The pipeline debug
- API (DPIPE) is aimed at providing the user visibility into the ASIC's
- pipeline in a generic way.
- The hardware offload process is expected to be done in a way that the user
- should not be able to distinguish between the hardware vs. software
- implementation. In this process, hardware specifics are neglected. In
- reality those details can have lots of meaning and should be exposed in some
- standard way.
- This problem is made even more complex when one wishes to offload the
- control path of the whole networking stack to a switch ASIC. Due to
- differences in the hardware and software models some processes cannot be
- represented correctly.
- One example is the kernel's LPM algorithm which in many cases differs
- greatly to the hardware implementation. The configuration API is the same,
- but one cannot rely on the Forward Information Base (FIB) to look like the
- Level Path Compression trie (LPC-trie) in hardware.
- In many situations trying to analyze systems failure solely based on the
- kernel's dump may not be enough. By combining this data with complementary
- information about the underlying hardware, this debugging can be made
- easier; additionally, the information can be useful when debugging
- performance issues.
- Overview
- ========
- The ``devlink-dpipe`` interface closes this gap. The hardware's pipeline is
- modeled as a graph of match/action tables. Each table represents a specific
- hardware block. This model is not new, first being used by the P4 language.
- Traditionally it has been used as an alternative model for hardware
- configuration, but the ``devlink-dpipe`` interface uses it for visibility
- purposes as a standard complementary tool. The system's view from
- ``devlink-dpipe`` should change according to the changes done by the
- standard configuration tools.
- For example, it’s quite common to implement Access Control Lists (ACL)
- using Ternary Content Addressable Memory (TCAM). The TCAM memory can be
- divided into TCAM regions. Complex TC filters can have multiple rules with
- different priorities and different lookup keys. On the other hand hardware
- TCAM regions have a predefined lookup key. Offloading the TC filter rules
- using TCAM engine can result in multiple TCAM regions being interconnected
- in a chain (which may affect the data path latency). In response to a new TC
- filter new tables should be created describing those regions.
- Model
- =====
- The ``DPIPE`` model introduces several objects:
- * headers
- * tables
- * entries
- A ``header`` describes packet formats and provides names for fields within
- the packet. A ``table`` describes hardware blocks. An ``entry`` describes
- the actual content of a specific table.
- The hardware pipeline is not port specific, but rather describes the whole
- ASIC. Thus it is tied to the top of the ``devlink`` infrastructure.
- Drivers can register and unregister tables at run time, in order to support
- dynamic behavior. This dynamic behavior is mandatory for describing hardware
- blocks like TCAM regions which can be allocated and freed dynamically.
- ``devlink-dpipe`` generally is not intended for configuration. The exception
- is hardware counting for a specific table.
- The following commands are used to obtain the ``dpipe`` objects from
- userspace:
- * ``table_get``: Receive a table's description.
- * ``headers_get``: Receive a device's supported headers.
- * ``entries_get``: Receive a table's current entries.
- * ``counters_set``: Enable or disable counters on a table.
- Table
- -----
- The driver should implement the following operations for each table:
- * ``matches_dump``: Dump the supported matches.
- * ``actions_dump``: Dump the supported actions.
- * ``entries_dump``: Dump the actual content of the table.
- * ``counters_set_update``: Synchronize hardware with counters enabled or
- disabled.
- Header/Field
- ------------
- In a similar way to P4 headers and fields are used to describe a table's
- behavior. There is a slight difference between the standard protocol headers
- and specific ASIC metadata. The protocol headers should be declared in the
- ``devlink`` core API. On the other hand ASIC meta data is driver specific
- and should be defined in the driver. Additionally, each driver-specific
- devlink documentation file should document the driver-specific ``dpipe``
- headers it implements. The headers and fields are identified by enumeration.
- In order to provide further visibility some ASIC metadata fields could be
- mapped to kernel objects. For example, internal router interface indexes can
- be directly mapped to the net device ifindex. FIB table indexes used by
- different Virtual Routing and Forwarding (VRF) tables can be mapped to
- internal routing table indexes.
- Match
- -----
- Matches are kept primitive and close to hardware operation. Match types like
- LPM are not supported due to the fact that this is exactly a process we wish
- to describe in full detail. Example of matches:
- * ``field_exact``: Exact match on a specific field.
- * ``field_exact_mask``: Exact match on a specific field after masking.
- * ``field_range``: Match on a specific range.
- The id's of the header and the field should be specified in order to
- identify the specific field. Furthermore, the header index should be
- specified in order to distinguish multiple headers of the same type in a
- packet (tunneling).
- Action
- ------
- Similar to match, the actions are kept primitive and close to hardware
- operation. For example:
- * ``field_modify``: Modify the field value.
- * ``field_inc``: Increment the field value.
- * ``push_header``: Add a header.
- * ``pop_header``: Remove a header.
- Entry
- -----
- Entries of a specific table can be dumped on demand. Each eentry is
- identified with an index and its properties are described by a list of
- match/action values and specific counter. By dumping the tables content the
- interactions between tables can be resolved.
- Abstraction Example
- ===================
- The following is an example of the abstraction model of the L3 part of
- Mellanox Spectrum ASIC. The blocks are described in the order they appear in
- the pipeline. The table sizes in the following examples are not real
- hardware sizes and are provided for demonstration purposes.
- LPM
- ---
- The LPM algorithm can be implemented as a list of hash tables. Each hash
- table contains routes with the same prefix length. The root of the list is
- /32, and in case of a miss the hardware will continue to the next hash
- table. The depth of the search will affect the data path latency.
- In case of a hit the entry contains information about the next stage of the
- pipeline which resolves the MAC address. The next stage can be either local
- host table for directly connected routes, or adjacency table for next-hops.
- The ``meta.lpm_prefix`` field is used to connect two LPM tables.
- .. code::
- table lpm_prefix_16 {
- size: 4096,
- counters_enabled: true,
- match: { meta.vr_id: exact,
- ipv4.dst_addr: exact_mask,
- ipv6.dst_addr: exact_mask,
- meta.lpm_prefix: exact },
- action: { meta.adj_index: set,
- meta.adj_group_size: set,
- meta.rif_port: set,
- meta.lpm_prefix: set },
- }
- Local Host
- ----------
- In the case of local routes the LPM lookup already resolves the egress
- router interface (RIF), yet the exact MAC address is not known. The local
- host table is a hash table combining the output interface id with
- destination IP address as a key. The result is the MAC address.
- .. code::
- table local_host {
- size: 4096,
- counters_enabled: true,
- match: { meta.rif_port: exact,
- ipv4.dst_addr: exact},
- action: { ethernet.daddr: set }
- }
- Adjacency
- ---------
- In case of remote routes this table does the ECMP. The LPM lookup results in
- ECMP group size and index that serves as a global offset into this table.
- Concurrently a hash of the packet is generated. Based on the ECMP group size
- and the packet's hash a local offset is generated. Multiple LPM entries can
- point to the same adjacency group.
- .. code::
- table adjacency {
- size: 4096,
- counters_enabled: true,
- match: { meta.adj_index: exact,
- meta.adj_group_size: exact,
- meta.packet_hash_index: exact },
- action: { ethernet.daddr: set,
- meta.erif: set }
- }
- ERIF
- ----
- In case the egress RIF and destination MAC have been resolved by previous
- tables this table does multiple operations like TTL decrease and MTU check.
- Then the decision of forward/drop is taken and the port L3 statistics are
- updated based on the packet's type (broadcast, unicast, multicast).
- .. code::
- table erif {
- size: 800,
- counters_enabled: true,
- match: { meta.rif_port: exact,
- meta.is_l3_unicast: exact,
- meta.is_l3_broadcast: exact,
- meta.is_l3_multicast, exact },
- action: { meta.l3_drop: set,
- meta.l3_forward: set }
- }
|