| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044104510461047104810491050105110521053105410551056105710581059106010611062106310641065106610671068106910701071107210731074107510761077107810791080108110821083108410851086108710881089109010911092109310941095109610971098109911001101110211031104110511061107110811091110111111121113111411151116111711181119112011211122112311241125112611271128112911301131113211331134113511361137113811391140 |
- ============
- Architecture
- ============
- This document describes the **Distributed Switch Architecture (DSA)** subsystem
- design principles, limitations, interactions with other subsystems, and how to
- develop drivers for this subsystem as well as a TODO for developers interested
- in joining the effort.
- Design principles
- =================
- The Distributed Switch Architecture subsystem was primarily designed to
- support Marvell Ethernet switches (MV88E6xxx, a.k.a. Link Street product
- line) using Linux, but has since evolved to support other vendors as well.
- The original philosophy behind this design was to be able to use unmodified
- Linux tools such as bridge, iproute2, ifconfig to work transparently whether
- they configured/queried a switch port network device or a regular network
- device.
- An Ethernet switch typically comprises multiple front-panel ports and one
- or more CPU or management ports. The DSA subsystem currently relies on the
- presence of a management port connected to an Ethernet controller capable of
- receiving Ethernet frames from the switch. This is a very common setup for all
- kinds of Ethernet switches found in Small Home and Office products: routers,
- gateways, or even top-of-rack switches. This host Ethernet controller will
- be later referred to as "conduit" and "cpu" in DSA terminology and code.
- The D in DSA stands for Distributed, because the subsystem has been designed
- with the ability to configure and manage cascaded switches on top of each other
- using upstream and downstream Ethernet links between switches. These specific
- ports are referred to as "dsa" ports in DSA terminology and code. A collection
- of multiple switches connected to each other is called a "switch tree".
- For each front-panel port, DSA creates specialized network devices which are
- used as controlling and data-flowing endpoints for use by the Linux networking
- stack. These specialized network interfaces are referred to as "user" network
- interfaces in DSA terminology and code.
- The ideal case for using DSA is when an Ethernet switch supports a "switch tag"
- which is a hardware feature making the switch insert a specific tag for each
- Ethernet frame it receives to/from specific ports to help the management
- interface figure out:
- - what port is this frame coming from
- - what was the reason why this frame got forwarded
- - how to send CPU originated traffic to specific ports
- The subsystem does support switches not capable of inserting/stripping tags, but
- the features might be slightly limited in that case (traffic separation relies
- on Port-based VLAN IDs).
- Note that DSA does not currently create network interfaces for the "cpu" and
- "dsa" ports because:
- - the "cpu" port is the Ethernet switch facing side of the management
- controller, and as such, would create a duplication of feature, since you
- would get two interfaces for the same conduit: conduit netdev, and "cpu" netdev
- - the "dsa" port(s) are just conduits between two or more switches, and as such
- cannot really be used as proper network interfaces either, only the
- downstream, or the top-most upstream interface makes sense with that model
- NB: for the past 15 years, the DSA subsystem had been making use of the terms
- "master" (rather than "conduit") and "slave" (rather than "user"). These terms
- have been removed from the DSA codebase and phased out of the uAPI.
- Switch tagging protocols
- ------------------------
- DSA supports many vendor-specific tagging protocols, one software-defined
- tagging protocol, and a tag-less mode as well (``DSA_TAG_PROTO_NONE``).
- The exact format of the tag protocol is vendor specific, but in general, they
- all contain something which:
- - identifies which port the Ethernet frame came from/should be sent to
- - provides a reason why this frame was forwarded to the management interface
- All tagging protocols are in ``net/dsa/tag_*.c`` files and implement the
- methods of the ``struct dsa_device_ops`` structure, which are detailed below.
- Tagging protocols generally fall in one of three categories:
- 1. The switch-specific frame header is located before the Ethernet header,
- shifting to the right (from the perspective of the DSA conduit's frame
- parser) the MAC DA, MAC SA, EtherType and the entire L2 payload.
- 2. The switch-specific frame header is located before the EtherType, keeping
- the MAC DA and MAC SA in place from the DSA conduit's perspective, but
- shifting the 'real' EtherType and L2 payload to the right.
- 3. The switch-specific frame header is located at the tail of the packet,
- keeping all frame headers in place and not altering the view of the packet
- that the DSA conduit's frame parser has.
- A tagging protocol may tag all packets with switch tags of the same length, or
- the tag length might vary (for example packets with PTP timestamps might
- require an extended switch tag, or there might be one tag length on TX and a
- different one on RX). Either way, the tagging protocol driver must populate the
- ``struct dsa_device_ops::needed_headroom`` and/or ``struct dsa_device_ops::needed_tailroom``
- with the length in octets of the longest switch frame header/trailer. The DSA
- framework will automatically adjust the MTU of the conduit interface to
- accommodate for this extra size in order for DSA user ports to support the
- standard MTU (L2 payload length) of 1500 octets. The ``needed_headroom`` and
- ``needed_tailroom`` properties are also used to request from the network stack,
- on a best-effort basis, the allocation of packets with enough extra space such
- that the act of pushing the switch tag on transmission of a packet does not
- cause it to reallocate due to lack of memory.
- Even though applications are not expected to parse DSA-specific frame headers,
- the format on the wire of the tagging protocol represents an Application Binary
- Interface exposed by the kernel towards user space, for decoders such as
- ``libpcap``. The tagging protocol driver must populate the ``proto`` member of
- ``struct dsa_device_ops`` with a value that uniquely describes the
- characteristics of the interaction required between the switch hardware and the
- data path driver: the offset of each bit field within the frame header and any
- stateful processing required to deal with the frames (as may be required for
- PTP timestamping).
- From the perspective of the network stack, all switches within the same DSA
- switch tree use the same tagging protocol. In case of a packet transiting a
- fabric with more than one switch, the switch-specific frame header is inserted
- by the first switch in the fabric that the packet was received on. This header
- typically contains information regarding its type (whether it is a control
- frame that must be trapped to the CPU, or a data frame to be forwarded).
- Control frames should be decapsulated only by the software data path, whereas
- data frames might also be autonomously forwarded towards other user ports of
- other switches from the same fabric, and in this case, the outermost switch
- ports must decapsulate the packet.
- Note that in certain cases, it might be the case that the tagging format used
- by a leaf switch (not connected directly to the CPU) is not the same as what
- the network stack sees. This can be seen with Marvell switch trees, where the
- CPU port can be configured to use either the DSA or the Ethertype DSA (EDSA)
- format, but the DSA links are configured to use the shorter (without Ethertype)
- DSA frame header, in order to reduce the autonomous packet forwarding overhead.
- It still remains the case that, if the DSA switch tree is configured for the
- EDSA tagging protocol, the operating system sees EDSA-tagged packets from the
- leaf switches that tagged them with the shorter DSA header. This can be done
- because the Marvell switch connected directly to the CPU is configured to
- perform tag translation between DSA and EDSA (which is simply the operation of
- adding or removing the ``ETH_P_EDSA`` EtherType and some padding octets).
- It is possible to construct cascaded setups of DSA switches even if their
- tagging protocols are not compatible with one another. In this case, there are
- no DSA links in this fabric, and each switch constitutes a disjoint DSA switch
- tree. The DSA links are viewed as simply a pair of a DSA conduit (the out-facing
- port of the upstream DSA switch) and a CPU port (the in-facing port of the
- downstream DSA switch).
- The tagging protocol of the attached DSA switch tree can be viewed through the
- ``dsa/tagging`` sysfs attribute of the DSA conduit::
- cat /sys/class/net/eth0/dsa/tagging
- If the hardware and driver are capable, the tagging protocol of the DSA switch
- tree can be changed at runtime. This is done by writing the new tagging
- protocol name to the same sysfs device attribute as above (the DSA conduit and
- all attached switch ports must be down while doing this).
- It is desirable that all tagging protocols are testable with the ``dsa_loop``
- mockup driver, which can be attached to any network interface. The goal is that
- any network interface should be capable of transmitting the same packet in the
- same way, and the tagger should decode the same received packet in the same way
- regardless of the driver used for the switch control path, and the driver used
- for the DSA conduit.
- The transmission of a packet goes through the tagger's ``xmit`` function.
- The passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
- ``skb_mac_header(skb)``, i.e. at the destination MAC address, and the passed
- ``struct net_device *dev`` represents the virtual DSA user network interface
- whose hardware counterpart the packet must be steered to (i.e. ``swp0``).
- The job of this method is to prepare the skb in a way that the switch will
- understand what egress port the packet is for (and not deliver it towards other
- ports). Typically this is fulfilled by pushing a frame header. Checking for
- insufficient size in the skb headroom or tailroom is unnecessary provided that
- the ``needed_headroom`` and ``needed_tailroom`` properties were filled out
- properly, because DSA ensures there is enough space before calling this method.
- The reception of a packet goes through the tagger's ``rcv`` function. The
- passed ``struct sk_buff *skb`` has ``skb->data`` pointing at
- ``skb_mac_header(skb) + ETH_ALEN`` octets, i.e. to where the first octet after
- the EtherType would have been, were this frame not tagged. The role of this
- method is to consume the frame header, adjust ``skb->data`` to really point at
- the first octet after the EtherType, and to change ``skb->dev`` to point to the
- virtual DSA user network interface corresponding to the physical front-facing
- switch port that the packet was received on.
- Since tagging protocols in category 1 and 2 break software (and most often also
- hardware) packet dissection on the DSA conduit, features such as RPS (Receive
- Packet Steering) on the DSA conduit would be broken. The DSA framework deals
- with this by hooking into the flow dissector and shifting the offset at which
- the IP header is to be found in the tagged frame as seen by the DSA conduit.
- This behavior is automatic based on the ``overhead`` value of the tagging
- protocol. If not all packets are of equal size, the tagger can implement the
- ``flow_dissect`` method of the ``struct dsa_device_ops`` and override this
- default behavior by specifying the correct offset incurred by each individual
- RX packet. Tail taggers do not cause issues to the flow dissector.
- Checksum offload should work with category 1 and 2 taggers when the DSA conduit
- driver declares NETIF_F_HW_CSUM in vlan_features and looks at csum_start and
- csum_offset. For those cases, DSA will shift the checksum start and offset by
- the tag size. If the DSA conduit driver still uses the legacy NETIF_F_IP_CSUM
- or NETIF_F_IPV6_CSUM in vlan_features, the offload might only work if the
- offload hardware already expects that specific tag (perhaps due to matching
- vendors). DSA user ports inherit those flags from the conduit, and it is up to
- the driver to correctly fall back to software checksum when the IP header is not
- where the hardware expects. If that check is ineffective, the packets might go
- to the network without a proper checksum (the checksum field will have the
- pseudo IP header sum). For category 3, when the offload hardware does not
- already expect the switch tag in use, the checksum must be calculated before any
- tag is inserted (i.e. inside the tagger). Otherwise, the DSA conduit would
- include the tail tag in the (software or hardware) checksum calculation. Then,
- when the tag gets stripped by the switch during transmission, it will leave an
- incorrect IP checksum in place.
- Due to various reasons (most common being category 1 taggers being associated
- with DSA-unaware conduits, mangling what the conduit perceives as MAC DA), the
- tagging protocol may require the DSA conduit to operate in promiscuous mode, to
- receive all frames regardless of the value of the MAC DA. This can be done by
- setting the ``promisc_on_conduit`` property of the ``struct dsa_device_ops``.
- Note that this assumes a DSA-unaware conduit driver, which is the norm.
- Conduit network devices
- -----------------------
- Conduit network devices are regular, unmodified Linux network device drivers for
- the CPU/management Ethernet interface. Such a driver might occasionally need to
- know whether DSA is enabled (e.g.: to enable/disable specific offload features),
- but the DSA subsystem has been proven to work with industry standard drivers:
- ``e1000e,`` ``mv643xx_eth`` etc. without having to introduce modifications to these
- drivers. Such network devices are also often referred to as conduit network
- devices since they act as a pipe between the host processor and the hardware
- Ethernet switch.
- Networking stack hooks
- ----------------------
- When a conduit netdev is used with DSA, a small hook is placed in the
- networking stack is in order to have the DSA subsystem process the Ethernet
- switch specific tagging protocol. DSA accomplishes this by registering a
- specific (and fake) Ethernet type (later becoming ``skb->protocol``) with the
- networking stack, this is also known as a ``ptype`` or ``packet_type``. A typical
- Ethernet Frame receive sequence looks like this:
- Conduit network device (e.g.: e1000e):
- 1. Receive interrupt fires:
- - receive function is invoked
- - basic packet processing is done: getting length, status etc.
- - packet is prepared to be processed by the Ethernet layer by calling
- ``eth_type_trans``
- 2. net/ethernet/eth.c::
- eth_type_trans(skb, dev)
- if (dev->dsa_ptr != NULL)
- -> skb->protocol = ETH_P_XDSA
- 3. drivers/net/ethernet/\*::
- netif_receive_skb(skb)
- -> iterate over registered packet_type
- -> invoke handler for ETH_P_XDSA, calls dsa_switch_rcv()
- 4. net/dsa/dsa.c::
- -> dsa_switch_rcv()
- -> invoke switch tag specific protocol handler in 'net/dsa/tag_*.c'
- 5. net/dsa/tag_*.c:
- - inspect and strip switch tag protocol to determine originating port
- - locate per-port network device
- - invoke ``eth_type_trans()`` with the DSA user network device
- - invoked ``netif_receive_skb()``
- Past this point, the DSA user network devices get delivered regular Ethernet
- frames that can be processed by the networking stack.
- User network devices
- --------------------
- User network devices created by DSA are stacked on top of their conduit network
- device, each of these network interfaces will be responsible for being a
- controlling and data-flowing end-point for each front-panel port of the switch.
- These interfaces are specialized in order to:
- - insert/remove the switch tag protocol (if it exists) when sending traffic
- to/from specific switch ports
- - query the switch for ethtool operations: statistics, link state,
- Wake-on-LAN, register dumps...
- - manage external/internal PHY: link, auto-negotiation, etc.
- These user network devices have custom net_device_ops and ethtool_ops function
- pointers which allow DSA to introduce a level of layering between the networking
- stack/ethtool and the switch driver implementation.
- Upon frame transmission from these user network devices, DSA will look up which
- switch tagging protocol is currently registered with these network devices and
- invoke a specific transmit routine which takes care of adding the relevant
- switch tag in the Ethernet frames.
- These frames are then queued for transmission using the conduit network device
- ``ndo_start_xmit()`` function. Since they contain the appropriate switch tag, the
- Ethernet switch will be able to process these incoming frames from the
- management interface and deliver them to the physical switch port.
- When using multiple CPU ports, it is possible to stack a LAG (bonding/team)
- device between the DSA user devices and the physical DSA conduits. The LAG
- device is thus also a DSA conduit, but the LAG slave devices continue to be DSA
- conduits as well (just with no user port assigned to them; this is needed for
- recovery in case the LAG DSA conduit disappears). Thus, the data path of the LAG
- DSA conduit is used asymmetrically. On RX, the ``ETH_P_XDSA`` handler, which
- calls ``dsa_switch_rcv()``, is invoked early (on the physical DSA conduit;
- LAG slave). Therefore, the RX data path of the LAG DSA conduit is not used.
- On the other hand, TX takes place linearly: ``dsa_user_xmit`` calls
- ``dsa_enqueue_skb``, which calls ``dev_queue_xmit`` towards the LAG DSA conduit.
- The latter calls ``dev_queue_xmit`` towards one physical DSA conduit or the
- other, and in both cases, the packet exits the system through a hardware path
- towards the switch.
- Graphical representation
- ------------------------
- Summarized, this is basically how DSA looks like from a network device
- perspective::
- Unaware application
- opens and binds socket
- | ^
- | |
- +-----------v--|--------------------+
- |+------+ +------+ +------+ +------+|
- || swp0 | | swp1 | | swp2 | | swp3 ||
- |+------+-+------+-+------+-+------+|
- | DSA switch driver |
- +-----------------------------------+
- | ^
- Tag added by | | Tag consumed by
- switch driver | | switch driver
- v |
- +-----------------------------------+
- | Unmodified host interface driver | Software
- --------+-----------------------------------+------------
- | Host interface (eth0) | Hardware
- +-----------------------------------+
- | ^
- Tag consumed by | | Tag added by
- switch hardware | | switch hardware
- v |
- +-----------------------------------+
- | Switch |
- |+------+ +------+ +------+ +------+|
- || swp0 | | swp1 | | swp2 | | swp3 ||
- ++------+-+------+-+------+-+------++
- User MDIO bus
- -------------
- In order to be able to read to/from a switch PHY built into it, DSA creates an
- user MDIO bus which allows a specific switch driver to divert and intercept
- MDIO reads/writes towards specific PHY addresses. In most MDIO-connected
- switches, these functions would utilize direct or indirect PHY addressing mode
- to return standard MII registers from the switch builtin PHYs, allowing the PHY
- library and/or to return link status, link partner pages, auto-negotiation
- results, etc.
- For Ethernet switches which have both external and internal MDIO buses, the
- user MII bus can be utilized to mux/demux MDIO reads and writes towards either
- internal or external MDIO devices this switch might be connected to: internal
- PHYs, external PHYs, or even external switches.
- Data structures
- ---------------
- DSA data structures are defined in ``include/net/dsa.h`` as well as
- ``net/dsa/dsa_priv.h``:
- - ``dsa_chip_data``: platform data configuration for a given switch device,
- this structure describes a switch device's parent device, its address, as
- well as various properties of its ports: names/labels, and finally a routing
- table indication (when cascading switches)
- - ``dsa_platform_data``: platform device configuration data which can reference
- a collection of dsa_chip_data structures if multiple switches are cascaded,
- the conduit network device this switch tree is attached to needs to be
- referenced
- - ``dsa_switch_tree``: structure assigned to the conduit network device under
- ``dsa_ptr``, this structure references a dsa_platform_data structure as well as
- the tagging protocol supported by the switch tree, and which receive/transmit
- function hooks should be invoked, information about the directly attached
- switch is also provided: CPU port. Finally, a collection of dsa_switch are
- referenced to address individual switches in the tree.
- - ``dsa_switch``: structure describing a switch device in the tree, referencing
- a ``dsa_switch_tree`` as a backpointer, user network devices, conduit network
- device, and a reference to the backing``dsa_switch_ops``
- - ``dsa_switch_ops``: structure referencing function pointers, see below for a
- full description.
- Design limitations
- ==================
- Lack of CPU/DSA network devices
- -------------------------------
- DSA does not currently create user network devices for the CPU or DSA ports, as
- described before. This might be an issue in the following cases:
- - inability to fetch switch CPU port statistics counters using ethtool, which
- can make it harder to debug MDIO switch connected using xMII interfaces
- - inability to configure the CPU port link parameters based on the Ethernet
- controller capabilities attached to it: http://patchwork.ozlabs.org/patch/509806/
- - inability to configure specific VLAN IDs / trunking VLANs between switches
- when using a cascaded setup
- Common pitfalls using DSA setups
- --------------------------------
- Once a conduit network device is configured to use DSA (dev->dsa_ptr becomes
- non-NULL), and the switch behind it expects a tagging protocol, this network
- interface can only exclusively be used as a conduit interface. Sending packets
- directly through this interface (e.g.: opening a socket using this interface)
- will not make us go through the switch tagging protocol transmit function, so
- the Ethernet switch on the other end, expecting a tag will typically drop this
- frame.
- Interactions with other subsystems
- ==================================
- DSA currently leverages the following subsystems:
- - MDIO/PHY library: ``drivers/net/phy/phy.c``, ``mdio_bus.c``
- - Switchdev:``net/switchdev/*``
- - Device Tree for various of_* functions
- - Devlink: ``net/core/devlink.c``
- MDIO/PHY library
- ----------------
- User network devices exposed by DSA may or may not be interfacing with PHY
- devices (``struct phy_device`` as defined in ``include/linux/phy.h)``, but the DSA
- subsystem deals with all possible combinations:
- - internal PHY devices, built into the Ethernet switch hardware
- - external PHY devices, connected via an internal or external MDIO bus
- - internal PHY devices, connected via an internal MDIO bus
- - special, non-autonegotiated or non MDIO-managed PHY devices: SFPs, MoCA; a.k.a
- fixed PHYs
- The PHY configuration is done by the ``dsa_user_phy_setup()`` function and the
- logic basically looks like this:
- - if Device Tree is used, the PHY device is looked up using the standard
- "phy-handle" property, if found, this PHY device is created and registered
- using ``of_phy_connect()``
- - if Device Tree is used and the PHY device is "fixed", that is, conforms to
- the definition of a non-MDIO managed PHY as defined in
- ``Documentation/devicetree/bindings/net/fixed-link.txt``, the PHY is registered
- and connected transparently using the special fixed MDIO bus driver
- - finally, if the PHY is built into the switch, as is very common with
- standalone switch packages, the PHY is probed using the user MII bus created
- by DSA
- SWITCHDEV
- ---------
- DSA directly utilizes SWITCHDEV when interfacing with the bridge layer, and
- more specifically with its VLAN filtering portion when configuring VLANs on top
- of per-port user network devices. As of today, the only SWITCHDEV objects
- supported by DSA are the FDB and VLAN objects.
- Devlink
- -------
- DSA registers one devlink device per physical switch in the fabric.
- For each devlink device, every physical port (i.e. user ports, CPU ports, DSA
- links or unused ports) is exposed as a devlink port.
- DSA drivers can make use of the following devlink features:
- - Regions: debugging feature which allows user space to dump driver-defined
- areas of hardware information in a low-level, binary format. Both global
- regions as well as per-port regions are supported. It is possible to export
- devlink regions even for pieces of data that are already exposed in some way
- to the standard iproute2 user space programs (ip-link, bridge), like address
- tables and VLAN tables. For example, this might be useful if the tables
- contain additional hardware-specific details which are not visible through
- the iproute2 abstraction, or it might be useful to inspect these tables on
- the non-user ports too, which are invisible to iproute2 because no network
- interface is registered for them.
- - Params: a feature which enables user to configure certain low-level tunable
- knobs pertaining to the device. Drivers may implement applicable generic
- devlink params, or may add new device-specific devlink params.
- - Resources: a monitoring feature which enables users to see the degree of
- utilization of certain hardware tables in the device, such as FDB, VLAN, etc.
- - Shared buffers: a QoS feature for adjusting and partitioning memory and frame
- reservations per port and per traffic class, in the ingress and egress
- directions, such that low-priority bulk traffic does not impede the
- processing of high-priority critical traffic.
- For more details, consult ``Documentation/networking/devlink/``.
- Device Tree
- -----------
- DSA features a standardized binding which is documented in
- ``Documentation/devicetree/bindings/net/dsa/dsa.txt``. PHY/MDIO library helper
- functions such as ``of_get_phy_mode()``, ``of_phy_connect()`` are also used to query
- per-port PHY specific details: interface connection, MDIO bus location, etc.
- Driver development
- ==================
- DSA switch drivers need to implement a ``dsa_switch_ops`` structure which will
- contain the various members described below.
- Probing, registration and device lifetime
- -----------------------------------------
- DSA switches are regular ``device`` structures on buses (be they platform, SPI,
- I2C, MDIO or otherwise). The DSA framework is not involved in their probing
- with the device core.
- Switch registration from the perspective of a driver means passing a valid
- ``struct dsa_switch`` pointer to ``dsa_register_switch()``, usually from the
- switch driver's probing function. The following members must be valid in the
- provided structure:
- - ``ds->dev``: will be used to parse the switch's OF node or platform data.
- - ``ds->num_ports``: will be used to create the port list for this switch, and
- to validate the port indices provided in the OF node.
- - ``ds->ops``: a pointer to the ``dsa_switch_ops`` structure holding the DSA
- method implementations.
- - ``ds->priv``: backpointer to a driver-private data structure which can be
- retrieved in all further DSA method callbacks.
- In addition, the following flags in the ``dsa_switch`` structure may optionally
- be configured to obtain driver-specific behavior from the DSA core. Their
- behavior when set is documented through comments in ``include/net/dsa.h``.
- - ``ds->vlan_filtering_is_global``
- - ``ds->needs_standalone_vlan_filtering``
- - ``ds->configure_vlan_while_not_filtering``
- - ``ds->untag_bridge_pvid``
- - ``ds->assisted_learning_on_cpu_port``
- - ``ds->mtu_enforcement_ingress``
- - ``ds->fdb_isolation``
- Internally, DSA keeps an array of switch trees (group of switches) global to
- the kernel, and attaches a ``dsa_switch`` structure to a tree on registration.
- The tree ID to which the switch is attached is determined by the first u32
- number of the ``dsa,member`` property of the switch's OF node (0 if missing).
- The switch ID within the tree is determined by the second u32 number of the
- same OF property (0 if missing). Registering multiple switches with the same
- switch ID and tree ID is illegal and will cause an error. Using platform data,
- a single switch and a single switch tree is permitted.
- In case of a tree with multiple switches, probing takes place asymmetrically.
- The first N-1 callers of ``dsa_register_switch()`` only add their ports to the
- port list of the tree (``dst->ports``), each port having a backpointer to its
- associated switch (``dp->ds``). Then, these switches exit their
- ``dsa_register_switch()`` call early, because ``dsa_tree_setup_routing_table()``
- has determined that the tree is not yet complete (not all ports referenced by
- DSA links are present in the tree's port list). The tree becomes complete when
- the last switch calls ``dsa_register_switch()``, and this triggers the effective
- continuation of initialization (including the call to ``ds->ops->setup()``) for
- all switches within that tree, all as part of the calling context of the last
- switch's probe function.
- The opposite of registration takes place when calling ``dsa_unregister_switch()``,
- which removes a switch's ports from the port list of the tree. The entire tree
- is torn down when the first switch unregisters.
- It is mandatory for DSA switch drivers to implement the ``shutdown()`` callback
- of their respective bus, and call ``dsa_switch_shutdown()`` from it (a minimal
- version of the full teardown performed by ``dsa_unregister_switch()``).
- The reason is that DSA keeps a reference on the conduit net device, and if the
- driver for the conduit device decides to unbind on shutdown, DSA's reference
- will block that operation from finalizing.
- Either ``dsa_switch_shutdown()`` or ``dsa_unregister_switch()`` must be called,
- but not both, and the device driver model permits the bus' ``remove()`` method
- to be called even if ``shutdown()`` was already called. Therefore, drivers are
- expected to implement a mutual exclusion method between ``remove()`` and
- ``shutdown()`` by setting their drvdata to NULL after any of these has run, and
- checking whether the drvdata is NULL before proceeding to take any action.
- After ``dsa_switch_shutdown()`` or ``dsa_unregister_switch()`` was called, no
- further callbacks via the provided ``dsa_switch_ops`` may take place, and the
- driver may free the data structures associated with the ``dsa_switch``.
- Switch configuration
- --------------------
- - ``get_tag_protocol``: this is to indicate what kind of tagging protocol is
- supported, should be a valid value from the ``dsa_tag_protocol`` enum.
- The returned information does not have to be static; the driver is passed the
- CPU port number, as well as the tagging protocol of a possibly stacked
- upstream switch, in case there are hardware limitations in terms of supported
- tag formats.
- - ``change_tag_protocol``: when the default tagging protocol has compatibility
- problems with the conduit or other issues, the driver may support changing it
- at runtime, either through a device tree property or through sysfs. In that
- case, further calls to ``get_tag_protocol`` should report the protocol in
- current use.
- - ``setup``: setup function for the switch, this function is responsible for setting
- up the ``dsa_switch_ops`` private structure with all it needs: register maps,
- interrupts, mutexes, locks, etc. This function is also expected to properly
- configure the switch to separate all network interfaces from each other, that
- is, they should be isolated by the switch hardware itself, typically by creating
- a Port-based VLAN ID for each port and allowing only the CPU port and the
- specific port to be in the forwarding vector. Ports that are unused by the
- platform should be disabled. Past this function, the switch is expected to be
- fully configured and ready to serve any kind of request. It is recommended
- to issue a software reset of the switch during this setup function in order to
- avoid relying on what a previous software agent such as a bootloader/firmware
- may have previously configured. The method responsible for undoing any
- applicable allocations or operations done here is ``teardown``.
- - ``port_setup`` and ``port_teardown``: methods for initialization and
- destruction of per-port data structures. It is mandatory for some operations
- such as registering and unregistering devlink port regions to be done from
- these methods, otherwise they are optional. A port will be torn down only if
- it has been previously set up. It is possible for a port to be set up during
- probing only to be torn down immediately afterwards, for example in case its
- PHY cannot be found. In this case, probing of the DSA switch continues
- without that particular port.
- - ``port_change_conduit``: method through which the affinity (association used
- for traffic termination purposes) between a user port and a CPU port can be
- changed. By default all user ports from a tree are assigned to the first
- available CPU port that makes sense for them (most of the times this means
- the user ports of a tree are all assigned to the same CPU port, except for H
- topologies as described in commit 2c0b03258b8b). The ``port`` argument
- represents the index of the user port, and the ``conduit`` argument represents
- the new DSA conduit ``net_device``. The CPU port associated with the new
- conduit can be retrieved by looking at ``struct dsa_port *cpu_dp =
- conduit->dsa_ptr``. Additionally, the conduit can also be a LAG device where
- all the slave devices are physical DSA conduits. LAG DSA also have a
- valid ``conduit->dsa_ptr`` pointer, however this is not unique, but rather a
- duplicate of the first physical DSA conduit's (LAG slave) ``dsa_ptr``. In case
- of a LAG DSA conduit, a further call to ``port_lag_join`` will be emitted
- separately for the physical CPU ports associated with the physical DSA
- conduits, requesting them to create a hardware LAG associated with the LAG
- interface.
- PHY devices and link management
- -------------------------------
- - ``get_phy_flags``: Some switches are interfaced to various kinds of Ethernet PHYs,
- if the PHY library PHY driver needs to know about information it cannot obtain
- on its own (e.g.: coming from switch memory mapped registers), this function
- should return a 32-bit bitmask of "flags" that is private between the switch
- driver and the Ethernet PHY driver in ``drivers/net/phy/\*``.
- - ``phy_read``: Function invoked by the DSA user MDIO bus when attempting to read
- the switch port MDIO registers. If unavailable, return 0xffff for each read.
- For builtin switch Ethernet PHYs, this function should allow reading the link
- status, auto-negotiation results, link partner pages, etc.
- - ``phy_write``: Function invoked by the DSA user MDIO bus when attempting to write
- to the switch port MDIO registers. If unavailable return a negative error
- code.
- - ``adjust_link``: Function invoked by the PHY library when a user network device
- is attached to a PHY device. This function is responsible for appropriately
- configuring the switch port link parameters: speed, duplex, pause based on
- what the ``phy_device`` is providing.
- - ``fixed_link_update``: Function invoked by the PHY library, and specifically by
- the fixed PHY driver asking the switch driver for link parameters that could
- not be auto-negotiated, or obtained by reading the PHY registers through MDIO.
- This is particularly useful for specific kinds of hardware such as QSGMII,
- MoCA or other kinds of non-MDIO managed PHYs where out of band link
- information is obtained
- Ethtool operations
- ------------------
- - ``get_strings``: ethtool function used to query the driver's strings, will
- typically return statistics strings, private flags strings, etc.
- - ``get_ethtool_stats``: ethtool function used to query per-port statistics and
- return their values. DSA overlays user network devices general statistics:
- RX/TX counters from the network device, with switch driver specific statistics
- per port
- - ``get_sset_count``: ethtool function used to query the number of statistics items
- - ``get_wol``: ethtool function used to obtain Wake-on-LAN settings per-port, this
- function may for certain implementations also query the conduit network device
- Wake-on-LAN settings if this interface needs to participate in Wake-on-LAN
- - ``set_wol``: ethtool function used to configure Wake-on-LAN settings per-port,
- direct counterpart to set_wol with similar restrictions
- - ``set_eee``: ethtool function which is used to configure a switch port EEE (Green
- Ethernet) settings, can optionally invoke the PHY library to enable EEE at the
- PHY level if relevant. This function should enable EEE at the switch port MAC
- controller and data-processing logic
- - ``get_eee``: ethtool function which is used to query a switch port EEE settings,
- this function should return the EEE state of the switch port MAC controller
- and data-processing logic as well as query the PHY for its currently configured
- EEE settings
- - ``get_eeprom_len``: ethtool function returning for a given switch the EEPROM
- length/size in bytes
- - ``get_eeprom``: ethtool function returning for a given switch the EEPROM contents
- - ``set_eeprom``: ethtool function writing specified data to a given switch EEPROM
- - ``get_regs_len``: ethtool function returning the register length for a given
- switch
- - ``get_regs``: ethtool function returning the Ethernet switch internal register
- contents. This function might require user-land code in ethtool to
- pretty-print register values and registers
- Power management
- ----------------
- - ``suspend``: function invoked by the DSA platform device when the system goes to
- suspend, should quiesce all Ethernet switch activities, but keep ports
- participating in Wake-on-LAN active as well as additional wake-up logic if
- supported
- - ``resume``: function invoked by the DSA platform device when the system resumes,
- should resume all Ethernet switch activities and re-configure the switch to be
- in a fully active state
- - ``port_enable``: function invoked by the DSA user network device ndo_open
- function when a port is administratively brought up, this function should
- fully enable a given switch port. DSA takes care of marking the port with
- ``BR_STATE_BLOCKING`` if the port is a bridge member, or ``BR_STATE_FORWARDING`` if it
- was not, and propagating these changes down to the hardware
- - ``port_disable``: function invoked by the DSA user network device ndo_close
- function when a port is administratively brought down, this function should
- fully disable a given switch port. DSA takes care of marking the port with
- ``BR_STATE_DISABLED`` and propagating changes to the hardware if this port is
- disabled while being a bridge member
- Address databases
- -----------------
- Switching hardware is expected to have a table for FDB entries, however not all
- of them are active at the same time. An address database is the subset (partition)
- of FDB entries that is active (can be matched by address learning on RX, or FDB
- lookup on TX) depending on the state of the port. An address database may
- occasionally be called "FID" (Filtering ID) in this document, although the
- underlying implementation may choose whatever is available to the hardware.
- For example, all ports that belong to a VLAN-unaware bridge (which is
- *currently* VLAN-unaware) are expected to learn source addresses in the
- database associated by the driver with that bridge (and not with other
- VLAN-unaware bridges). During forwarding and FDB lookup, a packet received on a
- VLAN-unaware bridge port should be able to find a VLAN-unaware FDB entry having
- the same MAC DA as the packet, which is present on another port member of the
- same bridge. At the same time, the FDB lookup process must be able to not find
- an FDB entry having the same MAC DA as the packet, if that entry points towards
- a port which is a member of a different VLAN-unaware bridge (and is therefore
- associated with a different address database).
- Similarly, each VLAN of each offloaded VLAN-aware bridge should have an
- associated address database, which is shared by all ports which are members of
- that VLAN, but not shared by ports belonging to different bridges that are
- members of the same VID.
- In this context, a VLAN-unaware database means that all packets are expected to
- match on it irrespective of VLAN ID (only MAC address lookup), whereas a
- VLAN-aware database means that packets are supposed to match based on the VLAN
- ID from the classified 802.1Q header (or the pvid if untagged).
- At the bridge layer, VLAN-unaware FDB entries have the special VID value of 0,
- whereas VLAN-aware FDB entries have non-zero VID values. Note that a
- VLAN-unaware bridge may have VLAN-aware (non-zero VID) FDB entries, and a
- VLAN-aware bridge may have VLAN-unaware FDB entries. As in hardware, the
- software bridge keeps separate address databases, and offloads to hardware the
- FDB entries belonging to these databases, through switchdev, asynchronously
- relative to the moment when the databases become active or inactive.
- When a user port operates in standalone mode, its driver should configure it to
- use a separate database called a port private database. This is different from
- the databases described above, and should impede operation as standalone port
- (packet in, packet out to the CPU port) as little as possible. For example,
- on ingress, it should not attempt to learn the MAC SA of ingress traffic, since
- learning is a bridging layer service and this is a standalone port, therefore
- it would consume useless space. With no address learning, the port private
- database should be empty in a naive implementation, and in this case, all
- received packets should be trivially flooded to the CPU port.
- DSA (cascade) and CPU ports are also called "shared" ports because they service
- multiple address databases, and the database that a packet should be associated
- to is usually embedded in the DSA tag. This means that the CPU port may
- simultaneously transport packets coming from a standalone port (which were
- classified by hardware in one address database), and from a bridge port (which
- were classified to a different address database).
- Switch drivers which satisfy certain criteria are able to optimize the naive
- configuration by removing the CPU port from the flooding domain of the switch,
- and just program the hardware with FDB entries pointing towards the CPU port
- for which it is known that software is interested in those MAC addresses.
- Packets which do not match a known FDB entry will not be delivered to the CPU,
- which will save CPU cycles required for creating an skb just to drop it.
- DSA is able to perform host address filtering for the following kinds of
- addresses:
- - Primary unicast MAC addresses of ports (``dev->dev_addr``). These are
- associated with the port private database of the respective user port,
- and the driver is notified to install them through ``port_fdb_add`` towards
- the CPU port.
- - Secondary unicast and multicast MAC addresses of ports (addresses added
- through ``dev_uc_add()`` and ``dev_mc_add()``). These are also associated
- with the port private database of the respective user port.
- - Local/permanent bridge FDB entries (``BR_FDB_LOCAL``). These are the MAC
- addresses of the bridge ports, for which packets must be terminated locally
- and not forwarded. They are associated with the address database for that
- bridge.
- - Static bridge FDB entries installed towards foreign (non-DSA) interfaces
- present in the same bridge as some DSA switch ports. These are also
- associated with the address database for that bridge.
- - Dynamically learned FDB entries on foreign interfaces present in the same
- bridge as some DSA switch ports, only if ``ds->assisted_learning_on_cpu_port``
- is set to true by the driver. These are associated with the address database
- for that bridge.
- For various operations detailed below, DSA provides a ``dsa_db`` structure
- which can be of the following types:
- - ``DSA_DB_PORT``: the FDB (or MDB) entry to be installed or deleted belongs to
- the port private database of user port ``db->dp``.
- - ``DSA_DB_BRIDGE``: the entry belongs to one of the address databases of bridge
- ``db->bridge``. Separation between the VLAN-unaware database and the per-VID
- databases of this bridge is expected to be done by the driver.
- - ``DSA_DB_LAG``: the entry belongs to the address database of LAG ``db->lag``.
- Note: ``DSA_DB_LAG`` is currently unused and may be removed in the future.
- The drivers which act upon the ``dsa_db`` argument in ``port_fdb_add``,
- ``port_mdb_add`` etc should declare ``ds->fdb_isolation`` as true.
- DSA associates each offloaded bridge and each offloaded LAG with a one-based ID
- (``struct dsa_bridge :: num``, ``struct dsa_lag :: id``) for the purposes of
- refcounting addresses on shared ports. Drivers may piggyback on DSA's numbering
- scheme (the ID is readable through ``db->bridge.num`` and ``db->lag.id`` or may
- implement their own.
- Only the drivers which declare support for FDB isolation are notified of FDB
- entries on the CPU port belonging to ``DSA_DB_PORT`` databases.
- For compatibility/legacy reasons, ``DSA_DB_BRIDGE`` addresses are notified to
- drivers even if they do not support FDB isolation. However, ``db->bridge.num``
- and ``db->lag.id`` are always set to 0 in that case (to denote the lack of
- isolation, for refcounting purposes).
- Note that it is not mandatory for a switch driver to implement physically
- separate address databases for each standalone user port. Since FDB entries in
- the port private databases will always point to the CPU port, there is no risk
- for incorrect forwarding decisions. In this case, all standalone ports may
- share the same database, but the reference counting of host-filtered addresses
- (not deleting the FDB entry for a port's MAC address if it's still in use by
- another port) becomes the responsibility of the driver, because DSA is unaware
- that the port databases are in fact shared. This can be achieved by calling
- ``dsa_fdb_present_in_other_db()`` and ``dsa_mdb_present_in_other_db()``.
- The down side is that the RX filtering lists of each user port are in fact
- shared, which means that user port A may accept a packet with a MAC DA it
- shouldn't have, only because that MAC address was in the RX filtering list of
- user port B. These packets will still be dropped in software, however.
- Bridge layer
- ------------
- Offloading the bridge forwarding plane is optional and handled by the methods
- below. They may be absent, return -EOPNOTSUPP, or ``ds->max_num_bridges`` may
- be non-zero and exceeded, and in this case, joining a bridge port is still
- possible, but the packet forwarding will take place in software, and the ports
- under a software bridge must remain configured in the same way as for
- standalone operation, i.e. have all bridging service functions (address
- learning etc) disabled, and send all received packets to the CPU port only.
- Concretely, a port starts offloading the forwarding plane of a bridge once it
- returns success to the ``port_bridge_join`` method, and stops doing so after
- ``port_bridge_leave`` has been called. Offloading the bridge means autonomously
- learning FDB entries in accordance with the software bridge port's state, and
- autonomously forwarding (or flooding) received packets without CPU intervention.
- This is optional even when offloading a bridge port. Tagging protocol drivers
- are expected to call ``dsa_default_offload_fwd_mark(skb)`` for packets which
- have already been autonomously forwarded in the forwarding domain of the
- ingress switch port. DSA, through ``dsa_port_devlink_setup()``, considers all
- switch ports part of the same tree ID to be part of the same bridge forwarding
- domain (capable of autonomous forwarding to each other).
- Offloading the TX forwarding process of a bridge is a distinct concept from
- simply offloading its forwarding plane, and refers to the ability of certain
- driver and tag protocol combinations to transmit a single skb coming from the
- bridge device's transmit function to potentially multiple egress ports (and
- thereby avoid its cloning in software).
- Packets for which the bridge requests this behavior are called data plane
- packets and have ``skb->offload_fwd_mark`` set to true in the tag protocol
- driver's ``xmit`` function. Data plane packets are subject to FDB lookup,
- hardware learning on the CPU port, and do not override the port STP state.
- Additionally, replication of data plane packets (multicast, flooding) is
- handled in hardware and the bridge driver will transmit a single skb for each
- packet that may or may not need replication.
- When the TX forwarding offload is enabled, the tag protocol driver is
- responsible to inject packets into the data plane of the hardware towards the
- correct bridging domain (FID) that the port is a part of. The port may be
- VLAN-unaware, and in this case the FID must be equal to the FID used by the
- driver for its VLAN-unaware address database associated with that bridge.
- Alternatively, the bridge may be VLAN-aware, and in that case, it is guaranteed
- that the packet is also VLAN-tagged with the VLAN ID that the bridge processed
- this packet in. It is the responsibility of the hardware to untag the VID on
- the egress-untagged ports, or keep the tag on the egress-tagged ones.
- - ``port_bridge_join``: bridge layer function invoked when a given switch port is
- added to a bridge, this function should do what's necessary at the switch
- level to permit the joining port to be added to the relevant logical
- domain for it to ingress/egress traffic with other members of the bridge.
- By setting the ``tx_fwd_offload`` argument to true, the TX forwarding process
- of this bridge is also offloaded.
- - ``port_bridge_leave``: bridge layer function invoked when a given switch port is
- removed from a bridge, this function should do what's necessary at the
- switch level to deny the leaving port from ingress/egress traffic from the
- remaining bridge members.
- - ``port_stp_state_set``: bridge layer function invoked when a given switch port STP
- state is computed by the bridge layer and should be propagated to switch
- hardware to forward/block/learn traffic.
- - ``port_bridge_flags``: bridge layer function invoked when a port must
- configure its settings for e.g. flooding of unknown traffic or source address
- learning. The switch driver is responsible for initial setup of the
- standalone ports with address learning disabled and egress flooding of all
- types of traffic, then the DSA core notifies of any change to the bridge port
- flags when the port joins and leaves a bridge. DSA does not currently manage
- the bridge port flags for the CPU port. The assumption is that address
- learning should be statically enabled (if supported by the hardware) on the
- CPU port, and flooding towards the CPU port should also be enabled, due to a
- lack of an explicit address filtering mechanism in the DSA core.
- - ``port_fast_age``: bridge layer function invoked when flushing the
- dynamically learned FDB entries on the port is necessary. This is called when
- transitioning from an STP state where learning should take place to an STP
- state where it shouldn't, or when leaving a bridge, or when address learning
- is turned off via ``port_bridge_flags``.
- Bridge VLAN filtering
- ---------------------
- - ``port_vlan_filtering``: bridge layer function invoked when the bridge gets
- configured for turning on or off VLAN filtering. If nothing specific needs to
- be done at the hardware level, this callback does not need to be implemented.
- When VLAN filtering is turned on, the hardware must be programmed with
- rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed
- VLAN ID map/rules. If there is no PVID programmed into the switch port,
- untagged frames must be rejected as well. When turned off the switch must
- accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are
- allowed.
- - ``port_vlan_add``: bridge layer function invoked when a VLAN is configured
- (tagged or untagged) for the given switch port. The CPU port becomes a member
- of a VLAN only if a foreign bridge port is also a member of it (and
- forwarding needs to take place in software), or the VLAN is installed to the
- VLAN group of the bridge device itself, for termination purposes
- (``bridge vlan add dev br0 vid 100 self``). VLANs on shared ports are
- reference counted and removed when there is no user left. Drivers do not need
- to manually install a VLAN on the CPU port.
- - ``port_vlan_del``: bridge layer function invoked when a VLAN is removed from the
- given switch port
- - ``port_fdb_add``: bridge layer function invoked when the bridge wants to install a
- Forwarding Database entry, the switch hardware should be programmed with the
- specified address in the specified VLAN Id in the forwarding database
- associated with this VLAN ID.
- - ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a
- Forwarding Database entry, the switch hardware should be programmed to delete
- the specified MAC address from the specified VLAN ID if it was mapped into
- this port forwarding database
- - ``port_fdb_dump``: bridge bypass function invoked by ``ndo_fdb_dump`` on the
- physical DSA port interfaces. Since DSA does not attempt to keep in sync its
- hardware FDB entries with the software bridge, this method is implemented as
- a means to view the entries visible on user ports in the hardware database.
- The entries reported by this function have the ``self`` flag in the output of
- the ``bridge fdb show`` command.
- - ``port_mdb_add``: bridge layer function invoked when the bridge wants to install
- a multicast database entry. The switch hardware should be programmed with the
- specified address in the specified VLAN ID in the forwarding database
- associated with this VLAN ID.
- - ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a
- multicast database entry, the switch hardware should be programmed to delete
- the specified MAC address from the specified VLAN ID if it was mapped into
- this port forwarding database.
- Link aggregation
- ----------------
- Link aggregation is implemented in the Linux networking stack by the bonding
- and team drivers, which are modeled as virtual, stackable network interfaces.
- DSA is capable of offloading a link aggregation group (LAG) to hardware that
- supports the feature, and supports bridging between physical ports and LAGs,
- as well as between LAGs. A bonding/team interface which holds multiple physical
- ports constitutes a logical port, although DSA has no explicit concept of a
- logical port at the moment. Due to this, events where a LAG joins/leaves a
- bridge are treated as if all individual physical ports that are members of that
- LAG join/leave the bridge. Switchdev port attributes (VLAN filtering, STP
- state, etc) and objects (VLANs, MDB entries) offloaded to a LAG as bridge port
- are treated similarly: DSA offloads the same switchdev object / port attribute
- on all members of the LAG. Static bridge FDB entries on a LAG are not yet
- supported, since the DSA driver API does not have the concept of a logical port
- ID.
- - ``port_lag_join``: function invoked when a given switch port is added to a
- LAG. The driver may return ``-EOPNOTSUPP``, and in this case, DSA will fall
- back to a software implementation where all traffic from this port is sent to
- the CPU.
- - ``port_lag_leave``: function invoked when a given switch port leaves a LAG
- and returns to operation as a standalone port.
- - ``port_lag_change``: function invoked when the link state of any member of
- the LAG changes, and the hashing function needs rebalancing to only make use
- of the subset of physical LAG member ports that are up.
- Drivers that benefit from having an ID associated with each offloaded LAG
- can optionally populate ``ds->num_lag_ids`` from the ``dsa_switch_ops::setup``
- method. The LAG ID associated with a bonding/team interface can then be
- retrieved by a DSA switch driver using the ``dsa_lag_id`` function.
- IEC 62439-2 (MRP)
- -----------------
- The Media Redundancy Protocol is a topology management protocol optimized for
- fast fault recovery time for ring networks, which has some components
- implemented as a function of the bridge driver. MRP uses management PDUs
- (Test, Topology, LinkDown/Up, Option) sent at a multicast destination MAC
- address range of 01:15:4e:00:00:0x and with an EtherType of 0x88e3.
- Depending on the node's role in the ring (MRM: Media Redundancy Manager,
- MRC: Media Redundancy Client, MRA: Media Redundancy Automanager), certain MRP
- PDUs might need to be terminated locally and others might need to be forwarded.
- An MRM might also benefit from offloading to hardware the creation and
- transmission of certain MRP PDUs (Test).
- Normally an MRP instance can be created on top of any network interface,
- however in the case of a device with an offloaded data path such as DSA, it is
- necessary for the hardware, even if it is not MRP-aware, to be able to extract
- the MRP PDUs from the fabric before the driver can proceed with the software
- implementation. DSA today has no driver which is MRP-aware, therefore it only
- listens for the bare minimum switchdev objects required for the software assist
- to work properly. The operations are detailed below.
- - ``port_mrp_add`` and ``port_mrp_del``: notifies driver when an MRP instance
- with a certain ring ID, priority, primary port and secondary port is
- created/deleted.
- - ``port_mrp_add_ring_role`` and ``port_mrp_del_ring_role``: function invoked
- when an MRP instance changes ring roles between MRM or MRC. This affects
- which MRP PDUs should be trapped to software and which should be autonomously
- forwarded.
- IEC 62439-3 (HSR/PRP)
- ---------------------
- The Parallel Redundancy Protocol (PRP) is a network redundancy protocol which
- works by duplicating and sequence numbering packets through two independent L2
- networks (which are unaware of the PRP tail tags carried in the packets), and
- eliminating the duplicates at the receiver. The High-availability Seamless
- Redundancy (HSR) protocol is similar in concept, except all nodes that carry
- the redundant traffic are aware of the fact that it is HSR-tagged (because HSR
- uses a header with an EtherType of 0x892f) and are physically connected in a
- ring topology. Both HSR and PRP use supervision frames for monitoring the
- health of the network and for discovery of other nodes.
- In Linux, both HSR and PRP are implemented in the hsr driver, which
- instantiates a virtual, stackable network interface with two member ports.
- The driver only implements the basic roles of DANH (Doubly Attached Node
- implementing HSR), DANP (Doubly Attached Node implementing PRP) and RedBox
- (allows non-HSR devices to connect to the ring via Interlink ports).
- A driver which is able of offloading certain functions should declare the
- corresponding netdev features as indicated by the documentation at
- ``Documentation/networking/netdev-features.rst``. Additionally, the following
- methods must be implemented:
- - ``port_hsr_join``: function invoked when a given switch port is added to a
- DANP/DANH. The driver may return ``-EOPNOTSUPP`` and in this case, DSA will
- fall back to a software implementation where all traffic from this port is
- sent to the CPU.
- - ``port_hsr_leave``: function invoked when a given switch port leaves a
- DANP/DANH and returns to normal operation as a standalone port.
- Note that the ``NETIF_F_HW_HSR_DUP`` feature relies on transmission towards
- multiple ports, which is generally available whenever the tagging protocol uses
- the ``dsa_xmit_port_mask()`` helper function. If the helper is used, the HSR
- offload feature should also be set. The ``dsa_port_simple_hsr_join()`` and
- ``dsa_port_simple_hsr_leave()`` methods can be used as generic implementations
- of ``port_hsr_join`` and ``port_hsr_leave``, if this is the only supported
- offload feature.
- TODO
- ====
- Making SWITCHDEV and DSA converge towards an unified codebase
- -------------------------------------------------------------
- SWITCHDEV properly takes care of abstracting the networking stack with offload
- capable hardware, but does not enforce a strict switch device driver model. On
- the other DSA enforces a fairly strict device driver model, and deals with most
- of the switch specific. At some point we should envision a merger between these
- two subsystems and get the best of both worlds.
|