nf_flowtable.rst 9.6 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235
  1. .. SPDX-License-Identifier: GPL-2.0
  2. ====================================
  3. Netfilter's flowtable infrastructure
  4. ====================================
  5. This documentation describes the Netfilter flowtable infrastructure which allows
  6. you to define a fastpath through the flowtable datapath. This infrastructure
  7. also provides hardware offload support. The flowtable supports for the layer 3
  8. IPv4 and IPv6 and the layer 4 TCP and UDP protocols.
  9. Overview
  10. --------
  11. Once the first packet of the flow successfully goes through the IP forwarding
  12. path, from the second packet on, you might decide to offload the flow to the
  13. flowtable through your ruleset. The flowtable infrastructure provides a rule
  14. action that allows you to specify when to add a flow to the flowtable.
  15. A packet that finds a matching entry in the flowtable (ie. flowtable hit) is
  16. transmitted to the output netdevice via neigh_xmit(), hence, packets bypass the
  17. classic IP forwarding path (the visible effect is that you do not see these
  18. packets from any of the Netfilter hooks coming after ingress). In case that
  19. there is no matching entry in the flowtable (ie. flowtable miss), the packet
  20. follows the classic IP forwarding path.
  21. The flowtable uses a resizable hashtable. Lookups are based on the following
  22. n-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3
  23. source and destination, layer 4 source and destination ports and the input
  24. interface (useful in case there are several conntrack zones in place).
  25. The 'flow add' action allows you to populate the flowtable, the user selectively
  26. specifies what flows are placed into the flowtable. Hence, packets follow the
  27. classic IP forwarding path unless the user explicitly instruct flows to use this
  28. new alternative forwarding path via policy.
  29. The flowtable datapath is represented in Fig.1, which describes the classic IP
  30. forwarding path including the Netfilter hooks and the flowtable fastpath bypass.
  31. ::
  32. userspace process
  33. ^ |
  34. | |
  35. _____|____ ____\/___
  36. / \ / \
  37. | input | | output |
  38. \__________/ \_________/
  39. ^ |
  40. | |
  41. _________ __________ --------- _____\/_____
  42. / \ / \ |Routing | / \
  43. --> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit
  44. \_________/ \__________/ ---------- \____________/ ^
  45. | ^ | ^ |
  46. flowtable | ____\/___ | |
  47. | | / \ | |
  48. __\/___ | | forward |------------ |
  49. |-----| | \_________/ |
  50. |-----| | 'flow offload' rule |
  51. |-----| | adds entry to |
  52. |_____| | flowtable |
  53. | | |
  54. / \ | |
  55. /hit\_no_| |
  56. \ ? / |
  57. \ / |
  58. |__yes_________________fastpath bypass ____________________________|
  59. Fig.1 Netfilter hooks and flowtable interactions
  60. The flowtable entry also stores the NAT configuration, so all packets are
  61. mangled according to the NAT policy that is specified from the classic IP
  62. forwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented
  63. traffic is passed up to follow the classic IP forwarding path given that the
  64. transport header is missing, in this case, flowtable lookups are not possible.
  65. TCP RST and FIN packets are also passed up to the classic IP forwarding path to
  66. release the flow gracefully. Packets that exceed the MTU are also passed up to
  67. the classic forwarding path to report packet-too-big ICMP errors to the sender.
  68. Example configuration
  69. ---------------------
  70. Enabling the flowtable bypass is relatively easy, you only need to create a
  71. flowtable and add one rule to your forward chain::
  72. table inet x {
  73. flowtable f {
  74. hook ingress priority 0; devices = { eth0, eth1 };
  75. }
  76. chain y {
  77. type filter hook forward priority 0; policy accept;
  78. ip protocol tcp flow add @f
  79. counter packets 0 bytes 0
  80. }
  81. }
  82. This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1
  83. netdevices. You can create as many flowtables as you want in case you need to
  84. perform resource partitioning. The flowtable priority defines the order in which
  85. hooks are run in the pipeline, this is convenient in case you already have a
  86. nftables ingress chain (make sure the flowtable priority is smaller than the
  87. nftables ingress chain hence the flowtable runs before in the pipeline).
  88. The 'flow offload' action from the forward chain 'y' adds an entry to the
  89. flowtable for the TCP syn-ack packet coming in the reply direction. Once the
  90. flow is offloaded, you will observe that the counter rule in the example above
  91. does not get updated for the packets that are being forwarded through the
  92. forwarding bypass.
  93. You can identify offloaded flows through the [OFFLOAD] tag when listing your
  94. connection tracking table.
  95. ::
  96. # conntrack -L
  97. tcp 6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2
  98. Layer 2 encapsulation
  99. ---------------------
  100. Since Linux kernel 5.13, the flowtable infrastructure discovers the real
  101. netdevice behind VLAN and PPPoE netdevices. The flowtable software datapath
  102. parses the VLAN and PPPoE layer 2 headers to extract the ethertype and the
  103. VLAN ID / PPPoE session ID which are used for the flowtable lookups. The
  104. flowtable datapath also deals with layer 2 decapsulation.
  105. You do not need to add the PPPoE and the VLAN devices to your flowtable,
  106. instead the real device is sufficient for the flowtable to track your flows.
  107. Bridge and IP forwarding
  108. ------------------------
  109. Since Linux kernel 5.13, you can add bridge ports to the flowtable. The
  110. flowtable infrastructure discovers the topology behind the bridge device. This
  111. allows the flowtable to define a fastpath bypass between the bridge ports
  112. (represented as eth1 and eth2 in the example figure below) and the gateway
  113. device (represented as eth0) in your switch/router.
  114. ::
  115. fastpath bypass
  116. .-------------------------.
  117. / \
  118. | IP forwarding |
  119. | / \ \/
  120. | br0 eth0 ..... eth0
  121. . / \ *host B*
  122. -> eth1 eth2
  123. . *switch/router*
  124. .
  125. .
  126. eth0
  127. *host A*
  128. The flowtable infrastructure also supports for bridge VLAN filtering actions
  129. such as PVID and untagged. You can also stack a classic VLAN device on top of
  130. your bridge port.
  131. If you would like that your flowtable defines a fastpath between your bridge
  132. ports and your IP forwarding path, you have to add your bridge ports (as
  133. represented by the real netdevice) to your flowtable definition.
  134. Counters
  135. --------
  136. The flowtable can synchronize packet and byte counters with the existing
  137. connection tracking entry by specifying the counter statement in your flowtable
  138. definition, e.g.
  139. ::
  140. table inet x {
  141. flowtable f {
  142. hook ingress priority 0; devices = { eth0, eth1 };
  143. counter
  144. }
  145. }
  146. Counter support is available since Linux kernel 5.7.
  147. Hardware offload
  148. ----------------
  149. If your network device provides hardware offload support, you can turn it on by
  150. means of the 'offload' flag in your flowtable definition, e.g.
  151. ::
  152. table inet x {
  153. flowtable f {
  154. hook ingress priority 0; devices = { eth0, eth1 };
  155. flags offload;
  156. }
  157. }
  158. There is a workqueue that adds the flows to the hardware. Note that a few
  159. packets might still run over the flowtable software path until the workqueue has
  160. a chance to offload the flow to the network device.
  161. You can identify hardware offloaded flows through the [HW_OFFLOAD] tag when
  162. listing your connection tracking table. Please, note that the [OFFLOAD] tag
  163. refers to the software offload mode, so there is a distinction between [OFFLOAD]
  164. which refers to the software flowtable fastpath and [HW_OFFLOAD] which refers
  165. to the hardware offload datapath being used by the flow.
  166. The flowtable hardware offload infrastructure also supports for the DSA
  167. (Distributed Switch Architecture).
  168. Limitations
  169. -----------
  170. The flowtable behaves like a cache. The flowtable entries might get stale if
  171. either the destination MAC address or the egress netdevice that is used for
  172. transmission changes.
  173. This might be a problem if:
  174. - You run the flowtable in software mode and you combine bridge and IP
  175. forwarding in your setup.
  176. - Hardware offload is enabled.
  177. More reading
  178. ------------
  179. This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki
  180. also made a very complete and comprehensive summary called "A state of network
  181. acceleration" that describes how things were before this infrastructure was
  182. mainlined [3]_ and it also makes a rough summary of this work [4]_.
  183. .. [1] https://lwn.net/Articles/738214/
  184. .. [2] https://lwn.net/Articles/742164/
  185. .. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
  186. .. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html