openvswitch.rst 11 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =============================================
  3. Open vSwitch datapath developer documentation
  4. =============================================
  5. The Open vSwitch kernel module allows flexible userspace control over
  6. flow-level packet processing on selected network devices. It can be
  7. used to implement a plain Ethernet switch, network device bonding,
  8. VLAN processing, network access control, flow-based network control,
  9. and so on.
  10. The kernel module implements multiple "datapaths" (analogous to
  11. bridges), each of which can have multiple "vports" (analogous to ports
  12. within a bridge). Each datapath also has associated with it a "flow
  13. table" that userspace populates with "flows" that map from keys based
  14. on packet headers and metadata to sets of actions. The most common
  15. action forwards the packet to another vport; other actions are also
  16. implemented.
  17. When a packet arrives on a vport, the kernel module processes it by
  18. extracting its flow key and looking it up in the flow table. If there
  19. is a matching flow, it executes the associated actions. If there is
  20. no match, it queues the packet to userspace for processing (as part of
  21. its processing, userspace will likely set up a flow to handle further
  22. packets of the same type entirely in-kernel).
  23. Flow key compatibility
  24. ----------------------
  25. Network protocols evolve over time. New protocols become important
  26. and existing protocols lose their prominence. For the Open vSwitch
  27. kernel module to remain relevant, it must be possible for newer
  28. versions to parse additional protocols as part of the flow key. It
  29. might even be desirable, someday, to drop support for parsing
  30. protocols that have become obsolete. Therefore, the Netlink interface
  31. to Open vSwitch is designed to allow carefully written userspace
  32. applications to work with any version of the flow key, past or future.
  33. To support this forward and backward compatibility, whenever the
  34. kernel module passes a packet to userspace, it also passes along the
  35. flow key that it parsed from the packet. Userspace then extracts its
  36. own notion of a flow key from the packet and compares it against the
  37. kernel-provided version:
  38. - If userspace's notion of the flow key for the packet matches the
  39. kernel's, then nothing special is necessary.
  40. - If the kernel's flow key includes more fields than the userspace
  41. version of the flow key, for example if the kernel decoded IPv6
  42. headers but userspace stopped at the Ethernet type (because it
  43. does not understand IPv6), then again nothing special is
  44. necessary. Userspace can still set up a flow in the usual way,
  45. as long as it uses the kernel-provided flow key to do it.
  46. - If the userspace flow key includes more fields than the
  47. kernel's, for example if userspace decoded an IPv6 header but
  48. the kernel stopped at the Ethernet type, then userspace can
  49. forward the packet manually, without setting up a flow in the
  50. kernel. This case is bad for performance because every packet
  51. that the kernel considers part of the flow must go to userspace,
  52. but the forwarding behavior is correct. (If userspace can
  53. determine that the values of the extra fields would not affect
  54. forwarding behavior, then it could set up a flow anyway.)
  55. How flow keys evolve over time is important to making this work, so
  56. the following sections go into detail.
  57. Flow key format
  58. ---------------
  59. A flow key is passed over a Netlink socket as a sequence of Netlink
  60. attributes. Some attributes represent packet metadata, defined as any
  61. information about a packet that cannot be extracted from the packet
  62. itself, e.g. the vport on which the packet was received. Most
  63. attributes, however, are extracted from headers within the packet,
  64. e.g. source and destination addresses from Ethernet, IP, or TCP
  65. headers.
  66. The <linux/openvswitch.h> header file defines the exact format of the
  67. flow key attributes. For informal explanatory purposes here, we write
  68. them as comma-separated strings, with parentheses indicating arguments
  69. and nesting. For example, the following could represent a flow key
  70. corresponding to a TCP packet that arrived on vport 1::
  71. in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
  72. eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
  73. frag=no), tcp(src=49163, dst=80)
  74. Often we ellipsize arguments not important to the discussion, e.g.::
  75. in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
  76. Wildcarded flow key format
  77. --------------------------
  78. A wildcarded flow is described with two sequences of Netlink attributes
  79. passed over the Netlink socket. A flow key, exactly as described above, and an
  80. optional corresponding flow mask.
  81. A wildcarded flow can represent a group of exact match flows. Each '1' bit
  82. in the mask specifies a exact match with the corresponding bit in the flow key.
  83. A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit
  84. of a incoming packet. Using wildcarded flow can improve the flow set up rate
  85. by reduce the number of new flows need to be processed by the user space program.
  86. Support for the mask Netlink attribute is optional for both the kernel and user
  87. space program. The kernel can ignore the mask attribute, installing an exact
  88. match flow, or reduce the number of don't care bits in the kernel to less than
  89. what was specified by the user space program. In this case, variations in bits
  90. that the kernel does not implement will simply result in additional flow setups.
  91. The kernel module will also work with user space programs that neither support
  92. nor supply flow mask attributes.
  93. Since the kernel may ignore or modify wildcard bits, it can be difficult for
  94. the userspace program to know exactly what matches are installed. There are
  95. two possible approaches: reactively install flows as they miss the kernel
  96. flow table (and therefore not attempt to determine wildcard changes at all)
  97. or use the kernel's response messages to determine the installed wildcards.
  98. When interacting with userspace, the kernel should maintain the match portion
  99. of the key exactly as originally installed. This will provides a handle to
  100. identify the flow for all future operations. However, when reporting the
  101. mask of an installed flow, the mask should include any restrictions imposed
  102. by the kernel.
  103. The behavior when using overlapping wildcarded flows is undefined. It is the
  104. responsibility of the user space program to ensure that any incoming packet
  105. can match at most one flow, wildcarded or not. The current implementation
  106. performs best-effort detection of overlapping wildcarded flows and may reject
  107. some but not all of them. However, this behavior may change in future versions.
  108. Unique flow identifiers
  109. -----------------------
  110. An alternative to using the original match portion of a key as the handle for
  111. flow identification is a unique flow identifier, or "UFID". UFIDs are optional
  112. for both the kernel and user space program.
  113. User space programs that support UFID are expected to provide it during flow
  114. setup in addition to the flow, then refer to the flow using the UFID for all
  115. future operations. The kernel is not required to index flows by the original
  116. flow key if a UFID is specified.
  117. Basic rule for evolving flow keys
  118. ---------------------------------
  119. Some care is needed to really maintain forward and backward
  120. compatibility for applications that follow the rules listed under
  121. "Flow key compatibility" above.
  122. The basic rule is obvious::
  123. ==================================================================
  124. New network protocol support must only supplement existing flow
  125. key attributes. It must not change the meaning of already defined
  126. flow key attributes.
  127. ==================================================================
  128. This rule does have less-obvious consequences so it is worth working
  129. through a few examples. Suppose, for example, that the kernel module
  130. did not already implement VLAN parsing. Instead, it just interpreted
  131. the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
  132. packet. The flow key for any packet with an 802.1Q header would look
  133. essentially like this, ignoring metadata::
  134. eth(...), eth_type(0x8100)
  135. Naively, to add VLAN support, it makes sense to add a new "vlan" flow
  136. key attribute to contain the VLAN tag, then continue to decode the
  137. encapsulated headers beyond the VLAN tag using the existing field
  138. definitions. With this change, a TCP packet in VLAN 10 would have a
  139. flow key much like this::
  140. eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
  141. But this change would negatively affect a userspace application that
  142. has not been updated to understand the new "vlan" flow key attribute.
  143. The application could, following the flow compatibility rules above,
  144. ignore the "vlan" attribute that it does not understand and therefore
  145. assume that the flow contained IP packets. This is a bad assumption
  146. (the flow only contains IP packets if one parses and skips over the
  147. 802.1Q header) and it could cause the application's behavior to change
  148. across kernel versions even though it follows the compatibility rules.
  149. The solution is to use a set of nested attributes. This is, for
  150. example, why 802.1Q support uses nested attributes. A TCP packet in
  151. VLAN 10 is actually expressed as::
  152. eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
  153. ip(proto=6, ...), tcp(...)))
  154. Notice how the "eth_type", "ip", and "tcp" flow key attributes are
  155. nested inside the "encap" attribute. Thus, an application that does
  156. not understand the "vlan" key will not see either of those attributes
  157. and therefore will not misinterpret them. (Also, the outer eth_type
  158. is still 0x8100, not changed to 0x0800.)
  159. Handling malformed packets
  160. --------------------------
  161. Don't drop packets in the kernel for malformed protocol headers, bad
  162. checksums, etc. This would prevent userspace from implementing a
  163. simple Ethernet switch that forwards every packet.
  164. Instead, in such a case, include an attribute with "empty" content.
  165. It doesn't matter if the empty content could be valid protocol values,
  166. as long as those values are rarely seen in practice, because userspace
  167. can always forward all packets with those values to userspace and
  168. handle them individually.
  169. For example, consider a packet that contains an IP header that
  170. indicates protocol 6 for TCP, but which is truncated just after the IP
  171. header, so that the TCP header is missing. The flow key for this
  172. packet would include a tcp attribute with all-zero src and dst, like
  173. this::
  174. eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
  175. As another example, consider a packet with an Ethernet type of 0x8100,
  176. indicating that a VLAN TCI should follow, but which is truncated just
  177. after the Ethernet type. The flow key for this packet would include
  178. an all-zero-bits vlan and an empty encap attribute, like this::
  179. eth(...), eth_type(0x8100), vlan(0), encap()
  180. Unlike a TCP packet with source and destination ports 0, an
  181. all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
  182. VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
  183. attribute expressly to allow this situation to be distinguished.
  184. Thus, the flow key in this second example unambiguously indicates a
  185. missing or malformed VLAN TCI.
  186. Other rules
  187. -----------
  188. The other rules for flow keys are much less subtle:
  189. - Duplicate attributes are not allowed at a given nesting level.
  190. - Ordering of attributes is not significant.
  191. - When the kernel sends a given flow key to userspace, it always
  192. composes it the same way. This allows userspace to hash and
  193. compare entire flow keys that it may not be able to fully
  194. interpret.