nexthop-group-resilient.rst 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293
  1. .. SPDX-License-Identifier: GPL-2.0
  2. =========================
  3. Resilient Next-hop Groups
  4. =========================
  5. Resilient groups are a type of next-hop group that is aimed at minimizing
  6. disruption in flow routing across changes to the group composition and
  7. weights of constituent next hops.
  8. The idea behind resilient hashing groups is best explained in contrast to
  9. the legacy multipath next-hop group, which uses the hash-threshold
  10. algorithm, described in RFC 2992.
  11. To select a next hop, hash-threshold algorithm first assigns a range of
  12. hashes to each next hop in the group, and then selects the next hop by
  13. comparing the SKB hash with the individual ranges. When a next hop is
  14. removed from the group, the ranges are recomputed, which leads to
  15. reassignment of parts of hash space from one next hop to another. RFC 2992
  16. illustrates it thus::
  17. +-------+-------+-------+-------+-------+
  18. | 1 | 2 | 3 | 4 | 5 |
  19. +-------+-+-----+---+---+-----+-+-------+
  20. | 1 | 2 | 4 | 5 |
  21. +---------+---------+---------+---------+
  22. Before and after deletion of next hop 3
  23. under the hash-threshold algorithm.
  24. Note how next hop 2 gave up part of the hash space in favor of next hop 1,
  25. and 4 in favor of 5. While there will usually be some overlap between the
  26. previous and the new distribution, some traffic flows change the next hop
  27. that they resolve to.
  28. If a multipath group is used for load-balancing between multiple servers,
  29. this hash space reassignment causes an issue that packets from a single
  30. flow suddenly end up arriving at a server that does not expect them. This
  31. can result in TCP connections being reset.
  32. If a multipath group is used for load-balancing among available paths to
  33. the same server, the issue is that different latencies and reordering along
  34. the way causes the packets to arrive in the wrong order, resulting in
  35. degraded application performance.
  36. To mitigate the above-mentioned flow redirection, resilient next-hop groups
  37. insert another layer of indirection between the hash space and its
  38. constituent next hops: a hash table. The selection algorithm uses SKB hash
  39. to choose a hash table bucket, then reads the next hop that this bucket
  40. contains, and forwards traffic there.
  41. This indirection brings an important feature. In the hash-threshold
  42. algorithm, the range of hashes associated with a next hop must be
  43. continuous. With a hash table, mapping between the hash table buckets and
  44. the individual next hops is arbitrary. Therefore when a next hop is deleted
  45. the buckets that held it are simply reassigned to other next hops::
  46. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  47. |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
  48. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  49. v v v v
  50. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  51. |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
  52. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  53. Before and after deletion of next hop 3
  54. under the resilient hashing algorithm.
  55. When weights of next hops in a group are altered, it may be possible to
  56. choose a subset of buckets that are currently not used for forwarding
  57. traffic, and use those to satisfy the new next-hop distribution demands,
  58. keeping the "busy" buckets intact. This way, established flows are ideally
  59. kept being forwarded to the same endpoints through the same paths as before
  60. the next-hop group change.
  61. Algorithm
  62. ---------
  63. In a nutshell, the algorithm works as follows. Each next hop deserves a
  64. certain number of buckets, according to its weight and the number of
  65. buckets in the hash table. In accordance with the source code, we will call
  66. this number a "wants count" of a next hop. In case of an event that might
  67. cause bucket allocation change, the wants counts for individual next hops
  68. are updated.
  69. Next hops that have fewer buckets than their wants count, are called
  70. "underweight". Those that have more are "overweight". If there are no
  71. overweight (and therefore no underweight) next hops in the group, it is
  72. said to be "balanced".
  73. Each bucket maintains a last-used timer. Every time a packet is forwarded
  74. through a bucket, this timer is updated to current jiffies value. One
  75. attribute of a resilient group is then the "idle timer", which is the
  76. amount of time that a bucket must not be hit by traffic in order for it to
  77. be considered "idle". Buckets that are not idle are busy.
  78. After assigning wants counts to next hops, an "upkeep" algorithm runs. For
  79. buckets:
  80. 1) that have no assigned next hop, or
  81. 2) whose next hop has been removed, or
  82. 3) that are idle and their next hop is overweight,
  83. upkeep changes the next hop that the bucket references to one of the
  84. underweight next hops. If, after considering all buckets in this manner,
  85. there are still underweight next hops, another upkeep run is scheduled to a
  86. future time.
  87. There may not be enough "idle" buckets to satisfy the updated wants counts
  88. of all next hops. Another attribute of a resilient group is the "unbalanced
  89. timer". This timer can be set to 0, in which case the table will stay out
  90. of balance until idle buckets do appear, possibly never. If set to a
  91. non-zero value, the value represents the period of time that the table is
  92. permitted to stay out of balance.
  93. With this in mind, we update the above list of conditions with one more
  94. item. Thus buckets:
  95. 4) whose next hop is overweight, and the amount of time that the table has
  96. been out of balance exceeds the unbalanced timer, if that is non-zero,
  97. \... are migrated as well.
  98. Offloading & Driver Feedback
  99. ----------------------------
  100. When offloading resilient groups, the algorithm that distributes buckets
  101. among next hops is still the one in SW. Drivers are notified of updates to
  102. next hop groups in the following three ways:
  103. - Full group notification with the type
  104. ``NH_NOTIFIER_INFO_TYPE_RES_TABLE``. This is used just after the group is
  105. created and buckets populated for the first time.
  106. - Single-bucket notifications of the type
  107. ``NH_NOTIFIER_INFO_TYPE_RES_BUCKET``, which is used for notifications of
  108. individual migrations within an already-established group.
  109. - Pre-replace notification, ``NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE``. This
  110. is sent before the group is replaced, and is a way for the driver to veto
  111. the group before committing anything to the HW.
  112. Some single-bucket notifications are forced, as indicated by the "force"
  113. flag in the notification. Those are used for the cases where e.g. the next
  114. hop associated with the bucket was removed, and the bucket really must be
  115. migrated.
  116. Non-forced notifications can be overridden by the driver by returning an
  117. error code. The use case for this is that the driver notifies the HW that a
  118. bucket should be migrated, but the HW discovers that the bucket has in fact
  119. been hit by traffic.
  120. A second way for the HW to report that a bucket is busy is through the
  121. ``nexthop_res_grp_activity_update()`` API. The buckets identified this way
  122. as busy are treated as if traffic hit them.
  123. Offloaded buckets should be flagged as either "offload" or "trap". This is
  124. done through the ``nexthop_bucket_set_hw_flags()`` API.
  125. Netlink UAPI
  126. ------------
  127. Resilient Group Replacement
  128. ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  129. Resilient groups are configured using the ``RTM_NEWNEXTHOP`` message in the
  130. same manner as other multipath groups. The following changes apply to the
  131. attributes passed in the netlink message:
  132. =================== =========================================================
  133. ``NHA_GROUP_TYPE`` Should be ``NEXTHOP_GRP_TYPE_RES`` for resilient group.
  134. ``NHA_RES_GROUP`` A nest that contains attributes specific to resilient
  135. groups.
  136. =================== =========================================================
  137. ``NHA_RES_GROUP`` payload:
  138. =================================== =========================================
  139. ``NHA_RES_GROUP_BUCKETS`` Number of buckets in the hash table.
  140. ``NHA_RES_GROUP_IDLE_TIMER`` Idle timer in units of clock_t.
  141. ``NHA_RES_GROUP_UNBALANCED_TIMER`` Unbalanced timer in units of clock_t.
  142. =================================== =========================================
  143. Next Hop Get
  144. ^^^^^^^^^^^^
  145. Requests to get resilient next-hop groups use the ``RTM_GETNEXTHOP``
  146. message in exactly the same way as other next hop get requests. The
  147. response attributes match the replacement attributes cited above, except
  148. ``NHA_RES_GROUP`` payload will include the following attribute:
  149. =================================== =========================================
  150. ``NHA_RES_GROUP_UNBALANCED_TIME`` How long has the resilient group been out
  151. of balance, in units of clock_t.
  152. =================================== =========================================
  153. Bucket Get
  154. ^^^^^^^^^^
  155. The message ``RTM_GETNEXTHOPBUCKET`` without the ``NLM_F_DUMP`` flag is
  156. used to request a single bucket. The attributes recognized at get requests
  157. are:
  158. =================== =========================================================
  159. ``NHA_ID`` ID of the next-hop group that the bucket belongs to.
  160. ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
  161. =================== =========================================================
  162. ``NHA_RES_BUCKET`` payload:
  163. ======================== ====================================================
  164. ``NHA_RES_BUCKET_INDEX`` Index of bucket in the resilient table.
  165. ======================== ====================================================
  166. Bucket Dumps
  167. ^^^^^^^^^^^^
  168. The message ``RTM_GETNEXTHOPBUCKET`` with the ``NLM_F_DUMP`` flag is used
  169. to request a dump of matching buckets. The attributes recognized at dump
  170. requests are:
  171. =================== =========================================================
  172. ``NHA_ID`` If specified, limits the dump to just the next-hop group
  173. with this ID.
  174. ``NHA_OIF`` If specified, limits the dump to buckets that contain
  175. next hops that use the device with this ifindex.
  176. ``NHA_MASTER`` If specified, limits the dump to buckets that contain
  177. next hops that use a device in the VRF with this ifindex.
  178. ``NHA_RES_BUCKET`` A nest that contains attributes specific to bucket.
  179. =================== =========================================================
  180. ``NHA_RES_BUCKET`` payload:
  181. ======================== ====================================================
  182. ``NHA_RES_BUCKET_NH_ID`` If specified, limits the dump to just the buckets
  183. that contain the next hop with this ID.
  184. ======================== ====================================================
  185. Usage
  186. -----
  187. To illustrate the usage, consider the following commands::
  188. # ip nexthop add id 1 via 192.0.2.2 dev eth0
  189. # ip nexthop add id 2 via 192.0.2.3 dev eth0
  190. # ip nexthop add id 10 group 1/2 type resilient \
  191. buckets 8 idle_timer 60 unbalanced_timer 300
  192. The last command creates a resilient next-hop group. It will have 8 buckets
  193. (which is unusually low number, and used here for demonstration purposes
  194. only), each bucket will be considered idle when no traffic hits it for at
  195. least 60 seconds, and if the table remains out of balance for 300 seconds,
  196. it will be forcefully brought into balance.
  197. Changing next-hop weights leads to change in bucket allocation::
  198. # ip nexthop replace id 10 group 1,3/2 type resilient
  199. This can be confirmed by looking at individual buckets::
  200. # ip nexthop bucket show id 10
  201. id 10 index 0 idle_time 5.59 nhid 1
  202. id 10 index 1 idle_time 5.59 nhid 1
  203. id 10 index 2 idle_time 8.74 nhid 2
  204. id 10 index 3 idle_time 8.74 nhid 2
  205. id 10 index 4 idle_time 8.74 nhid 1
  206. id 10 index 5 idle_time 8.74 nhid 1
  207. id 10 index 6 idle_time 8.74 nhid 1
  208. id 10 index 7 idle_time 8.74 nhid 1
  209. Note the two buckets that have a shorter idle time. Those are the ones that
  210. were migrated after the next-hop replace command to satisfy the new demand
  211. that next hop 1 be given 6 buckets instead of 4.
  212. Netdevsim
  213. ---------
  214. The netdevsim driver implements a mock offload of resilient groups, and
  215. exposes debugfs interface that allows marking individual buckets as busy.
  216. For example, the following will mark bucket 23 in next-hop group 10 as
  217. active::
  218. # echo 10 23 > /sys/kernel/debug/netdevsim/netdevsim10/fib/nexthop_bucket_activity
  219. In addition, another debugfs interface can be used to configure that the
  220. next attempt to migrate a bucket should fail::
  221. # echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
  222. Besides serving as an example, the interfaces that netdevsim exposes are
  223. useful in automated testing, and
  224. ``tools/testing/selftests/drivers/net/netdevsim/nexthop.sh`` makes use of
  225. them to test the algorithm.