tls-offload.rst 24 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570
  1. .. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
  2. ==================
  3. Kernel TLS offload
  4. ==================
  5. Kernel TLS operation
  6. ====================
  7. Linux kernel provides TLS connection offload infrastructure. Once a TCP
  8. connection is in ``ESTABLISHED`` state user space can enable the TLS Upper
  9. Layer Protocol (ULP) and install the cryptographic connection state.
  10. For details regarding the user-facing interface refer to the TLS
  11. documentation in :ref:`Documentation/networking/tls.rst <kernel_tls>`.
  12. ``ktls`` can operate in three modes:
  13. * Software crypto mode (``TLS_SW``) - CPU handles the cryptography.
  14. In most basic cases only crypto operations synchronous with the CPU
  15. can be used, but depending on calling context CPU may utilize
  16. asynchronous crypto accelerators. The use of accelerators introduces extra
  17. latency on socket reads (decryption only starts when a read syscall
  18. is made) and additional I/O load on the system.
  19. * Packet-based NIC offload mode (``TLS_HW``) - the NIC handles crypto
  20. on a packet by packet basis, provided the packets arrive in order.
  21. This mode integrates best with the kernel stack and is described in detail
  22. in the remaining part of this document
  23. (``ethtool`` flags ``tls-hw-tx-offload`` and ``tls-hw-rx-offload``).
  24. * Full TCP NIC offload mode (``TLS_HW_RECORD``) - mode of operation where
  25. NIC driver and firmware replace the kernel networking stack
  26. with its own TCP handling, it is not usable in production environments
  27. making use of the Linux networking stack for example any firewalling
  28. abilities or QoS and packet scheduling (``ethtool`` flag ``tls-hw-record``).
  29. The operation mode is selected automatically based on device configuration,
  30. offload opt-in or opt-out on per-connection basis is not currently supported.
  31. TX
  32. --
  33. At a high level user write requests are turned into a scatter list, the TLS ULP
  34. intercepts them, inserts record framing, performs encryption (in ``TLS_SW``
  35. mode) and then hands the modified scatter list to the TCP layer. From this
  36. point on the TCP stack proceeds as normal.
  37. In ``TLS_HW`` mode the encryption is not performed in the TLS ULP.
  38. Instead packets reach a device driver, the driver will mark the packets
  39. for crypto offload based on the socket the packet is attached to,
  40. and send them to the device for encryption and transmission.
  41. RX
  42. --
  43. On the receive side, if the device handled decryption and authentication
  44. successfully, the driver will set the decrypted bit in the associated
  45. :c:type:`struct sk_buff <sk_buff>`. The packets reach the TCP stack and
  46. are handled normally. ``ktls`` is informed when data is queued to the socket
  47. and the ``strparser`` mechanism is used to delineate the records. Upon read
  48. request, records are retrieved from the socket and passed to decryption routine.
  49. If device decrypted all the segments of the record the decryption is skipped,
  50. otherwise software path handles decryption.
  51. .. kernel-figure:: tls-offload-layers.svg
  52. :alt: TLS offload layers
  53. :align: center
  54. :figwidth: 28em
  55. Layers of Kernel TLS stack
  56. Device configuration
  57. ====================
  58. During driver initialization device sets the ``NETIF_F_HW_TLS_RX`` and
  59. ``NETIF_F_HW_TLS_TX`` features and installs its
  60. :c:type:`struct tlsdev_ops <tlsdev_ops>`
  61. pointer in the :c:member:`tlsdev_ops` member of the
  62. :c:type:`struct net_device <net_device>`.
  63. When TLS cryptographic connection state is installed on a ``ktls`` socket
  64. (note that it is done twice, once for RX and once for TX direction,
  65. and the two are completely independent), the kernel checks if the underlying
  66. network device is offload-capable and attempts the offload. In case offload
  67. fails the connection is handled entirely in software using the same mechanism
  68. as if the offload was never tried.
  69. Offload request is performed via the :c:member:`tls_dev_add` callback of
  70. :c:type:`struct tlsdev_ops <tlsdev_ops>`:
  71. .. code-block:: c
  72. int (*tls_dev_add)(struct net_device *netdev, struct sock *sk,
  73. enum tls_offload_ctx_dir direction,
  74. struct tls_crypto_info *crypto_info,
  75. u32 start_offload_tcp_sn);
  76. ``direction`` indicates whether the cryptographic information is for
  77. the received or transmitted packets. Driver uses the ``sk`` parameter
  78. to retrieve the connection 5-tuple and socket family (IPv4 vs IPv6).
  79. Cryptographic information in ``crypto_info`` includes the key, iv, salt
  80. as well as TLS record sequence number. ``start_offload_tcp_sn`` indicates
  81. which TCP sequence number corresponds to the beginning of the record with
  82. sequence number from ``crypto_info``. The driver can add its state
  83. at the end of kernel structures (see :c:member:`driver_state` members
  84. in ``include/net/tls.h``) to avoid additional allocations and pointer
  85. dereferences.
  86. TX
  87. --
  88. After TX state is installed, the stack guarantees that the first segment
  89. of the stream will start exactly at the ``start_offload_tcp_sn`` sequence
  90. number, simplifying TCP sequence number matching.
  91. TX offload being fully initialized does not imply that all segments passing
  92. through the driver and which belong to the offloaded socket will be after
  93. the expected sequence number and will have kernel record information.
  94. In particular, already encrypted data may have been queued to the socket
  95. before installing the connection state in the kernel.
  96. RX
  97. --
  98. In the RX direction, the local networking stack has little control over
  99. segmentation, so the initial records' TCP sequence number may be anywhere
  100. inside the segment.
  101. Normal operation
  102. ================
  103. At the minimum the device maintains the following state for each connection, in
  104. each direction:
  105. * crypto secrets (key, iv, salt)
  106. * crypto processing state (partial blocks, partial authentication tag, etc.)
  107. * record metadata (sequence number, processing offset and length)
  108. * expected TCP sequence number
  109. There are no guarantees on record length or record segmentation. In particular
  110. segments may start at any point of a record and contain any number of records.
  111. Assuming segments are received in order, the device should be able to perform
  112. crypto operations and authentication regardless of segmentation. For this
  113. to be possible, the device has to keep a small amount of segment-to-segment
  114. state. This includes at least:
  115. * partial headers (if a segment carried only a part of the TLS header)
  116. * partial data block
  117. * partial authentication tag (all data had been seen but part of the
  118. authentication tag has to be written or read from the subsequent segment)
  119. Record reassembly is not necessary for TLS offload. If the packets arrive
  120. in order the device should be able to handle them separately and make
  121. forward progress.
  122. TX
  123. --
  124. The kernel stack performs record framing reserving space for the authentication
  125. tag and populating all other TLS header and tailer fields.
  126. Both the device and the driver maintain expected TCP sequence numbers
  127. due to the possibility of retransmissions and the lack of software fallback
  128. once the packet reaches the device.
  129. For segments passed in order, the driver marks the packets with
  130. a connection identifier (note that a 5-tuple lookup is insufficient to identify
  131. packets requiring HW offload, see the :ref:`5tuple_problems` section)
  132. and hands them to the device. The device identifies the packet as requiring
  133. TLS handling and confirms the sequence number matches its expectation.
  134. The device performs encryption and authentication of the record data.
  135. It replaces the authentication tag and TCP checksum with correct values.
  136. RX
  137. --
  138. Before a packet is DMAed to the host (but after NIC's embedded switching
  139. and packet transformation functions) the device validates the Layer 4
  140. checksum and performs a 5-tuple lookup to find any TLS connection the packet
  141. may belong to (technically a 4-tuple
  142. lookup is sufficient - IP addresses and TCP port numbers, as the protocol
  143. is always TCP). If the packet is matched to a connection, the device confirms
  144. if the TCP sequence number is the expected one and proceeds to TLS handling
  145. (record delineation, decryption, authentication for each record in the packet).
  146. The device leaves the record framing unmodified, the stack takes care of record
  147. decapsulation. Device indicates successful handling of TLS offload in the
  148. per-packet context (descriptor) passed to the host.
  149. Upon reception of a TLS offloaded packet, the driver sets
  150. the :c:member:`decrypted` mark in :c:type:`struct sk_buff <sk_buff>`
  151. corresponding to the segment. Networking stack makes sure decrypted
  152. and non-decrypted segments do not get coalesced (e.g. by GRO or socket layer)
  153. and takes care of partial decryption.
  154. Resync handling
  155. ===============
  156. In presence of packet drops or network packet reordering, the device may lose
  157. synchronization with the TLS stream, and require a resync with the kernel's
  158. TCP stack.
  159. Note that resync is only attempted for connections which were successfully
  160. added to the device table and are in TLS_HW mode. For example,
  161. if the table was full when cryptographic state was installed in the kernel,
  162. such connection will never get offloaded. Therefore the resync request
  163. does not carry any cryptographic connection state.
  164. TX
  165. --
  166. Segments transmitted from an offloaded socket can get out of sync
  167. in similar ways to the receive side-retransmissions - local drops
  168. are possible, though network reorders are not. There are currently
  169. two mechanisms for dealing with out of order segments.
  170. Crypto state rebuilding
  171. ~~~~~~~~~~~~~~~~~~~~~~~
  172. Whenever an out of order segment is transmitted the driver provides
  173. the device with enough information to perform cryptographic operations.
  174. This means most likely that the part of the record preceding the current
  175. segment has to be passed to the device as part of the packet context,
  176. together with its TCP sequence number and TLS record number. The device
  177. can then initialize its crypto state, process and discard the preceding
  178. data (to be able to insert the authentication tag) and move onto handling
  179. the actual packet.
  180. In this mode depending on the implementation the driver can either ask
  181. for a continuation with the crypto state and the new sequence number
  182. (next expected segment is the one after the out of order one), or continue
  183. with the previous stream state - assuming that the out of order segment
  184. was just a retransmission. The former is simpler, and does not require
  185. retransmission detection therefore it is the recommended method until
  186. such time it is proven inefficient.
  187. Next record sync
  188. ~~~~~~~~~~~~~~~~
  189. Whenever an out of order segment is detected the driver requests
  190. that the ``ktls`` software fallback code encrypt it. If the segment's
  191. sequence number is lower than expected the driver assumes retransmission
  192. and doesn't change device state. If the segment is in the future, it
  193. may imply a local drop, the driver asks the stack to sync the device
  194. to the next record state and falls back to software.
  195. Resync request is indicated with:
  196. .. code-block:: c
  197. void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq)
  198. Until resync is complete driver should not access its expected TCP
  199. sequence number (as it will be updated from a different context).
  200. Following helper should be used to test if resync is complete:
  201. .. code-block:: c
  202. bool tls_offload_tx_resync_pending(struct sock *sk)
  203. Next time ``ktls`` pushes a record it will first send its TCP sequence number
  204. and TLS record number to the driver. Stack will also make sure that
  205. the new record will start on a segment boundary (like it does when
  206. the connection is initially added).
  207. RX
  208. --
  209. A small amount of RX reorder events may not require a full resynchronization.
  210. In particular the device should not lose synchronization
  211. when record boundary can be recovered:
  212. .. kernel-figure:: tls-offload-reorder-good.svg
  213. :alt: reorder of non-header segment
  214. :align: center
  215. Reorder of non-header segment
  216. Green segments are successfully decrypted, blue ones are passed
  217. as received on wire, red stripes mark start of new records.
  218. In above case segment 1 is received and decrypted successfully.
  219. Segment 2 was dropped so 3 arrives out of order. The device knows
  220. the next record starts inside 3, based on record length in segment 1.
  221. Segment 3 is passed untouched, because due to lack of data from segment 2
  222. the remainder of the previous record inside segment 3 cannot be handled.
  223. The device can, however, collect the authentication algorithm's state
  224. and partial block from the new record in segment 3 and when 4 and 5
  225. arrive continue decryption. Finally when 2 arrives it's completely outside
  226. of expected window of the device so it's passed as is without special
  227. handling. ``ktls`` software fallback handles the decryption of record
  228. spanning segments 1, 2 and 3. The device did not get out of sync,
  229. even though two segments did not get decrypted.
  230. Kernel synchronization may be necessary if the lost segment contained
  231. a record header and arrived after the next record header has already passed:
  232. .. kernel-figure:: tls-offload-reorder-bad.svg
  233. :alt: reorder of header segment
  234. :align: center
  235. Reorder of segment with a TLS header
  236. In this example segment 2 gets dropped, and it contains a record header.
  237. Device can only detect that segment 4 also contains a TLS header
  238. if it knows the length of the previous record from segment 2. In this case
  239. the device will lose synchronization with the stream.
  240. Stream scan resynchronization
  241. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  242. When the device gets out of sync and the stream reaches TCP sequence
  243. numbers more than a max size record past the expected TCP sequence number,
  244. the device starts scanning for a known header pattern. For example
  245. for TLS 1.2 and TLS 1.3 subsequent bytes of value ``0x03 0x03`` occur
  246. in the SSL/TLS version field of the header. Once pattern is matched
  247. the device continues attempting parsing headers at expected locations
  248. (based on the length fields at guessed locations).
  249. Whenever the expected location does not contain a valid header the scan
  250. is restarted.
  251. When the header is matched the device sends a confirmation request
  252. to the kernel, asking if the guessed location is correct (if a TLS record
  253. really starts there), and which record sequence number the given header had.
  254. The asynchronous resync process is coordinated on the kernel side using
  255. struct tls_offload_resync_async, which tracks and manages the resync request.
  256. Helper functions to manage struct tls_offload_resync_async:
  257. ``tls_offload_rx_resync_async_request_start()``
  258. Initializes an asynchronous resync attempt by specifying the sequence range to
  259. monitor and resetting internal state in the struct.
  260. ``tls_offload_rx_resync_async_request_end()``
  261. Retains the device's guessed TCP sequence number for comparison with current or
  262. future logged ones. It also clears the RESYNC_REQ_ASYNC flag from the resync
  263. request, indicating that the device has submitted its guessed sequence number.
  264. ``tls_offload_rx_resync_async_request_cancel()``
  265. Cancels any in-progress resync attempt, clearing the request state.
  266. When the kernel processes an RX segment that begins a new TLS record, it
  267. examines the current status of the asynchronous resynchronization request.
  268. If the device is still waiting to provide its guessed TCP sequence number
  269. (the async state), the kernel records the sequence number of this segment so
  270. that it can later be compared once the device's guess becomes available.
  271. If the device has already submitted its guessed sequence number (the non-async
  272. state), the kernel now tries to match that guess against the sequence numbers of
  273. all TLS record headers that have been logged since the resync request
  274. started.
  275. The kernel confirms the guessed location was correct and tells the device
  276. the record sequence number. Meanwhile, the device had been parsing
  277. and counting all records since the just-confirmed one, it adds the number
  278. of records it had seen to the record number provided by the kernel.
  279. At this point the device is in sync and can resume decryption at next
  280. segment boundary.
  281. In a pathological case the device may latch onto a sequence of matching
  282. headers and never hear back from the kernel (there is no negative
  283. confirmation from the kernel). The implementation may choose to periodically
  284. restart scan. Given how unlikely falsely-matching stream is, however,
  285. periodic restart is not deemed necessary.
  286. Special care has to be taken if the confirmation request is passed
  287. asynchronously to the packet stream and record may get processed
  288. by the kernel before the confirmation request.
  289. Stack-driven resynchronization
  290. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  291. The driver may also request the stack to perform resynchronization
  292. whenever it sees the records are no longer getting decrypted.
  293. If the connection is configured in this mode the stack automatically
  294. schedules resynchronization after it has received two completely encrypted
  295. records.
  296. The stack waits for the socket to drain and informs the device about
  297. the next expected record number and its TCP sequence number. If the
  298. records continue to be received fully encrypted stack retries the
  299. synchronization with an exponential back off (first after 2 encrypted
  300. records, then after 4 records, after 8, after 16... up until every
  301. 128 records).
  302. Error handling
  303. ==============
  304. TX
  305. --
  306. Packets may be redirected or rerouted by the stack to a different
  307. device than the selected TLS offload device. The stack will handle
  308. such condition using the :c:func:`sk_validate_xmit_skb` helper
  309. (TLS offload code installs :c:func:`tls_validate_xmit_skb` at this hook).
  310. Offload maintains information about all records until the data is
  311. fully acknowledged, so if skbs reach the wrong device they can be handled
  312. by software fallback.
  313. Any device TLS offload handling error on the transmission side must result
  314. in the packet being dropped. For example if a packet got out of order
  315. due to a bug in the stack or the device, reached the device and can't
  316. be encrypted such packet must be dropped.
  317. RX
  318. --
  319. If the device encounters any problems with TLS offload on the receive
  320. side it should pass the packet to the host's networking stack as it was
  321. received on the wire.
  322. For example authentication failure for any record in the segment should
  323. result in passing the unmodified packet to the software fallback. This means
  324. packets should not be modified "in place". Splitting segments to handle partial
  325. decryption is not advised. In other words either all records in the packet
  326. had been handled successfully and authenticated or the packet has to be passed
  327. to the host's stack as it was on the wire (recovering original packet in the
  328. driver if device provides precise error is sufficient).
  329. The Linux networking stack does not provide a way of reporting per-packet
  330. decryption and authentication errors, packets with errors must simply not
  331. have the :c:member:`decrypted` mark set.
  332. A packet should also not be handled by the TLS offload if it contains
  333. incorrect checksums.
  334. Performance metrics
  335. ===================
  336. TLS offload can be characterized by the following basic metrics:
  337. * max connection count
  338. * connection installation rate
  339. * connection installation latency
  340. * total cryptographic performance
  341. Note that each TCP connection requires a TLS session in both directions,
  342. the performance may be reported treating each direction separately.
  343. Max connection count
  344. --------------------
  345. The number of connections device can support can be exposed via
  346. ``devlink resource`` API.
  347. Total cryptographic performance
  348. -------------------------------
  349. Offload performance may depend on segment and record size.
  350. Overload of the cryptographic subsystem of the device should not have
  351. significant performance impact on non-offloaded streams.
  352. Statistics
  353. ==========
  354. Following minimum set of TLS-related statistics should be reported
  355. by the driver:
  356. * ``rx_tls_decrypted_packets`` - number of successfully decrypted RX packets
  357. which were part of a TLS stream.
  358. * ``rx_tls_decrypted_bytes`` - number of TLS payload bytes in RX packets
  359. which were successfully decrypted.
  360. * ``rx_tls_ctx`` - number of TLS RX HW offload contexts added to device for
  361. decryption.
  362. * ``rx_tls_del`` - number of TLS RX HW offload contexts deleted from device
  363. (connection has finished).
  364. * ``rx_tls_resync_req_pkt`` - number of received TLS packets with a resync
  365. request.
  366. * ``rx_tls_resync_req_start`` - number of times the TLS async resync request
  367. was started.
  368. * ``rx_tls_resync_req_end`` - number of times the TLS async resync request
  369. properly ended with providing the HW tracked tcp-seq.
  370. * ``rx_tls_resync_req_skip`` - number of times the TLS async resync request
  371. procedure was started but not properly ended.
  372. * ``rx_tls_resync_res_ok`` - number of times the TLS resync response call to
  373. the driver was successfully handled.
  374. * ``rx_tls_resync_res_skip`` - number of times the TLS resync response call to
  375. the driver was terminated unsuccessfully.
  376. * ``rx_tls_err`` - number of RX packets which were part of a TLS stream
  377. but were not decrypted due to unexpected error in the state machine.
  378. * ``tx_tls_encrypted_packets`` - number of TX packets passed to the device
  379. for encryption of their TLS payload.
  380. * ``tx_tls_encrypted_bytes`` - number of TLS payload bytes in TX packets
  381. passed to the device for encryption.
  382. * ``tx_tls_ctx`` - number of TLS TX HW offload contexts added to device for
  383. encryption.
  384. * ``tx_tls_ooo`` - number of TX packets which were part of a TLS stream
  385. but did not arrive in the expected order.
  386. * ``tx_tls_skip_no_sync_data`` - number of TX packets which were part of
  387. a TLS stream and arrived out-of-order, but skipped the HW offload routine
  388. and went to the regular transmit flow as they were retransmissions of the
  389. connection handshake.
  390. * ``tx_tls_drop_no_sync_data`` - number of TX packets which were part of
  391. a TLS stream dropped, because they arrived out of order and associated
  392. record could not be found.
  393. * ``tx_tls_drop_bypass_req`` - number of TX packets which were part of a TLS
  394. stream dropped, because they contain both data that has been encrypted by
  395. software and data that expects hardware crypto offload.
  396. Notable corner cases, exceptions and additional requirements
  397. ============================================================
  398. .. _5tuple_problems:
  399. 5-tuple matching limitations
  400. ----------------------------
  401. The device can only recognize received packets based on the 5-tuple
  402. of the socket. Current ``ktls`` implementation will not offload sockets
  403. routed through software interfaces such as those used for tunneling
  404. or virtual networking. However, many packet transformations performed
  405. by the networking stack (most notably any BPF logic) do not require
  406. any intermediate software device, therefore a 5-tuple match may
  407. consistently miss at the device level. In such cases the device
  408. should still be able to perform TX offload (encryption) and should
  409. fallback cleanly to software decryption (RX).
  410. Out of order
  411. ------------
  412. Introducing extra processing in NICs should not cause packets to be
  413. transmitted or received out of order, for example pure ACK packets
  414. should not be reordered with respect to data segments.
  415. Ingress reorder
  416. ---------------
  417. A device is permitted to perform packet reordering for consecutive
  418. TCP segments (i.e. placing packets in the correct order) but any form
  419. of additional buffering is disallowed.
  420. Coexistence with standard networking offload features
  421. -----------------------------------------------------
  422. Offloaded ``ktls`` sockets should support standard TCP stack features
  423. transparently. Enabling device TLS offload should not cause any difference
  424. in packets as seen on the wire.
  425. Transport layer transparency
  426. ----------------------------
  427. For the purpose of simplifying TLS offload, the device should not modify any
  428. packet headers.
  429. The device should not depend on any packet headers beyond what is strictly
  430. necessary for TLS offload.
  431. Segment drops
  432. -------------
  433. Dropping packets is acceptable only in the event of catastrophic
  434. system errors and should never be used as an error handling mechanism
  435. in cases arising from normal operation. In other words, reliance
  436. on TCP retransmissions to handle corner cases is not acceptable.
  437. TLS device features
  438. -------------------
  439. Drivers should ignore the changes to the TLS device feature flags.
  440. These flags will be acted upon accordingly by the core ``ktls`` code.
  441. TLS device feature flags only control adding of new TLS connection
  442. offloads, old connections will remain active after flags are cleared.
  443. TLS encryption cannot be offloaded to devices without checksum calculation
  444. offload. Hence, TLS TX device feature flag requires TX csum offload being set.
  445. Disabling the latter implies clearing the former. Disabling TX checksum offload
  446. should not affect old connections, and drivers should make sure checksum
  447. calculation does not break for them.
  448. Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user
  449. does not want to enable RX csum offload, TLS RX device feature is disabled
  450. as well.