| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589 |
- =============
- BPF Iterators
- =============
- --------
- Overview
- --------
- BPF supports two separate entities collectively known as "BPF iterators": BPF
- iterator *program type* and *open-coded* BPF iterators. The former is
- a stand-alone BPF program type which, when attached and activated by user,
- will be called once for each entity (task_struct, cgroup, etc) that is being
- iterated. The latter is a set of BPF-side APIs implementing iterator
- functionality and available across multiple BPF program types. Open-coded
- iterators provide similar functionality to BPF iterator programs, but gives
- more flexibility and control to all other BPF program types. BPF iterator
- programs, on the other hand, can be used to implement anonymous or BPF
- FS-mounted special files, whose contents are generated by attached BPF iterator
- program, backed by seq_file functionality. Both are useful depending on
- specific needs.
- When adding a new BPF iterator program, it is expected that similar
- functionality will be added as open-coded iterator for maximum flexibility.
- It's also expected that iteration logic and code will be maximally shared and
- reused between two iterator API surfaces.
- ------------------------
- Open-coded BPF Iterators
- ------------------------
- Open-coded BPF iterators are implemented as tightly-coupled trios of kfuncs
- (constructor, next element fetch, destructor) and iterator-specific type
- describing on-the-stack iterator state, which is guaranteed by the BPF
- verifier to not be tampered with outside of the corresponding
- constructor/destructor/next APIs.
- Each kind of open-coded BPF iterator has its own associated
- struct bpf_iter_<type>, where <type> denotes a specific type of iterator.
- bpf_iter_<type> state needs to live on BPF program stack, so make sure it's
- small enough to fit on BPF stack. For performance reasons its best to avoid
- dynamic memory allocation for iterator state and size the state struct big
- enough to fit everything necessary. But if necessary, dynamic memory
- allocation is a way to bypass BPF stack limitations. Note, state struct size
- is part of iterator's user-visible API, so changing it will break backwards
- compatibility, so be deliberate about designing it.
- All kfuncs (constructor, next, destructor) have to be named consistently as
- bpf_iter_<type>_{new,next,destroy}(), respectively. <type> represents iterator
- type, and iterator state should be represented as a matching
- `struct bpf_iter_<type>` state type. Also, all iter kfuncs should have
- a pointer to this `struct bpf_iter_<type>` as the very first argument.
- Additionally:
- - Constructor, i.e., `bpf_iter_<type>_new()`, can have arbitrary extra
- number of arguments. Return type is not enforced either.
- - Next method, i.e., `bpf_iter_<type>_next()`, has to return a pointer
- type and should have exactly one argument: `struct bpf_iter_<type> *`
- (const/volatile/restrict and typedefs are ignored).
- - Destructor, i.e., `bpf_iter_<type>_destroy()`, should return void and
- should have exactly one argument, similar to the next method.
- - `struct bpf_iter_<type>` size is enforced to be positive and
- a multiple of 8 bytes (to fit stack slots correctly).
- Such strictness and consistency allows to build generic helpers abstracting
- important, but boilerplate, details to be able to use open-coded iterators
- effectively and ergonomically (see libbpf's bpf_for_each() macro). This is
- enforced at kfunc registration point by the kernel.
- Constructor/next/destructor implementation contract is as follows:
- - constructor, `bpf_iter_<type>_new()`, always initializes iterator state on
- the stack. If any of the input arguments are invalid, constructor should
- make sure to still initialize it such that subsequent next() calls will
- return NULL. I.e., on error, *return error and construct empty iterator*.
- Constructor kfunc is marked with KF_ITER_NEW flag.
- - next method, `bpf_iter_<type>_next()`, accepts pointer to iterator state
- and produces an element. Next method should always return a pointer. The
- contract between BPF verifier is that next method *guarantees* that it
- will eventually return NULL when elements are exhausted. Once NULL is
- returned, subsequent next calls *should keep returning NULL*. Next method
- is marked with KF_ITER_NEXT (and should also have KF_RET_NULL as
- NULL-returning kfunc, of course).
- - destructor, `bpf_iter_<type>_destroy()`, is always called once. Even if
- constructor failed or next returned nothing. Destructor frees up any
- resources and marks stack space used by `struct bpf_iter_<type>` as usable
- for something else. Destructor is marked with KF_ITER_DESTROY flag.
- Any open-coded BPF iterator implementation has to implement at least these
- three methods. It is enforced that for any given type of iterator only
- applicable constructor/destructor/next are callable. I.e., verifier ensures
- you can't pass number iterator state into, say, cgroup iterator's next method.
- From a 10,000-feet BPF verification point of view, next methods are the points
- of forking a verification state, which are conceptually similar to what
- verifier is doing when validating conditional jumps. Verifier is branching out
- `call bpf_iter_<type>_next` instruction and simulates two outcomes: NULL
- (iteration is done) and non-NULL (new element is returned). NULL is simulated
- first and is supposed to reach exit without looping. After that non-NULL case
- is validated and it either reaches exit (for trivial examples with no real
- loop), or reaches another `call bpf_iter_<type>_next` instruction with the
- state equivalent to already (partially) validated one. State equivalency at
- that point means we technically are going to be looping forever without
- "breaking out" out of established "state envelope" (i.e., subsequent
- iterations don't add any new knowledge or constraints to the verifier state,
- so running 1, 2, 10, or a million of them doesn't matter). But taking into
- account the contract stating that iterator next method *has to* return NULL
- eventually, we can conclude that loop body is safe and will eventually
- terminate. Given we validated logic outside of the loop (NULL case), and
- concluded that loop body is safe (though potentially looping many times),
- verifier can claim safety of the overall program logic.
- ------------------------
- BPF Iterators Motivation
- ------------------------
- There are a few existing ways to dump kernel data into user space. The most
- popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
- all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink
- sockets in the system. However, their output format tends to be fixed, and if
- users want more information about these sockets, they have to patch the kernel,
- which often takes time to publish upstream and release. The same is true for popular
- tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any
- additional information needs a kernel patch.
- To solve this problem, the `drgn
- <https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to
- dig out the kernel data with no kernel change. However, the main drawback for
- drgn is performance, as it cannot do pointer tracing inside the kernel. In
- addition, drgn cannot validate a pointer value and may read invalid data if the
- pointer becomes invalid inside the kernel.
- The BPF iterator solves the above problem by providing flexibility on what data
- (e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel
- data object.
- ----------------------
- How BPF Iterators Work
- ----------------------
- A BPF iterator is a type of BPF program that allows users to iterate over
- specific types of kernel objects. Unlike traditional BPF tracing programs that
- allow users to define callbacks that are invoked at particular points of
- execution in the kernel, BPF iterators allow users to define callbacks that
- should be executed for every entry in a variety of kernel data structures.
- For example, users can define a BPF iterator that iterates over every task on
- the system and dumps the total amount of CPU runtime currently used by each of
- them. Another BPF task iterator may instead dump the cgroup information for each
- task. Such flexibility is the core value of BPF iterators.
- A BPF program is always loaded into the kernel at the behest of a user space
- process. A user space process loads a BPF program by opening and initializing
- the program skeleton as required and then invoking a syscall to have the BPF
- program verified and loaded by the kernel.
- In traditional tracing programs, a program is activated by having user space
- obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once
- activated, the program callback will be invoked whenever the tracepoint is
- triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the
- program is obtained using ``bpf_link_create()``, and the program callback is
- invoked by issuing system calls from user space.
- Next, let us see how you can use the iterators to iterate on kernel objects and
- read data.
- ------------------------
- How to Use BPF iterators
- ------------------------
- BPF selftests are a great resource to illustrate how to use the iterators. In
- this section, we’ll walk through a BPF selftest which shows how to load and use
- a BPF iterator program. To begin, we’ll look at `bpf_iter.c
- <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_,
- which illustrates how to load and trigger BPF iterators on the user space side.
- Later, we’ll look at a BPF program that runs in kernel space.
- Loading a BPF iterator in the kernel from user space typically involves the
- following steps:
- * The BPF program is loaded into the kernel through ``libbpf``. Once the kernel
- has verified and loaded the program, it returns a file descriptor (fd) to user
- space.
- * Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()``
- specified with the BPF program file descriptor received from the kernel.
- * Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the
- ``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2.
- * Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is
- available.
- * Close the iterator fd using ``close(bpf_iter_fd)``.
- * If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again.
- The following are a few examples of selftest BPF iterator programs:
- * `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_
- * `bpf_iter_task_vmas.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vmas.c>`_
- * `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_
- Let us look at ``bpf_iter_task_file.c``, which runs in kernel space:
- Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h
- <https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_.
- Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>``
- represents a BPF iterator. The suffix ``<iter_name>`` represents the type of
- iterator.
- ::
- struct bpf_iter__task_file {
- union {
- struct bpf_iter_meta *meta;
- };
- union {
- struct task_struct *task;
- };
- u32 fd;
- union {
- struct file *file;
- };
- };
- In the above code, the field 'meta' contains the metadata, which is the same for
- all BPF iterator programs. The rest of the fields are specific to different
- iterators. For example, for task_file iterators, the kernel layer provides the
- 'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference
- counted
- <https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_,
- so they won't go away when the BPF program runs.
- Here is a snippet from the ``bpf_iter_task_file.c`` file:
- ::
- SEC("iter/task_file")
- int dump_task_file(struct bpf_iter__task_file *ctx)
- {
- struct seq_file *seq = ctx->meta->seq;
- struct task_struct *task = ctx->task;
- struct file *file = ctx->file;
- __u32 fd = ctx->fd;
- if (task == NULL || file == NULL)
- return 0;
- if (ctx->meta->seq_num == 0) {
- count = 0;
- BPF_SEQ_PRINTF(seq, " tgid gid fd file\n");
- }
- if (tgid == task->tgid && task->tgid != task->pid)
- count++;
- if (last_tgid != task->tgid) {
- last_tgid = task->tgid;
- unique_tgid_count++;
- }
- BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
- (long)file->f_op);
- return 0;
- }
- In the above example, the section name ``SEC(iter/task_file)``, indicates that
- the program is a BPF iterator program to iterate all files from all tasks. The
- context of the program is ``bpf_iter__task_file`` struct.
- The user space program invokes the BPF iterator program running in the kernel
- by issuing a ``read()`` syscall. Once invoked, the BPF
- program can export data to user space using a variety of BPF helper functions.
- You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or
- ``bpf_seq_write()`` function based on whether you need formatted output or just
- binary data, respectively. For binary-encoded data, the user space applications
- can process the data from ``bpf_seq_write()`` as needed. For the formatted data,
- you can use ``cat <path>`` to print the results similar to ``cat
- /proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later,
- use ``rm -f <path>`` to remove the pinned iterator.
- For example, you can use the following command to create a BPF iterator from the
- ``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route``
- path:
- ::
- $ bpftool iter pin ./bpf_iter_ipv6_route.o /sys/fs/bpf/my_route
- And then print out the results using the following command:
- ::
- $ cat /sys/fs/bpf/my_route
- -------------------------------------------------------
- Implement Kernel Support for BPF Iterator Program Types
- -------------------------------------------------------
- To implement a BPF iterator in the kernel, the developer must make a one-time
- change to the following key data structure defined in the `bpf.h
- <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_
- file.
- ::
- struct bpf_iter_reg {
- const char *target;
- bpf_iter_attach_target_t attach_target;
- bpf_iter_detach_target_t detach_target;
- bpf_iter_show_fdinfo_t show_fdinfo;
- bpf_iter_fill_link_info_t fill_link_info;
- bpf_iter_get_func_proto_t get_func_proto;
- u32 ctx_arg_info_size;
- u32 feature;
- struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX];
- const struct bpf_iter_seq_info *seq_info;
- };
- After filling the data structure fields, call ``bpf_iter_reg_target()`` to
- register the iterator to the main BPF iterator subsystem.
- The following is the breakdown for each field in struct ``bpf_iter_reg``.
- .. list-table::
- :widths: 25 50
- :header-rows: 1
- * - Fields
- - Description
- * - target
- - Specifies the name of the BPF iterator. For example: ``bpf_map``,
- ``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel.
- * - attach_target and detach_target
- - Allows for target specific ``link_create`` action since some targets
- may need special processing. Called during the user space link_create stage.
- * - show_fdinfo and fill_link_info
- - Called to fill target specific information when user tries to get link
- info associated with the iterator.
- * - get_func_proto
- - Permits a BPF iterator to access BPF helpers specific to the iterator.
- * - ctx_arg_info_size and ctx_arg_info
- - Specifies the verifier states for BPF program arguments associated with
- the bpf iterator.
- * - feature
- - Specifies certain action requests in the kernel BPF iterator
- infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
- that the kernel function cond_resched() is called to avoid other kernel
- subsystem (e.g., rcu) misbehaving.
- * - seq_info
- - Specifies the set of seq operations for the BPF iterator and helpers to
- initialize/free the private data for the corresponding ``seq_file``.
- `Click here
- <https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_
- to see an implementation of the ``task_vma`` BPF iterator in the kernel.
- ---------------------------------
- Parameterizing BPF Task Iterators
- ---------------------------------
- By default, BPF iterators walk through all the objects of the specified types
- (processes, cgroups, maps, etc.) across the entire system to read relevant
- kernel data. But often, there are cases where we only care about a much smaller
- subset of iterable kernel objects, such as only iterating tasks within a
- specific process. Therefore, BPF iterator programs support filtering out objects
- from iteration by allowing user space to configure the iterator program when it
- is attached.
- --------------------------
- BPF Task Iterator Program
- --------------------------
- The following code is a BPF iterator program to print files and task information
- through the ``seq_file`` of the iterator. It is a standard BPF iterator program
- that visits every file of an iterator. We will use this BPF program in our
- example later.
- ::
- #include <vmlinux.h>
- #include <bpf/bpf_helpers.h>
- char _license[] SEC("license") = "GPL";
- SEC("iter/task_file")
- int dump_task_file(struct bpf_iter__task_file *ctx)
- {
- struct seq_file *seq = ctx->meta->seq;
- struct task_struct *task = ctx->task;
- struct file *file = ctx->file;
- __u32 fd = ctx->fd;
- if (task == NULL || file == NULL)
- return 0;
- if (ctx->meta->seq_num == 0) {
- BPF_SEQ_PRINTF(seq, " tgid pid fd file\n");
- }
- BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
- (long)file->f_op);
- return 0;
- }
- ----------------------------------------
- Creating a File Iterator with Parameters
- ----------------------------------------
- Now, let us look at how to create an iterator that includes only files of a
- process.
- First, fill the ``bpf_iter_attach_opts`` struct as shown below:
- ::
- LIBBPF_OPTS(bpf_iter_attach_opts, opts);
- union bpf_iter_link_info linfo;
- memset(&linfo, 0, sizeof(linfo));
- linfo.task.pid = getpid();
- opts.link_info = &linfo;
- opts.link_info_len = sizeof(linfo);
- ``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator
- that only includes opened files for the process with the specified ``pid``. In
- this example, we will only be iterating files for our process. If
- ``linfo.task.pid`` is zero, the iterator will visit every opened file of every
- process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator
- that visits opened files of a specific thread, not a process. In this example,
- ``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a
- separate file descriptor table. In most circumstances, all process threads share
- a single file descriptor table.
- Now, in the userspace program, pass the pointer of struct to the
- ``bpf_program__attach_iter()``.
- ::
- link = bpf_program__attach_iter(prog, &opts);
- iter_fd = bpf_iter_create(bpf_link__fd(link));
- If both *tid* and *pid* are zero, an iterator created from this struct
- ``bpf_iter_attach_opts`` will include every opened file of every task in the
- system (in the namespace, actually.) It is the same as passing a NULL as the
- second argument to ``bpf_program__attach_iter()``.
- The whole program looks like the following code:
- ::
- #include <stdio.h>
- #include <unistd.h>
- #include <bpf/bpf.h>
- #include <bpf/libbpf.h>
- #include "bpf_iter_task_ex.skel.h"
- static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts)
- {
- struct bpf_link *link;
- char buf[16] = {};
- int iter_fd = -1, len;
- int ret = 0;
- link = bpf_program__attach_iter(prog, opts);
- if (!link) {
- fprintf(stderr, "bpf_program__attach_iter() fails\n");
- return -1;
- }
- iter_fd = bpf_iter_create(bpf_link__fd(link));
- if (iter_fd < 0) {
- fprintf(stderr, "bpf_iter_create() fails\n");
- ret = -1;
- goto free_link;
- }
- /* not check contents, but ensure read() ends without error */
- while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) {
- buf[len] = 0;
- printf("%s", buf);
- }
- printf("\n");
- free_link:
- if (iter_fd >= 0)
- close(iter_fd);
- bpf_link__destroy(link);
- return 0;
- }
- static void test_task_file(void)
- {
- LIBBPF_OPTS(bpf_iter_attach_opts, opts);
- struct bpf_iter_task_ex *skel;
- union bpf_iter_link_info linfo;
- skel = bpf_iter_task_ex__open_and_load();
- if (skel == NULL)
- return;
- memset(&linfo, 0, sizeof(linfo));
- linfo.task.pid = getpid();
- opts.link_info = &linfo;
- opts.link_info_len = sizeof(linfo);
- printf("PID %d\n", getpid());
- do_read_opts(skel->progs.dump_task_file, &opts);
- bpf_iter_task_ex__destroy(skel);
- }
- int main(int argc, const char * const * argv)
- {
- test_task_file();
- return 0;
- }
- The following lines are the output of the program.
- ::
- PID 1859
- tgid pid fd file
- 1859 1859 0 ffffffff82270aa0
- 1859 1859 1 ffffffff82270aa0
- 1859 1859 2 ffffffff82270aa0
- 1859 1859 3 ffffffff82272980
- 1859 1859 4 ffffffff8225e120
- 1859 1859 5 ffffffff82255120
- 1859 1859 6 ffffffff82254f00
- 1859 1859 7 ffffffff82254d80
- 1859 1859 8 ffffffff8225abe0
- ------------------
- Without Parameters
- ------------------
- Let us look at how a BPF iterator without parameters skips files of other
- processes in the system. In this case, the BPF program has to check the pid or
- the tid of tasks, or it will receive every opened file in the system (in the
- current *pid* namespace, actually). So, we usually add a global variable in the
- BPF program to pass a *pid* to the BPF program.
- The BPF program would look like the following block.
- ::
- ......
- int target_pid = 0;
- SEC("iter/task_file")
- int dump_task_file(struct bpf_iter__task_file *ctx)
- {
- ......
- if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */
- return 0;
- BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
- (long)file->f_op);
- return 0;
- }
- The user space program would look like the following block:
- ::
- ......
- static void test_task_file(void)
- {
- ......
- skel = bpf_iter_task_ex__open_and_load();
- if (skel == NULL)
- return;
- skel->bss->target_pid = getpid(); /* process ID. For thread id, use gettid() */
- memset(&linfo, 0, sizeof(linfo));
- linfo.task.pid = getpid();
- opts.link_info = &linfo;
- opts.link_info_len = sizeof(linfo);
- ......
- }
- ``target_pid`` is a global variable in the BPF program. The user space program
- should initialize the variable with a process ID to skip opened files of other
- processes in the BPF program. When you parametrize a BPF iterator, the iterator
- calls the BPF program fewer times which can save significant resources.
- ---------------------------
- Parametrizing VMA Iterators
- ---------------------------
- By default, a BPF VMA iterator includes every VMA in every process. However,
- you can still specify a process or a thread to include only its VMAs. Unlike
- files, a thread can not have a separate address space (since Linux 2.6.0-test6).
- Here, using *tid* makes no difference from using *pid*.
- ----------------------------
- Parametrizing Task Iterators
- ----------------------------
- A BPF task iterator with *pid* includes all tasks (threads) of a process. The
- BPF program receives these tasks one after another. You can specify a BPF task
- iterator with *tid* parameter to include only the tasks that match the given
- *tid*.
|