# Introduction

I’m recently writing something that uses Linux’s firewall framework to do some non-standard operations packets. Extending the kernel is required for my task but unfortunately documentations about this topic I find online are quite dated. These old documents are mainly for kernel version 2.4 and earlier 2.6.x, in which new matches or targets are registered by calling ipt_register_match and ipt_register_target. The related subsystem of kernel has changed a lot since then, and iptables has been replaced by nftables. Although we can use xt_register_match and xt_register_target instead, I prefer to move to the new nftables framework. Due to the lack of documentation, I have to dig into the source code of Linux kernel to figure out how things works, and this post is the note for that. As Linus Torvalds says in 2008, “Linux is evolution, not intelligent design”, the design and API of nftables might be changing very fast. So I’m not only trying to make a brief review on the design or API of nftables. But also, this post will serve as a guide on how to find the correct way of doing things by reading the kernel source code. The eager reader can go directly to the summary section. This post is based on kernel version 4.13, the most recent version when this post is started writing.

Here in this post, we will solve a toy problem: monitor all outgoing TCP traffic from port 80, if it contains the string given by the user, log it. I don’t assume any knowledge in the design or kernel API of nftables, but I do assume the reader has read and understand well the official documents on how to use nftables.

# Starting point of kernel code

The starting point is to find which source file to read. The following command gives a nice overview on nftables in Linux kernel:

The output shall be something that looks like:

The files listed in the output will be the files to look at. A good tool to read Linux kernel source code is FreeElectrons. We can start by looking at the file names in directory include/net/netfilter on its website. We can see nft_masq.h, nft_redir.h, nft_reject.h in that directory. These are all actions in nftables. Following the references of symbols defined in these files will lead us towards sample codes on how to create new actions. Let’s take reject as an example. From its header, we can find an interesting symbol nft_reject_init. Looking around all the definitions and references of that symbol, we are able to find the core code at net/ipv4/netfilter/nft_reject_ipv4.c. In its core code, L61-L72 reads:

We can immediately know from the above code that we should call nft_register_expr to register an expression and call nft_unregister_expr to unregister.

Now let’s take a look at the prototype of nft_register_expr in nf_tables.h:

It takes one parameter of type struct nft_expr_type *. This struct is also defined in nf_tables.h at L681. The usage of nft_register_expr and nft_expr_type can be guessed by reading all the examples. The list of all examples can be found from its reference in here. In that list, there is one file that looks very interesting from its file name:

Open this link, we can see all these basic expressions:

Now we know what to look at. The next step will be to read these examples to get a feeling on how to write our own expression.

# The usage of kernel API

To know the usage of related API, we choose to read the reject operation for the inet family and compare operation as sample code. The source code of reject is located at net/netfilter/nft_reject_inet.c. The source code of compare is loacated at net/netfilter/nft_cmp.c. In the case that one expression only correspond to one operation, the usage is shown below by the source code of reject operation at L120:

From the above code, we know that we should create an instance of both struct nft_expr_ops and struct nft_expr_type and point to each other at nft_expr_ops.type and nft_expr_type.ops. In the case that one expression correspond to many operations, the usage is shown below by the source code of compare operation:

From this we can see that we should create an instance of struct nft_expr_ops for each operation, and use select_ops to choose dynamically which operation to use. The select_ops should return the pointer to the operation chosen, or an ERR_PTR in case of error. Now Let’s discuss struct nft_expr_ops and struct nft_expr_type in detail separately.

## struct nft_expr_ops

Let’s take a look at struct nft_expr_ops first. It’s definition is at include/net/netfilter/nf_tables.h#L722:

From the name and comments of these fields, we can see that init, destroy, clone play the role of constructor, destructor, and copy constructor. What to do in these functions is shown in the code above. In that code, init is defined as nft_reject_inet_init and clone and destroy are not defined. The source code for nft_reject_inet_init is located at L64:

By reading this function and looking into all other functions called by this function, we can see that the following things will happen: The kernel will allocate memory for an instance of struct nft_reject, which is the struct that stores operation specific data, at expr->data. In order for the kernel to know the size of memory to allocate for struct nft_reject, its size is passed to nft_expr_ops.size as shown in the 4th line at the code snippet above:

The init function is responsible to initialize the fields of this instance by reading attributes from netlink by calling functions like nla_get_<type>. Data from netlink is stored at the argument tb. In case of error, the init function should return a negative number, otherwise 0 should be returned. Up to now, we are not sure how to let netlink know what attributes are expected and what are the length of these attributes yet, but don’t worry, things will become clear as we keep reading. Let’s for now just forget about this problem.

Now let’s take a look at the dump field, it is implemented by nft_reject_inet_dump for reject operation. The code is located at L96:

We can see that this operation send back the parameters to netlink using functions like nla_put_<type>. In case of success, 0 should be returned, otherwise it should return a negative number.

The function that evaluate the evaluation correspond to the field eval. We can think of there are two types of expressions: those that match some conditions, and those that do something, such as drop, reject, accept, dnat, etc., to the packets. For these two types, the eval should be slightly different. Here we use the source code of both compare and reject as example. In reject, it is implemented as nft_reject_inet_eval. The source code is located at L20:

In the compare operation, it is implemented as nft_cmp_eval. The source code is located at L27:

From these two functions, we can see that this function tells the kernel to do something by setting regs->verdict.code or to continue to the next expression by not changing regs->verdict.code. For actions, the value of regs->verdict.code should be set to one of the following as shown in include/uapi/linux/netfilter.h#L9:

For matches, it should be a value in enum nft_verdicts, which is listed at include/uapi/linux/netfilter/nf_tables.h#L49:

The field validate is used to check the validation of operation, for example: masquerade is only available at hook point POSTROUTING, reject is only available at hook point LOCAL INPUT, LOCAL_OUTPUT and FORWARD, etc. This can be shown at the source code of at net/netfilter/nft_reject.c#L29:

The function nft_chain_validate_hooks is used to validate the hook point. There are other helper functions to validate different things, the list of these functions can be obtained by searching the string “validate” at include/net/netfilter/nf_tables.h.

## struct nft_expr_type

The definition of nft_expr_type is at include/net/netfilter/nf_tables.h#L681:

The field ops and select_ops is already discussed; the field list is internally, so we should not worry about it here; the field name is the name of the expression; the field owner should be set to the pointer towards the current module. These are all trivial fields. Now let’s take a look at the policy and maxattr field. The related code at the definition of nft_reject_inet_type is:

The array nft_reject_policy is defined at L23:

The two array index above, NFTA_REJECT_TYPE and NFTA_REJECT_ICMP_CODE, belongs to an enum named nft_reject_attributes. And the definition of NFTA_REJECT_MAX and nft_reject_attributes is located at include/uapi/linux/netfilter/nf_tables.h#L1089:

Recall that we raised a question before on how does the kernel knows what are the attributes expected by the expression. The policy field is exactly the answer to this question. Let’s now dig deeper and read the source code of netlink starting at include/net/netlink.h#L9:

The comments explains itself very well. The source code of attributes of different expressions are all defined at include/uapi/linux/netfilter/nf_tables.h. To get a feeling on how to write an array like this, just search the string “attributes” in this file. All definition of attributes should begin with an UNSPEC to leave space for internal usage.

The field family is the address family of your expression. Possible values can be found at include/uapi/linux/netfilter.h#L59:

The field flags are used to denote expression types. Currently, only one flag is available, that is if an expression is stateful. See include/net/netfilter/nf_tables.h#L707:

# Summary on kernel codes

Create an instance of struct nft_expr_ops for each operation of this expression. Implements its fields as in its definition. Use init, clone, destroy to initialize, clone and destroy object. In init, read attributes from netlink and setup operation’s struct. Implement the core function of this operation in eval, tell kernel what to do by setting regs->verdict.code. In dump, send the attributes through netlink. Apply constraints to operations in validate.

Create an instance of struct nft_expr_type for your expression. Implements its fields as in its definition. If you have multiple operations that should be selected dynamically, implement select_ops otherwise set ops. Set name, owner according to your expression. If applicable, set address family at family. If applicable, use flags to indicate if your expression is stateful. Create an array of struct nla_policy, setup attribute information in that array, and set this array as policy. Set maxattr as the maximum number of attributes.

Call nft_register_expr to register your expression. Call nft_unregister_expr to unregister your expression.

# Writing our own kernel code

With the knowledge on how to write kernel codes, we are ready to write our own module to add our expression. Here we call our expression “abcde”

abcde.h:

abcde.c:

Makefile:

The complete source code for this example can be found at GitHub:
https://github.com/zasdfgbnm/nftables-abcde

# Modify user space tool

In order to be able to conveniently use our new expression “abcde”, it would be good to modify the source code of user space tool, i.e. the nft command, to make it aware of our new expression. Extending the user space tool is easier. We first check it out from its git repository and switch to tag v0.7 (the newest release when this article is written):

To figure out where to modify, let’s run grep to see how the expression reject is implemented:

The above command will output something like:

This tells us the files we may want to modify. A good start point is scanner.l and parser_bison.y. We can copy and paste the code for reject, replace it with our own thing.

After some try and error, we end up with the following patch generated by git diff v0.7:

Same thing applies to libnftnl, we clone the repository and checkout the tag libnftnl-1.0.7:

After some try and error, we end up with the following patch using command git diff libnftnl-1.0.7:

The abcde branch of nftables and libnftnl can be found at GitHub:
https://github.com/zasdfgbnm/nftables/tree/abcde
https://github.com/zasdfgbnm/libnftnl/tree/abcde

# Test

Our new module can be tested by inserting our module, and then using our self-compiled nft tool to add a rule that looks like:

Open a darkhttpd server, access to it, and the output of dmesg will looks like: