Nftables, Docker, and a default drop policy

10 min read

The absolute basics of securing a server is properly configuring the firewall not to expose internal services. The framework for packet filtering on Linux systems is Netfilter, with two main frontends: the legacy iptables, and the new nftables.

They allow to define rules for accepting or dropping packets based on conditions such as destination IP address and port. Things start to get a little bit complicated when you get into connection tracking, NATing, forwarding, masquerading, etc.

This article starts with a drop by default ruleset, explains the issue with applications like Docker trying to get control over the entire firewall, and provides a solution to make them work while keeping full control over the firewall.

§
The basics

Nftables rules belong to chains attached to hooks in the Netfilter packet processing flow. Chains are organized within tables of some network family (nowadays you can use inet to handle both ipv4 and ipv6 within a single table).

The input hook handles packets addressed to the host while the forward hook handles packets sent to the host, but not destined to it. The default policy tells what happens to a packet that wasn't accepted, dropped, or rejected during the evaluation of a chain.

I always start my ruleset by dropping everything in the input and forward hooks, which blocks any connection to the host:

table inet firewall
delete table inet firewall

table inet firewall {
    chain input {
        type filter hook input priority filter; policy drop
    }

    chain forward {
        type filter hook forward priority filter; policy drop
    }
}

The first two instructions ensure the firewall table exists and flush it when re-applying this file. The important point here is that you shouldn't use the nat and filter tables, or statements like flush ruleset, because you will conflict with the rules managed by other applications using the iptables-nft compatibility layer.

To get some basic functionality back like access to local services, ping, DHCP, IPv6, connection tracking, you can add a few basic rules:

table inet firewall
delete table inet firewall

table inet firewall {
    chain input {
        type filter hook input priority filter; policy drop

        ct state established,related accept
        ct state invalid drop

        iifname lo accept

        ip protocol icmp icmp type echo-request accept
        ip6 nexthdr icmpv6 icmpv6 type { echo-request, nd-router-advert, nd-router-solicit, nd-neighbor-advert, nd-neighbor-solicit } accept

        meta l4proto { tcp, udp } th dport llmnr accept
    }

    chain forward {
        type filter hook forward priority filter; policy drop

        ct state established,related accept
        ct state invalid drop
    }
}

I won't go into more details here on the basic ruleset and the basics of Nftables rules, but you may find the following resources helpful:

§
The problem with accept

Consider an application like Docker, which emits Nftables rules through the iptables-nft compatibility layer in the filter and nat tables. When you expose a container on port 8080/tcp, Docker adds a few rules that can be summarized as follows:

table inet nat {
    chain prerouting {
        tcp dport 8080 dnat to CONTAINER_IP:CONTAINER_PORT
    }
}

table inet filter {
    chain forward {
        ip daddr CONTAINER_IP tcp dport CONTAINER_PORT accept
    }
}

Assuming the ruleset also contains the firewall table from the previous section, do you think you can connect to the service bound to port 8080?

The nice Netfilter packet flow diagram makes it obvious that if you accept a packet at the prerouting stage, it will flow to the next stages until it is eventually dropped, for instance in the input or forward stage. A common misconception about Netfilter is what happens when multiple chains attached to the same hook give a contradictory verdict.

Since Nftables supports priorities, you may think that you can choose in which order distinct chains should execute within the same hook, and you may further assume that if you accept a packet in the rule with the highest priority, which happens to be the rule with the lowest numeric id, then it goes to the next stage. This isn't how it works.

If a packet is accepted in some chain, it will traverse all the other chains attached to the same hook. For it to be considered truly accepted, it must traverse all of them without ever being dropped. That means any chain in any random table cannot really "accept" a packet. The only thing it can really do is not drop or reject it within the current chain.

The unfortunate consequence is that an application that wishes to emit rules to drop packets it considers invalid must make sure not to interfere with all the other applications, which is exactly what our firewall table does with its default drop policy for applications like Docker that rely on their own rules to forward packets.

What can we do to keep the default drop policy, but delegate the handling of Docker rules to the docker table?

  1. Not using a separate table and putting the rules in the DOCKER-USER chain.

    If you use iptables, the default input and forward policies are set to drop, so Docker provides an escape hatch with the DOCKER-USER chain that you can use to accept packet before they are dropped by the default policy. The main drawback of this solution is that Docker just took over your entire firewall, and other applications need to be aware of that.

  2. Replicating the rules in our own ruleset. This isn't practical without fixed container IP addresses, unless you add a layer of automation.

  3. Run the container in the host network namespace, so the port binds directly to the host.

    Then you can add an accept rule in the input table like iffname enp1s0 tcp dport 8080 accept. Of course, it is better from a security standpoint when containers do not have access to the host network.

Let's try in the following sections to keep our default drop policy, open the firewall just enough for Docker without having to replicate its rules, while keeping full control over what we accept.

§
Allow forwarding to all Docker networks

The most simple solution is to allow forwarding for any network managed by Docker. It happens that Docker allocates container IPs according to a configurable address pool, mostly from 172.17.0.0/12 by default (see: moby:libnetwork/ipamutils/utils.go).

So you can just add the following rules:

table inet firewall {
    chain forward {
        ip saddr 172.17.0.0/12 accept
        ip daddr 172.17.0.0/12 accept
    }
}

While Docker is extremely happy, you have no control over which containers are accessible from outside. Imagine you want to test a local service, it will automatically be made publicly accessible.

Note that it can be made to work to a certain extinct. Docker doesn't have the level of control of something like kube-router which generates the appropriate rules to restrict k8s services if you specify the allowed networks, but there are a few things that do work:

  • Packets to 172.17.0.0/12 are not routable over the internet, but other machines on the local network do not necessarily have this limitation.

  • When you start your container, you can explicitly bind to 127.0.0.1:8080, optionally to other IP addresses, and not 8080 (which is implicitly expanded to 0.0.0.0:8080). From a security standpoint, Linux filters martian packets by default, but binding to specific IP addresses has its limitations (starting from DHCP).

  • You can leverage Docker networks to create well defined zones and associated firewall rules, but the granularity may be an issue if you want to restrict access to some app while exposing another.

§
Allow forwarding to a specific container port

If the service is bound to a well defined port on the host, the most simple way to filter packets is to match by protocol, input interface or IP address, and port, that's it.

Let's define the set of interfaces that are part of the public zone, and allow forwarded packets from these interfaces to the bound port:

table firewall {
    chain forward {
        define public_ifs = { "enp1s0" };
        iifname $public_ifs tcp dport HOST_PORT accept
    }
}

The main issue here is that the forwarding hook is to late to reject this packet, as we do not have the original port.

The rule Docker adds to the prerouting hook DNATs packets from HOST_PORT to CONTAINER-IP:CONTAINER-PORT. When the packet later arrives in the forward hook, its destination port is CONTAINER_PORT, not HOST_PORT. You could always make sure to bind the same port on the host and inside the container, but that may not be practical depending on the port set in the base images and or bound by the services already running on the host.

With Nftables, you can attach filter chains that run before the prerouting hook. While you would have the original port, you do not know at this early stage whether the packet is destined to the host or if it will be forwarded. You could manually check if the destination IP matches any local IPs and do the filtering that way, but this isn't really practical in the context of DHCP for example.

Additionally, you do not know what will happen in the prerouting table at the time your filtering rule is evaluated:

  • If your host functions as a router, a packet addressed to the host IP could be redirected to another host in the prerouting table.
  • If you drop packets to port 8080 because you do not run a local webserver, you could also block legitimate traffic forwarded through the host to another machine's port 8080.

This is where we see the real benefit of the split between the input and forward hook, that works based on the knowledge of the local interface IPs. But this is also the core of the issue, because we do not know the original port when a packets arrives in the forward hook after being DNATed by Docker.

The astute reader may try to attach a chain to the forward hook with a priority before DNAT. Unfortunately, this doesn't work either because priorities are local to hooks, not global. DNAT is always happening before input and forward, despite the priorities having names that suggest otherwise.

The solution is to rely on a conntrack feature that provides the original port before DNAT. The following rule accepts traffic originally directed to port 8080 and later forwarded to the Docker network:

table firewall {
    chain forward {
        iifname $public_ifs ip daddr 172.17.0.0/12 ct original proto-dst 8080 accept
    }
}

A similar rule is provided by the Docker documentation for iptables.

§
Going further

The Docker packet filtering and firewall documentation provides alternative solutions. You could also try the following:

  • Running Docker itself inside a network namespace and add explicit forwarding rules from the host to the Docker namespace.

  • Setting marks on a flow in the prerouting stage before DNAT (this is where priorities are important). Then in the forwarding table, you can allow packets with this mark. Unfortunately this is a feature that may cause conflicts when multiple applications rely on it.