-
Notifications
You must be signed in to change notification settings - Fork 114
Description
Not sure if this should be considered a netavark firewall issue, a kernel issue, or just performance limitations.
I have a server (Ryzen 5600X, 1 gigabit ethernet port with 4 VLANs on top of it) that is acting as my home router as well as running a bunch of containers. I have 47 containers running, across 11 networks (resulting in 11 network interfaces getting created). Each of those networks has a private IPv4 subnet, public IPv6 subnet, and a ULA (private) IPv6 subnet assigned. I'm using firewalld as my backend (which itself is using nftables as the backend), with other non-container-related rules configured for general connectivity and firewalling. I also have CAKE SQM set up in both the ingress and egress directions, with the bandwidth set to 1Gb/s. The SQM for the ingress direction is set up by redirecting packets that come in to an IFB interface; this is done by adding a tc filter
rule.
Recently, I did a test with iperf3 between this server and two devices connected via ethernet. On both devices, TCP traffic from the server to the device is sent around 920Mb/s-970Mb/s (basically the full line rate), but TCP traffic from the device to the server is sent at a max of ~600Mb/s, with occasional drops to 200Mb/s. During this time, I can see that ksoftirqd is running at 100% on the server, suggesting that there's a CPU bottleneck involved. (When traffic is sent from the server to the device, ksoftirqd is not at 100% on either side.)
Here, 192.168.3.1 is the server which is running as a router and is running all of the containers.
$ iperf3 -c 192.168.3.1 -t 60
Connecting to host 192.168.3.1, port 5201
[ 5] local 192.168.3.21 port 35770 connected to 192.168.3.1 port 5201
[ ID] Interval Transfer Bitrate Retr Cwnd
[ 5] 0.00-1.00 sec 73.8 MBytes 618 Mbits/sec 0 296 KBytes
[ 5] 1.00-2.00 sec 73.5 MBytes 617 Mbits/sec 0 250 KBytes
[ 5] 2.00-3.00 sec 73.0 MBytes 612 Mbits/sec 0 227 KBytes
[ 5] 3.00-4.00 sec 72.6 MBytes 609 Mbits/sec 0 168 KBytes
[ 5] 4.00-5.00 sec 69.5 MBytes 583 Mbits/sec 0 285 KBytes
[ 5] 5.00-6.00 sec 72.6 MBytes 609 Mbits/sec 0 250 KBytes
[ 5] 6.00-7.00 sec 72.1 MBytes 605 Mbits/sec 0 227 KBytes
[ 5] 7.00-8.00 sec 70.4 MBytes 590 Mbits/sec 0 168 KBytes
[ 5] 8.00-9.00 sec 72.1 MBytes 605 Mbits/sec 0 174 KBytes
[ 5] 9.00-10.00 sec 71.2 MBytes 598 Mbits/sec 0 238 KBytes
[ 5] 10.00-11.00 sec 72.6 MBytes 609 Mbits/sec 0 122 KBytes
[ 5] 11.00-12.00 sec 72.1 MBytes 605 Mbits/sec 0 221 KBytes
[ 5] 12.00-13.00 sec 72.6 MBytes 609 Mbits/sec 0 168 KBytes
[ 5] 13.00-14.00 sec 71.8 MBytes 602 Mbits/sec 0 180 KBytes
[ 5] 14.00-15.00 sec 68.4 MBytes 574 Mbits/sec 0 378 KBytes
[ 5] 15.00-16.00 sec 71.2 MBytes 598 Mbits/sec 0 116 KBytes
[ 5] 16.00-17.00 sec 71.2 MBytes 598 Mbits/sec 0 250 KBytes
[ 5] 17.00-18.00 sec 72.6 MBytes 609 Mbits/sec 0 168 KBytes
[ 5] 18.00-19.00 sec 72.1 MBytes 605 Mbits/sec 0 331 KBytes
[ 5] 19.00-20.00 sec 73.0 MBytes 612 Mbits/sec 0 261 KBytes
[ 5] 20.00-21.00 sec 72.6 MBytes 609 Mbits/sec 0 180 KBytes
[ 5] 21.00-22.00 sec 70.4 MBytes 590 Mbits/sec 0 267 KBytes
[ 5] 22.00-23.00 sec 72.1 MBytes 605 Mbits/sec 0 174 KBytes
[ 5] 23.00-24.00 sec 72.2 MBytes 606 Mbits/sec 0 221 KBytes
[ 5] 24.00-25.00 sec 69.6 MBytes 584 Mbits/sec 0 349 KBytes
[ 5] 25.00-26.00 sec 71.2 MBytes 598 Mbits/sec 0 180 KBytes
[ 5] 26.00-27.00 sec 71.2 MBytes 598 Mbits/sec 0 238 KBytes
...
$ iperf3 -c 192.168.3.1 --bidir
Connecting to host 192.168.3.1, port 5201
[ 5] local 192.168.3.21 port 37654 connected to 192.168.3.1 port 5201
[ 7] local 192.168.3.21 port 37670 connected to 192.168.3.1 port 5201
[ ID][Role] Interval Transfer Bitrate Retr Cwnd
[ 5][TX-C] 0.00-1.00 sec 75.5 MBytes 633 Mbits/sec 0 459 KBytes
[ 7][RX-C] 0.00-1.00 sec 90.5 MBytes 759 Mbits/sec
[ 5][TX-C] 1.00-2.00 sec 76.5 MBytes 642 Mbits/sec 0 279 KBytes
[ 7][RX-C] 1.00-2.00 sec 89.9 MBytes 754 Mbits/sec
[ 5][TX-C] 2.00-3.00 sec 78.1 MBytes 655 Mbits/sec 0 378 KBytes
[ 7][RX-C] 2.00-3.00 sec 106 MBytes 891 Mbits/sec
[ 5][TX-C] 3.00-4.00 sec 78.4 MBytes 657 Mbits/sec 0 349 KBytes
[ 7][RX-C] 3.00-4.00 sec 89.0 MBytes 747 Mbits/sec
[ 5][TX-C] 4.00-5.00 sec 77.2 MBytes 648 Mbits/sec 0 192 KBytes
[ 7][RX-C] 4.00-5.00 sec 108 MBytes 909 Mbits/sec
[ 5][TX-C] 5.00-6.00 sec 79.4 MBytes 666 Mbits/sec 0 279 KBytes
[ 7][RX-C] 5.00-6.00 sec 80.2 MBytes 673 Mbits/sec
[ 5][TX-C] 6.00-7.00 sec 74.9 MBytes 628 Mbits/sec 0 325 KBytes
[ 7][RX-C] 6.00-7.00 sec 97.5 MBytes 818 Mbits/sec
[ 5][TX-C] 7.00-8.00 sec 65.1 MBytes 546 Mbits/sec 0 296 KBytes
[ 7][RX-C] 7.00-8.00 sec 94.4 MBytes 792 Mbits/sec
[ 5][TX-C] 8.00-9.00 sec 72.2 MBytes 606 Mbits/sec 0 267 KBytes
[ 7][RX-C] 8.00-9.00 sec 99.2 MBytes 833 Mbits/sec
[ 5][TX-C] 9.00-10.00 sec 78.1 MBytes 655 Mbits/sec 0 465 KBytes
[ 7][RX-C] 9.00-10.00 sec 101 MBytes 846 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID][Role] Interval Transfer Bitrate Retr
[ 5][TX-C] 0.00-10.00 sec 756 MBytes 634 Mbits/sec 0 sender
[ 5][TX-C] 0.00-10.00 sec 754 MBytes 633 Mbits/sec receiver
[ 7][RX-C] 0.00-10.00 sec 957 MBytes 803 Mbits/sec 0 sender
[ 7][RX-C] 0.00-10.00 sec 956 MBytes 802 Mbits/sec receiver
$ iperf3 -c 192.168.3.1 --reverse -t 60
Connecting to host 192.168.3.1, port 5201
Reverse mode, remote host 192.168.3.1 is sending
[ 5] local 192.168.3.21 port 38648 connected to 192.168.3.1 port 5201
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 117 MBytes 982 Mbits/sec
[ 5] 1.00-2.00 sec 117 MBytes 984 Mbits/sec
[ 5] 2.00-3.00 sec 117 MBytes 985 Mbits/sec
[ 5] 3.00-4.00 sec 116 MBytes 976 Mbits/sec
[ 5] 4.00-5.00 sec 117 MBytes 985 Mbits/sec
[ 5] 5.00-6.00 sec 117 MBytes 984 Mbits/sec
[ 5] 6.00-7.00 sec 117 MBytes 983 Mbits/sec
^C[ 5] 7.00-7.33 sec 38.2 MBytes 983 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bitrate
[ 5] 0.00-7.33 sec 0.00 Bytes 0.00 bits/sec sender
[ 5] 0.00-7.33 sec 858 MBytes 982 Mbits/sec receiver
iperf3: interrupt - the client has terminated
I started with disabling CAKE SQM, and that restored it back to the full line rate, and CPU usage increase was barely noticeable (not counting the CPU usage for iperf3 itself). That initially led me to think there was something in the SQM setup that was causing some slowdown in the kernel (maybe the redirection of packets to the ifb interface that gets created?). However, I couldn't see any reports of this online.
I then set up the same CAKE SQM on the device (192.168.3.21) to see if it was reproducible there, but iperf3 was able to send and receive at full line rate, so whatever it was, it was something specific to the server.
I used perf top
to see if anything stood out, and the top consumer was nftables. Comparing it to perf top
on the device, I didn't see nftables at the top of the list, which suggests it was related to the firewall. On a hunch, I stopped all of the pods and containers, and iperf3 was able to achieve line rate in both directions.
I looked at the rules generated, and the netavark_zone
is adding each subnet of each network, rather than adding the interface of each network into the zone. Because of this, there are 3x as many entries. Additionally, because I have some firewall policies between netavark_zone
and the host (and other zones), there are some additional rules being generated.
I modified the zone definition to have the interfaces instead of the subnets, and after making this change, iperf3 can achieve line rate, with no noticeable CPU increase.
This suggests one or more of the following:
- Nftables is inefficient for some reason when parsing/matching IPv4 or IPv6 addresses.
- Firewalld is generating rules for nftables inefficiently and is not taking advantages of built-in features (such as sets).
- The number of rules that need to be generated because of the use of IPv4/IPv6 addresses instead of interfaces is so much greater that it slows down the firewall processing code.
I'm not sure about 1, but I'd like to think that this is fairly well optimized in the kernel. For 2, while writing this up, I checked firewalld's issue tracker to see if there's anything there, and found firewalld/firewalld#1399, so this is somewhat of a known issue there. For 3, I'm thinking that if there's no specific reason for using subnets instead of interfaces in the firewall rules, netavark could instead filter by interface instead of subnet, which would prevent rule explosion if there's both an IPv4 and IPv6 address present for the network.
Can suggestion 3 be looked into?