Bug 1677323

Summary:	iptables -X returns iptables: No buffer space available when huge amount of chains are used.
Product:	Red Hat Enterprise Linux 8	Reporter:	Robin Hack <rhack>
Component:	iptables	Assignee:	Phil Sutter <psutter>
Status:	CLOSED ERRATA	QA Contact:	Jiri Peska <jpeska>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	8.0	CC:	iptables-maint-list, jpeska, qe-baseos-daemons, rkhan, todoleza
Target Milestone:	rc
Target Release:	8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	iptables-1.8.2-16.el8	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-11-05 22:17:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Robin Hack 2019-02-14 14:41:09 UTC

Description of problem:
With kernel 4.18.0-64
time iptables -X
takes 47 minutes and then dies on:
iptables: No buffer space available.

With kernel 4.18.0-68
time iptables -X
takes 1m5s (improvement here!) but
it also dies with:
iptables: No buffer space available.



Version-Release number of selected component (if applicable):
iptables-1.8.2-9.el8.x86_64
kernel-4.18.0-68.el8.x86_64
libnftnl-1.1.1-4.el8.x86_64

How reproducible:
always

Steps to Reproduce:
1. follow reproducer from:
https://bugzilla.redhat.com/show_bug.cgi?id=1647306
(create 200000 chains)
2. try to remove 200000 chains by invoking: iptables -X

Actual results:
iptables -X dies with:
iptables: No buffer space available.

Expected results:
Removed chains.


Additional info:

Comment 1 Phil Sutter 2019-02-15 09:34:57 UTC

Hi Robin,

I can't reproduce this on a beaker machine (admittedly a pretty fat one):

[root@wsfd-netdev13 ~]# ./iptables-restore-reproducer.sh 
calling iptables-nft-restore

real	0m13.036s
user	0m12.262s
sys	0m4.245s
calling iptables-nft -F

real	1m2.740s
user	0m0.885s
sys	1m1.427s
calling iptables-nft -X

real	0m8.028s
user	0m0.266s
sys	0m7.734s
[root@wsfd-netdev13 ~]# uname -a
Linux wsfd-netdev13.ntdv.lab.eng.bos.redhat.com 4.18.0-68.el8.x86_64 #1 SMP Wed Feb 13 14:25:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
[root@wsfd-netdev13 ~]# rpm -q iptables
iptables-1.8.2-9.el8.x86_64
[root@wsfd-netdev13 ~]# rpm -q libnftnl
libnftnl-1.1.1-4.el8.x86_64
[root@wsfd-netdev13 ~]# cat iptables-restore-reproducer.sh 
#! /bin/bash
echo "calling iptables-nft-restore"
time iptables-restore <(

echo "*filter"
for i in $(seq 0 200000);do
        printf ":chain_%06x - [0:0]\n" $i
done
for i in $(seq 0 200000);do
        printf -- "-A INPUT -j chain_%06x\n" $i
        printf -- "-A INPUT -j chain_%06x\n" $i
done
echo COMMIT

)
echo "calling iptables-nft -F"
time iptables -F
echo "calling iptables-nft -X"
time iptables -X
[root@wsfd-netdev13 ~]#

Comment 2 Robin Hack 2019-02-15 10:22:33 UTC

Hi Phil.

# free -m
              total        used        free      shared  buff/cache   available
Mem:           1829         141        1435           8         251        1534
Swap:             0           0           0

strace output of iptables -X:
sendto(3, {{len=20, type=NFNL_SUBSYS_NFTABLES<<8|NFT_MSG_GETCHAIN, flags=NLM_F_REQUEST|NLM_F_DUMP, seq=0, pid=0}, {nfgen_family=AF_INET, version=NFNETLINK_V0, res_id=htons(0)}, 20, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, 12) = 20
recvmsg(3, ... tons of structs ) = 16488
brk(NULL)                               = 0x56131b143000
brk(0x56131b164000)                     = 0x56131b164000

... skipped ...

sendmsg(3, tons of data ... = 12000040
select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=0}) = 1 (in [3], left {tv_sec=0, tv_usec=0})
recvmsg(3, {msg_namelen=12}, 0)         = -1 ENOBUFS (No buffer space available)

I can attach whole strace output :).

Comment 3 Phil Sutter 2019-02-15 15:00:16 UTC

Hi Robin,

(In reply to Robin Hack from comment #2)
> Hi Phil.
> 
> # free -m
>               total        used        free      shared  buff/cache  
> available
> Mem:           1829         141        1435           8         251       
> 1534
> Swap:             0           0           0
> 
> strace output of iptables -X:
> sendto(3, {{len=20, type=NFNL_SUBSYS_NFTABLES<<8|NFT_MSG_GETCHAIN,
> flags=NLM_F_REQUEST|NLM_F_DUMP, seq=0, pid=0}, {nfgen_family=AF_INET,
> version=NFNETLINK_V0, res_id=htons(0)}, 20, 0, {sa_family=AF_NETLINK,
> nl_pid=0, nl_groups=00000000}, 12) = 20
> recvmsg(3, ... tons of structs ) = 16488
> brk(NULL)                               = 0x56131b143000
> brk(0x56131b164000)                     = 0x56131b164000
> 
> ... skipped ...
> 
> sendmsg(3, tons of data ... = 12000040
> select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=0}) = 1 (in [3], left
> {tv_sec=0, tv_usec=0})
> recvmsg(3, {msg_namelen=12}, 0)         = -1 ENOBUFS (No buffer space
> available)

Receiving ENOBUFS when calling recvmsg() typically happens if a single netlink
message exceeds the 32k max buffer size supported by kernel. I wonder why this
doesn't happen for me though. Could you perhaps try with a smaller number of
chains and rules? The difference should still be noticeable and for use in a
test script, a delay of over a minute is probably too large, anyway.

Thanks, Phil

Comment 4 Robin Hack 2019-02-18 13:23:54 UTC

Hello.

Ok. It looks like 
          for ((i = 0; i < 300000; ++i)); do
                printf ":chain_%06x - [0:0]\n" $i
            done
is not a issue itself even with that big number.

However with combination with:
         for ((i = 0; i < 1000; ++i)); do
                printf -- "-A INPUT -j chain_%06x\n" $i
                printf -- "-A INPUT -j chain_%06x\n" $i
            done
it returns iptables: No buffer space available.
but with smaller numbers it's starts to return:
100 chains - iptables v1.8.2 (nf_tables):  CHAIN_USER_DEL failed (Device or resource busy): chain chain_000000
10 chains - iptables v1.8.2 (nf_tables):  CHAIN_USER_DEL failed (Device or resource busy): chain chain_000000

kernel-4.18.0-69.el8.x86_64
iptables-1.8.2-9.el8.x86_64
libnftnl-1.1.1-4.el8.x86_64

Comment 5 Phil Sutter 2019-07-02 15:15:59 UTC

I managed to reproduce the issue. Turned out I missed the fact that it happens
only if one doesn't call 'iptables -F' before calling 'iptables -X'. 

Sent a patch upstream, mostly to ask for advice on how to properly fix it:
https://marc.info/?l=netfilter-devel&m=156208033321053&w=2

Comment 6 Phil Sutter 2019-07-02 18:10:08 UTC

Fix sent upstream: https://marc.info/?l=netfilter-devel&m=156209061024148&w=2

Comment 7 Phil Sutter 2019-07-03 07:37:50 UTC

Upstream commit to backport:

commit d3e39e9c457f452540359e42fb58d64a28fe3e18 (origin/master, origin/HEAD)
Author: Phil Sutter <phil>
Date:   Tue Jul 2 20:30:49 2019 +0200

    nft: Set socket receive buffer
    
    When trying to delete user-defined chains in a large ruleset,
    iptables-nft aborts with "No buffer space available". This can be
    reproduced using the following script:
    
    | #! /bin/bash
    | iptables-nft-restore <(
    |
    | echo "*filter"
    | for i in $(seq 0 200000);do
    |         printf ":chain_%06x - [0:0]\n" $i
    | done
    | for i in $(seq 0 200000);do
    |         printf -- "-A INPUT -j chain_%06x\n" $i
    |         printf -- "-A INPUT -j chain_%06x\n" $i
    | done
    | echo COMMIT
    |
    | )
    | iptables-nft -X
    
    The problem seems to be the sheer amount of netlink error messages sent
    back to user space (one EBUSY for each chain). To solve this, set
    receive buffer size depending on number of commands sent to kernel.
    
    Suggested-by: Pablo Neira Ayuso <pablo>
    Signed-off-by: Phil Sutter <phil>
    Signed-off-by: Pablo Neira Ayuso <pablo>

Comment 8 Phil Sutter 2019-08-07 23:18:31 UTC

Tomas,

Please consider providing qa_ack+ here. We're a bit late with RHEL8.1 but since it is a bug fix I think it's worth trying.

Cheers, Phil

Comment 13 errata-xmlrpc 2019-11-05 22:17:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3573