Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1908127

Summary:

nftable segmentation fault with big ip set

Product:

Red Hat Enterprise Linux 8

Reporter:

rubus <rubus_spam>

Component:

nftables

Assignee:

Phil Sutter <psutter>

Status:

CLOSED ERRATA

QA Contact:

qe-baseos-daemons

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

8.3

CC:

egarver, fabio.pedretti, martin.bozic, pgnet.dev, psutter, todoleza

Target Milestone:

Keywords:

Triaged, Upstream

Target Release:

8.6

Flags:

pm-rhel: mirror+

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

nftables-0.9.3-24.el8

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Clones:

2020668 2040754 (view as bug list)

Environment:

Last Closed:

2022-05-10 15:17:29 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2040754, 2047821

Attachments:

Description	Flags
configuration files for reproduce bug	none

Description rubus 2020-12-15 21:30:11 UTC

Created attachment 1739468 [details]
configuration files for reproduce bug

Description of problem:

Try to make rule to drop packets form big set of ip ranges.
Total number of elements in set is 1103505. Set of ip is keep as variable in separate file (bad_actors_set).
define bad_actors_set={
0.0.0.0/8,
1.0.1.0/24,
and a lot of more... total 1103505 CIDRs
}

Set definition and rest of configuration is in main.nft file.

        set bad_actors{
                type ipv4_addr
                flags interval
                elements={$bad_actors_set}
        }
 
I get segmentation fault when i try reload nftables.
Files are in attachment

Version-Release number of selected component (if applicable):
nftables v0.9.3 (Topsy)

How reproducible:
Fallow instruction on clean install.

Steps to Reproduce:
1. systemctl disable --now firewalld
2. tar -x nft_conf.xz
3. nft -f main.nft

Actual results:
# nft -f main.nft
Segmentation fault

Expected results:
# nft -f main.nft
#

Additional info:

For little smaller set reload sometimes succeed.
Way of create smaller set:
#cp bad_actors_set bad_actors_set.back
#head -n 1047500 bad_actors_set.back > bad_actors_set ; echo "}" >> bad_actors_set

If nft -f main.nft succeed memory usage is huge.
# echo "bad_actors_set size: $( wc -l bad_actors_set)" ; /usr/bin/time -v nft -f main.nft
bad_actors_set size: 1047501 bad_actors_set
        Command being timed: "nft -f main.nft"
        User time (seconds): 4.74
        System time (seconds): 1.93
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.71
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
   ---->Maximum resident set size (kbytes): 1773088
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 443932
        Voluntary context switches: 8
        Involuntary context switches: 21
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Smaller set don't guarantee success:
# echo "bad_actors_set size: $( wc -l bad_actors_set)" ; for i in {0..10}; do echo $i; nft -f main.nft; done
bad_actors_set size: 1047501 bad_actors_set
0
Segmentation fault (core dumped)
1
Segmentation fault (core dumped)
2
Segmentation fault (core dumped)
3
Segmentation fault (core dumped)
4
5
Segmentation fault (core dumped)
6
Segmentation fault (core dumped)
7
8
Segmentation fault (core dumped)
9
10

Comment 1 Phil Sutter 2021-05-18 15:01:10 UTC

Yes, loading a huge ruleset requires a large amount of memory. If nft segfaults, did you check 'dmesg' output? Maybe this is OOM situation and the kernel kills the process?

Comment 3 Phil Sutter 2021-06-10 12:51:48 UTC

A fix preventing the segfault submitted upstream:

https://lore.kernel.org/netfilter-devel/20210609140233.8085-1-phil@nwl.cc/

The memory overhead pretty much looks like a design issue, each element eats
quite a lot of memory in user space. Splitting the set of elements to add into
chunks would help, although that's obviously not an ideal solution.

Rubus, are you confined on memory? If so, how much RAM does your system have?

Cheers, Phil

Comment 4 Fabio Pedretti 2021-08-20 12:37:07 UTC

See also this firewalld issue report https://github.com/firewalld/firewalld/issues/738

Comment 5 rubus 2021-10-10 09:34:06 UTC

I made it on VM with 8 or 16 GB of RAM. Now that vm is gone. So I can't check it on that system.

I did not check dmesg sorry.

Today i check it on fresh install RHEL 8.4. (16G RAM)


[rubus@playground Downloads]$ nft -f main.nft
Segmentation fault (core dumped)
[rubus@playground Downloads]$ free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi       1.1Gi        13Gi        16Mi       1.0Gi        14Gi
Swap:         7.9Gi          0B       7.9Gi

dmesg:
[   35.593088] nft[3218]: segfault at 7fff859005b8 ip 00007fa4795d2a27 sp 00007fff858ff5c0 error 6 in libnftables.so.1.0.0[7fa479597000+95000]
[   35.593095] Code: 00 00 00 48 89 d0 48 c1 e8 04 48 c1 e0 04 48 89 c1 48 81 e1 00 f0 ff ff 48 29 ce 48 89 f1 48 39 cc 74 15 48 81 ec 00 10 00 00 <48> 83 8c 24 f8 0f 00 00 00 48 39 cc 75 eb 25 ff 0f 00 00 0f 85 10


As You write in fix message 3G of free RAM should be enough. Back to your question, I thing it wasn't OOM but I didn't check on that old VM.

Phil do RHEL 8.4 contain your patch?

Is current message dmesg message enough to be sure that it's not OOM problem? If not, how to do it?

#dnf list installed | grep nftables
nftables.x86_64                                    1:0.9.3-18.el8                                 @rhel-8-for-x86_64-baseos-rpms   
python3-nftables.x86_64                            1:0.9.3-18.el8                                 @rhel-8-for-x86_64-baseos-rpms

Comment 6 Phil Sutter 2021-10-11 15:18:46 UTC

Hi!

(In reply to rubus from comment #5)
[...]
> As You write in fix message 3G of free RAM should be enough. Back to your
> question, I thing it wasn't OOM but I didn't check on that old VM.
> 
> Phil do RHEL 8.4 contain your patch?

No, it is not. So nftables segfaulting in 8.4 is likely for the same cause as
described in the fix.

> Is current message dmesg message enough to be sure that it's not OOM
> problem? If not, how to do it?

Yes, should be enough. If the kernel killed nftables, it would brag about it in
dmesg.

> #dnf list installed | grep nftables
> nftables.x86_64                                    1:0.9.3-18.el8           
> @rhel-8-for-x86_64-baseos-rpms   
> python3-nftables.x86_64                            1:0.9.3-18.el8           
> @rhel-8-for-x86_64-baseos-rpms

Since you have a new machine to test with, would you mind giving this scratch build a test?

http://people.redhat.com/~psutter/nftables-0.9.3-18.el8_4.huge_set_segfault/

You may end with ENOBUFS, I haven't found a solution for that yet. It should be
possible to avoid that error by splitting the elements to add over multiple
transactions.

Thanks, Phil

Comment 7 Phil Sutter 2021-10-11 15:26:25 UTC

(In reply to Phil Sutter from comment #6)
> http://people.redhat.com/~psutter/nftables-0.9.3-18.el8_4.huge_set_segfault/

I was too quick, these RPMs won't install in RHEL8.4. I'll update them in a bunch, please don't install until I get back.

Comment 8 Phil Sutter 2021-10-11 15:52:00 UTC

(In reply to Phil Sutter from comment #7)
> (In reply to Phil Sutter from comment #6)
> > http://people.redhat.com/~psutter/nftables-0.9.3-18.el8_4.huge_set_segfault/
> 
> I was too quick, these RPMs won't install in RHEL8.4. I'll update them in a
> bunch, please don't install until I get back.

Now it's fine.

Comment 9 rubus 2021-10-11 21:30:55 UTC

Hi Phil,

I install your version of nftables. But I think you solve part of problem. 

I try to load whole set:
#/usr/bin/time -v nft -f main.nft

and start top command:

top - 16:11:53 up 23 min,  1 user,  load average: 2.18, 1.55, 0.74
Tasks: 292 total,   2 running, 290 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.8 us,  0.4 sy,  0.0 ni, 86.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15831.4 total,  10222.5 free,   3937.2 used,   1671.7 buff/cache
MiB Swap:   8076.0 total,   8076.0 free,      0.0 used.  11394.3 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                              
    945 root      16  -4  261072 215628   2880 R 100.0   1.3   3:16.82 sedispatch                                                           
     62 root      20   0       0      0      0 I   1.7   0.0   0:18.05 kworker/u16:5-xfs-cil/dm-0                                           
    943 root      16  -4  147496   4596   2132 S   1.3   0.0   0:04.14 auditd                                                               
   2449 mm        20   0 5495628 513108 148748 S   1.3   3.2   1:21.09 gnome-shell                                                          
  33904 root      20   0 1286752   1.1g   3908 D   0.7   7.4   0:07.13 nft    

and after hour:

top - 17:07:27 up  1:18,  1 user,  load average: 2.05, 2.13, 2.13
Tasks: 294 total,   4 running, 290 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.6 us,  0.4 sy,  0.0 ni, 86.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15831.4 total,   9813.6 free,   4376.2 used,   1641.5 buff/cache
MiB Swap:   8076.0 total,   8076.0 free,      0.0 used.  10983.3 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                
    945 root      16  -4  858900 813436   2880 R 100.0   5.0  58:49.21 sedispatch                                             
  35321 root      20   0       0      0      0 I   1.3   0.0   0:00.16 kworker/u16:0-xfs-cil/dm-0                             
    943 root      16  -4  147556   5296   2132 R   1.0   0.0   0:44.18 auditd                                                 
   2449 mm        20   0 5540584 551212 158752 S   0.7   3.4   2:10.26 gnome-shell                                            
  33904 root      20   0 1286752   1.2g   3908 D   0.7   7.7   0:09.59 nft                                                    
      1 root      20   0  252876  15352   9804 S   0.0   0.1   0:01.65 systemd

sedispatch process eats CPU and slowly take more RAM from 1.3% to 5% after hour.

I'll wait to see if 16G of RAM is enough to load this set and how much time it takes. 
If you have any ideas for measurements of running processes let me know.

Cheers, Rubus

Comment 10 rubus 2021-10-12 05:45:58 UTC

So it's possible but take more then 2h and 2G (observation not measurement) of RAM.

 /usr/bin/time -v nft -f main.nft
	Command being timed: "nft -f main.nft"
	User time (seconds): 5.19
	System time (seconds): 6.66
	Percent of CPU this job got: 0%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:14:22
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1859840
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 465057
	Voluntary context switches: 293
	Involuntary context switches: 249
	Swaps: 0
	File system inputs: 0
	File system outputs: 0
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

dmesg is clear.

Any idea why sedispatch take so much time?

If someone is as crazy as me, he can make 2h DOS for him self:)

Comment 11 Phil Sutter 2021-10-14 16:53:40 UTC

Hi,

(In reply to rubus from comment #10)
> Any idea why sedispatch take so much time?

I guess you're suffering from overly verbose auditd logging. This is a current
problem in RHEL8.4 when applying large nftables rulesets. Auditd logs every
single element IIRC and sedispatch tries to make sense out of the logs.

This is fixed in unreleased RHEL8.5 already and we're discussing a backport of
the relevant bits to RHEL8.4 already. A workaround should be to:

| auditctl -A exclude,never -F msgtype=NETFILTER_CFG

FWIW, on a (pretty beefy) RHEL8.5 machine your ruleset applied within 8s. :)

Please let me know whether the above auditctl command restores ruleset apply
time to sane values. If so, we're good after backporting the fix from comment
3.

Thanks, Phil

Comment 12 rubus 2021-10-14 19:55:15 UTC

(In reply to Phil Sutter from comment #11)

> Please let me know whether the above auditctl command restores ruleset apply
> time to sane values. If so, we're good after backporting the fix from comment
> 3.

It help, thank you. Time to load ruleset is now sane, ~8s on my vm.

Thanks, Rubus

Comment 13 Phil Sutter 2021-10-15 11:17:38 UTC

Great, thanks for validating!

Upstream commit to backport:

commit baecd1cf26851a4c5b7d469206a488f14fe5b147
Author: Phil Sutter <phil>
Date:   Wed Jun 9 15:49:52 2021 +0200

    segtree: Fix segfault when restoring a huge interval set
    
    Restoring a set of IPv4 prefixes with about 1.1M elements crashes nft as
    set_to_segtree() exhausts the stack. Prevent this by allocating the
    pointer array on heap and make sure it is freed before returning to
    caller.
    
    With this patch in place, restoring said set succeeds with allocation of
    about 3GB of memory, according to valgrind.
    
    Signed-off-by: Phil Sutter <phil>

Comment 14 Štěpán Němec 2021-11-05 12:54:47 UTC

For testing/verification, we will also need to backport upstream commit 
d8ccad2a2b73 ("tests: cover baecd1cf2685 ("segtree: Fix segfault when restoring a huge interval set")")

Phil, please confirm the added backport and the time frame (ITM-11 still
seems manageable), I will then give qa_ack so that we can get this in.
Thanks.

Comment 15 Phil Sutter 2021-11-05 13:48:40 UTC

(In reply to Štěpán Němec from comment #14)
> For testing/verification, we will also need to backport upstream commit 
> d8ccad2a2b73 ("tests: cover baecd1cf2685 ("segtree: Fix segfault when
> restoring a huge interval set")")

Oh, yes of course. Thanks for the reminder!

> Phil, please confirm the added backport and the time frame (ITM-11 still
> seems manageable), I will then give qa_ack so that we can get this in.

Yes, please.

Comment 17 Phil Sutter 2021-11-05 15:15:21 UTC

https://src.osci.redhat.com/rpms/nftables/pull-request/7

Comment 21 Štěpán Němec 2021-11-12 13:11:08 UTC

I have verified that the new build (nftables-0.9.3-23.el8) fixes
the issue on all architectures we routinely test on (x86_64,
aarch64, ppc64le, s390x): the computation now only takes time
proportional to the set size, but the stack usage does not increase.

Nevertheless, the test (comment 14) needs fixing to become more
reliable across different architectures, so I'm extending the
ITM deadline to be able to come up with a better automated test.

Comment 22 Štěpán Němec 2021-12-09 09:09:56 UTC

nftables-0.9.3-24.el8 contains two more backports improving
the upstream test:

7b81d9cb094f ("tests: shell: better parameters for the interval stack overflow test")
dad3338f1f76 ("tests: shell: $NFT needs to be invoked unquoted")

Comment 26 errata-xmlrpc 2022-05-10 15:17:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (nftables bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:2004