Bug 1908127
| Summary: | nftable segmentation fault with big ip set | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | rubus <rubus_spam> | ||||
| Component: | nftables | Assignee: | Phil Sutter <psutter> | ||||
| Status: | CLOSED ERRATA | QA Contact: | qe-baseos-daemons | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 8.3 | CC: | egarver, fabio.pedretti, martin.bozic, pgnet.dev, psutter, todoleza | ||||
| Target Milestone: | rc | Keywords: | Triaged, Upstream | ||||
| Target Release: | 8.6 | Flags: | pm-rhel:
mirror+
|
||||
| Hardware: | Unspecified | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | nftables-0.9.3-24.el8 | Doc Type: | No Doc Update | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 2020668 2040754 (view as bug list) | Environment: | |||||
| Last Closed: | 2022-05-10 15:17:29 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 2040754, 2047821 | ||||||
| Attachments: |
|
||||||
Yes, loading a huge ruleset requires a large amount of memory. If nft segfaults, did you check 'dmesg' output? Maybe this is OOM situation and the kernel kills the process? A fix preventing the segfault submitted upstream: https://lore.kernel.org/netfilter-devel/20210609140233.8085-1-phil@nwl.cc/ The memory overhead pretty much looks like a design issue, each element eats quite a lot of memory in user space. Splitting the set of elements to add into chunks would help, although that's obviously not an ideal solution. Rubus, are you confined on memory? If so, how much RAM does your system have? Cheers, Phil See also this firewalld issue report https://github.com/firewalld/firewalld/issues/738 I made it on VM with 8 or 16 GB of RAM. Now that vm is gone. So I can't check it on that system.
I did not check dmesg sorry.
Today i check it on fresh install RHEL 8.4. (16G RAM)
[rubus@playground Downloads]$ nft -f main.nft
Segmentation fault (core dumped)
[rubus@playground Downloads]$ free -h
total used free shared buff/cache available
Mem: 15Gi 1.1Gi 13Gi 16Mi 1.0Gi 14Gi
Swap: 7.9Gi 0B 7.9Gi
dmesg:
[ 35.593088] nft[3218]: segfault at 7fff859005b8 ip 00007fa4795d2a27 sp 00007fff858ff5c0 error 6 in libnftables.so.1.0.0[7fa479597000+95000]
[ 35.593095] Code: 00 00 00 48 89 d0 48 c1 e8 04 48 c1 e0 04 48 89 c1 48 81 e1 00 f0 ff ff 48 29 ce 48 89 f1 48 39 cc 74 15 48 81 ec 00 10 00 00 <48> 83 8c 24 f8 0f 00 00 00 48 39 cc 75 eb 25 ff 0f 00 00 0f 85 10
As You write in fix message 3G of free RAM should be enough. Back to your question, I thing it wasn't OOM but I didn't check on that old VM.
Phil do RHEL 8.4 contain your patch?
Is current message dmesg message enough to be sure that it's not OOM problem? If not, how to do it?
#dnf list installed | grep nftables
nftables.x86_64 1:0.9.3-18.el8 @rhel-8-for-x86_64-baseos-rpms
python3-nftables.x86_64 1:0.9.3-18.el8 @rhel-8-for-x86_64-baseos-rpms
Hi! (In reply to rubus from comment #5) [...] > As You write in fix message 3G of free RAM should be enough. Back to your > question, I thing it wasn't OOM but I didn't check on that old VM. > > Phil do RHEL 8.4 contain your patch? No, it is not. So nftables segfaulting in 8.4 is likely for the same cause as described in the fix. > Is current message dmesg message enough to be sure that it's not OOM > problem? If not, how to do it? Yes, should be enough. If the kernel killed nftables, it would brag about it in dmesg. > #dnf list installed | grep nftables > nftables.x86_64 1:0.9.3-18.el8 > @rhel-8-for-x86_64-baseos-rpms > python3-nftables.x86_64 1:0.9.3-18.el8 > @rhel-8-for-x86_64-baseos-rpms Since you have a new machine to test with, would you mind giving this scratch build a test? http://people.redhat.com/~psutter/nftables-0.9.3-18.el8_4.huge_set_segfault/ You may end with ENOBUFS, I haven't found a solution for that yet. It should be possible to avoid that error by splitting the elements to add over multiple transactions. Thanks, Phil (In reply to Phil Sutter from comment #6) > http://people.redhat.com/~psutter/nftables-0.9.3-18.el8_4.huge_set_segfault/ I was too quick, these RPMs won't install in RHEL8.4. I'll update them in a bunch, please don't install until I get back. (In reply to Phil Sutter from comment #7) > (In reply to Phil Sutter from comment #6) > > http://people.redhat.com/~psutter/nftables-0.9.3-18.el8_4.huge_set_segfault/ > > I was too quick, these RPMs won't install in RHEL8.4. I'll update them in a > bunch, please don't install until I get back. Now it's fine. Hi Phil,
I install your version of nftables. But I think you solve part of problem.
I try to load whole set:
#/usr/bin/time -v nft -f main.nft
and start top command:
top - 16:11:53 up 23 min, 1 user, load average: 2.18, 1.55, 0.74
Tasks: 292 total, 2 running, 290 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.8 us, 0.4 sy, 0.0 ni, 86.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15831.4 total, 10222.5 free, 3937.2 used, 1671.7 buff/cache
MiB Swap: 8076.0 total, 8076.0 free, 0.0 used. 11394.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
945 root 16 -4 261072 215628 2880 R 100.0 1.3 3:16.82 sedispatch
62 root 20 0 0 0 0 I 1.7 0.0 0:18.05 kworker/u16:5-xfs-cil/dm-0
943 root 16 -4 147496 4596 2132 S 1.3 0.0 0:04.14 auditd
2449 mm 20 0 5495628 513108 148748 S 1.3 3.2 1:21.09 gnome-shell
33904 root 20 0 1286752 1.1g 3908 D 0.7 7.4 0:07.13 nft
and after hour:
top - 17:07:27 up 1:18, 1 user, load average: 2.05, 2.13, 2.13
Tasks: 294 total, 4 running, 290 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.6 us, 0.4 sy, 0.0 ni, 86.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 15831.4 total, 9813.6 free, 4376.2 used, 1641.5 buff/cache
MiB Swap: 8076.0 total, 8076.0 free, 0.0 used. 10983.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
945 root 16 -4 858900 813436 2880 R 100.0 5.0 58:49.21 sedispatch
35321 root 20 0 0 0 0 I 1.3 0.0 0:00.16 kworker/u16:0-xfs-cil/dm-0
943 root 16 -4 147556 5296 2132 R 1.0 0.0 0:44.18 auditd
2449 mm 20 0 5540584 551212 158752 S 0.7 3.4 2:10.26 gnome-shell
33904 root 20 0 1286752 1.2g 3908 D 0.7 7.7 0:09.59 nft
1 root 20 0 252876 15352 9804 S 0.0 0.1 0:01.65 systemd
sedispatch process eats CPU and slowly take more RAM from 1.3% to 5% after hour.
I'll wait to see if 16G of RAM is enough to load this set and how much time it takes.
If you have any ideas for measurements of running processes let me know.
Cheers, Rubus
So it's possible but take more then 2h and 2G (observation not measurement) of RAM. /usr/bin/time -v nft -f main.nft Command being timed: "nft -f main.nft" User time (seconds): 5.19 System time (seconds): 6.66 Percent of CPU this job got: 0% Elapsed (wall clock) time (h:mm:ss or m:ss): 2:14:22 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 1859840 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 465057 Voluntary context switches: 293 Involuntary context switches: 249 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 dmesg is clear. Any idea why sedispatch take so much time? If someone is as crazy as me, he can make 2h DOS for him self:) Hi, (In reply to rubus from comment #10) > Any idea why sedispatch take so much time? I guess you're suffering from overly verbose auditd logging. This is a current problem in RHEL8.4 when applying large nftables rulesets. Auditd logs every single element IIRC and sedispatch tries to make sense out of the logs. This is fixed in unreleased RHEL8.5 already and we're discussing a backport of the relevant bits to RHEL8.4 already. A workaround should be to: | auditctl -A exclude,never -F msgtype=NETFILTER_CFG FWIW, on a (pretty beefy) RHEL8.5 machine your ruleset applied within 8s. :) Please let me know whether the above auditctl command restores ruleset apply time to sane values. If so, we're good after backporting the fix from comment 3. Thanks, Phil (In reply to Phil Sutter from comment #11) > Please let me know whether the above auditctl command restores ruleset apply > time to sane values. If so, we're good after backporting the fix from comment > 3. It help, thank you. Time to load ruleset is now sane, ~8s on my vm. Thanks, Rubus Great, thanks for validating!
Upstream commit to backport:
commit baecd1cf26851a4c5b7d469206a488f14fe5b147
Author: Phil Sutter <phil>
Date: Wed Jun 9 15:49:52 2021 +0200
segtree: Fix segfault when restoring a huge interval set
Restoring a set of IPv4 prefixes with about 1.1M elements crashes nft as
set_to_segtree() exhausts the stack. Prevent this by allocating the
pointer array on heap and make sure it is freed before returning to
caller.
With this patch in place, restoring said set succeeds with allocation of
about 3GB of memory, according to valgrind.
Signed-off-by: Phil Sutter <phil>
For testing/verification, we will also need to backport upstream commit
d8ccad2a2b73 ("tests: cover baecd1cf2685 ("segtree: Fix segfault when restoring a huge interval set")")
Phil, please confirm the added backport and the time frame (ITM-11 still
seems manageable), I will then give qa_ack so that we can get this in.
Thanks.
(In reply to Štěpán Němec from comment #14) > For testing/verification, we will also need to backport upstream commit > d8ccad2a2b73 ("tests: cover baecd1cf2685 ("segtree: Fix segfault when > restoring a huge interval set")") Oh, yes of course. Thanks for the reminder! > Phil, please confirm the added backport and the time frame (ITM-11 still > seems manageable), I will then give qa_ack so that we can get this in. Yes, please. I have verified that the new build (nftables-0.9.3-23.el8) fixes the issue on all architectures we routinely test on (x86_64, aarch64, ppc64le, s390x): the computation now only takes time proportional to the set size, but the stack usage does not increase. Nevertheless, the test (comment 14) needs fixing to become more reliable across different architectures, so I'm extending the ITM deadline to be able to come up with a better automated test. nftables-0.9.3-24.el8 contains two more backports improving
the upstream test:
7b81d9cb094f ("tests: shell: better parameters for the interval stack overflow test")
dad3338f1f76 ("tests: shell: $NFT needs to be invoked unquoted")
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (nftables bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2004 |
Created attachment 1739468 [details] configuration files for reproduce bug Description of problem: Try to make rule to drop packets form big set of ip ranges. Total number of elements in set is 1103505. Set of ip is keep as variable in separate file (bad_actors_set). define bad_actors_set={ 0.0.0.0/8, 1.0.1.0/24, and a lot of more... total 1103505 CIDRs } Set definition and rest of configuration is in main.nft file. set bad_actors{ type ipv4_addr flags interval elements={$bad_actors_set} } I get segmentation fault when i try reload nftables. Files are in attachment Version-Release number of selected component (if applicable): nftables v0.9.3 (Topsy) How reproducible: Fallow instruction on clean install. Steps to Reproduce: 1. systemctl disable --now firewalld 2. tar -x nft_conf.xz 3. nft -f main.nft Actual results: # nft -f main.nft Segmentation fault Expected results: # nft -f main.nft # Additional info: For little smaller set reload sometimes succeed. Way of create smaller set: #cp bad_actors_set bad_actors_set.back #head -n 1047500 bad_actors_set.back > bad_actors_set ; echo "}" >> bad_actors_set If nft -f main.nft succeed memory usage is huge. # echo "bad_actors_set size: $( wc -l bad_actors_set)" ; /usr/bin/time -v nft -f main.nft bad_actors_set size: 1047501 bad_actors_set Command being timed: "nft -f main.nft" User time (seconds): 4.74 System time (seconds): 1.93 Percent of CPU this job got: 99% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.71 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 ---->Maximum resident set size (kbytes): 1773088 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 443932 Voluntary context switches: 8 Involuntary context switches: 21 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 Smaller set don't guarantee success: # echo "bad_actors_set size: $( wc -l bad_actors_set)" ; for i in {0..10}; do echo $i; nft -f main.nft; done bad_actors_set size: 1047501 bad_actors_set 0 Segmentation fault (core dumped) 1 Segmentation fault (core dumped) 2 Segmentation fault (core dumped) 3 Segmentation fault (core dumped) 4 5 Segmentation fault (core dumped) 6 Segmentation fault (core dumped) 7 8 Segmentation fault (core dumped) 9 10