Bug 2127774
| Summary: | nft: netlink_delinearize.c:2695: netlink_delinearize_rule: Assertion `pctx->table != NULL' failed. | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 8 | Reporter: | Jonathan Maxwell <jmaxwell> | |
| Component: | nftables | Assignee: | Phil Sutter <psutter> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | qe-baseos-daemons | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 8.5 | CC: | egarver, jpeska, mleitner, psutter, qe-baseos-daemons, sukulkar, todoleza | |
| Target Milestone: | rc | Keywords: | TestOnly, Triaged | |
| Target Release: | 8.8 | Flags: | pm-rhel:
mirror+
|
|
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | nftables-1.0.4-2.el8 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2130721 (view as bug list) | Environment: | ||
| Last Closed: | 2024-06-05 12:10:22 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 2211076 | |||
| Bug Blocks: | 2130721 | |||
|
Comment 3
Phil Sutter
2022-09-20 17:10:13 UTC
I'm currently looking at the C01S01F1 core dump. It seems lookup is for IPv6 family table "filter", while cache contains only IPv4 family table "filter": (gdb) print h->family $12 = 10 (gdb) print h->table.name $13 = 0x55b359887980 "filter" (gdb) print ((struct table *)(ctx->nft->cache->list))->handle.family $14 = 2 (gdb) print ((struct table *)(ctx->nft->cache->list))->handle.table.name $15 = 0x55b359887d70 "filter" (gdb) print ctx->nft->cache->list->next == ctx->nft->cache->list->prev $16 = 1 I did not find a related fix yet. According to the dump, the command was 'nft monitor rules'. When monitoring, nft has to consistently keep the cache up to date. It reused the events for this: A new table event will add said table to the cache. If a new rule event is received, nft assumes it has either seen the rule's table and chain at startup (where initially a full cache is fetched) or there must have been a new table/chain event prior to the new rule event. The dump indicates nft either missed this new table event or it is possible somehow to make it drop the cache when it should not. Or the kernel did not send the new table event. Either way, I'm a bit out of ideas when it comes to reproducing the problem, also I didn't find a potential fix in nftables git history at least. How frequent does it happen in the customer's env? Are they able to reproduce the crash? (In reply to Phil Sutter from comment #3) > Hi Jon, > > The assert there triggers if the rule received from kernel references a > table that doesn't exist in cache. This is a bit odd because in cache_init() > (src/rule.c), table cache is unconditionally populated before fetching rule > cache. > > Do you know how nft was called and what the ruleset was? Maybe there's a bug > in RHEL8.5 nftables. This is nftables-0.9.3-21.el8, right? > > Cheers, Phil Hi Phil, Yes its: nftables-0.9.3-21.el8.x86_64 Regards Jon (In reply to Phil Sutter from comment #6) > According to the dump, the command was 'nft monitor rules'. > > When monitoring, nft has to consistently keep the cache up to date. It reused > the events for this: A new table event will add said table to the cache. > > If a new rule event is received, nft assumes it has either seen the rule's > table and chain at startup (where initially a full cache is fetched) or there > must have been a new table/chain event prior to the new rule event. > > The dump indicates nft either missed this new table event or it is possible > somehow to make it drop the cache when it should not. Or the kernel did not > send the new table event. > > Either way, I'm a bit out of ideas when it comes to reproducing the problem, > also I didn't find a potential fix in nftables git history at least. > Thanks for the Hypothesis Phil that makes sense. > How frequent does it happen in the customer's env? Are they able to reproduce > the crash? They are not able to reproduce per se. They said "It happens only on some system and not always.". I have asked if they can provide us with the application code and a reproducer if possible. Regards Jon Turns out my attempts at reproducing the issue were just not pressing enough: I
had tried to run 'iptables -A' and 'nft flush ruleset' in a loop and called
'nft monitor' a few times. But since this is a race condition between 'nft add
table' and 'nft monitor' startup, it doesn't happen as often. Starting and killing 'nft monitor' in a loop as well did the trick:
| #!/bin/bash
|
| while true; do
| ./install/sbin/nft flush ruleset
| nft -f - <<-EOF
| table t {
| chain c {
| counter
| }
| }
| EOF
| done &
| maniploop=$!
|
| trap "kill $maniploop; kill \$!; wait" EXIT
|
| while true; do
| ./install/sbin/nft monitor rules >/dev/null &
| sleep 0.2
| kill $!
| done
I tried to make 'nft monitor' refresh cache once after receiving the first
event, but it made the abort more likely. I'll try to eliminate the assert()
calls next week - they're bad practice within a library anyway.
Fix submitted upstream: https://lore.kernel.org/netfilter-devel/20220928223248.25933-1-phil@nwl.cc/ I'll clone this ticket for RHEL9 since the problem exists there as well. Phil, Seeing that it will now return NULL and errno ENOENT. Will any changes be required to the calling program? Regards Jon Hi Jon, (In reply to Jonathan Maxwell from comment #17) > Seeing that it will now return NULL and errno ENOENT. Will any changes be > required to the calling program? No, it's fine as it is: The calling function in 'nft monitor', netlink_events_rule_cb(), will print 'W: Received event for an unknown table.' on stderr and otherwise ignore the event if netlink_delinearize_rule() returns NULL. The program then returns to listening for the next event. Cheers, Phil Dropping from the 8.8 RPL because of lack of Votes (devel whiteboard). -Sushil Inherited the fix mentioned in comment 16 by package rebase, marking as TestOnly. Hi QE, please adjust the stale date if you still want to have a specific test case for this. Otherwise, this bz will close in a week from now. Thanks. The program is considering auto-closing old/stale bugzilla tickets, like this one. I'll go ahead and close it manually already. |