Bug 1986588
Summary: | iptables-nft returns incorrect result for -C when concurrently running | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Casey Callendrello <cdc> | |
Component: | iptables | Assignee: | Phil Sutter <psutter> | |
Status: | CLOSED ERRATA | QA Contact: | Tomas Dolezal <todoleza> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 8.4 | CC: | egarver, jbainbri, psutter, snemec, todoleza | |
Target Milestone: | rc | Keywords: | Triaged, ZStream | |
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | iptables-1.8.4-20.el8 | Doc Type: | No Doc Update | |
Doc Text: |
If this bug requires documentation, please select an appropriate Doc Type value.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1990016 1990017 (view as bug list) | Environment: | ||
Last Closed: | 2021-11-09 19:54:29 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1990016, 1990017 |
Description
Casey Callendrello
2021-07-27 20:21:09 UTC
I couldn't create a reliable reproducer, but this also happens when concurrently running an "iptables -C XXX -t nat (...)" along with an "iptables -S YYY -t mangle". This is the particular sequence of commands executed on OpenShift that triggers this bug. (In reply to Casey Callendrello from comment #0) > Additional info: > > This definitely depends on the size of the chain. If the chain only has one > rule, I can't seem to reproduce it. > > Note that this node is an OpenShift / Kubernetes node, and thus has a lot of > iptables activity. But not on this chain. Doesn't k8s regularly wipe and re-apply the entire iptables ruleset? Is this reproducible on RHEL? (In reply to Eric Garver from comment #2) > Doesn't k8s regularly wipe and re-apply the entire iptables ruleset? k8s will do an iptables-save, iptables-restore (though not for all chains). However, there are other bits of code that still use direct iptables commands. > Is this reproducible on RHEL? I'm not sure about RHEL, but this is RHCOS, which is quite closely related. (In reply to Casey Callendrello from comment #3) > (In reply to Eric Garver from comment #2) > > Doesn't k8s regularly wipe and re-apply the entire iptables ruleset? > > > k8s will do an iptables-save, iptables-restore (though not for all chains). > However, there are other bits of code that still use direct iptables > commands. > > > Is this reproducible on RHEL? > > I'm not sure about RHEL, but this is RHCOS, which is quite closely related. I was more pointing out that OCP has a lot of rule set changes and noise that you wouldn't get with a vanilla RHEL install. @psutter, IIRC iptables-nft is atomic at the chain level. The "-w" option is ignored. Any ideas what could be going on here? Are we hitting a batch restart or cache issue? (In reply to Eric Garver from comment #4) > (In reply to Casey Callendrello from comment #3) > > (In reply to Eric Garver from comment #2) > > > Doesn't k8s regularly wipe and re-apply the entire iptables ruleset? > > > > > > k8s will do an iptables-save, iptables-restore (though not for all chains). > > However, there are other bits of code that still use direct iptables > > commands. > > > > > Is this reproducible on RHEL? > > > > I'm not sure about RHEL, but this is RHCOS, which is quite closely related. > > I was more pointing out that OCP has a lot of rule set changes and noise > that you wouldn't get with a vanilla RHEL install. > > @psutter, IIRC iptables-nft is atomic at the chain level. At table level even. > The "-w" option is > ignored. Any ideas what could be going on here? Are we hitting a batch > restart or cache issue? You never know. A working reproducer on RHEL would be good. If OCP is really flushing the ruleset and reapplying its own rules, the chain IPTABLES-BUG should vanish and remain gone after the loop failed. Is that the case? If not, does OCP do something stupid like 'iptables-save >/tmp/foo && edit /tmp/foo && iptables-restore /tmp/foo'? That might leave a gap in which given rule is not there again yet. FYI: It is actually quick and easy to reproduce given the instructions in comment 0. This proves it's not an OCP issue. I'll investigate further, sorry for the noise! Here's a pretty simple reproducer: --- #!/bin/bash iptables -t nat -F IPTABLES-BUG >/dev/null 2>&1 iptables -t nat -X IPTABLES-BUG >/dev/null 2>&1 mt='-m tcp -p tcp --dport 42 -m comment --comment testing -j MASQUERADE' iptables -t nat -N IPTABLES-BUG for i in $(seq 1 250); do iptables -t nat -w -A IPTABLES-BUG -d 1.0.0.$i $mt done while true; do iptables -t nat -C IPTABLES-BUG -d 1.0.0.250/32 $mt & iptables -t nat -C IPTABLES-BUG -d 1.0.0.250/32 $mt done --- The problem exists only in RHEL8. Using either kernel or iptables from upstream avoids it, here's why: 1) In RHEL8 kernel, listing the ruleset bumps its generation ID. Upstream avoids that, probably with commit b8b27498659c6 ("netfilter: nf_tables: return immediately on empty commit"). 2) While fetching rules, recvfrom() call may return EINTR. Upstream iptables notices the changed generation ID and retries, RHEL8 iptables can't do that due to how cache population is integrated into calling code. Backporting the patch from 1) would avoid the problem with given reproducer, but a background job performing actual ruleset changes may cause the same situation nevertheless. Reorganizing cache handling in iptables to match upstream requires a larger backport series, probably not suitable for RHEL8. This RHEL-only fix seems to do the trick: --- a/iptables/nft-cache.c +++ b/iptables/nft-cache.c @@ -404,9 +404,12 @@ static int nft_rule_list_update(struct nftnl_chain *c, void *data) NLM_F_DUMP, h->seq); nftnl_rule_nlmsg_build_payload(nlh, rule); +retry: ret = mnl_talk(h, nlh, nftnl_rule_list_cb, c); - if (ret < 0 && errno == EINTR) + if (ret < 0 && errno == EINTR) { assert(nft_restart(h) >= 0); + goto retry; + } nftnl_rule_free(rule); --- Even with multiple other loops spinning on chain create and delete calls, this doesn't loop more than 3 or 4 times until it succeeds and given that it retries only in EINTR case, I consider it safe. I decided to apply the same workaround to fetch_table_cache(), fetch_set_cache() and fetch_chain_cache() also. For some reason failures from fetch_table_cache() weren't problematic and fetch_chain_cache() never failed, but large enough rulesets might change that. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (iptables bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4468 |