Hide Forgot
Description of problem: When concurrently running iptables -C, sometimes iptables -C incorrectly returns an error. Version-Release number of selected component (if applicable): # iptables --version iptables v1.8.4 (nf_tables) # rpm -qa | grep tables iptables-libs-1.8.4-15.el8_3.3.x86_64 iptables-1.8.4-15.el8_3.3.x86_64 nftables-0.9.3-16.el8.x86_64 # uname -a Linux ci-ln-5svkrn2-f76d1-ccstw-worker-a-sb62g 4.18.0-240.23.2.el8_3.x86_64 #1 SMP Thu Jul 8 03:12:56 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux How reproducible: Easy, on a node with enough iptables load (e.g. OpenShift node). Steps to Reproduce: 1. # iptables -t nat -N IPTABLES-BUG; for i in $(seq 1 250); do iptables -t nat -w -A IPTABLES-BUG -d 1.0.0.$i -m tcp -p tcp --dport 42 -m comment --comment testing -j MASQUERADE; done 2. # while iptables -w 5 -W 100000 -t nat -C IPTABLES-BUG -d 1.0.0.250/32 -p tcp -m tcp --dport 42 -m comment --comment testing -j MASQUERADE; do echo -n; done 3. In another terminal, # while iptables -w 5 -W 100000 -t nat -C IPTABLES-BUG -d 1.0.0.250/32 -p tcp -m tcp --dport 42 -m comment --comment testing -j MASQUERADE; do echo -n; done One of the two should quickly fail. Actual results: One of the while-loops eventually fails with iptables: Bad rule (does a matching rule exist in that chain?). Expected results: iptables -C should never fail. Additional info: This definitely depends on the size of the chain. If the chain only has one rule, I can't seem to reproduce it. Note that this node is an OpenShift / Kubernetes node, and thus has a lot of iptables activity. But not on this chain.
I couldn't create a reliable reproducer, but this also happens when concurrently running an "iptables -C XXX -t nat (...)" along with an "iptables -S YYY -t mangle". This is the particular sequence of commands executed on OpenShift that triggers this bug.
(In reply to Casey Callendrello from comment #0) > Additional info: > > This definitely depends on the size of the chain. If the chain only has one > rule, I can't seem to reproduce it. > > Note that this node is an OpenShift / Kubernetes node, and thus has a lot of > iptables activity. But not on this chain. Doesn't k8s regularly wipe and re-apply the entire iptables ruleset? Is this reproducible on RHEL?
(In reply to Eric Garver from comment #2) > Doesn't k8s regularly wipe and re-apply the entire iptables ruleset? k8s will do an iptables-save, iptables-restore (though not for all chains). However, there are other bits of code that still use direct iptables commands. > Is this reproducible on RHEL? I'm not sure about RHEL, but this is RHCOS, which is quite closely related.
(In reply to Casey Callendrello from comment #3) > (In reply to Eric Garver from comment #2) > > Doesn't k8s regularly wipe and re-apply the entire iptables ruleset? > > > k8s will do an iptables-save, iptables-restore (though not for all chains). > However, there are other bits of code that still use direct iptables > commands. > > > Is this reproducible on RHEL? > > I'm not sure about RHEL, but this is RHCOS, which is quite closely related. I was more pointing out that OCP has a lot of rule set changes and noise that you wouldn't get with a vanilla RHEL install. @psutter, IIRC iptables-nft is atomic at the chain level. The "-w" option is ignored. Any ideas what could be going on here? Are we hitting a batch restart or cache issue?
(In reply to Eric Garver from comment #4) > (In reply to Casey Callendrello from comment #3) > > (In reply to Eric Garver from comment #2) > > > Doesn't k8s regularly wipe and re-apply the entire iptables ruleset? > > > > > > k8s will do an iptables-save, iptables-restore (though not for all chains). > > However, there are other bits of code that still use direct iptables > > commands. > > > > > Is this reproducible on RHEL? > > > > I'm not sure about RHEL, but this is RHCOS, which is quite closely related. > > I was more pointing out that OCP has a lot of rule set changes and noise > that you wouldn't get with a vanilla RHEL install. > > @psutter, IIRC iptables-nft is atomic at the chain level. At table level even. > The "-w" option is > ignored. Any ideas what could be going on here? Are we hitting a batch > restart or cache issue? You never know. A working reproducer on RHEL would be good. If OCP is really flushing the ruleset and reapplying its own rules, the chain IPTABLES-BUG should vanish and remain gone after the loop failed. Is that the case? If not, does OCP do something stupid like 'iptables-save >/tmp/foo && edit /tmp/foo && iptables-restore /tmp/foo'? That might leave a gap in which given rule is not there again yet.
FYI: It is actually quick and easy to reproduce given the instructions in comment 0. This proves it's not an OCP issue. I'll investigate further, sorry for the noise!
Here's a pretty simple reproducer: --- #!/bin/bash iptables -t nat -F IPTABLES-BUG >/dev/null 2>&1 iptables -t nat -X IPTABLES-BUG >/dev/null 2>&1 mt='-m tcp -p tcp --dport 42 -m comment --comment testing -j MASQUERADE' iptables -t nat -N IPTABLES-BUG for i in $(seq 1 250); do iptables -t nat -w -A IPTABLES-BUG -d 1.0.0.$i $mt done while true; do iptables -t nat -C IPTABLES-BUG -d 1.0.0.250/32 $mt & iptables -t nat -C IPTABLES-BUG -d 1.0.0.250/32 $mt done --- The problem exists only in RHEL8. Using either kernel or iptables from upstream avoids it, here's why: 1) In RHEL8 kernel, listing the ruleset bumps its generation ID. Upstream avoids that, probably with commit b8b27498659c6 ("netfilter: nf_tables: return immediately on empty commit"). 2) While fetching rules, recvfrom() call may return EINTR. Upstream iptables notices the changed generation ID and retries, RHEL8 iptables can't do that due to how cache population is integrated into calling code. Backporting the patch from 1) would avoid the problem with given reproducer, but a background job performing actual ruleset changes may cause the same situation nevertheless. Reorganizing cache handling in iptables to match upstream requires a larger backport series, probably not suitable for RHEL8. This RHEL-only fix seems to do the trick: --- a/iptables/nft-cache.c +++ b/iptables/nft-cache.c @@ -404,9 +404,12 @@ static int nft_rule_list_update(struct nftnl_chain *c, void *data) NLM_F_DUMP, h->seq); nftnl_rule_nlmsg_build_payload(nlh, rule); +retry: ret = mnl_talk(h, nlh, nftnl_rule_list_cb, c); - if (ret < 0 && errno == EINTR) + if (ret < 0 && errno == EINTR) { assert(nft_restart(h) >= 0); + goto retry; + } nftnl_rule_free(rule); --- Even with multiple other loops spinning on chain create and delete calls, this doesn't loop more than 3 or 4 times until it succeeds and given that it retries only in EINTR case, I consider it safe.
I decided to apply the same workaround to fetch_table_cache(), fetch_set_cache() and fetch_chain_cache() also. For some reason failures from fetch_table_cache() weren't problematic and fetch_chain_cache() never failed, but large enough rulesets might change that.
https://src.osci.redhat.com/rpms/iptables/pull-request/14
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (iptables bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:4468