Bug 1986588

Summary:	iptables-nft returns incorrect result for -C when concurrently running
Product:	Red Hat Enterprise Linux 8	Reporter:	Casey Callendrello <cdc>
Component:	iptables	Assignee:	Phil Sutter <psutter>
Status:	CLOSED ERRATA	QA Contact:	Tomas Dolezal <todoleza>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	8.4	CC:	egarver, jbainbri, psutter, snemec, todoleza
Target Milestone:	rc	Keywords:	Triaged, ZStream
Target Release:	---	Flags:	pm-rhel: mirror+
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:	iptables-1.8.4-20.el8	Doc Type:	No Doc Update
Doc Text:	If this bug requires documentation, please select an appropriate Doc Type value.	Story Points:	---
Clone Of:
Clones:	1990016 1990017 (view as bug list)		Environment:
Last Closed:	2021-11-09 19:54:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1990016, 1990017

Description Casey Callendrello 2021-07-27 20:21:09 UTC

Description of problem: When concurrently running iptables -C, sometimes iptables -C incorrectly returns an error.


Version-Release number of selected component (if applicable):
# iptables --version
iptables v1.8.4 (nf_tables)

# rpm -qa | grep tables
iptables-libs-1.8.4-15.el8_3.3.x86_64
iptables-1.8.4-15.el8_3.3.x86_64
nftables-0.9.3-16.el8.x86_64

# uname -a
Linux ci-ln-5svkrn2-f76d1-ccstw-worker-a-sb62g 4.18.0-240.23.2.el8_3.x86_64 #1 SMP Thu Jul 8 03:12:56 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux



How reproducible: Easy, on a node with enough iptables load (e.g. OpenShift node).


Steps to Reproduce:
1. # iptables -t nat -N IPTABLES-BUG; for i in $(seq 1 250); do iptables -t nat -w -A IPTABLES-BUG -d 1.0.0.$i -m tcp -p tcp --dport 42 -m comment --comment testing -j MASQUERADE; done

2. # while iptables -w 5 -W 100000 -t nat -C IPTABLES-BUG -d 1.0.0.250/32 -p tcp -m tcp --dport 42 -m comment --comment testing -j MASQUERADE; do echo -n; done

3. In another terminal,
# while iptables -w 5 -W 100000 -t nat -C IPTABLES-BUG -d 1.0.0.250/32 -p tcp -m tcp --dport 42 -m comment --comment testing -j MASQUERADE; do echo -n; done

One of the two should quickly fail.


Actual results:
One of the while-loops eventually fails with

iptables: Bad rule (does a matching rule exist in that chain?).


Expected results:
iptables -C should never fail.


Additional info:

This definitely depends on the size of the chain. If the chain only has one rule, I can't seem to reproduce it.

Note that this node is an OpenShift / Kubernetes node, and thus has a lot of iptables activity. But not on this chain.

Comment 1 Casey Callendrello 2021-07-27 20:26:44 UTC

I couldn't create a reliable reproducer, but this also happens when concurrently running an "iptables -C XXX -t nat (...)" along with an "iptables -S YYY -t mangle". This is the particular sequence of commands executed on OpenShift that triggers this bug.

Comment 2 Eric Garver 2021-07-28 12:48:58 UTC

(In reply to Casey Callendrello from comment #0)
> Additional info:
> 
> This definitely depends on the size of the chain. If the chain only has one
> rule, I can't seem to reproduce it.
> 
> Note that this node is an OpenShift / Kubernetes node, and thus has a lot of
> iptables activity. But not on this chain.

Doesn't k8s regularly wipe and re-apply the entire iptables ruleset?

Is this reproducible on RHEL?

Comment 3 Casey Callendrello 2021-07-28 16:21:05 UTC

(In reply to Eric Garver from comment #2)
> Doesn't k8s regularly wipe and re-apply the entire iptables ruleset?


k8s will do an iptables-save, iptables-restore (though not for all chains). However, there are other bits of code that still use direct iptables commands.

> Is this reproducible on RHEL?

I'm not sure about RHEL, but this is RHCOS, which is quite closely related.

Comment 4 Eric Garver 2021-07-29 19:39:52 UTC

(In reply to Casey Callendrello from comment #3)
> (In reply to Eric Garver from comment #2)
> > Doesn't k8s regularly wipe and re-apply the entire iptables ruleset?
> 
> 
> k8s will do an iptables-save, iptables-restore (though not for all chains).
> However, there are other bits of code that still use direct iptables
> commands.
> 
> > Is this reproducible on RHEL?
> 
> I'm not sure about RHEL, but this is RHCOS, which is quite closely related.

I was more pointing out that OCP has a lot of rule set changes and noise that you wouldn't get with a vanilla RHEL install.

@psutter, IIRC iptables-nft is atomic at the chain level. The "-w" option is ignored. Any ideas what could be going on here? Are we hitting a batch restart or cache issue?

Comment 5 Phil Sutter 2021-07-30 11:34:55 UTC

(In reply to Eric Garver from comment #4)
> (In reply to Casey Callendrello from comment #3)
> > (In reply to Eric Garver from comment #2)
> > > Doesn't k8s regularly wipe and re-apply the entire iptables ruleset?
> > 
> > 
> > k8s will do an iptables-save, iptables-restore (though not for all chains).
> > However, there are other bits of code that still use direct iptables
> > commands.
> > 
> > > Is this reproducible on RHEL?
> > 
> > I'm not sure about RHEL, but this is RHCOS, which is quite closely related.
> 
> I was more pointing out that OCP has a lot of rule set changes and noise
> that you wouldn't get with a vanilla RHEL install.
> 
> @psutter, IIRC iptables-nft is atomic at the chain level.

At table level even.

> The "-w" option is
> ignored. Any ideas what could be going on here? Are we hitting a batch
> restart or cache issue?

You never know. A working reproducer on RHEL would be good. If OCP is really flushing the ruleset and reapplying its own rules, the chain IPTABLES-BUG should vanish and remain gone after the loop failed. Is that the case? If not, does OCP do something stupid like 'iptables-save >/tmp/foo && edit /tmp/foo && iptables-restore /tmp/foo'? That might leave a gap in which given rule is not there again yet.

Comment 6 Phil Sutter 2021-07-30 21:21:01 UTC

FYI: It is actually quick and easy to reproduce given the instructions in comment 0. This proves it's not an OCP issue. I'll investigate further, sorry for the noise!

Comment 7 Phil Sutter 2021-08-04 13:15:50 UTC

Here's a pretty simple reproducer:

---
#!/bin/bash

iptables -t nat -F IPTABLES-BUG >/dev/null 2>&1
iptables -t nat -X IPTABLES-BUG >/dev/null 2>&1

mt='-m tcp -p tcp --dport 42 -m comment --comment testing -j MASQUERADE'

iptables -t nat -N IPTABLES-BUG
for i in $(seq 1 250); do
	iptables -t nat -w -A IPTABLES-BUG -d 1.0.0.$i $mt
done

while true; do
	iptables -t nat -C IPTABLES-BUG -d 1.0.0.250/32 $mt &
	iptables -t nat -C IPTABLES-BUG -d 1.0.0.250/32 $mt
done
---

The problem exists only in RHEL8. Using either kernel or iptables from upstream avoids it, here's why:

1) In RHEL8 kernel, listing the ruleset bumps its generation ID. Upstream
   avoids that, probably with commit b8b27498659c6 ("netfilter: nf_tables:
   return immediately on empty commit").

2) While fetching rules, recvfrom() call may return EINTR. Upstream iptables
   notices the changed generation ID and retries, RHEL8 iptables can't do that
   due to how cache population is integrated into calling code.

Backporting the patch from 1) would avoid the problem with given reproducer,
but a background job performing actual ruleset changes may cause the same
situation nevertheless.

Reorganizing cache handling in iptables to match upstream requires a larger
backport series, probably not suitable for RHEL8.

This RHEL-only fix seems to do the trick:

--- a/iptables/nft-cache.c
+++ b/iptables/nft-cache.c
@@ -404,9 +404,12 @@ static int nft_rule_list_update(struct nftnl_chain *c, void *data)
                                        NLM_F_DUMP, h->seq);
        nftnl_rule_nlmsg_build_payload(nlh, rule);

+retry:
        ret = mnl_talk(h, nlh, nftnl_rule_list_cb, c);
-       if (ret < 0 && errno == EINTR)
+       if (ret < 0 && errno == EINTR) {
                assert(nft_restart(h) >= 0);
+               goto retry;
+       }

        nftnl_rule_free(rule);
---

Even with multiple other loops spinning on chain create and delete calls, this
doesn't loop more than 3 or 4 times until it succeeds and given that it retries
only in EINTR case, I consider it safe.

Comment 16 Phil Sutter 2021-08-04 15:32:28 UTC

I decided to apply the same workaround to fetch_table_cache(),
fetch_set_cache() and fetch_chain_cache() also. For some reason failures from
fetch_table_cache() weren't problematic and fetch_chain_cache() never failed,
but large enough rulesets might change that.

Comment 18 Phil Sutter 2021-08-04 16:02:47 UTC

https://src.osci.redhat.com/rpms/iptables/pull-request/14

Comment 26 errata-xmlrpc 2021-11-09 19:54:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (iptables bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4468