1986588 – iptables-nft returns incorrect result for -C when concurrently running

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1986588 - iptables-nft returns incorrect result for -C when concurrently running

Summary: iptables-nft returns incorrect result for -C when concurrently running

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 8
Classification:	Red Hat
Component:	iptables
Sub Component:
Version:	8.4
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Phil Sutter
QA Contact:	Tomas Dolezal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1990016 1990017
TreeView+	depends on / blocked

Reported:	2021-07-27 20:21 UTC by Casey Callendrello
Modified:	2021-11-16 01:42 UTC (History)
CC List:	5 users (show)
Fixed In Version:	iptables-1.8.4-20.el8
Doc Type:	No Doc Update
Doc Text:	If this bug requires documentation, please select an appropriate Doc Type value.
Clone Of:
Clones:	1990016 1990017 (view as bug list)
Environment:
Last Closed:	2021-11-09 19:54:29 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6514071	0	None	None	None	2021-11-16 01:42:37 UTC
Red Hat Product Errata	RHBA-2021:4468	0	None	None	None	2021-11-09 19:54:40 UTC

Description Casey Callendrello 2021-07-27 20:21:09 UTC

Description of problem: When concurrently running iptables -C, sometimes iptables -C incorrectly returns an error.


Version-Release number of selected component (if applicable):
# iptables --version
iptables v1.8.4 (nf_tables)

# rpm -qa | grep tables
iptables-libs-1.8.4-15.el8_3.3.x86_64
iptables-1.8.4-15.el8_3.3.x86_64
nftables-0.9.3-16.el8.x86_64

# uname -a
Linux ci-ln-5svkrn2-f76d1-ccstw-worker-a-sb62g 4.18.0-240.23.2.el8_3.x86_64 #1 SMP Thu Jul 8 03:12:56 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux



How reproducible: Easy, on a node with enough iptables load (e.g. OpenShift node).


Steps to Reproduce:
1. # iptables -t nat -N IPTABLES-BUG; for i in $(seq 1 250); do iptables -t nat -w -A IPTABLES-BUG -d 1.0.0.$i -m tcp -p tcp --dport 42 -m comment --comment testing -j MASQUERADE; done

2. # while iptables -w 5 -W 100000 -t nat -C IPTABLES-BUG -d 1.0.0.250/32 -p tcp -m tcp --dport 42 -m comment --comment testing -j MASQUERADE; do echo -n; done

3. In another terminal,
# while iptables -w 5 -W 100000 -t nat -C IPTABLES-BUG -d 1.0.0.250/32 -p tcp -m tcp --dport 42 -m comment --comment testing -j MASQUERADE; do echo -n; done

One of the two should quickly fail.


Actual results:
One of the while-loops eventually fails with

iptables: Bad rule (does a matching rule exist in that chain?).


Expected results:
iptables -C should never fail.


Additional info:

This definitely depends on the size of the chain. If the chain only has one rule, I can't seem to reproduce it.

Note that this node is an OpenShift / Kubernetes node, and thus has a lot of iptables activity. But not on this chain.

Comment 1 Casey Callendrello 2021-07-27 20:26:44 UTC

I couldn't create a reliable reproducer, but this also happens when concurrently running an "iptables -C XXX -t nat (...)" along with an "iptables -S YYY -t mangle". This is the particular sequence of commands executed on OpenShift that triggers this bug.

Comment 2 Eric Garver 2021-07-28 12:48:58 UTC

(In reply to Casey Callendrello from comment #0)
> Additional info:
> 
> This definitely depends on the size of the chain. If the chain only has one
> rule, I can't seem to reproduce it.
> 
> Note that this node is an OpenShift / Kubernetes node, and thus has a lot of
> iptables activity. But not on this chain.

Doesn't k8s regularly wipe and re-apply the entire iptables ruleset?

Is this reproducible on RHEL?

Comment 3 Casey Callendrello 2021-07-28 16:21:05 UTC

(In reply to Eric Garver from comment #2)
> Doesn't k8s regularly wipe and re-apply the entire iptables ruleset?


k8s will do an iptables-save, iptables-restore (though not for all chains). However, there are other bits of code that still use direct iptables commands.

> Is this reproducible on RHEL?

I'm not sure about RHEL, but this is RHCOS, which is quite closely related.

Comment 4 Eric Garver 2021-07-29 19:39:52 UTC

(In reply to Casey Callendrello from comment #3)
> (In reply to Eric Garver from comment #2)
> > Doesn't k8s regularly wipe and re-apply the entire iptables ruleset?
> 
> 
> k8s will do an iptables-save, iptables-restore (though not for all chains).
> However, there are other bits of code that still use direct iptables
> commands.
> 
> > Is this reproducible on RHEL?
> 
> I'm not sure about RHEL, but this is RHCOS, which is quite closely related.

I was more pointing out that OCP has a lot of rule set changes and noise that you wouldn't get with a vanilla RHEL install.

@psutter, IIRC iptables-nft is atomic at the chain level. The "-w" option is ignored. Any ideas what could be going on here? Are we hitting a batch restart or cache issue?

Comment 5 Phil Sutter 2021-07-30 11:34:55 UTC

(In reply to Eric Garver from comment #4)
> (In reply to Casey Callendrello from comment #3)
> > (In reply to Eric Garver from comment #2)
> > > Doesn't k8s regularly wipe and re-apply the entire iptables ruleset?
> > 
> > 
> > k8s will do an iptables-save, iptables-restore (though not for all chains).
> > However, there are other bits of code that still use direct iptables
> > commands.
> > 
> > > Is this reproducible on RHEL?
> > 
> > I'm not sure about RHEL, but this is RHCOS, which is quite closely related.
> 
> I was more pointing out that OCP has a lot of rule set changes and noise
> that you wouldn't get with a vanilla RHEL install.
> 
> @psutter, IIRC iptables-nft is atomic at the chain level.

At table level even.

> The "-w" option is
> ignored. Any ideas what could be going on here? Are we hitting a batch
> restart or cache issue?

You never know. A working reproducer on RHEL would be good. If OCP is really flushing the ruleset and reapplying its own rules, the chain IPTABLES-BUG should vanish and remain gone after the loop failed. Is that the case? If not, does OCP do something stupid like 'iptables-save >/tmp/foo && edit /tmp/foo && iptables-restore /tmp/foo'? That might leave a gap in which given rule is not there again yet.

Comment 6 Phil Sutter 2021-07-30 21:21:01 UTC

FYI: It is actually quick and easy to reproduce given the instructions in comment 0. This proves it's not an OCP issue. I'll investigate further, sorry for the noise!

Comment 7 Phil Sutter 2021-08-04 13:15:50 UTC

Here's a pretty simple reproducer:

---
#!/bin/bash

iptables -t nat -F IPTABLES-BUG >/dev/null 2>&1
iptables -t nat -X IPTABLES-BUG >/dev/null 2>&1

mt='-m tcp -p tcp --dport 42 -m comment --comment testing -j MASQUERADE'

iptables -t nat -N IPTABLES-BUG
for i in $(seq 1 250); do
	iptables -t nat -w -A IPTABLES-BUG -d 1.0.0.$i $mt
done

while true; do
	iptables -t nat -C IPTABLES-BUG -d 1.0.0.250/32 $mt &
	iptables -t nat -C IPTABLES-BUG -d 1.0.0.250/32 $mt
done
---

The problem exists only in RHEL8. Using either kernel or iptables from upstream avoids it, here's why:

1) In RHEL8 kernel, listing the ruleset bumps its generation ID. Upstream
   avoids that, probably with commit b8b27498659c6 ("netfilter: nf_tables:
   return immediately on empty commit").

2) While fetching rules, recvfrom() call may return EINTR. Upstream iptables
   notices the changed generation ID and retries, RHEL8 iptables can't do that
   due to how cache population is integrated into calling code.

Backporting the patch from 1) would avoid the problem with given reproducer,
but a background job performing actual ruleset changes may cause the same
situation nevertheless.

Reorganizing cache handling in iptables to match upstream requires a larger
backport series, probably not suitable for RHEL8.

This RHEL-only fix seems to do the trick:

--- a/iptables/nft-cache.c
+++ b/iptables/nft-cache.c
@@ -404,9 +404,12 @@ static int nft_rule_list_update(struct nftnl_chain *c, void *data)
                                        NLM_F_DUMP, h->seq);
        nftnl_rule_nlmsg_build_payload(nlh, rule);

+retry:
        ret = mnl_talk(h, nlh, nftnl_rule_list_cb, c);
-       if (ret < 0 && errno == EINTR)
+       if (ret < 0 && errno == EINTR) {
                assert(nft_restart(h) >= 0);
+               goto retry;
+       }

        nftnl_rule_free(rule);
---

Even with multiple other loops spinning on chain create and delete calls, this
doesn't loop more than 3 or 4 times until it succeeds and given that it retries
only in EINTR case, I consider it safe.

Comment 16 Phil Sutter 2021-08-04 15:32:28 UTC

I decided to apply the same workaround to fetch_table_cache(),
fetch_set_cache() and fetch_chain_cache() also. For some reason failures from
fetch_table_cache() weren't problematic and fetch_chain_cache() never failed,
but large enough rulesets might change that.

Comment 18 Phil Sutter 2021-08-04 16:02:47 UTC

https://src.osci.redhat.com/rpms/iptables/pull-request/14

Comment 26 errata-xmlrpc 2021-11-09 19:54:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (iptables bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:4468

Note You need to log in before you can comment on or make changes to this bug.