1880680 – [4.3] [Tigera plugin] - openshift-kube-proxy fails - Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables)

Bug 1880680 - [4.3] [Tigera plugin] - openshift-kube-proxy fails - Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables)

Summary: [4.3] [Tigera plugin] - openshift-kube-proxy fails - Failed to execute iptabl...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.3.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Dan Williams
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1880365 (view as bug list)
Depends On:	1843646 1884946 1885199 1885200 1885203
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-19 00:54 UTC by Vladislav Walek
Modified:	2024-10-01 16:53 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:18:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:20:53 UTC

Description Vladislav Walek 2020-09-19 00:54:18 UTC

Description of problem:

openshift-kube-proxy fails on some nodes with error:

	2020-09-17T10:52:33.795271544Z E0917 10:52:33.795244       1 proxier.go:1449] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables):
	2020-09-17T12:05:35.593340687Z line 27030: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-XFVE3LTTFRHEFCIQ
	2020-09-17T12:05:35.593340687Z line 27031: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-QZNCVF4BC5U2KMFO
	2020-09-17T12:05:35.593340687Z line 27032: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-KW7TSBXHYYDGU7P6
	2020-09-17T12:05:35.593340687Z line 27033: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-4J5AY7EL4YBQGJSK
	2020-09-17T12:05:35.593340687Z line 27034: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-ZRA4FC2ZSWYIJTXI
	2020-09-17T12:05:35.593340687Z line 27035: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-GTPIKDBCCRVQTJP2


It appears as an issue with the lock on the iptables. Providing the must gather shortly.

Version-Release number of selected component (if applicable):
OpenShift Container Platfrom 4.3.29

How reproducible:
n/a

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Ben Bennett 2020-09-21 13:33:24 UTC

Setting the target release to the latest dev branch so we can identify the problem.  Once that work is done, we will consider where to backport.

Comment 4 Mark Rooks 2020-09-23 06:59:42 UTC

Duplicate of 1880365?

Comment 10 Andrew Stoycos 2020-09-28 21:25:47 UTC

Looking into this, no timeline for a fix ATM, based on the outcome of BZ 1843646 the RHEL team basically ruled out a problem on the OS side. Checking out the possibility of parallel iptables restore commands causing things to fail.  Have you been able to reproduce this on a non-production cluster?

Comment 15 Andrew Stoycos 2020-09-30 13:19:00 UTC

Analysis of the must-gather has been done, This is an IPtables problem that was also seen here -> https://bugzilla.redhat.com/show_bug.cgi?id=1812261 and fixed here https://bugzilla.redhat.com/show_bug.cgi?id=1807811#c34 working with the team to figure out the correct way forward

Comment 16 Robb Manes 2020-09-30 16:48:22 UTC

(In reply to Andrew Stoycos from comment #15)
> Analysis of the must-gather has been done, This is an IPtables problem that
> was also seen here -> https://bugzilla.redhat.com/show_bug.cgi?id=1812261
> and fixed here https://bugzilla.redhat.com/show_bug.cgi?id=1807811#c34

The bugs bz1812261 and bz1807811 show segfaults in iptables when executing `nftnl_expr_build_payload`, while this shows a lock being taken out and possibly not released by some other actor.  I see no segfaults from the iptables-restore in the must-gather, we just get an exit code of `4`:

proxier.go:1449] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables):

This isn't segfaulting, something appears to hold a lock resource when we try to do the restore:

2020-09-17T12:05:35.593340687Z line 27030: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-XFVE3LTTFRHEFCIQ
2020-09-17T12:05:35.593340687Z line 27031: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-QZNCVF4BC5U2KMFO
2020-09-17T12:05:35.593340687Z line 27032: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-KW7TSBXHYYDGU7P6
2020-09-17T12:05:35.593340687Z line 27033: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-4J5AY7EL4YBQGJSK
2020-09-17T12:05:35.593340687Z line 27034: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-ZRA4FC2ZSWYIJTXI
2020-09-17T12:05:35.593340687Z line 27035: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-GTPIKDBCCRVQTJP2

Seeing how/where this is produced:

$ git remote -v
origin  git://git.netfilter.org/iptables.git (fetch)
origin  git://git.netfilter.org/iptables.git (push)

$ git checkout v1.8.4

This function performs nftables chain deletes to produce the message we see above when we ret -1:

static int __nft_chain_user_del(struct nftnl_chain *c, void *data)
{
        struct chain_user_del_data *d = data;
        struct nft_handle *h = d->handle;
        int ret;

        /* don't delete built-in chain */
        if (nft_chain_builtin(c))
                return d->builtin_err;

        if (d->verbose)
                fprintf(stdout, "Deleting chain `%s'\n",
                        nftnl_chain_get_str(c, NFTNL_CHAIN_NAME));

        /* This triggers required policy rule deletion. */
        if (h->family == NFPROTO_BRIDGE)
                nft_build_cache(h, c);

        /* XXX This triggers a fast lookup from the kernel. */
        nftnl_chain_unset(c, NFTNL_CHAIN_HANDLE);
        ret = batch_chain_add(h, NFT_COMPAT_CHAIN_USER_DEL, c);
        if (ret)
                return -1;

        nftnl_chain_list_del(c);
        return 0;
}

This calls nftnl_chain_list_del which is a exported by netfilter libraries included in the kernel, in include/linux/netfilter/nftables.h which is exposted in a library, which is libnftnl-1.1.5-4 on the problematic node.

EXPORT_SYMBOL(nftnl_chain_list_del);
void nftnl_chain_list_del(struct nftnl_chain *r)
{
        list_del(&r->head);
        hlist_del(&r->hnode);
}

This is returned from the kernel as EBUSY, likely because something else is already holding the resource.  Perhaps we can identify what that is first and foremost?  It might be another "stuck" iptables-restore or something else running with nftables/netfilter that is long-processing.

Let me know if I am off base with the above.  I'll try to narrow down what else could be holding the lock/resource.

Comment 17 Robb Manes 2020-09-30 16:50:41 UTC

Note the comment regarding an xtables lock mentioned here:

https://bugzilla.redhat.com/show_bug.cgi?id=1843646#c28

Comment 18 Andrew Stoycos 2020-09-30 18:18:39 UTC

No you have a good point @Robb, I was told IPtables should be able to handle parallel calls, but that comment specifically states otherwise. I am looking into what could be locking the tables. There will never be two `restore` commands called in parallel however another `iptables` command could be locking on something else.  

The reason I pointed to 1807811 was this comment 
https://bugzilla.redhat.com/show_bug.cgi?id=1807811#c5

"in previous release there was no segfault and when base chains existed, no errors at all. only on first attempt with no nft tables, error code 4 was returned if iptables-restore was started in parallel for all but one process."

Comment 19 Robb Manes 2020-09-30 20:02:01 UTC

I have an sosreport from a problematic system.  It appears that kube-proxy is running the iptables-restore, as we suspect, but is possibly spending a lot of time in it executing:

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root     2476413 91.1  0.0 140656 90792 ?        R    13:00   0:30 /usr/sbin/iptables-restore -w 5 --noflush --counters

    PID    PPID USER     GROUP        LWP NLWP START COMMAND         CGROUP
2476121       1 root     root     2476121    2 13:00 conmon          11:blkio:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod117c527e_13...
2476175 2476121 root     root     2476175   43 13:00 kube-proxy      12:cpuset:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod117c527e_13...
2476413 2476175 root     root     2476413    1 13:00 iptables-restor 12:cpuset:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod117c527e_13...

Note the 91% CPU time for the process.  This could be just a fluke of when we captured `ps`, or it could be an indication there is an immense amount of rules or chains that we are trying to walk.  In any regard, I think it likely one of the following:

- There is an immense amount of chains/rules and it is taking an excessive amount of time for iptables-restore to process them
- There is another process vying for attention of the lock iptables-restore is getting/releasing as it walks chains/tables
- The iptables-restore itself is blocked behind a spinlock or some other lock in an ioctl or other syscall (unlikely as the state is R and not D)

Perhaps we need to look at this in two parts; one for this specific instance to determine why this restore takes so long, and another to ensure concurrency is being used properly and safely when performing iptables-restore?

As far as why this only happened when Tigera was updated, perhaps Calico performs some other iptables/nftables operations simultaneously and we are locking each other out.  Just a guess on that last one.

Comment 20 Robb Manes 2020-09-30 21:07:11 UTC

On a problematic node, there are almost 34k lines from iptables:

$ wc -l bad-node-iptables-dump.txt
34872 bad-node-iptables-dump.txt

The number of chains:

$ awk '/^:/' bad-node-iptables-dump.txt | wc -l
8598

And the number of rules:

$ awk '/^\-/' bad-node-iptables-dump.txt | wc -l
26258

Majority of the chains are KUBE named:

$ awk '/^:KUBE/' bad-node-iptables-dump.txt | wc -l
7712

And some are from calico:

$ awk '/^:cali\-/' bad-node-iptables-dump.txt | wc -l
872

There are quite a few empty chains as well:

$ grep -c '[0:0]' bad-node-iptables-dump.txt
28879

Comment 21 Robb Manes 2020-09-30 21:09:11 UTC

Ignore this last part:

(In reply to Robb Manes from comment #20)
> There are quite a few empty chains as well:
> 
> $ grep -c '[0:0]' bad-node-iptables-dump.txt
> 28879

That is an invalid count.

Comment 22 Vladislav Walek 2020-10-01 01:04:34 UTC

Adding note from Tigera:

Calico Enterprise and Calico both issue iptables-restore calls.

Comment 30 Phil Sutter 2020-10-02 11:15:08 UTC

This looks like a duplicate of bug 1843646.

Legacy iptables (in RHEL7, for instance) used a lockfile in user space to avoid
parallel execution. This is what '-w' flag enables.

With iptables-nft (RHEL8), no lockfile in user space is needed anymore. Each
ruleset commit to kernel (either via 'COMMIT' line in dump files or individual
commands on command line) happens in a transaction in kernel space.

My guess for what's happening here is kube-proxy starts two iptables processes
in background in short succession and the first one takes a while to prepare
the commit (e.g. takes a while to parse the input or to fetch a ruleset cache
from kernel) so that in fact the second call gets to commit first.

The EBUSY return for CHAIN_USER_DEL is emitted by the kernel. Typically that
happens if the to be deleted chain is still referenced (jump/goto action) from
another rule. 

If you have a system where this is happening on we could try to wrap iptables
binaries with a shell script that does a lockfile-based lock. This should
eliminate any accidental command execution reordering.

Either way, in order to identify the problem it must be reproducable, ideally
outside of OpenShift.

Comment 31 Robb Manes 2020-10-02 12:07:51 UTC

(In reply to Phil Sutter from comment #30)
> My guess for what's happening here is kube-proxy starts two iptables
> processes
> in background in short succession and the first one takes a while to prepare
> the commit (e.g. takes a while to parse the input or to fetch a ruleset cache
> from kernel) so that in fact the second call gets to commit first.

Understood, this is likely a race of concurrent routines within just kube-proxy or concurrently running kube-proxy and calico on the same system, as they both reference iptables-restore and iptables commands.

> If you have a system where this is happening on we could try to wrap iptables
> binaries with a shell script that does a lockfile-based lock. This should
> eliminate any accidental command execution reordering.

The customer's environment still sees this happening but as it's being executed in the context of OpenShift, kube-proxy is running in a container and the container image is not easily changed out; I see it chroots to the host and executes the commands directly (https://github.com/openshift/sdn/blob/master/images/iptables-scripts/iptables-restore).  I see this opens a netlink socket when I strace `/usr/sbin/iptables-restore -w 5 --noflush --counters` on iptables-1.8.4-10.el8_2.1.  Is there a good place I might use systemtap to watch for locking/unlocking of the chains so we can track this independent of userspace, where I can print the pid and comm with args?  Or is it a lock per-chain so it's harder to track?

Comment 32 Dan Winship 2020-10-02 12:13:36 UTC

(In reply to Robb Manes from comment #26)
> see the comment here:
> https://github.com/projectcalico/felix/blob/master/iptables/table.go#L1209-
> L1212.

Yup, that's the problem all right. I reopened bug 1843646 with a reproducer. (Though I hadn't noticed Phil was already involved in this bug too so maybe we should just reassign this to RHEL/iptables? Dunno.)

Comment 33 Eric Garver 2020-10-02 13:57:01 UTC

(In reply to Dan Winship from comment #32)
> (In reply to Robb Manes from comment #26)
> > see the comment here:
> > https://github.com/projectcalico/felix/blob/master/iptables/table.go#L1209-
> > L1212.
> 
> Yup, that's the problem all right. I reopened bug 1843646 with a reproducer.
> (Though I hadn't noticed Phil was already involved in this bug too so maybe
> we should just reassign this to RHEL/iptables? Dunno.)

I think it's clear bug 1843646 is the root cause. But don't you want this bug open to track the fix propagation into OCP?

Comment 34 Dan Williams 2020-10-02 14:08:28 UTC

(In reply to Eric Garver from comment #33)
> (In reply to Dan Winship from comment #32)
> > (In reply to Robb Manes from comment #26)
> > > see the comment here:
> > > https://github.com/projectcalico/felix/blob/master/iptables/table.go#L1209-
> > > L1212.
> > 
> > Yup, that's the problem all right. I reopened bug 1843646 with a reproducer.
> > (Though I hadn't noticed Phil was already involved in this bug too so maybe
> > we should just reassign this to RHEL/iptables? Dunno.)
> 
> I think it's clear bug 1843646 is the root cause. But don't you want this
> bug open to track the fix propagation into OCP?

Yes, let's keep this bug open for the OCP side.

Comment 49 Andrew Stoycos 2020-10-15 13:15:49 UTC

*** Bug 1880365 has been marked as a duplicate of this bug. ***

Comment 60 Dan Williams 2020-12-07 15:56:22 UTC

Fixed RPM for RHEL 8.2-based releases is iptables-0-1.8.4-10.el8_2.4.

That fixed RPM should be in 4.4.31, 4.5.21, and 4.6.6

Comment 64 zhaozhanqi 2020-12-08 09:18:53 UTC

from https://bugzilla.redhat.com/show_bug.cgi?id=1885203#c2#c3#c4. this bug can be verified.

Comment 66 tmicheli 2021-02-15 12:00:41 UTC

I still see this error on a cluster in version 4.6.4 even if the iptables-package is updated to 1.8.4.10 according to this bugzilla BZ #1884946

----
% omg get clusterversion
NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
version  4.6.4    True       False        7m18s 

% grep iptables installed-rpms 
iptables-1.8.4-10.el8_2.1.x86_64                            Fri Nov  6 16:23:18 2020
iptables-libs-1.8.4-10.el8_2.1.x86_64                       Fri Nov  6 16:23:18 2020

% grep 'Failed to execute iptables-restore: exit status 4' * -Rn -A 4
2021-01-26T08:37:45.785723453Z E0126 08:37:45.785609       1 proxier.go:1555] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables):
CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-S274SUCD3F25PVEH
quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-47c2f751ab0d5ee88e2826749f1372e6a24db3d0c0c942136ae84db17cb7f086/namespaces/openshift-kube-proxy/pods/openshift-kube-proxy-x2tcv/kube-proxy/kube-proxy/logs/current.log-118-2021-01-26T08:37:45.785723453Z )
quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256- I0126 08:37:45.785640       1 proxier.go:825] Sync failed; retrying in 30s
1 reflector.go:405] k8s.io/client-go.6/tools/cache/reflector.go:125: watch of *v1.Endpoints ended with: very short watch: k8s.io/client-go.6/tools/cache/reflector.go:125: Unexpected watch close - watch lasted less than a second and no items received

Comment 67 Eric Garver 2021-02-15 13:04:31 UTC

(In reply to tmicheli from comment #66)
> I still see this error on a cluster in version 4.6.4 even if the
> iptables-package is updated to 1.8.4.10 according to this bugzilla BZ
> #1884946
> 
> ----
> % omg get clusterversion
> NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
> version  4.6.4    True       False        7m18s 
> 
> % grep iptables installed-rpms 
> iptables-1.8.4-10.el8_2.1.x86_64                            Fri Nov  6
> 16:23:18 2020
> iptables-libs-1.8.4-10.el8_2.1.x86_64                       Fri Nov  6
> 16:23:18 2020


That's because that's the wrong version. This bug is addressed by zstream bug 1891881, fixed in version iptables-1.8.4-10.el8_2.4. The bug you're looking at, bug 1884946, is for a different issue.

Comment 68 tmicheli 2021-02-16 10:03:21 UTC

Thanks for your answer. I mixed them up. Is there a rough timeline when the updated package will be available for RHCOS and fixes the issue with calico/Tigera?

Comment 69 Dan Winship 2021-02-16 14:22:54 UTC

Per comment 6, this was fixed in 4.6.6.

However also note that the Calico image includes its own iptables binary, which also needs to be updated, https://bugzilla.redhat.com/show_bug.cgi?id=1909199#c15

Comment 71 errata-xmlrpc 2021-02-24 15:18:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.

aabhishe
aconstan
apaladug
astoycos
bbennett
bleanhar
danw
dcbw
egarver
emahoney
erich
gparente
jolee
kangell
klamb
mrooks
nstielau
openshift-bugs-escalate
rcarrier
rkhan
rkshirsa
rmanes
tmicheli
tsmetana