Description of problem: openshift-kube-proxy fails on some nodes with error: 2020-09-17T10:52:33.795271544Z E0917 10:52:33.795244 1 proxier.go:1449] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables): 2020-09-17T12:05:35.593340687Z line 27030: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-XFVE3LTTFRHEFCIQ 2020-09-17T12:05:35.593340687Z line 27031: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-QZNCVF4BC5U2KMFO 2020-09-17T12:05:35.593340687Z line 27032: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-KW7TSBXHYYDGU7P6 2020-09-17T12:05:35.593340687Z line 27033: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-4J5AY7EL4YBQGJSK 2020-09-17T12:05:35.593340687Z line 27034: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-ZRA4FC2ZSWYIJTXI 2020-09-17T12:05:35.593340687Z line 27035: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-GTPIKDBCCRVQTJP2 It appears as an issue with the lock on the iptables. Providing the must gather shortly. Version-Release number of selected component (if applicable): OpenShift Container Platfrom 4.3.29 How reproducible: n/a Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Setting the target release to the latest dev branch so we can identify the problem. Once that work is done, we will consider where to backport.
Duplicate of 1880365?
Looking into this, no timeline for a fix ATM, based on the outcome of BZ 1843646 the RHEL team basically ruled out a problem on the OS side. Checking out the possibility of parallel iptables restore commands causing things to fail. Have you been able to reproduce this on a non-production cluster?
Analysis of the must-gather has been done, This is an IPtables problem that was also seen here -> https://bugzilla.redhat.com/show_bug.cgi?id=1812261 and fixed here https://bugzilla.redhat.com/show_bug.cgi?id=1807811#c34 working with the team to figure out the correct way forward
(In reply to Andrew Stoycos from comment #15) > Analysis of the must-gather has been done, This is an IPtables problem that > was also seen here -> https://bugzilla.redhat.com/show_bug.cgi?id=1812261 > and fixed here https://bugzilla.redhat.com/show_bug.cgi?id=1807811#c34 The bugs bz1812261 and bz1807811 show segfaults in iptables when executing `nftnl_expr_build_payload`, while this shows a lock being taken out and possibly not released by some other actor. I see no segfaults from the iptables-restore in the must-gather, we just get an exit code of `4`: proxier.go:1449] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables): This isn't segfaulting, something appears to hold a lock resource when we try to do the restore: 2020-09-17T12:05:35.593340687Z line 27030: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-XFVE3LTTFRHEFCIQ 2020-09-17T12:05:35.593340687Z line 27031: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-QZNCVF4BC5U2KMFO 2020-09-17T12:05:35.593340687Z line 27032: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-KW7TSBXHYYDGU7P6 2020-09-17T12:05:35.593340687Z line 27033: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-4J5AY7EL4YBQGJSK 2020-09-17T12:05:35.593340687Z line 27034: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SVC-ZRA4FC2ZSWYIJTXI 2020-09-17T12:05:35.593340687Z line 27035: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-GTPIKDBCCRVQTJP2 Seeing how/where this is produced: $ git remote -v origin git://git.netfilter.org/iptables.git (fetch) origin git://git.netfilter.org/iptables.git (push) $ git checkout v1.8.4 This function performs nftables chain deletes to produce the message we see above when we ret -1: static int __nft_chain_user_del(struct nftnl_chain *c, void *data) { struct chain_user_del_data *d = data; struct nft_handle *h = d->handle; int ret; /* don't delete built-in chain */ if (nft_chain_builtin(c)) return d->builtin_err; if (d->verbose) fprintf(stdout, "Deleting chain `%s'\n", nftnl_chain_get_str(c, NFTNL_CHAIN_NAME)); /* This triggers required policy rule deletion. */ if (h->family == NFPROTO_BRIDGE) nft_build_cache(h, c); /* XXX This triggers a fast lookup from the kernel. */ nftnl_chain_unset(c, NFTNL_CHAIN_HANDLE); ret = batch_chain_add(h, NFT_COMPAT_CHAIN_USER_DEL, c); if (ret) return -1; nftnl_chain_list_del(c); return 0; } This calls nftnl_chain_list_del which is a exported by netfilter libraries included in the kernel, in include/linux/netfilter/nftables.h which is exposted in a library, which is libnftnl-1.1.5-4 on the problematic node. EXPORT_SYMBOL(nftnl_chain_list_del); void nftnl_chain_list_del(struct nftnl_chain *r) { list_del(&r->head); hlist_del(&r->hnode); } This is returned from the kernel as EBUSY, likely because something else is already holding the resource. Perhaps we can identify what that is first and foremost? It might be another "stuck" iptables-restore or something else running with nftables/netfilter that is long-processing. Let me know if I am off base with the above. I'll try to narrow down what else could be holding the lock/resource.
Note the comment regarding an xtables lock mentioned here: https://bugzilla.redhat.com/show_bug.cgi?id=1843646#c28
No you have a good point @Robb, I was told IPtables should be able to handle parallel calls, but that comment specifically states otherwise. I am looking into what could be locking the tables. There will never be two `restore` commands called in parallel however another `iptables` command could be locking on something else. The reason I pointed to 1807811 was this comment https://bugzilla.redhat.com/show_bug.cgi?id=1807811#c5 "in previous release there was no segfault and when base chains existed, no errors at all. only on first attempt with no nft tables, error code 4 was returned if iptables-restore was started in parallel for all but one process."
I have an sosreport from a problematic system. It appears that kube-proxy is running the iptables-restore, as we suspect, but is possibly spending a lot of time in it executing: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2476413 91.1 0.0 140656 90792 ? R 13:00 0:30 /usr/sbin/iptables-restore -w 5 --noflush --counters PID PPID USER GROUP LWP NLWP START COMMAND CGROUP 2476121 1 root root 2476121 2 13:00 conmon 11:blkio:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod117c527e_13... 2476175 2476121 root root 2476175 43 13:00 kube-proxy 12:cpuset:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod117c527e_13... 2476413 2476175 root root 2476413 1 13:00 iptables-restor 12:cpuset:/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod117c527e_13... Note the 91% CPU time for the process. This could be just a fluke of when we captured `ps`, or it could be an indication there is an immense amount of rules or chains that we are trying to walk. In any regard, I think it likely one of the following: - There is an immense amount of chains/rules and it is taking an excessive amount of time for iptables-restore to process them - There is another process vying for attention of the lock iptables-restore is getting/releasing as it walks chains/tables - The iptables-restore itself is blocked behind a spinlock or some other lock in an ioctl or other syscall (unlikely as the state is R and not D) Perhaps we need to look at this in two parts; one for this specific instance to determine why this restore takes so long, and another to ensure concurrency is being used properly and safely when performing iptables-restore? As far as why this only happened when Tigera was updated, perhaps Calico performs some other iptables/nftables operations simultaneously and we are locking each other out. Just a guess on that last one.
On a problematic node, there are almost 34k lines from iptables: $ wc -l bad-node-iptables-dump.txt 34872 bad-node-iptables-dump.txt The number of chains: $ awk '/^:/' bad-node-iptables-dump.txt | wc -l 8598 And the number of rules: $ awk '/^\-/' bad-node-iptables-dump.txt | wc -l 26258 Majority of the chains are KUBE named: $ awk '/^:KUBE/' bad-node-iptables-dump.txt | wc -l 7712 And some are from calico: $ awk '/^:cali\-/' bad-node-iptables-dump.txt | wc -l 872 There are quite a few empty chains as well: $ grep -c '[0:0]' bad-node-iptables-dump.txt 28879
Ignore this last part: (In reply to Robb Manes from comment #20) > There are quite a few empty chains as well: > > $ grep -c '[0:0]' bad-node-iptables-dump.txt > 28879 That is an invalid count.
Adding note from Tigera: Calico Enterprise and Calico both issue iptables-restore calls.
This looks like a duplicate of bug 1843646. Legacy iptables (in RHEL7, for instance) used a lockfile in user space to avoid parallel execution. This is what '-w' flag enables. With iptables-nft (RHEL8), no lockfile in user space is needed anymore. Each ruleset commit to kernel (either via 'COMMIT' line in dump files or individual commands on command line) happens in a transaction in kernel space. My guess for what's happening here is kube-proxy starts two iptables processes in background in short succession and the first one takes a while to prepare the commit (e.g. takes a while to parse the input or to fetch a ruleset cache from kernel) so that in fact the second call gets to commit first. The EBUSY return for CHAIN_USER_DEL is emitted by the kernel. Typically that happens if the to be deleted chain is still referenced (jump/goto action) from another rule. If you have a system where this is happening on we could try to wrap iptables binaries with a shell script that does a lockfile-based lock. This should eliminate any accidental command execution reordering. Either way, in order to identify the problem it must be reproducable, ideally outside of OpenShift.
(In reply to Phil Sutter from comment #30) > My guess for what's happening here is kube-proxy starts two iptables > processes > in background in short succession and the first one takes a while to prepare > the commit (e.g. takes a while to parse the input or to fetch a ruleset cache > from kernel) so that in fact the second call gets to commit first. Understood, this is likely a race of concurrent routines within just kube-proxy or concurrently running kube-proxy and calico on the same system, as they both reference iptables-restore and iptables commands. > If you have a system where this is happening on we could try to wrap iptables > binaries with a shell script that does a lockfile-based lock. This should > eliminate any accidental command execution reordering. The customer's environment still sees this happening but as it's being executed in the context of OpenShift, kube-proxy is running in a container and the container image is not easily changed out; I see it chroots to the host and executes the commands directly (https://github.com/openshift/sdn/blob/master/images/iptables-scripts/iptables-restore). I see this opens a netlink socket when I strace `/usr/sbin/iptables-restore -w 5 --noflush --counters` on iptables-1.8.4-10.el8_2.1. Is there a good place I might use systemtap to watch for locking/unlocking of the chains so we can track this independent of userspace, where I can print the pid and comm with args? Or is it a lock per-chain so it's harder to track?
(In reply to Robb Manes from comment #26) > see the comment here: > https://github.com/projectcalico/felix/blob/master/iptables/table.go#L1209- > L1212. Yup, that's the problem all right. I reopened bug 1843646 with a reproducer. (Though I hadn't noticed Phil was already involved in this bug too so maybe we should just reassign this to RHEL/iptables? Dunno.)
(In reply to Dan Winship from comment #32) > (In reply to Robb Manes from comment #26) > > see the comment here: > > https://github.com/projectcalico/felix/blob/master/iptables/table.go#L1209- > > L1212. > > Yup, that's the problem all right. I reopened bug 1843646 with a reproducer. > (Though I hadn't noticed Phil was already involved in this bug too so maybe > we should just reassign this to RHEL/iptables? Dunno.) I think it's clear bug 1843646 is the root cause. But don't you want this bug open to track the fix propagation into OCP?
(In reply to Eric Garver from comment #33) > (In reply to Dan Winship from comment #32) > > (In reply to Robb Manes from comment #26) > > > see the comment here: > > > https://github.com/projectcalico/felix/blob/master/iptables/table.go#L1209- > > > L1212. > > > > Yup, that's the problem all right. I reopened bug 1843646 with a reproducer. > > (Though I hadn't noticed Phil was already involved in this bug too so maybe > > we should just reassign this to RHEL/iptables? Dunno.) > > I think it's clear bug 1843646 is the root cause. But don't you want this > bug open to track the fix propagation into OCP? Yes, let's keep this bug open for the OCP side.
*** Bug 1880365 has been marked as a duplicate of this bug. ***
Fixed RPM for RHEL 8.2-based releases is iptables-0-1.8.4-10.el8_2.4. That fixed RPM should be in 4.4.31, 4.5.21, and 4.6.6
from https://bugzilla.redhat.com/show_bug.cgi?id=1885203#c2#c3#c4. this bug can be verified.
I still see this error on a cluster in version 4.6.4 even if the iptables-package is updated to 1.8.4.10 according to this bugzilla BZ #1884946 ---- % omg get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.4 True False 7m18s % grep iptables installed-rpms iptables-1.8.4-10.el8_2.1.x86_64 Fri Nov 6 16:23:18 2020 iptables-libs-1.8.4-10.el8_2.1.x86_64 Fri Nov 6 16:23:18 2020 % grep 'Failed to execute iptables-restore: exit status 4' * -Rn -A 4 2021-01-26T08:37:45.785723453Z E0126 08:37:45.785609 1 proxier.go:1555] Failed to execute iptables-restore: exit status 4 (iptables-restore v1.8.4 (nf_tables): CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-S274SUCD3F25PVEH quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-47c2f751ab0d5ee88e2826749f1372e6a24db3d0c0c942136ae84db17cb7f086/namespaces/openshift-kube-proxy/pods/openshift-kube-proxy-x2tcv/kube-proxy/kube-proxy/logs/current.log-118-2021-01-26T08:37:45.785723453Z ) quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256- I0126 08:37:45.785640 1 proxier.go:825] Sync failed; retrying in 30s 1 reflector.go:405] k8s.io/client-go.6/tools/cache/reflector.go:125: watch of *v1.Endpoints ended with: very short watch: k8s.io/client-go.6/tools/cache/reflector.go:125: Unexpected watch close - watch lasted less than a second and no items received
(In reply to tmicheli from comment #66) > I still see this error on a cluster in version 4.6.4 even if the > iptables-package is updated to 1.8.4.10 according to this bugzilla BZ > #1884946 > > ---- > % omg get clusterversion > NAME VERSION AVAILABLE PROGRESSING SINCE STATUS > version 4.6.4 True False 7m18s > > % grep iptables installed-rpms > iptables-1.8.4-10.el8_2.1.x86_64 Fri Nov 6 > 16:23:18 2020 > iptables-libs-1.8.4-10.el8_2.1.x86_64 Fri Nov 6 > 16:23:18 2020 That's because that's the wrong version. This bug is addressed by zstream bug 1891881, fixed in version iptables-1.8.4-10.el8_2.4. The bug you're looking at, bug 1884946, is for a different issue.
Thanks for your answer. I mixed them up. Is there a rough timeline when the updated package will be available for RHCOS and fixes the issue with calico/Tigera?
Per comment 6, this was fixed in 4.6.6. However also note that the Calico image includes its own iptables binary, which also needs to be updated, https://bugzilla.redhat.com/show_bug.cgi?id=1909199#c15
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633