1823374 – RHEL 7.8: cleaning up mounts of Kubernetes service account secrets of deleted pods failing

Bug 1823374 - RHEL 7.8: cleaning up mounts of Kubernetes service account secrets of deleted pods failing

Summary: RHEL 7.8: cleaning up mounts of Kubernetes service account secrets of deleted...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.3.z
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Kir Kolyshkin
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1820717 1821347 (view as bug list)
Depends On:
Blocks:	1823885 1828341
TreeView+	depends on / blocked

Reported:	2020-04-13 13:42 UTC by Cesar Wong
Modified:	2024-03-25 15:49 UTC (History)
CC List:	18 users (show)
Fixed In Version:	runc-1.0.0-67.rc10.rhaos4.3.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1823885 1828341 (view as bug list)
Environment:
Last Closed:	2020-11-17 19:51:09 UTC
Target Upstream Version:
Embargoed:
Flags:	kir: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:49:51 UTC

Description Cesar Wong 2020-04-13 13:42:45 UTC

Description of problem:
On a RHEL 7.8 worker, pods are failing to delete.
from log: unlinkat kubernetes.io~secret/ibm-keepalived-watcher-token-lfp5h device or resource busy

Version-Release number of selected component (if applicable):
RHEL 7.8
OCP 4.3.11

How reproducible:
Always

Steps to Reproduce:
1. Create a cluster with RHEL 7.8 workers
2. Try and delete the `tuned daemonset` that runs 
```
 kubectl delete pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned
pod "tuned-5schs" deleted
pod "tuned-qcdgx" deleted
pod "tuned-t4kc2" deleted
```

Actual results:

Pods will stay stuck in deleted. If you look at the kubelet logs there will be errors about the tmpfs file system containing the service account secrets being busy. The tranisent kubernetes mount file will disappear along with the mount being listed in /proc/self/mountinfo however if you use fuser -m on the directory you will see it still is being referenced which is preventing it from getting deleted

Expected results:
Pods are deleted successfully

Additional info:

Comment 2 Ryan Phillips 2020-04-13 16:54:04 UTC

You will need to disable the node tuning operator and it's daemonsets to be able to delete the tuned pods, since the operator (and the daemonset) will restore the deleted pods.

I tested this on 4.3 on RHCOS and it's working as expected. Perhaps the tuned team can outline the procedures to disable the operator.

Comment 3 Tyler Lisowski 2020-04-13 17:23:30 UTC

This is specifically in regards to RHEL 7 machines not Red Hat Container OS.

RHEL 7 recently released RHEL 7.8  as noted in these release notes:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.8_release_notes/index

And using these OC4 RHEL 7 packages 
https://access.redhat.com/downloads/content/290/ver=4.3/rhel---7/4.3.10/x86_64/packages

I believe this new kernel caused a regression in OC 4 for RHEL 7 workers:

https://github.com/kubernetes/kubernetes/blob/v1.16.2/pkg/volume/secret/secret.go#L326
pkg/volume/secret/secret.go:326
    return volumeutil.UnmountViaEmptyDir(dir, c.plugin.host, c.volName, wrappedVolumeSpec(), c.podUID)
kubernetes/kubernetes | Added by GitHub
Tyler Lisowski  [9:29 AM]
https://github.com/kubernetes/kubernetes/blob/c97fe5036ef3df2967d086711e6c0c405941e14b/pkg/volume/emptydir/empty_dir.go#L391
pkg/volume/emptydir/empty_dir.go:391
func (ed *emptyDir) TearDownAt(dir string) error {
kubernetes/kubernetes | Added by GitHub
Tyler Lisowski  [9:30 AM]
https://github.com/kubernetes/kubernetes/blob/c97fe5036ef3df2967d086711e6c0c405941e14b/pkg/util/mount/mount_linux.go#L203


The above is the specific code path that is failing. It's something around tmpfs directories and how they are mounted into the container namespaces.

Comment 4 Tyler Lisowski 2020-04-13 17:26:41 UTC

Note that in my above fuser command:

That mount is for one pods service account token. However by tracing the processes that have references to that mount independent pods also have mounts to that tmpfs filesystem. That should not be the case: only one pod should be actively mounting that directory so that when it is stopped the directory can be cleaned. Instead, since multiple pods mount it it throws device busy errors.

Comment 5 Tyler Lisowski 2020-04-13 17:27:18 UTC

You will likely only be able to replicate this on a RHEL 7 machines not Red Hat Container OS machines

Comment 6 Peter Hunt 2020-04-13 18:10:14 UTC

I believe Ryan meant to set this to the Node component, though looking at it, it seems like more of a RHEL 7.x problem than with CRI-O or the kubelet

Comment 7 Peter Hunt 2020-04-13 18:55:34 UTC

I just read the comments and realized Ryan actually set it correctly. Sorry for the noise

Comment 8 Jiří Mencák 2020-04-14 07:49:34 UTC

If this is about disabling the Node Tuning Operator, cannot clusterversion's "overrides:" be used?

Comment 9 Tyler Lisowski 2020-04-14 15:56:45 UTC

Hi:

This is not about disabling the tuning operator. That is just an example of a deployed by default pod that hits this error. It's a regression in the pod deletion workflow of Openshift 4 on RHEL 7.8 workers. 



The following is the case:

The initial deletion of various host networking pods stay in Terminating state and kubelet fails to remove them because it cannot successfully remove the tmpfs directory that is used to store the service account token. When the deletion is done it will log the following errors:
```
Apr 14 10:49:03 pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 kubelet.service: E0414 10:49:03.249073    7893 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/secret/71125e08-6fb2-43d2-93d7-2bdb5992b1d8-tuned-token-cfvt5\" (\"71125e08-6fb2-43d2-93d7-2bdb5992b1d8\")" failed. No retries permitted until 2020-04-14 10:51:05.249040815 -0500 CDT m=+6260.792520516 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"tuned-token-cfvt5\" (UniqueName: \"kubernetes.io/secret/71125e08-6fb2-43d2-93d7-2bdb5992b1d8-tuned-token-cfvt5\") pod \"71125e08-6fb2-43d2-93d7-2bdb5992b1d8\" (UID: \"71125e08-6fb2-43d2-93d7-2bdb5992b1d8\") : unlinkat /var/data/kubelet/pods/71125e08-6fb2-43d2-93d7-2bdb5992b1d8/volumes/kubernetes.io~secret/tuned-token-cfvt5: device or resource busy"
Apr 14 10:51:05 pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 kubelet.service: I0414 10:51:05.252640    7893 reconciler.go:182] operationExecutor.UnmountVolume started for volume "tuned-token-cfvt5" (UniqueName: "kubernetes.io/secret/71125e08-6fb2-43d2-93d7-2bdb5992b1d8-tuned-token-cfvt5") pod "71125e08-6fb2-43d2-93d7-2bdb5992b1d8" (UID: "71125e08-6fb2-43d2-93d7-2bdb5992b1d8")
Apr 14 10:51:05 pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 kubelet.service: W0414 10:51:05.252877    7893 empty_dir.go:421] Warning: Failed to clear quota on /var/data/kubelet/pods/71125e08-6fb2-43d2-93d7-2bdb5992b1d8/volumes/kubernetes.io~secret/tuned-token-cfvt5: ClearQuota called, but quotas disabled
Apr 14 10:51:05 pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 kubelet.service: E0414 10:51:05.253029    7893 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/secret/71125e08-6fb2-43d2-93d7-2bdb5992b1d8-tuned-token-cfvt5\" (\"71125e08-6fb2-43d2-93d7-2bdb5992b1d8\")" failed. No retries permitted until 2020-04-14 10:53:07.252994574 -0500 CDT m=+6382.796474294 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"tuned-token-cfvt5\" (UniqueName: \"kubernetes.io/secret/71125e08-6fb2-43d2-93d7-2bdb5992b1d8-tuned-token-cfvt5\") pod \"71125e08-6fb2-43d2-93d7-2bdb5992b1d8\" (UID: \"71125e08-6fb2-43d2-93d7-2bdb5992b1d8\") : unlinkat /var/data/kubelet/pods/71125e08-6fb2-43d2-93d7-2bdb5992b1d8/volumes/kubernetes.io~secret/tuned-token-cfvt5: device or resource busy"
```

From the kubelet in both configurations. If you use fuser -m to look at all the processes using the mount after it has been deleted it will show something like this
```
[root@pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 ~]# fuser -m /var/data/kubelet/pods/71125e08-6fb2-43d2-93d7-2bdb5992b1d8/volumes/kubernetes.io~secret/tuned-token-cfvt5
/var/data/kubelet/pods/71125e08-6fb2-43d2-93d7-2bdb5992b1d8/volumes/kubernetes.io~secret/tuned-token-cfvt5:  7523  7893  7997c  8010m  8031c  8043m  8067m  8074c  8087m  8176c  8184c  8223c  8231c  8245c  8264c  8276m  8286c  8295m  8329m  8330m  8332m  8383m  8396m  8491c  8502m  8773c  8786cm  9160c  9172m  9195m  9290c  9302m  9585c  9597m  9619m  9641c  9654m  9922c  9935m 10130c 10144m 10207m 10208m 10209m 10210m 10212m 10213m 10321m 10322m 10386c 10398m 10580c 10594m 10645c 10665c 10670c 10700m 10703m 10709m 11425c 11439m 11520c 11533m 11643c 11668m 11795c 11810m 42502c 42516m 42688c 42701m 42846c 42859m 42897c 42910m 42933c 42946m 43017m 43075c 43088m 43180c 43192m 43230c 43242m 43286m 43344c 43357m 43392c 43417cm 123495m 127810m 127980m 128161m 128443m 128558m 129126m

[root@pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 ~]# cat /proc/7893/cmdline 
/usr/bin/hyperkubekubelet--enable-controller-attach-detach=false--root-dir=/var/data/kubelet--cgroup-driver=systemd--provider-id=ibm://048412671ddeb8ad98e6a614bc99fc8f///bolmgs1104j06md39q20/pres-bolmgs1104j06md39q20-openshift4b-reg-00000827--cloud-provider=external--cluster-dns=172.21.0.10--cluster-domain=cluster.local--feature-gates=ExpandInUsePersistentVolumes=true,LegacyNodeRoleBehavior=false,NodeDisruptionExclusion=false,ServiceNodeExclusion=false,SCTPSupport=false--pod-manifest-path=/etc/kubernetes/manifests--kubeconfig=/etc/kubernetes/kubelet-kubeconfig--tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305--v=2--file-check-frequency=5s--hostname-override=10.94.16.227--anonymous-auth=false--client-ca-file=/etc/kubernetes/cert/ca.pem--read-only-port=0--network-plugin=cni--cni-conf-dir=/etc/kubernetes/cni/net.d/--cni-bin-dir=/var/lib/cni/bin--tls-cert-file=/etc/kubernetes/cert/kubelet.pem--tls-private-key-file=/etc/kubernetes/cert/kubelet-key.pem--authorization-mode=Webhook--authentication-token-webhook--container-runtime=remote--runtime-request-timeout=15m--container-runtime-endpoint=/var/run/crio/crio.sock--feature-gates=CRIContainerLogRotation=true--container-log-max-size=100Mi--container-log-max-files=3--streaming-connection-idle-timeout=30m--event-qps=0--kube-reserved-cgroup=/podruntime.slice--system-reserved-cgroup=/system.slice--pod-max-pids=44406--kube-reserved=memory=1044Mi,cpu=36m,pid=12688--system-reserved=memory=1567Mi,cpu=54m,pid=12688--kubelet-cgroups=/podruntime.slice--runtime-cgroups=/podruntime.slice--enforce-node-allocatable=pods--kube-api-qps=20--kube-api-burst=40--eviction-soft=memory.available<100Mi,nodefs.available<10%,imagefs.available<10%,nodefs.inodesFree<10%,imagefs.inodesFree<10--eviction-soft-grace-period=memory.available=10m,nodefs.available=10m,imagefs.available=10m,nodefs.inodesFree=10m,imagefs.inodesFree=10m--eviction-hard=memory.available<100Mi,nodefs.available<5%,imagefs.available<5%,nodefs.inodesFree<5%,imagefs.inodesFree<5


[root@pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 ~]# cat /proc/10665/cmdline 
/usr/libexec/crio/conmon-s-ce795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618-nk8s_POD_dns-default-9mm8w_openshift-dns_dd3d6479-e22d-4e4b-b31f-da27e4091ec1_0-ue795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618-r/usr/bin/runc-b/var/data/crioruntimestorage/overlay-containers/e795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618/userdata--persist-dir/var/data/criorootstorage/overlay-containers/e795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618/userdata-p/var/data/crioruntimestorage/overlay-containers/e795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618/userdata/pidfile-P/var/data/crioruntimestorage/overlay-containers/e795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618/userdata/conmon-pidfile-l/var/log/pods/openshift-dns_dns-default-9mm8w_dd3d6479-e22d-4e4b-b31f-da27e4091ec1/e795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618.log--exit-dir/var/run/crio/exits--socket-dir-path/var/run/crio--log-levelerror--runtime-arg--root=/run/runc[root@pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 ~]# 
```

However nothing should be holding on to the mount at this point since the pod is deleted. These processes holding onto the mount are blocking the deletion of the pod

Comment 10 Jiří Mencák 2020-04-14 16:10:36 UTC

In that case I'm not sure why this is in the Node Tuning Operator component.

Comment 11 Peter Hunt 2020-04-14 17:52:05 UTC

I duplicated to Kernel to have them start looking at it. In the meantime, I'm trying to get access to the node (or a 7.8 env) to try to rule out CRI-O as the culprit

Comment 12 Peter Hunt 2020-04-14 19:10:15 UTC

I have done some digging on a node, and I'm fairly certain this is not CRI-O (it may be kubelet, but it seems more likely a kernel problem):
https://bugzilla.redhat.com/show_bug.cgi?id=1823885#c4

crictl, runc and ps all report the container is gone. Kubelet is reporting the volume cannot be umounted, and fuser reports every container process has access to that mount. I'm fairly certain the tuned mount is not mounted into every container, as well as opened in cri-o and kubelet. So I'm pretty sure there's a bug in the way the kernel is reporting processes accessing a mount

Comment 13 Peter Hunt 2020-04-14 21:01:25 UTC

duping this bug is making it unclear which to provide updates to...

https://bugzilla.redhat.com/show_bug.cgi?id=1823885#c7

we haven't ruled out it being Node team's problem, but there are also mitigations possible in the pod specs (specifying mountPropagation for each mount of `/`)

Ultimately, we need to know if this also works in 7.7, and if so we need to figure out why the difference in behaviour

Comment 14 Tyler Lisowski 2020-04-14 21:28:05 UTC

We are currently trying to revert libs back to find the issue 

We originally tried reverting
```
  shell: 'yum downgrade kernel-3.10.0-1062.18.1.el7 kernel-tools-3.10.0-1062.18.1.el7 kernel-tools-libs-3.10.0-1062.18.1.el7 kernel-headers-3.10.0-1062.18.1.el7 kernel-devel-3.10.0-1062.18.1.el7   systemd-219-67.el7_7.4 systemd-libs-219-67.el7_7.4 systemd-sysv-219-67.el7_7.4 libgudev1-219-67.el7_7.4  glibc-2.17-292.el7 glibc-common-2.17-292.el7  -y'
```

Glibc, systemd, kernel but still can replicate the issue we are now looking at reverting device mapper

Comment 15 Tyler Lisowski 2020-04-15 00:50:31 UTC

I believe we found the issue:


```
2020-02-13 Jindrich Novy <jnovy> - 1.0.0-67.rc10
- address CVE-2019-19921 by updating to rc10
- Resolves: #1802599

```

The new Runc released in RHEL 7.8 causes this regression specifically this CVE patch CVE-2019-19921



Reverting back to 
```
2020-03-02 Jindrich Novy <jnovy> - 1.0.0-66.rc8
- fix "CVE-2019-19921 runc: volume mount race condition with shared mounts leads to information leak/integrity manipulation"
- Resolves: #1803100
```

Fixes the error and is no longer reproducable. It was not related to any of the other packages

Comment 16 Daniel Walsh 2020-04-15 13:10:43 UTC

This does feel like the classic leaked mount points problem we used to have with docker.

Comment 17 Kir Kolyshkin 2020-04-15 20:19:59 UTC

OK, what we have here looks like a mount that is leaked to other mount namespace which is private. When such a namespace is created (say by having a container add local volume (i.e. bind mount) for host / with private propagation), all the mounts that are on the host will appear inside a new mnt ns.

Now, when someone tries to unmount this on the host, they will have EBUSY because this mount is also in that other mount ns and, since the propagation is set to private, the unmount event is not propagated to that other mount ns.

There are two solutions and one workaround I am aware of.

Solution #1: DO NOT USE PRIVATE mount propagation. Using RSLAVE will result in the same behavior, except that unmount will be possible.

Solution #2: add the "unmount all the mounts we don't need" to the container that is doing host mount with private propagation.

Workaround: use MNT_DETACH on the host, followed by the mount point removal.

Now, this workaround was not originally part of RHEL7 kernel. It was backported some time later, and with a twist -- to enable it, you need to write 1 to fs.may_detach_mounts sysctl file.

It looks like runc rpm was amended to have that file, see commit: http://pkgs.devel.redhat.com/cgit/rpms/runc/commit/?h=extras-rhel-7.8&id=1ea6227e0863ae79b8a7ec9c4543996ce426f6e5)

The bug we see might be caused by reversal of the above addition, see commit 
then it was reverted in http://pkgs.devel.redhat.com/cgit/rpms/runc/commit/?h=extras-rhel-7.8&id=4250b1f539bfde82ac7e33d9a1385286975d9915

Comment 18 Kir Kolyshkin 2020-04-15 20:22:36 UTC

To test if the above hypothesis holds the water:

1 # cat /proc/sys/fs/may_detach_mounts

(should show 0 with recent runc)

2 # echo 1 > /proc/sys/fs/may_detach_mounts

(to temporary enable MNT_DETACH + mountpoint removal feature)

3 try to reproduce the issue

Comment 19 Kir Kolyshkin 2020-04-15 21:40:11 UTC

Detailed explanation and how to check if my hypothesis is correct here: https://bugzilla.redhat.com/show_bug.cgi?id=1823374#c17

Comment 20 Kir Kolyshkin 2020-04-15 21:42:16 UTC

(never mind the previous comment -- wrong window)

Comment 21 Jindrich Novy 2020-04-16 16:07:27 UTC

Kir, can you please check whether it is fixed by:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27988483

Comment 22 Mrunal Patel 2020-04-16 21:29:43 UTC

Tyler, are you able to test with the version Jindrich posted?

Comment 23 Mrunal Patel 2020-04-16 21:30:05 UTC

Actually, I am not sure if you are able to access it.

Comment 24 Tyler Lisowski 2020-04-17 15:57:42 UTC

Just to be explicit: I don't have access to the link

Comment 26 Mrunal Patel 2020-04-17 17:56:02 UTC

Tyler, could you test by setting sysctlfs.may_detach_mounts=1 with runc 1.0.0-67.rc10 on your RHEL 7 nodes?

Comment 30 Tyler Lisowski 2020-04-17 21:15:42 UTC

Testing setting `fs.may_detach_mounts=1` and runc-1.0.0-67.rc10.rhaos4.3.el7_8.x86_64.rpm

Comment 31 Tyler Lisowski 2020-04-18 21:52:28 UTC

Tests look good when I update the automation to ensure fs.may_detach_mounts equals 1 with sysctl.conf and install the runc package.

Comment 32 Tyler Lisowski 2020-04-18 21:52:44 UTC

Tests look good when I update the automation to ensure fs.may_detach_mounts equals 1 with sysctl.conf and install the runc package.

Comment 33 Kir Kolyshkin 2020-04-22 20:08:49 UTC

*** Bug 1820717 has been marked as a duplicate of this bug. ***

Comment 34 MinLi 2020-04-23 07:01:21 UTC

verified on version : 4.4.0-0.nightly-2020-04-20-133922

sh-4.2# rpm -qa | grep runc
runc-1.0.0-68.rc10.rhaos4.4.el7_8.x86_64

sh-4.2# cat /proc/sys/fs/may_detach_mounts
1

Comment 36 errata-xmlrpc 2020-05-04 11:49:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 37 Ryan Phillips 2020-05-11 17:27:27 UTC

*** Bug 1821347 has been marked as a duplicate of this bug. ***

Comment 44 Kir Kolyshkin 2020-10-28 01:11:31 UTC

@mirollin Sorry it took me too long to reply, was on PTO / busy with other high prio tasks.

I believe the bug is fixed by re-adding the sysctl setting. You can verify it on a running system by merely doing cat /proc/sys/fs/may_detach_mounts. If it shows 1, the bug is fixed. If it shows 0, you might have an older version of runc package.

All in all, I think you should file a new bug, specify versions of all components, what is going on etc, otherwise we have all the eggs in one basket and it's hard to make heads or tails out of it.

Comment 45 Mitchell Rollinson 2020-11-04 02:23:06 UTC

Thanks Kir,

I have conveyed this to the cu, and am currently awaiting feedback. 
Checking one of their nodes they could see '/proc/sys/fs/may_detach_mounts', was set to 0, so this could be the issue/fix

Will update you as soon as i hear back.

Rgds

Note You need to log in before you can comment on or make changes to this bug.