Description of problem: On a RHEL 7.8 worker, pods are failing to delete. from log: unlinkat kubernetes.io~secret/ibm-keepalived-watcher-token-lfp5h device or resource busy Version-Release number of selected component (if applicable): RHEL 7.8 OCP 4.3.11 How reproducible: Always Steps to Reproduce: 1. Create a cluster with RHEL 7.8 workers 2. Try and delete the `tuned daemonset` that runs ``` kubectl delete pods -n openshift-cluster-node-tuning-operator -l openshift-app=tuned pod "tuned-5schs" deleted pod "tuned-qcdgx" deleted pod "tuned-t4kc2" deleted ``` Actual results: Pods will stay stuck in deleted. If you look at the kubelet logs there will be errors about the tmpfs file system containing the service account secrets being busy. The tranisent kubernetes mount file will disappear along with the mount being listed in /proc/self/mountinfo however if you use fuser -m on the directory you will see it still is being referenced which is preventing it from getting deleted Expected results: Pods are deleted successfully Additional info:
You will need to disable the node tuning operator and it's daemonsets to be able to delete the tuned pods, since the operator (and the daemonset) will restore the deleted pods. I tested this on 4.3 on RHCOS and it's working as expected. Perhaps the tuned team can outline the procedures to disable the operator.
This is specifically in regards to RHEL 7 machines not Red Hat Container OS. RHEL 7 recently released RHEL 7.8 as noted in these release notes: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/7.8_release_notes/index And using these OC4 RHEL 7 packages https://access.redhat.com/downloads/content/290/ver=4.3/rhel---7/4.3.10/x86_64/packages I believe this new kernel caused a regression in OC 4 for RHEL 7 workers: https://github.com/kubernetes/kubernetes/blob/v1.16.2/pkg/volume/secret/secret.go#L326 pkg/volume/secret/secret.go:326 return volumeutil.UnmountViaEmptyDir(dir, c.plugin.host, c.volName, wrappedVolumeSpec(), c.podUID) kubernetes/kubernetes | Added by GitHub Tyler Lisowski [9:29 AM] https://github.com/kubernetes/kubernetes/blob/c97fe5036ef3df2967d086711e6c0c405941e14b/pkg/volume/emptydir/empty_dir.go#L391 pkg/volume/emptydir/empty_dir.go:391 func (ed *emptyDir) TearDownAt(dir string) error { kubernetes/kubernetes | Added by GitHub Tyler Lisowski [9:30 AM] https://github.com/kubernetes/kubernetes/blob/c97fe5036ef3df2967d086711e6c0c405941e14b/pkg/util/mount/mount_linux.go#L203 The above is the specific code path that is failing. It's something around tmpfs directories and how they are mounted into the container namespaces.
Note that in my above fuser command: That mount is for one pods service account token. However by tracing the processes that have references to that mount independent pods also have mounts to that tmpfs filesystem. That should not be the case: only one pod should be actively mounting that directory so that when it is stopped the directory can be cleaned. Instead, since multiple pods mount it it throws device busy errors.
You will likely only be able to replicate this on a RHEL 7 machines not Red Hat Container OS machines
I believe Ryan meant to set this to the Node component, though looking at it, it seems like more of a RHEL 7.x problem than with CRI-O or the kubelet
I just read the comments and realized Ryan actually set it correctly. Sorry for the noise
If this is about disabling the Node Tuning Operator, cannot clusterversion's "overrides:" be used?
Hi: This is not about disabling the tuning operator. That is just an example of a deployed by default pod that hits this error. It's a regression in the pod deletion workflow of Openshift 4 on RHEL 7.8 workers. The following is the case: The initial deletion of various host networking pods stay in Terminating state and kubelet fails to remove them because it cannot successfully remove the tmpfs directory that is used to store the service account token. When the deletion is done it will log the following errors: ``` Apr 14 10:49:03 pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 kubelet.service: E0414 10:49:03.249073 7893 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/secret/71125e08-6fb2-43d2-93d7-2bdb5992b1d8-tuned-token-cfvt5\" (\"71125e08-6fb2-43d2-93d7-2bdb5992b1d8\")" failed. No retries permitted until 2020-04-14 10:51:05.249040815 -0500 CDT m=+6260.792520516 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"tuned-token-cfvt5\" (UniqueName: \"kubernetes.io/secret/71125e08-6fb2-43d2-93d7-2bdb5992b1d8-tuned-token-cfvt5\") pod \"71125e08-6fb2-43d2-93d7-2bdb5992b1d8\" (UID: \"71125e08-6fb2-43d2-93d7-2bdb5992b1d8\") : unlinkat /var/data/kubelet/pods/71125e08-6fb2-43d2-93d7-2bdb5992b1d8/volumes/kubernetes.io~secret/tuned-token-cfvt5: device or resource busy" Apr 14 10:51:05 pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 kubelet.service: I0414 10:51:05.252640 7893 reconciler.go:182] operationExecutor.UnmountVolume started for volume "tuned-token-cfvt5" (UniqueName: "kubernetes.io/secret/71125e08-6fb2-43d2-93d7-2bdb5992b1d8-tuned-token-cfvt5") pod "71125e08-6fb2-43d2-93d7-2bdb5992b1d8" (UID: "71125e08-6fb2-43d2-93d7-2bdb5992b1d8") Apr 14 10:51:05 pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 kubelet.service: W0414 10:51:05.252877 7893 empty_dir.go:421] Warning: Failed to clear quota on /var/data/kubelet/pods/71125e08-6fb2-43d2-93d7-2bdb5992b1d8/volumes/kubernetes.io~secret/tuned-token-cfvt5: ClearQuota called, but quotas disabled Apr 14 10:51:05 pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 kubelet.service: E0414 10:51:05.253029 7893 nestedpendingoperations.go:270] Operation for "\"kubernetes.io/secret/71125e08-6fb2-43d2-93d7-2bdb5992b1d8-tuned-token-cfvt5\" (\"71125e08-6fb2-43d2-93d7-2bdb5992b1d8\")" failed. No retries permitted until 2020-04-14 10:53:07.252994574 -0500 CDT m=+6382.796474294 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"tuned-token-cfvt5\" (UniqueName: \"kubernetes.io/secret/71125e08-6fb2-43d2-93d7-2bdb5992b1d8-tuned-token-cfvt5\") pod \"71125e08-6fb2-43d2-93d7-2bdb5992b1d8\" (UID: \"71125e08-6fb2-43d2-93d7-2bdb5992b1d8\") : unlinkat /var/data/kubelet/pods/71125e08-6fb2-43d2-93d7-2bdb5992b1d8/volumes/kubernetes.io~secret/tuned-token-cfvt5: device or resource busy" ``` From the kubelet in both configurations. If you use fuser -m to look at all the processes using the mount after it has been deleted it will show something like this ``` [root@pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 ~]# fuser -m /var/data/kubelet/pods/71125e08-6fb2-43d2-93d7-2bdb5992b1d8/volumes/kubernetes.io~secret/tuned-token-cfvt5 /var/data/kubelet/pods/71125e08-6fb2-43d2-93d7-2bdb5992b1d8/volumes/kubernetes.io~secret/tuned-token-cfvt5: 7523 7893 7997c 8010m 8031c 8043m 8067m 8074c 8087m 8176c 8184c 8223c 8231c 8245c 8264c 8276m 8286c 8295m 8329m 8330m 8332m 8383m 8396m 8491c 8502m 8773c 8786cm 9160c 9172m 9195m 9290c 9302m 9585c 9597m 9619m 9641c 9654m 9922c 9935m 10130c 10144m 10207m 10208m 10209m 10210m 10212m 10213m 10321m 10322m 10386c 10398m 10580c 10594m 10645c 10665c 10670c 10700m 10703m 10709m 11425c 11439m 11520c 11533m 11643c 11668m 11795c 11810m 42502c 42516m 42688c 42701m 42846c 42859m 42897c 42910m 42933c 42946m 43017m 43075c 43088m 43180c 43192m 43230c 43242m 43286m 43344c 43357m 43392c 43417cm 123495m 127810m 127980m 128161m 128443m 128558m 129126m [root@pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 ~]# cat /proc/7893/cmdline /usr/bin/hyperkubekubelet--enable-controller-attach-detach=false--root-dir=/var/data/kubelet--cgroup-driver=systemd--provider-id=ibm://048412671ddeb8ad98e6a614bc99fc8f///bolmgs1104j06md39q20/pres-bolmgs1104j06md39q20-openshift4b-reg-00000827--cloud-provider=external--cluster-dns=172.21.0.10--cluster-domain=cluster.local--feature-gates=ExpandInUsePersistentVolumes=true,LegacyNodeRoleBehavior=false,NodeDisruptionExclusion=false,ServiceNodeExclusion=false,SCTPSupport=false--pod-manifest-path=/etc/kubernetes/manifests--kubeconfig=/etc/kubernetes/kubelet-kubeconfig--tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305--v=2--file-check-frequency=5s--hostname-override=10.94.16.227--anonymous-auth=false--client-ca-file=/etc/kubernetes/cert/ca.pem--read-only-port=0--network-plugin=cni--cni-conf-dir=/etc/kubernetes/cni/net.d/--cni-bin-dir=/var/lib/cni/bin--tls-cert-file=/etc/kubernetes/cert/kubelet.pem--tls-private-key-file=/etc/kubernetes/cert/kubelet-key.pem--authorization-mode=Webhook--authentication-token-webhook--container-runtime=remote--runtime-request-timeout=15m--container-runtime-endpoint=/var/run/crio/crio.sock--feature-gates=CRIContainerLogRotation=true--container-log-max-size=100Mi--container-log-max-files=3--streaming-connection-idle-timeout=30m--event-qps=0--kube-reserved-cgroup=/podruntime.slice--system-reserved-cgroup=/system.slice--pod-max-pids=44406--kube-reserved=memory=1044Mi,cpu=36m,pid=12688--system-reserved=memory=1567Mi,cpu=54m,pid=12688--kubelet-cgroups=/podruntime.slice--runtime-cgroups=/podruntime.slice--enforce-node-allocatable=pods--kube-api-qps=20--kube-api-burst=40--eviction-soft=memory.available<100Mi,nodefs.available<10%,imagefs.available<10%,nodefs.inodesFree<10%,imagefs.inodesFree<10--eviction-soft-grace-period=memory.available=10m,nodefs.available=10m,imagefs.available=10m,nodefs.inodesFree=10m,imagefs.inodesFree=10m--eviction-hard=memory.available<100Mi,nodefs.available<5%,imagefs.available<5%,nodefs.inodesFree<5%,imagefs.inodesFree<5 [root@pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 ~]# cat /proc/10665/cmdline /usr/libexec/crio/conmon-s-ce795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618-nk8s_POD_dns-default-9mm8w_openshift-dns_dd3d6479-e22d-4e4b-b31f-da27e4091ec1_0-ue795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618-r/usr/bin/runc-b/var/data/crioruntimestorage/overlay-containers/e795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618/userdata--persist-dir/var/data/criorootstorage/overlay-containers/e795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618/userdata-p/var/data/crioruntimestorage/overlay-containers/e795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618/userdata/pidfile-P/var/data/crioruntimestorage/overlay-containers/e795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618/userdata/conmon-pidfile-l/var/log/pods/openshift-dns_dns-default-9mm8w_dd3d6479-e22d-4e4b-b31f-da27e4091ec1/e795f5110d56b6e91e94ebe53328ee5460fdf9b1b0ece24fdd73987100326618.log--exit-dir/var/run/crio/exits--socket-dir-path/var/run/crio--log-levelerror--runtime-arg--root=/run/runc[root@pres-bolmgs1104j06md39q20-openshift4b-reg-00000827 ~]# ``` However nothing should be holding on to the mount at this point since the pod is deleted. These processes holding onto the mount are blocking the deletion of the pod
In that case I'm not sure why this is in the Node Tuning Operator component.
I duplicated to Kernel to have them start looking at it. In the meantime, I'm trying to get access to the node (or a 7.8 env) to try to rule out CRI-O as the culprit
I have done some digging on a node, and I'm fairly certain this is not CRI-O (it may be kubelet, but it seems more likely a kernel problem): https://bugzilla.redhat.com/show_bug.cgi?id=1823885#c4 crictl, runc and ps all report the container is gone. Kubelet is reporting the volume cannot be umounted, and fuser reports every container process has access to that mount. I'm fairly certain the tuned mount is not mounted into every container, as well as opened in cri-o and kubelet. So I'm pretty sure there's a bug in the way the kernel is reporting processes accessing a mount
duping this bug is making it unclear which to provide updates to... https://bugzilla.redhat.com/show_bug.cgi?id=1823885#c7 we haven't ruled out it being Node team's problem, but there are also mitigations possible in the pod specs (specifying mountPropagation for each mount of `/`) Ultimately, we need to know if this also works in 7.7, and if so we need to figure out why the difference in behaviour
We are currently trying to revert libs back to find the issue We originally tried reverting ``` shell: 'yum downgrade kernel-3.10.0-1062.18.1.el7 kernel-tools-3.10.0-1062.18.1.el7 kernel-tools-libs-3.10.0-1062.18.1.el7 kernel-headers-3.10.0-1062.18.1.el7 kernel-devel-3.10.0-1062.18.1.el7 systemd-219-67.el7_7.4 systemd-libs-219-67.el7_7.4 systemd-sysv-219-67.el7_7.4 libgudev1-219-67.el7_7.4 glibc-2.17-292.el7 glibc-common-2.17-292.el7 -y' ``` Glibc, systemd, kernel but still can replicate the issue we are now looking at reverting device mapper
I believe we found the issue: ``` 2020-02-13 Jindrich Novy <jnovy> - 1.0.0-67.rc10 - address CVE-2019-19921 by updating to rc10 - Resolves: #1802599 ``` The new Runc released in RHEL 7.8 causes this regression specifically this CVE patch CVE-2019-19921 Reverting back to ``` 2020-03-02 Jindrich Novy <jnovy> - 1.0.0-66.rc8 - fix "CVE-2019-19921 runc: volume mount race condition with shared mounts leads to information leak/integrity manipulation" - Resolves: #1803100 ``` Fixes the error and is no longer reproducable. It was not related to any of the other packages
This does feel like the classic leaked mount points problem we used to have with docker.
OK, what we have here looks like a mount that is leaked to other mount namespace which is private. When such a namespace is created (say by having a container add local volume (i.e. bind mount) for host / with private propagation), all the mounts that are on the host will appear inside a new mnt ns. Now, when someone tries to unmount this on the host, they will have EBUSY because this mount is also in that other mount ns and, since the propagation is set to private, the unmount event is not propagated to that other mount ns. There are two solutions and one workaround I am aware of. Solution #1: DO NOT USE PRIVATE mount propagation. Using RSLAVE will result in the same behavior, except that unmount will be possible. Solution #2: add the "unmount all the mounts we don't need" to the container that is doing host mount with private propagation. Workaround: use MNT_DETACH on the host, followed by the mount point removal. Now, this workaround was not originally part of RHEL7 kernel. It was backported some time later, and with a twist -- to enable it, you need to write 1 to fs.may_detach_mounts sysctl file. It looks like runc rpm was amended to have that file, see commit: http://pkgs.devel.redhat.com/cgit/rpms/runc/commit/?h=extras-rhel-7.8&id=1ea6227e0863ae79b8a7ec9c4543996ce426f6e5) The bug we see might be caused by reversal of the above addition, see commit then it was reverted in http://pkgs.devel.redhat.com/cgit/rpms/runc/commit/?h=extras-rhel-7.8&id=4250b1f539bfde82ac7e33d9a1385286975d9915
To test if the above hypothesis holds the water: 1 # cat /proc/sys/fs/may_detach_mounts (should show 0 with recent runc) 2 # echo 1 > /proc/sys/fs/may_detach_mounts (to temporary enable MNT_DETACH + mountpoint removal feature) 3 try to reproduce the issue
Detailed explanation and how to check if my hypothesis is correct here: https://bugzilla.redhat.com/show_bug.cgi?id=1823374#c17
(never mind the previous comment -- wrong window)
Kir, can you please check whether it is fixed by: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=27988483
Tyler, are you able to test with the version Jindrich posted?
Actually, I am not sure if you are able to access it.
Just to be explicit: I don't have access to the link
Tyler, could you test by setting sysctlfs.may_detach_mounts=1 with runc 1.0.0-67.rc10 on your RHEL 7 nodes?
Testing setting `fs.may_detach_mounts=1` and runc-1.0.0-67.rc10.rhaos4.3.el7_8.x86_64.rpm
Tests look good when I update the automation to ensure fs.may_detach_mounts equals 1 with sysctl.conf and install the runc package.
*** Bug 1820717 has been marked as a duplicate of this bug. ***
verified on version : 4.4.0-0.nightly-2020-04-20-133922 sh-4.2# rpm -qa | grep runc runc-1.0.0-68.rc10.rhaos4.4.el7_8.x86_64 sh-4.2# cat /proc/sys/fs/may_detach_mounts 1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581
*** Bug 1821347 has been marked as a duplicate of this bug. ***
@mirollin Sorry it took me too long to reply, was on PTO / busy with other high prio tasks. I believe the bug is fixed by re-adding the sysctl setting. You can verify it on a running system by merely doing cat /proc/sys/fs/may_detach_mounts. If it shows 1, the bug is fixed. If it shows 0, you might have an older version of runc package. All in all, I think you should file a new bug, specify versions of all components, what is going on etc, otherwise we have all the eggs in one basket and it's hard to make heads or tails out of it.
Thanks Kir, I have conveyed this to the cu, and am currently awaiting feedback. Checking one of their nodes they could see '/proc/sys/fs/may_detach_mounts', was set to 0, so this could be the issue/fix Will update you as soon as i hear back. Rgds