Description of problem (please be detailed as possible and provide log snippets): In recent tier 1 runs on IBM cloud, failures were observed while cleaning up the app pods created during the test. The deletion of app pods that were using either RBD block volume mode or RWX access mode PVCs was stuck and the command timed out as seen below: E subprocess.TimeoutExpired: Command '['oc', '-n', 'namespace-test-6fddd1c88c594a50a543bd17f308b496', '--kubeconfig', '/home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig', 'delete', 'Pod', 'pod-test-rbd-3a985b1b7bb741f68028eba3766f20de']' timed out after 600 seconds Test cases for which this issue was seen are listed below: - tests/manage/pv_services/test_pvc_assign_pod_node.py::TestPvcAssignPodNode::test_rwx_pvc_assign_pod_node[CephBlockPool] - te sts/manage/pv_services/test_pvc_assign_pod_node.py::TestPvcAssignPodNode::test_rwx_pvc_assign_pod_node[CephFileSystem] - tests/manage/pv_services/test_raw_block_pv.py::TestRawBlockPV::test_raw_block_pv[Delete] - tests/manage/pv_services/test_raw_block_pv.py::TestRawBlockPV::test_raw_block_pv[Retain] - tests/manage/pv_services/pvc_resize/test_pvc_expansion.py::TestPvcExpand::test_pvc_expansion - tests/manage/pv_services/pvc_resize/test_pvc_expansion.py::TestPvcExpand::test_pvc_expand_expanded_pvc Version of all relevant components (if applicable): OCP: 4.5.15 OCS : v4.6.0-131.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? Not that I am aware of Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: Run the tests mentioned above from ocs-ci. Or 1. Create an RBD based PVC with RWX access mode 2. Create two app pods using the same PVC and run IO 3. Delete the app pod. Actual results: Pod deletion times out Expected results: Pod deletion should be successful Additional info: A similar error was observed in this bug: https://bugzilla.redhat.com/show_bug.cgi?id=1871314 But in this run the kubelet dir issue was fixed and the error seen in https://bugzilla.redhat.com/show_bug.cgi?id=1871314#c5 was not observed.
Ok, I think I see what is going on. This is the most important message in the log: unmount failed: exit status 32 ... umount: /var/data/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-90eb1a1e-9787-48c7-9335-7b595a4244bb/dev/64da0a1c-3c63-46c4-9279-8722acfe1442: target is busy The other warnings are fine, kubelet is cleaning up what was already cleaned by previous attempts: * Warning: Unmap skipped because symlink does not exist on the path: ... * couldn't find loopback device which takes file descriptor lock. Skip detaching device. ... So, something uses /var/data/kubelet/plugins/kubernetes.io/csi/volumeDevices/pvc-90eb1a1e-9787-48c7-9335-7b595a4244bb/dev/64da0a1c-3c63-46c4-9279-8722acfe1442. From the logs I can't see what. Could there be some leftoevers of the pod that used the volume? Is it (bind-)mounted anywhere?
Similar issue with unmount and device busy messages also seen in Bug 1877812 (again with IBM ROKS platform)
Sahina, can we request IBM team to try to repro once more on a fresh setup where root-dir is set properly either to /var/lib/kubelet or /var/data/kubelet from beginning. We have seen issues with this configuration in the past as well.
Akash, can you please respond to Comment 11?
Hey Akash, just checked the one of the cluster where I did the last one execution and see that kubelet path is set to : /var/lib/kubelet as I can see from: --root-dir=/var/lib/kubelet This is ps aux from one of the node: sh-4.2# ps aux |grep kubelet root 13847 9.0 0.4 10675928 520204 ? Ssl Oct14 2971:21 kubelet --root-dir=/var/lib/kubelet --cgroup-driver=systemd --provider-id=ibm://51ce5574da53f188c6ba403639d180d0///bu3ghtrf0do58senbv10/kube-bu3ghtrf0do58senbv10-pbaloghvpc4-default-000001d0 --cloud-provider=external --cluster-dns=172.21.0.10 --cluster-domain=cluster.local --feature-gates=ExpandInUsePersistentVolumes=true,LegacyNodeRoleBehavior=false,NodeDisruptionExclusion=false,ServiceNodeExclusion=false,SCTPSupport=false --pod-manifest-path=/etc/kubernetes/manifests --kubeconfig=/etc/kubernetes/kubelet-kubeconfig --max-pods=160 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305 --v=2 --file-check-frequency=5s --hostname-override=10.243.128.11 --anonymous-auth=false --client-ca-file=/etc/kubernetes/cert/ca.pem --read-only-port=0 --network-plugin=cni --cni-conf-dir=/etc/kubernetes/cni/net.d/ --cni-bin-dir=/var/lib/cni/bin --tls-cert-file=/etc/kubernetes/cert/kubelet.pem --tls-private-key-file=/etc/kubernetes/cert/kubelet-key.pem --authorization-mode=Webhook --authentication-token-webhook --container-runtime=remote --runtime-request-timeout=15m --container-runtime-endpoint=/var/run/crio/crio.sock --feature-gates=CRIContainerLogRotation=true --container-log-max-size=100Mi --container-log-max-files=3 --streaming-connection-idle-timeout=30m --event-qps=0 --kube-reserved-cgroup=/podruntime.slice --system-reserved-cgroup=/system.slice --pod-max-pids=431610 --kube-reserved=memory=3549Mi,cpu=48m,pid=23978 --system-reserved=memory=5324Mi,cpu=72m,pid=23978 --kubelet-cgroups=/podruntime.slice --runtime-cgroups=/podruntime.slice --enforce-node-allocatable=pods --kube-api-qps=20 --kube-api-burst=40 --eviction-soft=memory.available<100Mi,nodefs.available<10%,imagefs.available<10%,nodefs.inodesFree<10%,imagefs.inodesFree<10 --eviction-soft-grace-period=memory.available=10m,nodefs.available=10m,imagefs.available=10m,nodefs.inodesFree=10m,imagefs.inodesFree=10m --eviction-hard=memory.available<100Mi,nodefs.available<5%,imagefs.available<5%,nodefs.inodesFree<5%,imagefs.inodesFree<5 root 17325 0.0 0.0 113912 8792 ? Ssl Oct14 0:21 /csi-node-driver-registrar --v=5 --csi-address=/csi/csi.sock --kubelet-registration-path=/var/data/kubelet/csi-plugins/vpc.block.csi.ibm.io/csi.sock root 190624 0.0 0.0 9100 680 ? S+ 03:15 0:00 grep kubelet root 958230 0.0 0.0 713616 7952 ? Ssl Oct31 0:25 /usr/bin/csi-node-driver-registrar --v=5 --csi-address=/csi/csi.sock --kubelet-registration-path=/var/lib/kubelet/plugins/openshift-storage.rbd.csi.ceph.com/csi.sock root 958610 0.0 0.0 714768 7932 ? Ssl Oct31 0:26 /usr/bin/csi-node-driver-registrar --v=5 --csi-address=/csi/csi.sock --kubelet-registration-path=/var/lib/kubelet/plugins/openshift-storage.cephfs.csi.ceph.com/csi.sock I see in this BZ that since 4.4 https://bugzilla.redhat.com/show_bug.cgi?id=1801365#c33 There should be the regular default path used which seems to be true on the cluster. But I see this process csi-node-driver-registrar: root 17325 0.0 0.0 113912 8792 ? Ssl Oct14 0:21 /csi-node-driver-registrar --v=5 --csi-address=/csi/csi.sock --kubelet-registration-path=/var/data/kubelet/csi-plugins/vpc.block.csi.ibm.io/csi.sock This from some reason has /var/data/kubelet in the path - smells fishy to me. Can someone from CSI take a look? I still have cluster up here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-roks/pbalogh-roks_20201015T084053/openshift-cluster-dir/auth/kubeconfig
@Petr, I see the kubelet-registration-path looks incorrect for the VPC plugin above. We use this CSI plugin from IBM to create PVCs used by OSD pods in OCS. So its not yet clear if the path in VPC plugin is causing issues. We will try to change the VPC path and see if OSD issue is resolved so we know the root cause.
@Petr, We checked by testing a sample pod using same VPC based volume and were able to successfully create and delete the pod. So the kubelet path is not an issue here as its a soft link used in VPC plugin. The volume was unmounted. But in case of OSD pods, I see they are in terminating state. The volume used by OSD pod is actually detached when we check in the backend. But looks like OSD pod is still holding onto the volume which is already detached. I also see some device like loop0. Please check why OSD pod is still holding to volumes and is loop0 related to this issue.
From https://bugzilla.redhat.com/show_bug.cgi?id=1877812#c26 it is clear that the loopdevice is still holding the mount and that should be the reason why the pods are also not getting deleted. If we RCA https://bugzilla.redhat.com/show_bug.cgi?id=1877812 we can close this one. But IMHO, there is nothing for csi team to look here.
*** This bug has been marked as a duplicate of bug 1877812 ***