Description of problem: kube-controller force detaching a volume after maxWaitForUnmountDuration leaves block devices on node in a inconsistent state Version-Release number of selected component (if applicable): 4.7 How reproducible: Each time a node is unable to unmount a volume. Steps to Reproduce: 0. On a cluster with dynamic volume provisioning, using a CSI driver with uses multipath like the IBM Block Storage CSI driver 1. Create a PV exposed over multiple paths 2. Bind the PV to a PVC, ie mount on a node the PV with a CSI driver which supports multipath 2. Use the PV in a Pod 3. Create a runtime error condition where the unmount of the PV would fail, like starting a new shell command using the PV mount point as a current working directory. This will prevent unmounting of the mount point. 4. Delete the pod. This will also mean that the node kubelet on which the pod is running will try to unmount the PV, either directly or via asking the node CSI driver. Unmounting the PV will fail, because of 3. 5. After waiting for maxWaitForUnmountDuration minutes, the kube-controller will kick in and will force detach the PV on the storage provider: 2021-10-01T06:45:49.352705556Z W1001 06:45:49.352024 1 reconciler.go:206] attacherDetacher.DetachVolume started for volume "pvc-i-am-an-uuid-you-know" (UniqueName: "kubernetes.io/csi/block.csi.ibm.com^SVC:123") on node "node.example.com" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching 6 . At that point the node where the pod is running does not have access to the device/PV, but the multipath device map is still present on the node, and points to an unexistent block device / PV. 7. Create a new Pod on that node, using a *different* PVC / PV. When the Pod is started, the kernel sees the old multipath device still present, although now the backing device on the storage provider is a completely different thing (This behaviour can be partially mitigated if the option intruced by https://access.redhat.com/errata/RHBA-2021:4367 is applied but it is not yet available on OpenShift) This is the moment where strange things happen, for instance the kernel reports a wrong size of the block device, and could even lead to data corruption. Actual results: force detaching a volume after maxWaitForUnmountDuration leaves block devices on node in a inconsistent state Expected results: kube-controller asks node to flush (delete) the remaining multipath maps on the node before force detaching on the storage. Additional info: If you think this problem should be fixed in another layer of the stack and not Kubernetes / OpenShift please advise.
After thinking about this, I think the best way to handle this would be to fix the behaviour of the CSI driver when the method NodeUnstageVolume is called. The CSI spec does not have an "emergency unmount" way to call the node driver, and since the multipaths device are created in NodeStageVolume by the CSI driver, it should be also the responsability of the NodeDriver to delete the multipath devices. Currently we have a loop where NodeUnstageVolume() is called a number of time, but never completes as the multipath flushing is returning an error: ie https://github.com/IBM/ibm-block-csi-driver/blob/9ce398e8ef8a20ffdfcdfa16bce35a422b31da40/node/pkg/driver/node.go#L359 err = d.OsDeviceConnectivityHelper.FlushMultipathDevice(baseDevice) if err != nil { return nil, status.Errorf(codes.Internal, "Multipath -f command failed with error: %v", err) } is always failing 2021-10-01 14:37:21,1012 DEBUG [1177780] [SVC:i-am-a-lun-wwid] (node.go:360) - <<<< NodeUnstageVolume 2021-10-01 14:37:21,1012 ERROR [1177780] [-] (driver.go:85) - GRPC error: rpc error: code = Internal desc = Multipath -f command failed with error: exit status 1 2021-10-01 14:37:23,1012 DEBUG [1177789] [SVC:i-am-a-lun-wwid] (node.go:360) - <<<< NodeUnstageVolume 2021-10-01 14:37:23,1012 ERROR [1177789] [-] (driver.go:85) - GRPC error: rpc error: code = Internal desc = Multipath -f command failed with error: exit status 1 2021-10-01 14:37:27,1012 DEBUG [1177749] [SVC:i-am-a-lun-wwid] (node.go:360) - <<<< NodeUnstageVolume 2021-10-01 14:37:27,1012 ERROR [1177749] [-] (driver.go:85) - GRPC error: rpc error: code = Internal desc = Multipath -f command failed with error: exit status 1 2021-10-01 14:37:35,1012 DEBUG [1177724] [SVC:i-am-a-lun-wwid] (node.go:360) - <<<< NodeUnstageVolume 2021-10-01 14:37:35,1012 ERROR [1177724] [-] (driver.go:85) - GRPC error: rpc error: code = Internal desc = Multipath -f command failed with error: exit status 1 IMHO instead of returning immediatly with an error and letting the kubelet call it again with exponential backoff, the NodeStageVolume() should implement the following sequence of event: try multipath -f dm-X (if it succeeds there is no emergency any more, just remove the paths and we are good) if it fails, do: multipathd -k"disablequeueing map dm-X" then retry multipath -f dm-X (if it succeeds there is no emergency any more, just remove the paths and we are good) even if it still fails, remove the paths with echo 1 > /sys/block/sdX/device/delete (for each path).
Can the CSI driver check if a volume is mounted / mapped to a node in ControllerUnpublish call? Some (most?) clouds don't allow detaching a volume that is still mounted on a node and they return either timeout (waiting for the volume to get unmounted) or some EBUSY. This will not help when the node is really unreachable (kernel oops, kubelet crash), a volume would be always attached to such a node. (We really need fencing in Kubernetes).
Another question: do you know why a volume can't be unmounted in 6 minutes? The step "Create a runtime error condition where the unmount of the PV would fail" should not be part of a generic flow, it should be very rare corner case.
Hello Jan, I can (partially) answer the 2nd question (because we are still in progress trying to figure it out): (In reply to Jan Safranek from comment #12) > Another question: do you know why a volume can't be unmounted in 6 minutes? > The step "Create a runtime error condition where the unmount of the PV would > fail" should not be part of a generic flow, it should be very rare corner > case. We agree that identifying the triggering conditions is important for, at least, a workaround. So far what we see is: [1] The multipath map persistently fails to get flushed (reporting busy) [2] There are xfs related processes on top of this map (the usual kernel threads that xfs uses), however the filesystem does not appear in /proc/mounts - an indication that a lazy unmount of the filesystem happened sometime earlier, while there was some process that was still using the filesystem. We have tried today a proof of concept for this thought and, at a first glance, we have reproduced the problem. More specifically: - We started a test pod which mounted a LUN containing an XFS filesystem - On the node where the test pod was running we opened a terminal (outside the container - on the main host) and just did "cd /mount/point/the/pod/is/using". We left that terminal open so that the mountpoint would remain busy. - We scaled down the pod so that everything stops and cleans up. After some minutes we could see the corresponding multipath map existing with all its paths failed (the expected result of just unmapping the LUN on the storage side). We are planning to have a "cleaner" similar reproducer attempt in order to gather logs from it and compare with what we had seen when the failure happened in a production pod. This test gives us some confidence on the triggering conditions, but we still need to identify who and why is keeping the filesystem busy and also why the lazy unmount happens in the case of the production pods (which didn't seem to happen, at least at the first few minutes while we were still checking for it during our test). Trapping what is keeping a filesystem busy is a significantly more complicated exercise, as we don't really know which filesystem to target (since this is happening randomly) and monitoring all files on all filesystems is too big an overhead. Thanks, Alexandros
Hi Jan To answer your first question: > Can the CSI driver check if a volume is mounted / mapped to a node in ControllerUnpublish call? Some (most?) clouds don't allow detaching a volume that is still mounted on a node and they return either timeout (waiting for the volume to get unmounted) or some EBUSY. 1. According to the CSI spec, the ControllerUnpublishVolume: "MUST be called after all NodeUnstageVolume and NodeUnpublishVolume on the volume are called and succeed". 2. In this particular case (Fiber Channel SAN Array) it seems the storage system currently doesn't provide any flag related to the volume still being mounted, and "The Plugin MUST NOT assume that this RPC [ControllerUnpublishVolume] will be executed on the node where the volume was previously used", so I don't think our CSI controller can mitigate this issue for now. The question here, since the CSI controller cannot know from the storage side if a volume is still mounted on a node, can the CSI controller get this information via the Kubernetes API server ? Would it be enough for instance to assume that as long as the PersisentVolume who "owns" the mount point is still associated to a PersistentVolumeClaim, the PersisentVolume is mounted, and hence it is unsafe to call ControllerUnpublishVolume ? Or do you have any other idea ?
I've filed an issue upstream: https://github.com/kubernetes/kubernetes/issues/106710 I can't promise any easy or fast fix, this 6 minute timeout is our base for detaching volumes from unavailable nodes where kernel / kubelet crashed and force detach is the only way how to get out of it. Untangling it from case where the CSI driver can't temporarily unmount a volume is hard.
Discussion with upstream continues, right now it looks like that Kubernetes should force-detach volumes without completed NodeUnstage after 6 minutes only when the corresponding node is "unhealthy". We're discussing what "unhealthy" actually means, in any case it won't be worse than today, when it detaches after 6 minutes from all nodes.
In multiple OCP 4.9 clusters deployed on IBM P9 Powersystem, using IBM CSI Driver v 1.8.0 ( and v 1.9.0 ), the issue described in this Bugzilla happened multiple times. In detail, multiple pods that require a PVC and scheduled on a specific node (nodeX) hung in ContainerCreating state, due to the impossibility to map new Luns on that node. On nodeA there are at least 1 LUN which paths are in failed state, and that Lun should not be on nodeA because it is already mapped ( with active paths ) to another node (nodeB), as can be seen also in the related VolumeAttachment describe. So we have this situation: nodeA has Lun-J mapped with paths in failed state nodeB has Lun-J mapped with paths in active state Volumeattachment ( related to Lun-J's PVC ) has the correct correlation LUN-J --> nodeB On nodeA, removing the failed paths (/dev/sdX and /dev/sdY) and related mpathZ device on nodeA (often stopping and restarting multipathd daemon due to its locks on mpathZ device), solved the problem and the pods which where in ContainerCreating state on this node went in Running state. After the reboot ( due to the application of a machineconfig ) of nodeA, the failed paths appear again, and the node was stuck again with the impossibility to map new PVC on it. Further investigations and tests revealed that in a situation like this, it is necessary to remove all ORPHANED PODS present on nodeA. Orphaned pods evidence are found in the nodeA's journal with logs like this: May 13 18:19:41 worker3.enxctmiapp.ccg.local hyperkube[2873]: E0513 18:19:41.108536 2873 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"01f40a73-6881-4524-9485-c37a31a27813\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=42 With the following commands, ( which remove the volume .json file for every orphaned pod in the /var/lib/kubelet/pods/... tree ), it was possible to clean the situation on nodeA, and also after the reboot of nodeA, failed paths didn't appear again: ----- pod orphaned deletion by delete of relative volume json file: ----- while true; do ORPHANEDPODID=$(journalctl --since "10min ago"|grep "but error not a directory occurred when trying to remove the volumes dir"|tail -1|grep -o " err=.*"|cut -d= -f2|cut -d\" -f3|tr -d "\\" 2>/dev/null); cat /var/lib/kubelet/pods/$PODID/volumes/*/*/*.json; sleep 4; rm /var/lib/kubelet/pods/$ORPHANEDPODID/volumes/*/*/*.json; sleep 3; done Final consideration: in our environment, in the majority of cases, on the node with Luns mapped with paths in failed state (like nodeA), there were orphaned pods. ( In all these cases, removing the failed paths was a solution only if node didn't reboot. Removing failed paths+clear orphaned pods was the final solution. ) In the remaining cases of this issue, there were no orphaned pods on the affected node.
In the case description, when saying: > 3. Create a runtime error condition where the unmount of the PV would fail, like starting a new shell command using the PV mount point as a current working directory. This will prevent unmounting of the mount point. this can happen when doing a recursive filesystem scan from all mount points, antivirus and file base compliance scanning could be here culprits this has been reported for ACS at https://issues.redhat.com/browse/ROX-9350
@apilati It would be orphaned pods only when the detachment and unmapping did not happen successfully.
The upstream issue has been fixed and it will be part of the Kubernetes rebase to 1.25, i.e. in 4.12.
Hello Jan, First of all thanks for moving this forward. I am trying to understand what to expect from this change, once backported. My understanding is that detaching the LUN won't happen even after the timeout on a healthy node. I can see that "healthy" is determined by "IsNodeReady", but I am not sure what this translates to in real world terms. Also, when something (e.g. https://issues.redhat.com/browse/ROX-9350) keeps the device busy and the force detach does not happen, what happens to the stopping pod? Will it be hanging in a stopping state? (I obviously understand that there is no magic to clear everything - I'm just trying to figure out what we expect to happen). Thanks, Alexandros
> My understanding is that detaching the LUN won't happen even after the timeout on a healthy node. I can see that "healthy" is determined by "IsNodeReady" Yes, that's it. > what happens to the stopping pod? Will it be hanging in a stopping state? The Pod may be Terminating forever or the volume will be always attached to the node, not able to be attached to a different one. It's fault of a CSI driver if it can't clean a volume in NodeUnpublish / NodeUnstage.
@jsafrane I am confused to when we can say the actions are pending from CSI driver or CSI driver need fix for particular things in case of unhealthy node. If node becomes NotReady and after 6 minutes of timeout, force detachment will happen. After this, can we expect that volumes will only get marked as NodeUnpublish / NodeUnstage or detachment will happen fully from the node? Then at this point what is the job of CSI driver which has to be taken care?
If the node is still unhealthy, ControllerUnpublish will happen without NodeUnpublish / NodeUnstage. This allows the volume to be used in on another node, because otherwise the volume is stuck to the faulty node forever and an application cannot access its data. When the node comes back, NodeUnpublish / NodeUnstage may be called on it, but the volume may be ControllerUnpublished at this point already.
Kubernetes rebase has landed and kubelet will never call ControllerUnpublish on volumes mounted to *healthy* nodes.
That's good to know and actually it could reproduce in 4.12 with kubernetes 1.24. So update to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:7399