2022328 – kube-controller unpublishing volume after maxWaitForUnmountDuration leaves block devices on node in a inconsistent state

Bug 2022328 - kube-controller unpublishing volume after maxWaitForUnmountDuration leaves block devices on node in a inconsistent state

Summary: kube-controller unpublishing volume after maxWaitForUnmountDuration leaves bl...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.7
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.12.0
Assignee:	Jan Safranek
QA Contact:	Wei Duan
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-11-11 10:53 UTC by Emmanuel Kasper
Modified:	2023-01-17 19:46 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Previously, {product-title} detachec a volume when a Container Storage Interface (CSI) driver was not able to unmount the volume from a node. Detaching a volume without unmount is not allowed by CSI specifications and drivers could enter `undocumented` state. With this update, CSI drivers are detached before unmount only on unhealthy nodes preventing the `undocumented` state. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2049306[BZ#2049306])
Clone Of:
Environment:
Last Closed:	2023-01-17 19:46:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2022:7399	0	None	None	None	2023-01-17 19:46:57 UTC

Description Emmanuel Kasper 2021-11-11 10:53:44 UTC

Description of problem:
kube-controller force detaching a volume after maxWaitForUnmountDuration leaves block devices on node in a inconsistent state

Version-Release number of selected component (if applicable): 4.7

How reproducible:
Each time a node is unable to unmount a volume.

Steps to Reproduce:
0. On a cluster with dynamic volume provisioning, using a CSI driver with uses multipath like the IBM Block Storage CSI driver
1. Create a PV exposed over multiple paths
2. Bind the PV to a PVC, ie mount on a node the PV with a CSI driver which supports multipath
2. Use the PV in a Pod
3. Create a runtime error condition where the unmount of the PV would fail, like starting a new shell command using
the PV mount point as a current working directory. This will prevent unmounting of the mount point.
4. Delete the pod. This will also mean that the node kubelet on which the pod is running will try to unmount
the PV, either directly or via asking the node CSI driver.
Unmounting the PV will fail, because of 3.

5. After waiting for maxWaitForUnmountDuration minutes, the kube-controller will kick in and will force detach the PV on the storage provider:

2021-10-01T06:45:49.352705556Z W1001 06:45:49.352024 1 reconciler.go:206] attacherDetacher.DetachVolume started for volume "pvc-i-am-an-uuid-you-know"
(UniqueName: "kubernetes.io/csi/block.csi.ibm.com^SVC:123") on node "node.example.com"
This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching

6 . At that point the node where the pod is running does not have access to the device/PV, but the multipath device map is still present on the node, and points to an unexistent block device / PV.
7. Create a new Pod on that node, using a *different* PVC / PV. When the Pod is started, the kernel sees the old multipath device still present, although now the backing
device on the storage provider is a completely different thing (This behaviour can be partially mitigated if the option intruced by https://access.redhat.com/errata/RHBA-2021:4367 is applied but it is not yet available on OpenShift)
This is the moment where strange things happen, for instance the kernel reports a wrong size of the block device,
and could even lead to data corruption.

Actual results:
force detaching a volume after maxWaitForUnmountDuration leaves block devices on node in a inconsistent state

Expected results:
kube-controller asks node to flush (delete) the remaining multipath maps on the node before force detaching on the storage.

Additional info:
If you think this problem should be fixed in another layer of the stack and not Kubernetes / OpenShift please advise.

Comment 2 Emmanuel Kasper 2021-11-16 10:11:02 UTC

After thinking about this, I think the best way to handle this would be to fix the behaviour of the CSI driver when the method NodeUnstageVolume is called.
The CSI spec does not have an "emergency unmount" way to call the node driver, and since the multipaths device are created in NodeStageVolume by the CSI driver, it should be also the responsability of the NodeDriver to delete the multipath devices.

Currently we have a loop where NodeUnstageVolume() is called a number of time, but never completes as the multipath flushing is returning an error:

ie https://github.com/IBM/ibm-block-csi-driver/blob/9ce398e8ef8a20ffdfcdfa16bce35a422b31da40/node/pkg/driver/node.go#L359
 	err = d.OsDeviceConnectivityHelper.FlushMultipathDevice(baseDevice)
	if err != nil {
		return nil, status.Errorf(codes.Internal, "Multipath -f command failed with error: %v", err)
	}

is always failing

2021-10-01 14:37:21,1012 DEBUG  [1177780] [SVC:i-am-a-lun-wwid] (node.go:360) - <<<< NodeUnstageVolume
2021-10-01 14:37:21,1012 ERROR  [1177780] [-] (driver.go:85) - GRPC error: rpc error: code = Internal desc = Multipath -f command failed with error: exit status 1
2021-10-01 14:37:23,1012 DEBUG  [1177789] [SVC:i-am-a-lun-wwid] (node.go:360) - <<<< NodeUnstageVolume
2021-10-01 14:37:23,1012 ERROR  [1177789] [-] (driver.go:85) - GRPC error: rpc error: code = Internal desc = Multipath -f command failed with error: exit status 1
2021-10-01 14:37:27,1012 DEBUG  [1177749] [SVC:i-am-a-lun-wwid] (node.go:360) - <<<< NodeUnstageVolume
2021-10-01 14:37:27,1012 ERROR  [1177749] [-] (driver.go:85) - GRPC error: rpc error: code = Internal desc = Multipath -f command failed with error: exit status 1
2021-10-01 14:37:35,1012 DEBUG  [1177724] [SVC:i-am-a-lun-wwid] (node.go:360) - <<<< NodeUnstageVolume
2021-10-01 14:37:35,1012 ERROR  [1177724] [-] (driver.go:85) - GRPC error: rpc error: code = Internal desc = Multipath -f command failed with error: exit status 1

IMHO instead of returning immediatly with an error and letting the kubelet call it again with exponential backoff, the NodeStageVolume() should implement the following sequence of event:

      try multipath -f dm-X (if it succeeds there is no emergency any more, just remove the paths and we are good)

      if it fails, do: multipathd -k"disablequeueing map dm-X"

      then retry multipath -f dm-X (if it succeeds there is no emergency any more, just remove the paths and we are good)

      even if it still fails, remove the paths with echo 1 > /sys/block/sdX/device/delete (for each path).

Comment 11 Jan Safranek 2021-11-24 16:03:12 UTC

Can the CSI driver check if a volume is mounted / mapped to a node in ControllerUnpublish call? Some (most?) clouds don't allow detaching a volume that is still mounted on a node and they return either timeout (waiting for the volume to get unmounted) or some EBUSY.
This will not help when the node is really unreachable (kernel oops, kubelet crash), a volume would be always attached to such a node. (We really need fencing in Kubernetes).

Comment 12 Jan Safranek 2021-11-24 17:12:43 UTC

Another question: do you know why a volume can't be unmounted in 6 minutes? The step "Create a runtime error condition where the unmount of the PV would fail" should not be part of a generic flow, it should be very rare corner case.

Comment 13 Alexandros Panagiotou 2021-11-24 17:56:01 UTC

Hello Jan,
I can (partially) answer the 2nd question (because we are still in progress trying to figure it out):

(In reply to Jan Safranek from comment #12)
> Another question: do you know why a volume can't be unmounted in 6 minutes?
> The step "Create a runtime error condition where the unmount of the PV would
> fail" should not be part of a generic flow, it should be very rare corner
> case.

We agree that identifying the triggering conditions is important for, at least, a workaround. So far what we see is:

[1] The multipath map persistently fails to get flushed (reporting busy)

[2] There are xfs related processes on top of this map (the usual kernel threads that xfs uses), however the filesystem does not appear in /proc/mounts - an indication that a lazy unmount of the filesystem happened sometime earlier, while there was some process that was still using the filesystem.

We have tried today a proof of concept for this thought and, at a first glance, we have reproduced the problem. More specifically:

- We started a test pod which mounted a LUN containing an XFS filesystem
- On the node where the test pod was running we opened a terminal (outside the container - on the main host) and just did "cd /mount/point/the/pod/is/using". We left that terminal open so that the mountpoint would remain busy.
- We scaled down the pod so that everything stops and cleans up. After some minutes we could see the corresponding multipath map existing with all its paths failed (the expected result of just unmapping the LUN on the storage side).

We are planning to have a "cleaner" similar reproducer attempt in order to gather logs from it and compare with what we had seen when the failure happened in a production pod. This test gives us some confidence on the triggering conditions, but we still need to identify who and why is keeping the filesystem busy and also why the lazy unmount happens in the case of the production pods (which didn't seem to happen, at least at the first few minutes while we were still checking for it during our test).

Trapping what is keeping a filesystem busy is a significantly more complicated exercise, as we don't really know which filesystem to target (since this is happening randomly) and monitoring all files on all filesystems is too big an overhead.

Thanks,
Alexandros

Comment 16 Emmanuel Kasper 2021-11-29 08:35:06 UTC

Hi Jan
To answer your first question:
> Can the CSI driver check if a volume is mounted / mapped to a node in ControllerUnpublish call? Some (most?) clouds don't allow detaching a volume that is still mounted on a node and they return either timeout (waiting for the volume to get unmounted) or some EBUSY.

1.    According to the CSI spec, the ControllerUnpublishVolume: "MUST be called after all NodeUnstageVolume and NodeUnpublishVolume on the volume are called and succeed".
2.    In this particular case (Fiber Channel SAN Array) it seems the storage system currently doesn't provide any flag related to the volume still being mounted, and "The Plugin MUST NOT assume that this RPC [ControllerUnpublishVolume] will be executed on the node where the volume was previously used", so I don't think our CSI controller can mitigate this issue for now.

The question here, since the CSI controller cannot know from the storage side if a volume is still mounted on a node, can the CSI controller get this information via the Kubernetes API server ? Would it be enough for instance to assume that as long as the PersisentVolume who "owns" the mount point is still associated to a PersistentVolumeClaim, the PersisentVolume is mounted, and hence it is unsafe to call ControllerUnpublishVolume ? Or do you have any other idea ?

Comment 17 Jan Safranek 2021-11-29 13:48:35 UTC

I've filed an issue upstream: https://github.com/kubernetes/kubernetes/issues/106710
I can't promise any easy or fast fix, this 6 minute timeout is our base for detaching volumes from unavailable nodes where kernel / kubelet crashed and force detach is the only way how to get out of it. Untangling it from case where the CSI driver can't temporarily unmount a volume is hard.

Comment 18 Jan Safranek 2021-12-20 13:38:42 UTC

Discussion with upstream continues, right now it looks like that Kubernetes should force-detach volumes without completed NodeUnstage after 6 minutes only when the corresponding node is "unhealthy". We're discussing what "unhealthy" actually means, in any case it won't be worse than today, when it detaches after 6 minutes from all nodes.

Comment 19 apilati 2022-06-28 16:41:13 UTC

In multiple OCP 4.9 clusters deployed on IBM P9 Powersystem, using IBM CSI Driver v 1.8.0 ( and v 1.9.0 ), the issue described in this Bugzilla happened multiple times.

In detail, multiple pods that require a PVC and scheduled on a specific node (nodeX) hung in ContainerCreating state, due to the impossibility to map new Luns on that node.
On nodeA there are at least 1 LUN which paths are in failed state, and that Lun should not be on nodeA because it is already mapped ( with active paths ) to another node (nodeB), as can be seen also in the related VolumeAttachment describe.

So we have this situation:
   nodeA has Lun-J mapped with paths in failed state
   nodeB has Lun-J mapped with paths in active state
   Volumeattachment ( related to Lun-J's PVC ) has the correct correlation LUN-J --> nodeB


On nodeA, removing the failed paths (/dev/sdX and /dev/sdY) and related mpathZ device on nodeA (often stopping and restarting multipathd daemon due to its locks on mpathZ device), solved the problem and the pods which where in ContainerCreating state on this node went in Running state.

After the reboot ( due to the application of a machineconfig ) of nodeA, the failed paths appear again, and the node was stuck again with the impossibility to map new PVC on it.

Further investigations and tests revealed that in a situation like this, it is necessary to remove all ORPHANED PODS present on nodeA.
Orphaned pods evidence are found in the nodeA's journal with logs like this:
 
   May 13 18:19:41 worker3.enxctmiapp.ccg.local hyperkube[2873]: E0513 18:19:41.108536    2873 kubelet_volumes.go:245] "There were many similar errors. Turn up verbosity to see them." err="orphaned pod \"01f40a73-6881-4524-9485-c37a31a27813\" found, but error not a directory occurred when trying to remove the volumes dir" numErrs=42 

With the following commands, ( which remove the volume .json file for every orphaned pod in the /var/lib/kubelet/pods/... tree ), it was possible to clean the situation on nodeA,
and also after the reboot of nodeA, failed paths didn't appear again:

----- pod orphaned deletion by delete of relative volume json file: -----
while true; do ORPHANEDPODID=$(journalctl --since "10min ago"|grep "but error not a directory occurred when trying to remove the volumes dir"|tail -1|grep -o " err=.*"|cut -d= -f2|cut -d\" -f3|tr -d "\\" 2>/dev/null); cat /var/lib/kubelet/pods/$PODID/volumes/*/*/*.json; sleep 4; rm /var/lib/kubelet/pods/$ORPHANEDPODID/volumes/*/*/*.json; sleep 3; done


Final consideration:
in our environment, in the majority of cases, on the node with Luns mapped with paths in failed state (like nodeA), there were orphaned pods.
( In all these cases, removing the failed paths was a solution only if node didn't reboot. Removing failed paths+clear orphaned pods was the final solution. )

In the remaining cases of this issue, there were no orphaned pods on the affected node.

Comment 20 Emmanuel Kasper 2022-06-29 08:15:43 UTC

In the case description, when saying:

> 3. Create a runtime error condition where the unmount of the PV would fail, like starting a new shell command using 
the PV mount point as a current working directory. This will prevent unmounting of the mount point.

this can happen when doing a recursive filesystem scan from all mount points, antivirus and file base compliance scanning could be here culprits 

this has been reported for ACS at https://issues.redhat.com/browse/ROX-9350

Comment 21 Aditya Deshpande 2022-07-18 03:00:04 UTC

@apilati It would be orphaned pods only when the detachment and unmapping did not happen successfully.

Comment 22 Jan Safranek 2022-07-18 09:23:33 UTC

The upstream issue has been fixed and it will be part of the Kubernetes rebase to 1.25, i.e. in 4.12.

Comment 23 Alexandros Panagiotou 2022-07-18 17:46:23 UTC

Hello Jan,
First of all thanks for moving this forward. I am trying to understand what to expect from this change, once backported. My understanding is that detaching the LUN won't happen even after the timeout on a healthy node. I can see that "healthy" is determined by "IsNodeReady", but I am not sure what this translates to in real world terms.

Also, when something (e.g. https://issues.redhat.com/browse/ROX-9350) keeps the device busy and the force detach does not happen, what happens to the stopping pod? Will it be hanging in a stopping state? (I obviously understand that there is no magic to clear everything - I'm just trying to figure out what we expect to happen).

Thanks,
Alexandros

Comment 25 Jan Safranek 2022-07-19 07:46:49 UTC

> My understanding is that detaching the LUN won't happen even after the timeout on a healthy node. I can see that "healthy" is determined by "IsNodeReady"

Yes, that's it.

> what happens to the stopping pod? Will it be hanging in a stopping state?

The Pod may be Terminating forever or the volume will be always attached to the node, not able to be attached to a different one. It's fault of a CSI driver if it can't clean a volume in NodeUnpublish / NodeUnstage.

Comment 26 Aditya Deshpande 2022-08-03 14:35:42 UTC

@jsafrane I am confused to when we can say the actions are pending from CSI driver or CSI driver need fix for particular things in case of unhealthy node. 

If node becomes NotReady and after 6 minutes of timeout, force detachment will happen. After this, can we expect that volumes will only get marked as NodeUnpublish / NodeUnstage or detachment will happen fully from the node? 
Then at this point what is the job of CSI driver which has to be taken care?

Comment 27 Jan Safranek 2022-08-04 08:56:32 UTC

If the node is still unhealthy, ControllerUnpublish will happen without NodeUnpublish / NodeUnstage. This allows the volume to be used in on another node, because otherwise the volume is stuck to the faulty node forever and an application cannot access its data.
When the node comes back, NodeUnpublish / NodeUnstage may be called on it, but the volume may be ControllerUnpublished at this point already.

Comment 28 Jan Safranek 2022-10-10 13:50:49 UTC

Kubernetes rebase has landed and kubelet will never call ControllerUnpublish on volumes mounted to *healthy* nodes.

Comment 34 Wei Duan 2022-10-20 09:13:36 UTC

That's good to know and actually it could reproduce in 4.12 with kubernetes 1.24. 
So update to VERIFIED.

Comment 37 errata-xmlrpc 2023-01-17 19:46:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.12.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:7399

Note You need to log in before you can comment on or make changes to this bug.