2103982 – [4.9] AWS EBS CSI driver stuck removing EBS volumes - GetDeviceMountRefs check failed

Bug 2103982 - [4.9] AWS EBS CSI driver stuck removing EBS volumes - GetDeviceMountRefs check failed

Summary: [4.9] AWS EBS CSI driver stuck removing EBS volumes - GetDeviceMountRefs chec...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.9.z
Assignee:	Jonathan Dobson
QA Contact:	Penghao Wang
Docs Contact:
URL:
Whiteboard:	UpdateRecommendationsBlocked
Duplicates (2):	2103980 2108616 (view as bug list)
Depends On:	2050190
Blocks:	2106910
TreeView+	depends on / blocked

Reported:	2022-07-05 13:15 UTC by akretzsc
Modified:	2022-08-09 14:01 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	2106910 (view as bug list)
Environment:
Last Closed:	2022-08-09 14:00:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift aws-ebs-csi-driver pull 203	None	Merged	Bug 2103982: Rebase to v1.4.0	2022-07-26 07:07:57 UTC
Github	openshift aws-ebs-csi-driver pull 204	None	Merged	[release-4.9] Bug 2103982: Fix build on ARM after rebase	2022-07-26 07:07:58 UTC
Red Hat Product Errata	RHSA-2022:5879	None	None	None	2022-08-09 14:01:30 UTC

Description akretzsc 2022-07-05 13:15:42 UTC

What problem/issue/behavior are you having trouble with? What do you expect to see?
Our OpenShift 4.8.39 cluster installed on AWS via IPI, is using the gp2-csi storageclass. For unknown reasons, there are time in which Pods get stuck trying to Terminate. We've been able to determine that the issue is due to the aws-ebs-csi driver removing the EBS volume from the node before/while leaving it's mountpoints in the kernel. Attached is a must-gather _while_ three Pods were stuck trying to Terminate.

```
Jun 15 22:34:10 ip-100-64-174-34 hyperkube[92613]: E0615 22:34:10.671813 92613 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-0c1929520fef19990 podName: nodeName:}" failed. No retries permitted until 2022-06-15 22:34:11.671787683 +0000 UTC m=+1007.896465463 (durationBeforeRetry 1s). Error: "GetDeviceMountRefs check failed for volume \"pvc-d65166a3-2719-478e-95da-5f06a922026e\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0c1929520fef19990\") on node \"ip-100-64-174-34.ec2.internal\" : the device mount path \"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-d65166a3-2719-478e-95da-5f06a922026e/globalmount\" is still mounted by other references [/var/lib/kubelet/pods/811cfb65-b6fc-4284-baec-60c61efc8a08/volumes/kubernetes.io~csi/pvc-d65166a3-2719-478e-95da-5f06a922026e/mount]"
```

By manually unmounting the detached EBS mountpoints on the node, the kubelet is able to process the Pod termination fully:

```
[root@ip-100-64-174-34 ~]# mount | grep 29d8cb63-ab29-47f3
/dev/nvme17n1 on /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-29d8cb63-ab29-47f3-915e-a6facd7d37da/globalmount type ext4 (rw,relatime,seclabel)
/dev/nvme17n1 on /var/lib/kubelet/pods/ff74466d-c84b-4776-a87c-fe38b58becd2/volumes/kubernetes.io~csi/pvc-29d8cb63-ab29-47f3-915e-a6facd7d37da/mount type ext4 (rw,relatime,seclabel)
[root@ip-100-64-174-34 ~]# umount /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-29d8cb63-ab29-47f3-915e-a6facd7d37da/globalmount /var/lib/kubelet/pods/ff74466d-c84b-4776-a87c-fe38b58becd2/volumes/kubernetes.io~csi/pvc-29d8cb63-ab29-47f3-915e-a6facd7d37da/mount
[root@ip-100-64-174-34 ~]#
[root@ip-100-64-174-34 ~]# mount | grep 29d8cb63-ab29-47f3
[root@ip-100-64-174-34 ~]#
```

We believe this to be due to a bug in the aws-ebs-csi driver per these issues and PRs:
- https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/679
- https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1076
- https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1027
- https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/1082

I was able to reproduce this problem by restarting the kubelet on a node after a Pod with a gp2-csi PV is running. When you then delete the Pod, the kubelet will hit the above error. https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1076#issuecomment-937336310

What is the business impact? Please also provide timeframe information.
Pods stuck terminating prevents node maintenance and services from updating.

Where are you experiencing the behavior? What environment?
OpenShift 4.8.39 IPI install on AWS and using gp2-csi storageclass.

When does the behavior occur? Frequency? Repeatedly? At certain times?
Frequently

Steps to reproduce:
http://pastebin.test.redhat.com/1061401

Comment 1 Swati Mulje 2022-07-05 13:30:51 UTC

*** Bug 2103980 has been marked as a duplicate of this bug. ***

Comment 4 Jonathan Dobson 2022-07-12 23:53:47 UTC

I tested this on 4.8 and 4.10, following the procedure in the description:
http://pastebin.test.redhat.com/1061401

I was able to reproduce on 4.8:

Jul 12 23:01:29 ip-10-0-198-6 hyperkube[1325]: I0712 23:01:29.789998    1325 operation_generaor.go:698] MountVolume.SetUp succeeded for volume "pvc-a5d41617-c9ca-4156-b6ec-12371c643afb"(UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-0c8cd336e6a7fba5f") pod "test-gp2-csi" (ID: "07062617-f26a-4282-bdfe-05958dd2218a")

Jul 12 23:09:46 ip-10-0-198-6 hyperkube[146543]: I0712 23:09:46.473458  146543 operation_generator.go:698] MountVolume.SetUp succeeded for volume "pvc-a5d41617-c9ca-4156-b6ec-12371c643afb" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-0c8cd336e6a7fba5f") pod "test-gp2-csi" (UID: "07062617-f26a-4282-bdfe-05958dd2218a")

Jul 12 23:10:52 ip-10-0-198-6 hyperkube[146543]: I0712 23:10:52.457818  146543 operation_generator.go:829] UnmountVolume.TearDown succeeded for volume "kubernetes.io/csi/ebs.csi.aws.com^vol-0c8cd336e6a7fba5f" (OuterVolumeSpecName: "data") pod "07062617-f26a-4282-bdfe-05958dd2218a" (UID: "07062617-f26a-4282-bdfe-05958dd2218a"). InnerVolumeSpecName "pvc-a5d41617-c9ca-4156-b6ec-12371c643afb". PluginName "kubernetes.io/csi", VolumeGidValue ""
Jul 12 23:10:52 ip-10-0-198-6 hyperkube[146543]: I0712 23:10:52.547765  146543 reconciler.go:312] "operationExecutor.UnmountDevice started for volume \"pvc-a5d41617-c9ca-4156-b6ec-12371c643afb\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0c8cd336e6a7fba5f\") on node \"ip-10-0-198-6.us-east-2.compute.internal\" "
Jul 12 23:10:52 ip-10-0-198-6 hyperkube[146543]: E0712 23:10:52.551729  146543 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/csi/ebs.csi.aws.com^vol-0c8cd336e6a7fba5f podName: nodeName:}" failed. No retries permitted until 2022-07-12 23:10:53.051691933 +0000 UTC m=+73.983813348 (durationBeforeRetry 500ms). Error: "GetDeviceMountRefs check failed for volume \"pvc-a5d41617-c9ca-4156-b6ec-12371c643afb\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-0c8cd336e6a7fba5f\") on node \"ip-10-0-198-6.us-east-2.compute.internal\" : the device mount path \"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-a5d41617-c9ca-4156-b6ec-12371c643afb/globalmount\" is still mounted by other references [/var/lib/kubelet/pods/07062617-f26a-4282-bdfe-05958dd2218a/volumes/kubernetes.io~csi/pvc-a5d41617-c9ca-4156-b6ec-12371c643afb/mount]"

But it is _not_ reproducible on 4.10:

Jul 12 21:22:51 ip-10-0-162-195 hyperkube[1375]: I0712 21:22:51.522506    1375 operation_generator.go:756] "MountVolume.SetUp succeeded for volume \"pvc-63919ef5-ada9-44dd-accd-cfd381edfd8d\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-020c66b643d5f9fff\") pod \"test-gp2-csi\" (UID: \"b326847f-45cc-4c53-bab4-103bc2ff8c12\") " pod="testing/test-gp2-csi"

Jul 12 21:32:31 ip-10-0-162-195 hyperkube[199677]: I0712 21:32:31.456769  199677 operation_generator.go:910] UnmountVolume.TearDown succeeded for volume "kubernetes.io/csi/ebs.csi.aws.com^vol-020c66b643d5f9fff" (OuterVolumeSpecName: "data") pod "b326847f-45cc-4c53-bab4-103bc2ff8c12" (UID: "b326847f-45cc-4c53-bab4-103bc2ff8c12"). InnerVolumeSpecName "pvc-63919ef5-ada9-44dd-accd-cfd381edfd8d". PluginName "kubernetes.io/csi", VolumeGidValue ""
Jul 12 21:32:31 ip-10-0-162-195 hyperkube[199677]: I0712 21:32:31.543711  199677 reconciler.go:293] "operationExecutor.UnmountDevice started for volume \"pvc-63919ef5-ada9-44dd-accd-cfd381edfd8d\" (UniqueName: \"kubernetes.io/csi/ebs.csi.aws.com^vol-020c66b643d5f9fff\") on node \"ip-10-0-162-195.us-east-2.compute.internal\" "
Jul 12 21:32:31 ip-10-0-162-195 hyperkube[199677]: I0712 21:32:31.559187  199677 operation_generator.go:1018] UnmountDevice succeeded for volume "pvc-63919ef5-ada9-44dd-accd-cfd381edfd8d" (UniqueName: "kubernetes.io/csi/ebs.csi.aws.com^vol-020c66b643d5f9fff") on node "ip-10-0-162-195.us-east-2.compute.internal"

The problem is that MountVolume.SetUp gets called twice on 4.8, which leaves an extra mount reference that triggers the error during unmount when the pod is destroyed.

This comment has a good explanation of the reason:
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/issues/1076#issuecomment-937336310

This PR is the one that removed the mount check in kubelet, starting in k8s 1.20 / OCP 4.7:
https://github.com/kubernetes/kubernetes/pull/88759

But unfortunately the idempotency of all the drivers were not fixed in the same release. This AWS issue was fixed in OCP 4.10:
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/1082

Comment 9 Jonathan Dobson 2022-07-13 20:41:05 UTC

There is at least one more relevant PR that we need in 4.9 / 4.8 to fix this issue:
https://github.com/kubernetes-sigs/aws-ebs-csi-driver/pull/1019

And there are a couple more PR's between #1019 and #1082 that we would need to avoid manual rework of the patches.

  git cherry-pick -m 1 edb4fbee7e5f078e9214750e579abcd576f3f3d2
  git cherry-pick -m 1 aefaa900ae057acda57173977370a61e6b9b04bb
  git cherry-pick -m 1 869e7f5d0999e0703d445e7909a910ef729cb25e
  git cherry-pick -m 1 ad8df8f2d220a261f6ae1f4870afd937592648a2

The other option would be to rebase on driver version v1.4.0, which might be simpler and less risky than cherry-picking a series of patches.

Comment 12 Jonathan Dobson 2022-07-19 13:57:38 UTC

*** Bug 2108616 has been marked as a duplicate of this bug. ***

Comment 15 W. Trevor King 2022-07-26 07:51:07 UTC

4.9.44 was built with 203, but came too soon to include 204's ARM fixes. We're asking the following questions to evaluate whether or not this bug warrants changing update recommendations from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters?
* reasoning: This allows us to populate from, to, and matchingRules in conditional update recommendations for "the $SOURCE_RELEASE to $TARGET_RELEASE update is not recommended for clusters like $THIS".
* example: AWS ARM clusters updating from any release to 4.9.44, or installing 4.9.44.

What is the impact? Is it serious enough to warrant removing update recommendations?
* reasoning: This allows us to populate name and message in conditional update recommendations for "...because if you update, $THESE_CONDITIONS may cause $THESE_UNFORTUNATE_SYMPTOMS".
* example: 4.9.44 fails to install (all the time? Only when some other component requests storage?)
* example: Any AWS ARM cluster attempting to update to 4.9.44 will hang part way through as the storage ClusterOperator goes Degraded=True on the broken/missing CSI image.

How involved is remediation?
* reasoning: This allows administrators who are already vulnerable, or who chose to waive conditional-update risks, to recover their cluster. And even moderately serious impacts might be acceptable if they are easy to mitigate.
* example: Admin must update out to a new OCP-core release with a fixed ARM AWS CSI driver image.

Is this a regression?
* reasoning: Updating between two vulnerable releases may not increase exposure (unless rebooting during the update increases vulnerability, etc.). We only qualify update recommendations if the update increases exposure.
* example: Yes, from all 4.y.z to 4.9.44.

Comment 16 Jonathan Dobson 2022-07-26 15:09:30 UTC

(In reply to W. Trevor King from comment #15)
> 4.9.44 was built with 203, but came too soon to include 204's ARM fixes. 
> We're asking the following questions to evaluate whether or not this bug
> warrants changing update recommendations from either the previous X.Y or
> X.Y.Z. The ultimate goal is to avoid delivering an update which introduces
> new risk or reduces cluster functionality in any way. Sample answers are
> provided to give more context and the ImpactStatementRequested label has
> been added to this bug. When responding, please remove
> ImpactStatementRequested and set the ImpactStatementProposed label. The
> expectation is that the assignee answers these questions.
> 
> Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of
> clusters?
> * reasoning: This allows us to populate from, to, and matchingRules in
> conditional update recommendations for "the $SOURCE_RELEASE to
> $TARGET_RELEASE update is not recommended for clusters like $THIS".
> * example: AWS ARM clusters updating from any release to 4.9.44, or
> installing 4.9.44.

New installations using 4.9.44 is not recommended for AWS ARM clusters. The upgrade from any release to 4.9.44 is also not recommended for AWS ARM clusters.

> What is the impact? Is it serious enough to warrant removing update
> recommendations?
> * reasoning: This allows us to populate name and message in conditional
> update recommendations for "...because if you update, $THESE_CONDITIONS may
> cause $THESE_UNFORTUNATE_SYMPTOMS".
> * example: 4.9.44 fails to install (all the time?  Only when some other
> component requests storage?)
> * example: Any AWS ARM cluster attempting to update to 4.9.44 will hang part
> way through as the storage ClusterOperator goes Degraded=True on the
> broken/missing CSI image.

New installations of 4.9.44 or upgrades to 4.9.44 for AWS ARM clusters will fail to complete due to a degraded storage ClusterOperator, caused by broken CSI driver images. Persistent volumes provided by the CSI driver will not be usable while the cluster is in this state.

> How involved is remediation?
> * reasoning: This allows administrators who are already vulnerable, or who
> chose to waive conditional-update risks, to recover their cluster. And even
> moderately serious impacts might be acceptable if they are easy to mitigate.
> * example: Admin must update out to a new OCP-core release with a fixed ARM
> AWS CSI driver image.

The cluster administrator must update to a new OCP release with a fixed AWS EBS CSI driver image for ARM clusters.

> Is this a regression?
> * reasoning: Updating between two vulnerable releases may not increase
> exposure (unless rebooting during the update increases vulnerability, etc.).
> We only qualify update recommendations if the update increases exposure.
> * example: Yes, from all 4.y.z to 4.9.44.

Yes, this is a regression affecting upgrades to 4.9.44.

Comment 17 Lalatendu Mohanty 2022-07-26 16:27:43 UTC

https://github.com/openshift/cincinnati-graph-data/pull/2255 is the PR to tombstone the version.

Comment 20 errata-xmlrpc 2022-08-09 14:00:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.9.45 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5879

Note You need to log in before you can comment on or make changes to this bug.