Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1977807

Summary:	Prometheus PV is corrupted during CSI migration tests
Product:	OpenShift Container Platform	Reporter:	Jan Safranek <jsafrane>
Component:	Storage	Assignee:	Jan Safranek <jsafrane>
Storage sub component:	Kubernetes	QA Contact:	Wei Duan <wduan>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, fbertina, rh-container
Version:	4.8
Target Milestone:	---
Target Release:	4.9.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-10-18 17:37:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Safranek 2021-06-30 14:09:52 UTC

Periodic jobs for CSI migration on AWS is failing consistently since Jul 15: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-broken#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-csi-migration

This is a CI job run with `-v=5` in the AWS CSI driver to get more useful logs, however, all other CI jobs fail similarly:

https://github.com/openshift/aws-ebs-csi-driver-operator/pull/133
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_aws-ebs-csi-driver-operator/133/pull-ci-openshift-aws-ebs-csi-driver-operator-master-e2e-aws-csi-migration/1409843278685147136

I can see that ext4 gets corrupted on /dev/nvme1n1 on node ip-10-0-251-70.us-east-2.compute.internal

> Jun 29 12:23:25.215128 ip-10-0-251-70.us-east-2.compute.internal hyperkube[1356]: I0629 12:23:25.214951    1356 attacher.go:189] Successfully found attached AWS Volume "aws://us-east-2c/vol-0ba07ec7cd97f3ae9" at path "/dev/nvme1n1".
> Jun 29 12:53:25.766786 ip-10-0-251-70.us-east-2.compute.internal kernel: EXT4-fs error (device nvme1n1) in ext4_reserve_inode_write:5953: IO failure


Later on, Prometheus starts complaining and does not provide any metrics:
> 2021-06-29T12:53:30.849716550Z level=error ts=2021-06-29T12:53:30.849Z caller=scrape.go:1098 component="scrape manager" scrape_pool=serviceMonitor/openshift-monitoring/node-exporter/0 target=https://10.0.183.184:9100/metrics msg="Scrape commit failed" err="write to WAL: log samples: write /prometheus/wal/00000001: read-only file system"

(there are no earlier Prometheus logs, not sure why)

PV is pvc-7dd40eb9-9b4f-4e85-a15b-23f4161e0e45, pod is prometheus-k8s-1

In kubelet logs from ip-10-0-251-70.us-east-2.compute.internal, I can see that /dev/nvme1n1 was first uses by in-tree volume plugin for the same PV and it was correctly unmounted:

> Jun 29 12:23:25.215128 ip-10-0-251-70.us-east-2.compute.internal hyperkube[1356]: I0629 12:23:25.214951    1356 attacher.go:189] Successfully found attached AWS Volume "aws://us-east-2c/vol-0ba07ec7cd97f3ae9" at path "/dev/nvme1n1".
> ...
> Jun 29 12:46:10.855735 ip-10-0-251-70.us-east-2.compute.internal hyperkube[1356]: W0629 12:46:10.855709    1356 mount_helper_common.go:133] Warning: "/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/mounts/aws/us-east-2c/vol-0ba07ec7cd97f3ae9" is not a mountpoint, deleting
> Jun 29 12:46:10.855819 ip-10-0-251-70.us-east-2.compute.internal hyperkube[1356]: I0629 12:46:10.855803    1356 operation_generator.go:930] UnmountDevice succeeded for volume "pvc-7dd40eb9-9b4f-4e85-a15b-23f4161e0e45" %!(EXTRA string=UnmountDevice succeeded for volume "pvc-7dd40eb9-9b4f-4e85-a15b-23f4161e0e45" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-2c/vol-0ba07ec7cd97f3ae9") on node "ip-10-0-251-70.us-east-2.compute.internal" )
> Jun 29 12:46:10.928728 ip-10-0-251-70.us-east-2.compute.internal hyperkube[1356]: I0629 12:46:10.928701    1356 reconciler.go:319] "Volume detached for volume \"pvc-7dd40eb9-9b4f-4e85-a15b-23f4161e0e45\" (UniqueName: \"kubernetes.io/aws-ebs/aws://us-east-2c/vol-0ba07ec7cd97f3ae9\") on node \"ip-10-0-251-70.us-east-2.compute.internal\" DevicePath \"/dev/nvme1n1\""

(note: "Warning: xxx is not a mountpoint, deleting" looks suspicious).

Later on, CSI migration is enabled and the same PV is mounted via the CSI driver. From the driver logs:

> 2021-06-29T12:48:29.991419421Z I0629 12:48:29.991348       1 node.go:105] NodeStageVolume: called with args {VolumeId:vol-0ba07ec7cd97f3ae9 PublishContext:map[devicePath:/dev/xvdba] StagingTargetPath:/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-7dd40eb9-9b4f-4e85-a15b-23f4161e0e45/globalmount VolumeCapability:mount:<fs_type:"ext4" > access_mode:<mode:SINGLE_NODE_WRITER >  
> Secrets:map[] VolumeContext:map[partition:0] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}
> 2021-06-29T12:48:29.991448603Z W0629 12:48:29.991413       1 node.go:171] NodeStageVolume: invalid partition config, will ignore. partition = 0
> 2021-06-29T12:48:29.991563419Z I0629 12:48:29.991549       1 node.go:179] NodeStageVolume: find device path /dev/xvdba -> /dev/nvme1n1
> 2021-06-29T12:48:29.992858616Z I0629 12:48:29.992226       1 node.go:213] NodeStageVolume: formatting /dev/nvme1n1 and mounting at /var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-7dd40eb9-9b4f-4e85-a15b-> 23f4161e0e45/globalmount with fstype ext4
> 2021-06-29T12:48:30.056341675Z I0629 12:48:30.056314       1 mount.go:121] ResizeFs.needResize - checking mounted volume /dev/nvme1n1
...

(note that there is no 'Disk "/dev/nvme1n1" appears to be unformatted, attempting to format as type: "ext4"', so the device was already formatted from the previous pod and the CSI driver just mounted it)

This is 5 minutes before the corruption happened.

I checked:

* The volume is not used on any other node.
* Both kubelet and CSI driver logs do not show that nvme1n1 would be used by another volume.

Suspicious events:
* The CSI driver on the node logs "NodeStageVolume: find device path /dev/xvdba -> /dev/nvmeXnY" for several volumes. Each with a different X and Y. I checked the CSI driver code, it seems that /dev/xvdba is not used anywhere when finding the device, still, it shows that the attacher / CSI driver may be using a single device name for multiple volumes. I did not check the attacher logs yet.

* The A/D controller logs these two suspicious lines on startup:
> 2021-06-29T12:58:48.446917233Z I0629 12:58:48.446849       1 attach_detach_controller.go:740] Marking volume attachment as uncertain as volume:"kubernetes.io/aws-ebs/aws://us-east-2c/vol-0ba07ec7cd97f3ae9" ("ip-10-0-251-70.us-east-2.compute.internal") is not attached (Detached)
> 2021-06-29T12:58:48.446917233Z I0629 12:58:48.446883       1 attach_detach_controller.go:740] Marking volume attachment as uncertain as volume:"kubernetes.io/aws-ebs/aws://us-east-2b/vol-0452ff618cee8c71e" ("ip-10-0-153-60.us-east-2.compute.internal") is not attached (Detached)

KCM is restarted, perhaps several times, when enabling CSI migration. I don't have earlier KCM logs.  The volume is detached:
> 2021-06-29T13:03:24.905093374Z I0629 13:03:24.905054       1 reconciler.go:219] attacherDetacher.DetachVolume started for volume "pvc-7dd40eb9-9b4f-4e85-a15b-23f4161e0e45" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-2c/vol-0ba07ec7cd97f3ae9") on node "ip-10-0-251-70.us-east-2.compute.internal"

In-tree volume detached at 13:03? CSI migration should have been enabled by then already...

Comment 1 Jan Safranek 2021-07-01 10:13:11 UTC

> * The A/D controller logs these two suspicious lines on startup:
>> 2021-06-29T12:58:48.446917233Z I0629 12:58:48.446849       1 attach_detach_controller.go:740] Marking volume attachment as uncertain as volume:"kubernetes.io/aws-ebs/aws://us-east-2c/vol-0ba07ec7cd97f3ae9" ("ip-10-0-251-70.us-east-2.compute.internal") is not attached (Detached)
>> 2021-06-29T12:58:48.446917233Z I0629 12:58:48.446883       1 attach_detach_controller.go:740] Marking volume attachment as uncertain as volume:"kubernetes.io/aws-ebs/aws://us-east-2b/vol-0452ff618cee8c71e" ("ip-10-0-153-60.us-east-2.compute.internal") is not attached (Detached)
>
> KCM is restarted, perhaps several times, when enabling CSI migration. I don't have earlier KCM logs.  The volume is detached:
>> 2021-06-29T13:03:24.905093374Z I0629 13:03:24.905054       1 reconciler.go:219] attacherDetacher.DetachVolume started for volume "pvc-7dd40eb9-9b4f-4e85-a15b-23f4161e0e45" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-2c/vol-0ba07ec7cd97f3ae9") on node "ip-10-0-251-70.us-east-2.compute.internal"

> In-tree volume detached at 13:03? CSI migration should have been enabled by then already...

I checked, A/D controller thinks these are really in-tree volumes, not migrated volumes with misleading unique name in the logs.

Comment 2 Jan Safranek 2021-07-01 11:01:07 UTC

Tried to reproduce it manually. I installed AWS cluster, enabled PVCs for Prometheus and enabled migration. After several KCM restarts due to new feature flags + nodes drained, I can see the startup of the last KCM logs:

> I0701 09:48:09.274454       1 attach_detach_controller.go:740] Marking volume attachment as uncertain as volume:"kubernetes.io/aws-ebs/aws://us-east-1a/vol-09a070fded8cd4675" ("ip-10-0-130-134.ec2.internal") is not attached (Detached)
> I0701 09:48:09.274498       1 attach_detach_controller.go:740] Marking volume attachment as uncertain as volume:"kubernetes.io/aws-ebs/aws://us-east-1b/vol-05ae35e408f95b1f6" ("ip-10-0-158-70.ec2.internal") is not attached (Detached)

At this time, CSI migration is enabled on all nodes already. Following vol-09a070fded8cd4675:

> I0701 09:51:56.308855       1 reconciler.go:219] attacherDetacher.DetachVolume started for volume "pvc-3b0d6a2f-5045-44fb-bfd4-0c04a609b91e" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-09a070fded8cd4675") on node "ip-10-0-130-134.ec2.internal"
> I0701 09:51:56.315381       1 operation_generator.go:1483] Verified volume is safe to detach for volume "pvc-3b0d6a2f-5045-44fb-bfd4-0c04a609b91e" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-09a070fded8cd4675") on node "ip-10-0-130-134.ec2.internal"
...
> I0701 09:52:01.910531       1 aws.go:2291] Waiting for volume "vol-09a070fded8cd4675" state: actual=busy, desired=detached
> I0701 09:52:04.007945       1 aws.go:2291] Waiting for volume "vol-09a070fded8cd4675" state: actual=busy, desired=detached

(and it continues forever)

The volume is being detached by the in-tree volume plugin, while it is attached and mounted by the CSI driver. From ip-10-0-130-134.ec2.internal node status:

  volumesAttached:
  - devicePath: ""
    name: kubernetes.io/csi/ebs.csi.aws.com^vol-09a070fded8cd4675
  volumesInUse:
  - kubernetes.io/csi/ebs.csi.aws.com^vol-09a070fded8cd4675

It looks like VolumeAttachment processing on A/D controller startup does not take CSI migration into account:
https://github.com/kubernetes/kubernetes/blob/3f4c39bbd7b8d1dec2bc88c6f4c8e7ba6ba83169/pkg/controller/volume/attachdetach/attach_detach_controller.go#L720
- it should migrate the PV to CSI, if CSI migration is enabled on the corresponding node.

Still, I am not sure if it can cause volume corruption.

Comment 3 Jan Safranek 2021-07-01 12:48:26 UTC

> It looks like VolumeAttachment processing on A/D controller startup does not
> take CSI migration into account:
> https://github.com/kubernetes/kubernetes/blob/
> 3f4c39bbd7b8d1dec2bc88c6f4c8e7ba6ba83169/pkg/controller/volume/attachdetach/
> attach_detach_controller.go#L720
> - it should migrate the PV to CSI, if CSI migration is enabled on the
> corresponding node.

This was fixed in https://github.com/kubernetes/kubernetes/pull/101737 and backported to all supported releases, however, they were not merged to OCP yet.

Comment 4 Jan Safranek 2021-07-02 12:32:53 UTC

Tested CSI migration with the aforemntioned PR in https://github.com/openshift/kubernetes/pull/844 and it passed once (except from regular flakes).

Comment 5 Jan Safranek 2021-08-19 09:20:17 UTC

Kubernetes rebase has landed. And CSI migration tests are more green now: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-broken#periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-csi-migration

Comment 7 Fabio Bertinatto 2021-08-19 13:32:57 UTC

*** Bug 1974906 has been marked as a duplicate of this bug. ***

Comment 8 Wei Duan 2021-08-30 14:09:29 UTC

Verified pass.

Comment 11 errata-xmlrpc 2021-10-18 17:37:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759