Bug 1977807
| Summary: | Prometheus PV is corrupted during CSI migration tests | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jan Safranek <jsafrane> |
| Component: | Storage | Assignee: | Jan Safranek <jsafrane> |
| Storage sub component: | Kubernetes | QA Contact: | Wei Duan <wduan> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | medium | ||
| Priority: | unspecified | CC: | aos-bugs, fbertina, rh-container |
| Version: | 4.8 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-18 17:37:18 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jan Safranek
2021-06-30 14:09:52 UTC
> * The A/D controller logs these two suspicious lines on startup: >> 2021-06-29T12:58:48.446917233Z I0629 12:58:48.446849 1 attach_detach_controller.go:740] Marking volume attachment as uncertain as volume:"kubernetes.io/aws-ebs/aws://us-east-2c/vol-0ba07ec7cd97f3ae9" ("ip-10-0-251-70.us-east-2.compute.internal") is not attached (Detached) >> 2021-06-29T12:58:48.446917233Z I0629 12:58:48.446883 1 attach_detach_controller.go:740] Marking volume attachment as uncertain as volume:"kubernetes.io/aws-ebs/aws://us-east-2b/vol-0452ff618cee8c71e" ("ip-10-0-153-60.us-east-2.compute.internal") is not attached (Detached) > > KCM is restarted, perhaps several times, when enabling CSI migration. I don't have earlier KCM logs. The volume is detached: >> 2021-06-29T13:03:24.905093374Z I0629 13:03:24.905054 1 reconciler.go:219] attacherDetacher.DetachVolume started for volume "pvc-7dd40eb9-9b4f-4e85-a15b-23f4161e0e45" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-2c/vol-0ba07ec7cd97f3ae9") on node "ip-10-0-251-70.us-east-2.compute.internal" > In-tree volume detached at 13:03? CSI migration should have been enabled by then already... I checked, A/D controller thinks these are really in-tree volumes, not migrated volumes with misleading unique name in the logs. Tried to reproduce it manually. I installed AWS cluster, enabled PVCs for Prometheus and enabled migration. After several KCM restarts due to new feature flags + nodes drained, I can see the startup of the last KCM logs: > I0701 09:48:09.274454 1 attach_detach_controller.go:740] Marking volume attachment as uncertain as volume:"kubernetes.io/aws-ebs/aws://us-east-1a/vol-09a070fded8cd4675" ("ip-10-0-130-134.ec2.internal") is not attached (Detached) > I0701 09:48:09.274498 1 attach_detach_controller.go:740] Marking volume attachment as uncertain as volume:"kubernetes.io/aws-ebs/aws://us-east-1b/vol-05ae35e408f95b1f6" ("ip-10-0-158-70.ec2.internal") is not attached (Detached) At this time, CSI migration is enabled on all nodes already. Following vol-09a070fded8cd4675: > I0701 09:51:56.308855 1 reconciler.go:219] attacherDetacher.DetachVolume started for volume "pvc-3b0d6a2f-5045-44fb-bfd4-0c04a609b91e" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-09a070fded8cd4675") on node "ip-10-0-130-134.ec2.internal" > I0701 09:51:56.315381 1 operation_generator.go:1483] Verified volume is safe to detach for volume "pvc-3b0d6a2f-5045-44fb-bfd4-0c04a609b91e" (UniqueName: "kubernetes.io/aws-ebs/aws://us-east-1a/vol-09a070fded8cd4675") on node "ip-10-0-130-134.ec2.internal" ... > I0701 09:52:01.910531 1 aws.go:2291] Waiting for volume "vol-09a070fded8cd4675" state: actual=busy, desired=detached > I0701 09:52:04.007945 1 aws.go:2291] Waiting for volume "vol-09a070fded8cd4675" state: actual=busy, desired=detached (and it continues forever) The volume is being detached by the in-tree volume plugin, while it is attached and mounted by the CSI driver. From ip-10-0-130-134.ec2.internal node status: volumesAttached: - devicePath: "" name: kubernetes.io/csi/ebs.csi.aws.com^vol-09a070fded8cd4675 volumesInUse: - kubernetes.io/csi/ebs.csi.aws.com^vol-09a070fded8cd4675 It looks like VolumeAttachment processing on A/D controller startup does not take CSI migration into account: https://github.com/kubernetes/kubernetes/blob/3f4c39bbd7b8d1dec2bc88c6f4c8e7ba6ba83169/pkg/controller/volume/attachdetach/attach_detach_controller.go#L720 - it should migrate the PV to CSI, if CSI migration is enabled on the corresponding node. Still, I am not sure if it can cause volume corruption. > It looks like VolumeAttachment processing on A/D controller startup does not > take CSI migration into account: > https://github.com/kubernetes/kubernetes/blob/ > 3f4c39bbd7b8d1dec2bc88c6f4c8e7ba6ba83169/pkg/controller/volume/attachdetach/ > attach_detach_controller.go#L720 > - it should migrate the PV to CSI, if CSI migration is enabled on the > corresponding node. This was fixed in https://github.com/kubernetes/kubernetes/pull/101737 and backported to all supported releases, however, they were not merged to OCP yet. Tested CSI migration with the aforemntioned PR in https://github.com/openshift/kubernetes/pull/844 and it passed once (except from regular flakes). Kubernetes rebase has landed. And CSI migration tests are more green now: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-broken#periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-csi-migration *** Bug 1974906 has been marked as a duplicate of this bug. *** Verified pass. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |