Bug 1954509
Summary: | FC volume is marked as unmounted after failed reconstruction | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Juan Luis de Sousa-Valadas <jdesousa> | |
Component: | Storage | Assignee: | Jan Safranek <jsafrane> | |
Storage sub component: | Kubernetes | QA Contact: | Chao Yang <chaoyang> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | medium | CC: | aos-bugs, chaoyang, jsafrane, piqin, rbdiri, rpalco | |
Version: | 3.11.0 | |||
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause:
FibreChannel were not correctly unmounted from a node when a pod was deleted while a pod that used the volume was deleted in the API server while kubelet on the node was not running, e.g. because of a crash.
Consequence:
While the volume is still mounted on the original node, application pods may be scheduled to another node and the volume may be mounted there. Having a single volume mounted on two nodes corrupts the filesystem on the volume.
Fix:
FibreChannel volumes are unmounted when a new kubelet starts. At the same time, the volume cannot be mounted to another node until the new kubelet fully starts and confirms the volume is unmounted.
Result:
FibreChannel volumes are not corrupted.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1970977 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 23:04:13 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1970977 |
Description
Juan Luis de Sousa-Valadas
2021-04-28 10:04:43 UTC
The volumes are unmounted (UnmountDevice) indeed *after* a Pod is deleted, however, Kubernetes has code that prevents the volume to be attached to another node before UnmountDevice really succeeds. Check node.status.volumesAttached and node.status.volumesInUse - volumesInUse should be cleaned after a volume is really unmounted and only after that a RWO volume can be attached somewhere else. I know FC itself does not have 3rd party attach service, but that's the reason we have a fake one in Kubernetes that does nothing and only prevents one RWO volume to be "attached" to multiple nodes in parallel with error "Volume is already exclusively attached to one node and can't be attached to another". In https://bugzilla.redhat.com/show_bug.cgi?id=1903524 we even added extra "blockdev --flushbufs" and "echo '- - -' to /sys/block/<device>/device/delete" to flush everything after a volume is unmounted. Is it possible that this does not work in the customer environment? Can you please post kubelet + kube-controller-manager logs showing the volume was mounted on a node A *before* it's unmounted from node B? That should not happen and if it does, it's indeed a bug. "oc get node -o yaml" would help too, to see how the volume is added / removed from volumesAttached / volumesInUse. Jan, > Is it possible that this does not work in the customer environment? I think that's the case, and I managed to force a reproducer in a red hat environment. I'll ask QA for another cluster to reproduce this. > Can you please post kubelet + kube-controller-manager logs showing the volume was mounted on a node A *before* it's unmounted from node B? That should not happen and if it does, it's indeed a bug. Unfortunately I don't have both kubelet logs, we the kube-controller-manager logs are in the sosreport but we only have the logs of the kubelet of node B. Qin, Would it be possible for QA to reproduce it or to give me a 3.11 cluster with fiber channel? Chao Yang got me one about 3 weeks ago when I filed the bug. Filed issue upstream: https://github.com/kubernetes/kubernetes/issues/101791 Upstream PR: https://github.com/kubernetes/kubernetes/pull/101792 https://github.com/kubernetes/kubernetes/pull/94781/ It fixes "could not get consistent content of /proc/mounts after 3 attempts" (there will be 999 attempts now) and recovery from "multipath -f" errors, but still, I don't know how to reproduce it reliably: 1. create a pod that use a FC volume 2. stop kubelet 3. delete the pod with force (oc delete pod XYZ --force --grace-period=0) 4. start kubelet 5. hope that FC volume plugin hits "could not get consistent content of /proc/mounts after 3 attempts". If you're lucky, then before the patch you'll see "could not get consistent content of /proc/mounts after 3 attempts" and the volume still mounted in /var/lib/origin/openshift.local.volumes/plugins/kubernetes.io/fc/... even after ~5 minutes. After the patch it's impossible to get that unlucky (or you'll see "could not get consistent content of /proc/mounts after 999 attempts"). The volume should be unmounted + multipath device flushed + deleted. (Instead of 3., I can see in the logs that a pod probe failed and it caused the pod to be deleted. Not sure how reliably you can reproduce that) This needs some serious discussion upstream. As you can see, it needs several bad things happen at the same time and when it happens, I don't see a good workaround. Perhaps always drain a node before restarting kubelet, but it's hard to ensure with ansible. And also this PR could be helpful: https://github.com/kubernetes/kubernetes/pull/101862 I noticed that "multipath -f /dev/dm-4" (= the multipath device) sometimes fails with "device is busy", even after umount. Therefore kubelet should try it several times. I created cherry-pick to Kubernetes 1.21: https://github.com/kubernetes/kubernetes/pull/102656 Could not install cluster due to psi is not stable today. 1.Create pv/pvc/pod /dev/sdd on /var/lib/kubelet/plugins/kubernetes.io/fc/50050768030539b6-lun-0 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) /dev/sdd on /var/lib/kubelet/pods/e402f318-d871-47d8-86a0-a0c98ee89c9b/volumes/kubernetes.io~fc/pv0001 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) 2.Stop kubelet service hp-dl388g9-06.rhts.eng.pek2.redhat.com NotReady worker 3h14m v1.21.0-rc.0+120883f 3.oc delete pod deployment-1-6db9b5d75d-zrl7s --force --grace-period=0 4.oc describe pods Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 3m12s default-scheduler Successfully assigned valadas/deployment-1-6db9b5d75d-g8shf to hp-dl388g9-03.rhts.eng.pek2.redhat.com Warning FailedAttachVolume 3m13s attachdetach-controller Multi-Attach error for volume "pv0001" Volume is already exclusively attached to one node and can't be attached to another Warning FailedMount 69s kubelet Unable to attach or mount volumes: unmounted volumes=[local], unattached volumes=[kube-api-access-mbqcc local]: timed out waiting for the condition 5. Start kubelet service 6.Pod is running oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES deployment-1-6db9b5d75d-g8shf 1/1 Running 0 30m 10.131.2.85 hp-dl388g9-03.rhts.eng.pek2.redhat.com <none> <none> Jun 16 12:27:40 hp-dl388g9-06.rhts.eng.pek2.redhat.com hyperkube[2280]: I0616 12:27:40.867577 2280 operation_generator.go:616] MountVolume.MountDevice succeeded for volume "pv0001" (UniqueName: "kubernetes.io/fc/[50050768030539b6]:0") pod "deployment-1-6db9b5d75d-zrl7s" (UID: "e402f318-d871-47d8-86a0-a0c98ee89c9b") device mount path "/var/lib/kubelet/plugins/kubernetes.io/fc/50050768030539b6-lun-0" Jun 16 12:27:40 hp-dl388g9-06.rhts.eng.pek2.redhat.com hyperkube[2280]: I0616 12:27:40.874155 2280 operation_generator.go:698] MountVolume.SetUp succeeded for volume "pv0001" (UniqueName: "kubernetes.io/fc/[50050768030539b6]:0") pod "deployment-1-6db9b5d75d-zrl7s" (UID: "e402f318-d871-47d8-86a0-a0c98ee89c9b") Jun 16 15:13:37 hp-dl388g9-06.rhts.eng.pek2.redhat.com hyperkube[219234]: I0616 15:13:37.362108 219234 reconciler.go:429] "Reconciler sync states: could not find pod information in desired state, update it in actual state" reconstructedVolume=&{volumeName:kubernetes.io/fc/[50050768030539b6]:0 podName:e402f318-d871-47d8-86a0-a0c98ee89c9b volumeSpec:0xc000011b90 outerVolumeSpecName:pv0001 pod:0xc00148a400 volumeGidValue: devicePath: mounter:0xc0013eb260 deviceMounter:0xc000f7cd80 blockVolumeMapper:<nil>} Jun 16 15:13:38 hp-dl388g9-06.rhts.eng.pek2.redhat.com hyperkube[219234]: I0616 15:13:38.997411 219234 reconciler.go:196] "operationExecutor.UnmountVolume started for volume \"pv0001\" (UniqueName: \"kubernetes.io/fc/[50050768030539b6]:0\") pod \"e402f318-d871-47d8-86a0-a0c98ee89c9b\" (UID: \"e402f318-d871-47d8-86a0-a0c98ee89c9b\") " Jun 16 15:13:39 hp-dl388g9-06.rhts.eng.pek2.redhat.com hyperkube[219234]: W0616 15:13:39.015696 219234 mount_helper_common.go:133] Warning: "/var/lib/kubelet/pods/e402f318-d871-47d8-86a0-a0c98ee89c9b/volumes/kubernetes.io~fc/pv0001" is not a mountpoint, deleting Jun 16 15:13:39 hp-dl388g9-06.rhts.eng.pek2.redhat.com hyperkube[219234]: I0616 15:13:39.026348 219234 operation_generator.go:829] UnmountVolume.TearDown succeeded for volume "kubernetes.io/fc/[50050768030539b6]:0" (OuterVolumeSpecName: "pv0001") pod "e402f318-d871-47d8-86a0-a0c98ee89c9b" (UID: "e402f318-d871-47d8-86a0-a0c98ee89c9b"). InnerVolumeSpecName "pv0001". PluginName "kubernetes.io/fc", VolumeGidValue "" Jun 16 15:13:39 hp-dl388g9-06.rhts.eng.pek2.redhat.com hyperkube[219234]: I0616 15:13:39.104454 219234 reconciler.go:312] "operationExecutor.UnmountDevice started for volume \"pv0001\" (UniqueName: \"kubernetes.io/fc/[50050768030539b6]:0\") on node \"hp-dl388g9-06.rhts.eng.pek2.redhat.com\" " Jun 16 15:13:39 hp-dl388g9-06.rhts.eng.pek2.redhat.com hyperkube[219234]: I0616 15:13:39.288127 219234 operation_generator.go:930] UnmountDevice succeeded for volume "pv0001" %!(EXTRA string=UnmountDevice succeeded for volume "pv0001" (UniqueName: "kubernetes.io/fc/[50050768030539b6]:0") on node "hp-dl388g9-06.rhts.eng.pek2.redhat.com" ) Jun 16 15:13:39 hp-dl388g9-06.rhts.eng.pek2.redhat.com hyperkube[219234]: I0616 15:13:39.352556 219234 reconciler.go:319] "Volume detached for volume \"pv0001\" (UniqueName: \"kubernetes.io/fc/[50050768030539b6]:0\") on node \"hp-dl388g9-06.rhts.eng.pek2.redhat.com\" DevicePath \"\"" 1.Create pv/pvc/pod 2.touch /var/lib/kubelet/plugins/kubernetes.io/fc/50050768030539b6-lun-0/my-file 3.Make this device is busy 4.Stop kubelet service 5.oc delete pod --force --grace-period=0 6.Start kubelet service 7.After a while, pod will be running on other nodes. But We can see below message on the first node mount | grep sdd /dev/sdd on /var/lib/kubelet/plugins/kubernetes.io/fc/50050768030539b6-lun-0 type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) Jun 17 14:41:02 hp-dl388g9-06.rhts.eng.pek2.redhat.com hyperkube[233004]: I0617 14:41:02.395210 233004 reconciler.go:312] "operationExecutor.UnmountDevice started for volume \"pv0001\" (UniqueName: \"kubernetes.io/fc/[50050768030539b6]:0\") on node \"hp-dl388g9-06.rhts.eng.pek2.redhat.com\" " Jun 17 14:41:02 hp-dl388g9-06.rhts.eng.pek2.redhat.com hyperkube[233004]: E0617 14:41:02.406178 233004 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/fc/[50050768030539b6]:0 podName: nodeName:}" failed. No retries permitted until 2021-06-17 14:43:04.406085866 +0000 UTC m=+3799.915289059 (durationBeforeRetry 2m2s). Error: "UnmountDevice failed for volume \"pv0001\" (UniqueName: \"kubernetes.io/fc/[50050768030539b6]:0\") on node \"hp-dl388g9-06.rhts.eng.pek2.redhat.com\" : fc: failed to unmount: /var/lib/kubelet/plugins/kubernetes.io/fc/50050768030539b6-lun-0\nError: unmount failed: exit status 32\nUnmounting arguments: /var/lib/kubelet/plugins/kubernetes.io/fc/50050768030539b6-lun-0\nOutput: umount: /var/lib/kubelet/plugins/kubernetes.io/fc/50050768030539b6-lun-0: target is busy.\n" [root@hp-dl388g9-06 50050768030539b6-lun-0]# date Thu Jun 17 14:42:05 UTC 2021 @jsafrane, I can see kubelet is always trying to unmount the device on the first node. Is this correct? When kubelet can't unmount a volume for 6 minutes, kube-controller-manager force-detaches the volume: W0617 19:42:33.596490 1 reconciler.go:222] attacherDetacher.DetachVolume started for volume "pv0001" (UniqueName: "kubernetes.io/fc/[50050768030539b6]:0") on node "hp-dl388g9-03.rhts.eng.pek2.redhat.com" This volume is not safe to detach, but maxWaitForUnmountDuration 6m0s expired, force detaching I0617 19:42:33.596547 1 operation_generator.go:483] DetachVolume.Detach succeeded for volume "pv0001" (UniqueName: "kubernetes.io/fc/[50050768030539b6]:0") on node "hp-dl388g9-03.rhts.eng.pek2.redhat.com" Since there is no 3rd party attach/detach for FC (and iSCSI) volumes, the volume is marked as detached and then it's usable on the other node. This is not related to this BZ and it will happen even without kubelet restart. Just block the mount for 6 minutes and with any volume. What I fixed is "could not get consistent content of /proc/mounts after 3 attempts" - now kubelet retries to read /proc/mount for 1 minute. But I don't know how to reproduce it, kernel must be seriously busy and/or constantly mounting/unmounting volumes when kubelet starts. Verified on 4.8.0-0.nightly-2021-06-16-190035 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |