Bug 1749563
| Summary: | Migration failed for VM with Ceph and VolumeMode=Filesystem | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | Denys Shchedrivyi <dshchedr> | ||||
| Component: | Virtualization | Assignee: | Fabian Deutsch <fdeutsch> | ||||
| Status: | CLOSED WORKSFORME | QA Contact: | Denys Shchedrivyi <dshchedr> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 2.1.0 | CC: | cnv-qe-bugs, fbertina, fdeutsch, hchiramm, hekumar, jsafrane | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 2.1.0 | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-09-16 06:56:29 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Denys, please provide the PV and PVC yamls as well. This is FS mode, thus I would expect that PV/PVC are RWX, this needs to be confirmed. I also think that this is a dupe of bug #1745776 From the error in describe, I am pretty sure, this PVC is "RWO" or IOW, the error is multi attach error. (In reply to Humble Chirammal from comment #2) > From the error in describe, I am pretty sure, this PVC is "RWO" or IOW, the > error is multi attach error. ah.. by seeing the output of PV mentioned in the description, I am confused: pvc-eb52992f-d00c-11e9-be37-5254001e809d 2Gi RWX Retain Released default/dv-cirros rook-ceph-block 8m It says, RWX. Also, by paying more attention, it looks to me that, both of these PODs are running in the same system ? if thats the case, iic, RWO does not make any difference. Humble, what else is needed to further debug this issue? (In reply to Fabian Deutsch from comment #4) > Humble, what else is needed to further debug this issue? Kubelet logs from the node where these pods are running or scheduled could help. From the attached template I see, this is a RWX PVC, if thats the case, I wont expect the multiattach error to pop up ( c#2). If this happens, it looks like a bug to me. I have to look into this in more detail, however at glance, it looks like waitForAttachandMount does not validate the accessmode of the PVC which cause this error. But, puzzle here is that, if thats the case, this could have been a general/common error from any upstream user. I am also requesting thoughts from OCP team while we progress, meanwhile please attach the kubelet logs. Can we also get #oc describe pv,pvc output from this setup? Thanks Humble, let's check with Denys. Denys, can you provide the requested infos? And can we please get the real YAMl file for both PV and PVC? "RWX" only means that Kubernetes thinks that the volume is shareable, however, it does not mean that it really is. Typically, Ceph RBD volume are not shareable, CephFS is. (In reply to Jan Safranek from comment #8) > And can we please get the real YAMl file for both PV and PVC? "RWX" only > means that Kubernetes thinks that the volume is shareable, however, it does > not mean that it really is. Typically, Ceph RBD volume are not shareable, > CephFS is. Yeah, this is where the confusion comes (c#3). Ideally if we use CSI, RBD + VolumeMode=FS combo the PV creation wouldnt have happened at all. https://github.com/ceph/ceph-csi/blob/master/pkg/rbd/controllerserver.go#L82 But it looks like the volume is provisioned or available. So I assume the PV is not provisioned by CSI. Is the PVC/PV provisioned through using in tree driver of RBD ( I am not sure it support RWX for FS mode)? or in some other way? Why I am asking this is, the provisioner used here is not available/mentioned with the data provided in bz description, so confirming it. Jan, one question, if there is a PV spec marked with "RWX" and PVC is getting used in more than one pod as in this bugzilla, any possibility exist for "multi attach error'? or IOW, 'Volume is already attached by pod default/virt-launcher-vm-dv-cirros-v72rs. Status Running' ? Humble, this error comes from CSI driver, not from Kubernetes:
> Warning FailedMount 27s (x8 over 1m) kubelet, cluster1-4pt9b-worker-0-rwq2g MountVolume.SetUp failed for volume "pvc-eb52992f-d00c-11e9-be37-5254001e809d" : mount command failed, status: Failure, reason: Rook: Mount volume failed: failed to attach volume pvc-eb52992f-d00c-11e9-be37-5254001e809d for pod default/virt-launcher-vm-dv-cirros-mk5vg. Volume is already attached by pod default/virt-launcher-vm-dv-cirros-v72rs. Status Running
(In reply to Jan Safranek from comment #10) > Humble, this error comes from CSI driver, not from Kubernetes: > > > Warning FailedMount 27s (x8 over 1m) kubelet, cluster1-4pt9b-worker-0-rwq2g MountVolume.SetUp failed for volume "pvc-eb52992f-d00c-11e9-be37-5254001e809d" : mount command failed, status: Failure, reason: Rook: Mount volume failed: failed to attach volume pvc-eb52992f-d00c-11e9-be37-5254001e809d for pod default/virt-launcher-vm-dv-cirros-mk5vg. Volume is already attached by pod default/virt-launcher-vm-dv-cirros-v72rs. Status Running I think I got whats happening here. Is this Rook provisioned volume with "FLEX DRIVER" ? If yes, that avoid the confusion a lot! Why I think this is flex driver is because of the same reason that, forst of all, ceph-csi project does not provision volumes with 'RBD + VolumeMode=FS combo and it dont have any check in ceph-csi code about multi attach to return back to kubelet as in above. I can't reproduce this issue again. We had some changes in environment installation process, so unfortunately I can't verify if it was "FLEX DRIVER" issue. Now on our clusters it works as expected: VM with Block mode and RWX - successfully migrated VM with Filesystem mode can't be RWX (expected behavior): Warning ProvisioningFailed 11s (x6 over 27s) rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-5dd69bb477-zhrql_5a261e9b-d3e1-11e9-a378-0a580a810018 failed to provision volume with StorageClass "rook-ceph-block": rpc error: code = InvalidArgument desc = multi node access modes are only supported on rbd `block` type volumes (In reply to Denys Shchedrivyi from comment #12) > I can't reproduce this issue again. We had some changes in environment > installation process, so unfortunately I can't verify if it was "FLEX > DRIVER" issue. > > > Now on our clusters it works as expected: > > > VM with Block mode and RWX - successfully migrated > > > VM with Filesystem mode can't be RWX (expected behavior): > Warning ProvisioningFailed 11s (x6 over 27s) > rook-ceph.rbd.csi.ceph.com_csi-rbdplugin-provisioner-5dd69bb477- > zhrql_5a261e9b-d3e1-11e9-a378-0a580a810018 failed to provision volume with > StorageClass "rook-ceph-block": rpc error: code = InvalidArgument desc = > multi node access modes are only supported on rbd `block` type volumes Yeah, if you are using ceph CSI , the provisioning itself would have failed with above message. As this bug is reported on "volumeMode=FS" and RWX, I think it was with "Flex driver" . I do agree, we can close this bugzilla as its working as expected. Folks, thanks for all the debugging effort. Closing this according to the last two comments, please reopen if necessary. |
Created attachment 1612082 [details] vm yaml file Description of problem: VM with shared PVC on Ceph and VolumeMode=Filesystem can't be migrated. Migration started, but target pod is stuck in "ContainerCreating" state with error: # oc get pod NAME READY STATUS RESTARTS AGE virt-launcher-vm-dv-cirros-mk5vg 0/1 ContainerCreating 0 25s virt-launcher-vm-dv-cirros-v72rs 1/1 Running 0 2m5s # oc describe pod virt-launcher-vm-dv-cirros-mk5vg . Warning FailedMount 27s (x8 over 1m) kubelet, cluster1-4pt9b-worker-0-rwq2g MountVolume.SetUp failed for volume "pvc-eb52992f-d00c-11e9-be37-5254001e809d" : mount command failed, status: Failure, reason: Rook: Mount volume failed: failed to attach volume pvc-eb52992f-d00c-11e9-be37-5254001e809d for pod default/virt-launcher-vm-dv-cirros-mk5vg. Volume is already attached by pod default/virt-launcher-vm-dv-cirros-v72rs. Status Running # oc get pv pvc-eb52992f-d00c-11e9-be37-5254001e809d 2Gi RWX Retain Released default/dv-cirros rook-ceph-block 8m Actual results: Migration failed Expected results: Migration should be successfully completed. If it is impossible to migrate VM with Ceph and VolumeMode=Filesystem - we should prevent migration from running.