Created attachment 1656447 [details] vm-dv-fedora.yaml Description of problem: We found situation that VM OS failed to start, in Linux it stuck in GRUB (see attached) in Windows OS is in repairing. Issue reproduce with rook-ceph on PSI. We create VM with DV and set it as running true. (see attached Fedora VM yaml) We suspect that the image is not finished the extract and VM is trying to start with HD that not loaded. Restarting the VM solving the issue. We don't see the issue on HPP. I think this is a race condition for VM, if the image is large the extract time is long and it you start VM it will stuck on OS. I think we need CDI importer pod report that it done after it extract the image, in our case we use qcow2 images. Note: In window 2019 we see that the extract takes ~5 min (on PSI) Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Create VM via attached yaml (oc create -f vm-dv-fedora.yaml) 2. Set VM to start right after dv & pvc are created and finished 3. Open console to VM and check that it didn't boot and went to GRUB Actual results: VM failed to boot Expected results: VM booted normally Additional info:
Maya, Adam, Do we have OCS bug to make dependency with this one?
Hi Niels. It seems that when using ceph rbd in ReadWriteMany mode, the volume may not be fully replicated on all nodes of the cluster by the time we are ready use it and in such cases we may read unexpected data from the device causing undefined behavior. Have you seen anything like this so far in your testing of ceph?
I believe I have solved this bug with the linked PR. For context here is the commit message: When converting images the qemu-img command uses a writeback cache mode by default. This means that writes to a block device go through the host page cache and are lazily flushed to the storage device. When using this cache mode, the process can exit before all writes have reached storage and our DataVolume could appear ready to use before all I/O has completed. This becomes a problem with shared storage because a second host does not have the benefit of the first host's page cache when reading. To prevent this problem we must perform I/O directly to the storage device. This behavior is selected by passing "-t" "none" to qemu-img. This problem will not arise with "create", "info", or "resize" qemu-img commands because those are metadata operations. The following document provides some good background on qemu cache modes: https://documentation.suse.com/sles/11-SP4/html/SLES-kvm4zseries/cha-qemu-cachemodes.html
Still blocked by: https://bugzilla.redhat.com/show_bug.cgi?id=1805627
Verified with the following code: ----------------------------------------------- oc version Client Version: 4.4.0-0.nightly-2020-02-17-022408 Server Version: 4.4.0-rc.1 Kubernetes Version: v1.17.1 virtctl version Client Version: version.Info{GitVersion:"v0.26.1", GitCommit:"e40ff7965e2aadbf21131626dfa3be85524e3a2c", GitTreeState:"clean", BuildDate:"2020-02-19T16:16:36Z", GoVersion:"go1.12.8", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{GitVersion:"v0.26.3", GitCommit:"e053fc2fe81102215a5e0cac2fbb705348f52ff1", GitTreeState:"clean", BuildDate:"2020-03-11T13:14:00Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"} Verified with the following scenario on nfs as we have no block storage class: ---------------------------------------------- Created a VM, using the attached Yaml, set the 'running' flag to 'true' Once the DV was created and completed the import Virtctl console to the VM - successfully accessed the CM Yaml used: ---------------- apiVersion: kubevirt.io/v1alpha3 kind: VirtualMachine metadata: labels: kubevirt.io/vm: vm-fedora name: vm-fedora-datavolume-new spec: dataVolumeTemplates: - metadata: creationTimestamp: null name: fedora-dv-new spec: pvc: volumeMode: accessModes: - ReadWriteMany resources: requests: storage: 25Gi storageClassName: nfs source: http: url: http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/fedora-images/Fedora-Cloud-Base-29-1.2.x86_64.qcow2 status: {} running: true template: metadata: labels: kubevirt.io/vm: vm-fedora-datavolume-new spec: domain: devices: disks: - disk: bus: virtio name: datavolumedisk1 machine: type: "" resources: requests: memory: 1Gi terminationGracePeriodSeconds: 0 volumes: - dataVolume: name: fedora-dv-new name: datavolumedisk1 Moving to VERIFIED
Verified with the following code: ---------------------------------------- Client Version: 4.5.0-rc.2 Server Version: 4.5.0-rc.2 Kubernetes Version: v1.18.3+91d0edd virtctl version Client Version: version.Info{GitVersion:"v0.26.1", GitCommit:"e40ff7965e2aadbf21131626dfa3be85524e3a2c", GitTreeState:"clean", BuildDate:"2020-02-19T16:13:09Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{GitVersion:"v0.30.0", GitCommit:"736b4aec6b2e92f2ab09ccfa6c5eb79366e88e5a", GitTreeState:"clean", BuildDate:"2020-06-09T10:37:48Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"} openshift-cnv kubevirt-hyperconverged-operator.v2.4.0 OpenShift virtualization 2.4.0 Succeeded Verified with the following scenario: ---------------------------------------- Created a VM, using the attached Yaml, set the 'running' flag to 'true' Once the DV was created and completed the import Virtctl console to the VM - successfully accessed the CM Yaml used to verify this bz: ---------------------------------------- apiVersion: kubevirt.io/v1alpha3 kind: VirtualMachine metadata: labels: kubevirt.io/vm: vm-fedora name: vm-fedora-datavolume-new spec: dataVolumeTemplates: - metadata: creationTimestamp: null name: fedora-dv-new spec: pvc: volumeMode: Block accessModes: - ReadWriteMany resources: requests: storage: 25Gi storageClassName: ocs-storagecluster-ceph-rbd source: http: url: http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/fedora-images/Fedora-Cloud-Base-29-1.2.x86_64.qcow2 status: {} running: true template: metadata: labels: kubevirt.io/vm: vm-fedora-datavolume-new spec: domain: devices: disks: - disk: bus: virtio name: datavolumedisk1 machine: type: "" resources: requests: memory: 1Gi terminationGracePeriodSeconds: 0 volumes: - dataVolume: name: fedora-dv-new name: datavolumedisk1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3194