Bug 1796342
| Summary: | VM Failing to start since hard disk not ready | ||
|---|---|---|---|
| Product: | Container Native Virtualization (CNV) | Reporter: | vsibirsk |
| Component: | Storage | Assignee: | Adam Litke <alitke> |
| Status: | CLOSED ERRATA | QA Contact: | Kevin Alon Goldblatt <kgoldbla> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 2.2.0 | CC: | alitke, cnv-qe-bugs, mrashish, ndevos, ngavrilo, qixuan.wang |
| Target Milestone: | --- | ||
| Target Release: | 2.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | virt-cdi-operator-container-v2.3.0-32 hco-bundle-registry-container-v2.2.0-353 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-07-28 19:09:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1794050 | ||
Maya, Adam, Do we have OCS bug to make dependency with this one? Hi Niels. It seems that when using ceph rbd in ReadWriteMany mode, the volume may not be fully replicated on all nodes of the cluster by the time we are ready use it and in such cases we may read unexpected data from the device causing undefined behavior. Have you seen anything like this so far in your testing of ceph? I believe I have solved this bug with the linked PR. For context here is the commit message: When converting images the qemu-img command uses a writeback cache mode by default. This means that writes to a block device go through the host page cache and are lazily flushed to the storage device. When using this cache mode, the process can exit before all writes have reached storage and our DataVolume could appear ready to use before all I/O has completed. This becomes a problem with shared storage because a second host does not have the benefit of the first host's page cache when reading. To prevent this problem we must perform I/O directly to the storage device. This behavior is selected by passing "-t" "none" to qemu-img. This problem will not arise with "create", "info", or "resize" qemu-img commands because those are metadata operations. The following document provides some good background on qemu cache modes: https://documentation.suse.com/sles/11-SP4/html/SLES-kvm4zseries/cha-qemu-cachemodes.html Still blocked by: https://bugzilla.redhat.com/show_bug.cgi?id=1805627 Verified with the following code:
-----------------------------------------------
oc version
Client Version: 4.4.0-0.nightly-2020-02-17-022408
Server Version: 4.4.0-rc.1
Kubernetes Version: v1.17.1
virtctl version
Client Version: version.Info{GitVersion:"v0.26.1", GitCommit:"e40ff7965e2aadbf21131626dfa3be85524e3a2c", GitTreeState:"clean", BuildDate:"2020-02-19T16:16:36Z", GoVersion:"go1.12.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{GitVersion:"v0.26.3", GitCommit:"e053fc2fe81102215a5e0cac2fbb705348f52ff1", GitTreeState:"clean", BuildDate:"2020-03-11T13:14:00Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
Verified with the following scenario on nfs as we have no block storage class:
----------------------------------------------
Created a VM, using the attached Yaml, set the 'running' flag to 'true'
Once the DV was created and completed the import
Virtctl console to the VM - successfully accessed the CM
Yaml used:
----------------
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
labels:
kubevirt.io/vm: vm-fedora
name: vm-fedora-datavolume-new
spec:
dataVolumeTemplates:
- metadata:
creationTimestamp: null
name: fedora-dv-new
spec:
pvc:
volumeMode:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 25Gi
storageClassName: nfs
source:
http:
url: http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/fedora-images/Fedora-Cloud-Base-29-1.2.x86_64.qcow2
status: {}
running: true
template:
metadata:
labels:
kubevirt.io/vm: vm-fedora-datavolume-new
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: datavolumedisk1
machine:
type: ""
resources:
requests:
memory: 1Gi
terminationGracePeriodSeconds: 0
volumes:
- dataVolume:
name: fedora-dv-new
name: datavolumedisk1
Moving to VERIFIED
Verified with the following code:
----------------------------------------
Client Version: 4.5.0-rc.2
Server Version: 4.5.0-rc.2
Kubernetes Version: v1.18.3+91d0edd
virtctl version
Client Version: version.Info{GitVersion:"v0.26.1", GitCommit:"e40ff7965e2aadbf21131626dfa3be85524e3a2c", GitTreeState:"clean", BuildDate:"2020-02-19T16:13:09Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{GitVersion:"v0.30.0", GitCommit:"736b4aec6b2e92f2ab09ccfa6c5eb79366e88e5a", GitTreeState:"clean", BuildDate:"2020-06-09T10:37:48Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
openshift-cnv kubevirt-hyperconverged-operator.v2.4.0 OpenShift virtualization 2.4.0 Succeeded
Verified with the following scenario:
----------------------------------------
Created a VM, using the attached Yaml, set the 'running' flag to 'true'
Once the DV was created and completed the import
Virtctl console to the VM - successfully accessed the CM
Yaml used to verify this bz:
----------------------------------------
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
labels:
kubevirt.io/vm: vm-fedora
name: vm-fedora-datavolume-new
spec:
dataVolumeTemplates:
- metadata:
creationTimestamp: null
name: fedora-dv-new
spec:
pvc:
volumeMode: Block
accessModes:
- ReadWriteMany
resources:
requests:
storage: 25Gi
storageClassName: ocs-storagecluster-ceph-rbd
source:
http:
url: http://cnv-qe-server.rhevdev.lab.eng.rdu2.redhat.com/files/cnv-tests/fedora-images/Fedora-Cloud-Base-29-1.2.x86_64.qcow2
status: {}
running: true
template:
metadata:
labels:
kubevirt.io/vm: vm-fedora-datavolume-new
spec:
domain:
devices:
disks:
- disk:
bus: virtio
name: datavolumedisk1
machine:
type: ""
resources:
requests:
memory: 1Gi
terminationGracePeriodSeconds: 0
volumes:
- dataVolume:
name: fedora-dv-new
name: datavolumedisk1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3194 |
Created attachment 1656447 [details] vm-dv-fedora.yaml Description of problem: We found situation that VM OS failed to start, in Linux it stuck in GRUB (see attached) in Windows OS is in repairing. Issue reproduce with rook-ceph on PSI. We create VM with DV and set it as running true. (see attached Fedora VM yaml) We suspect that the image is not finished the extract and VM is trying to start with HD that not loaded. Restarting the VM solving the issue. We don't see the issue on HPP. I think this is a race condition for VM, if the image is large the extract time is long and it you start VM it will stuck on OS. I think we need CDI importer pod report that it done after it extract the image, in our case we use qcow2 images. Note: In window 2019 we see that the extract takes ~5 min (on PSI) Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Create VM via attached yaml (oc create -f vm-dv-fedora.yaml) 2. Set VM to start right after dv & pvc are created and finished 3. Open console to VM and check that it didn't boot and went to GRUB Actual results: VM failed to boot Expected results: VM booted normally Additional info: