Bug 1869162
| Summary: | HE env - HA vm fails to start on HSM after the SPM went down and all SDs deactivated | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Ilan Zuckerman <izuckerm> | ||||||
| Component: | BLL.Virt | Assignee: | Arik <ahadas> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | meital avital <mavital> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 4.4.1.10 | CC: | ahadas, bugs, eshenitz, nsoffer, pkrempa | ||||||
| Target Milestone: | ovirt-4.4.3 | Flags: | pm-rhel:
ovirt-4.4+
|
||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2020-11-10 18:24:30 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Ilan Zuckerman
2020-08-17 06:21:04 UTC
We try to restart the VM on host_mixed_3: 2020-08-05 10:58:18,022+03 INFO [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-162671) [6ca36120] Running command: RunVmCommand internal: true. Entities affected : ID: 3d6f05b6-3f91-4979-b006-a045891cf6f8 Type: VMAction group RUN_VM with role type USER But it fails, probably due to inability to access the lease (Ilan please attach the vdsm log from host_mixed_3): 2020-08-05 10:58:18,013+03 WARN [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-162671) [6ca36120] The VM lease storage domain 'iscsi_0' status is not active, VM 'test' will fail to run in case the storage domain isn't reachable Looks like the engine is not prepared for a scenario in which a VM that was set to Unknown status on host X is detected as Down on host Y [1] So we need to do the following: If a VM that is set to Unknown on host X is detected as down on host Y and the exit-reason indicates it's not a proper shutdown, we should keep its run_on_vds pointing to host X and its Unknown status. [1] The assumption was that if the VM ran on host X it will at least get to PoweringUp state on host Y but this is not necessarily true, especially with https://gerrit.ovirt.org/#/c/94904 Sorry, the env which was used for this issue was already re-built. I dont have any additional logs. I believe you can fairly easily reproduce this on any other HE env. Please reproduce it and attach the vdsm log on the host that the VM fails to run on. (In reply to Arik from comment #3) > Please reproduce it and attach the vdsm log on the host that the VM fails to > run on. Reproduced the issue again with the same setup: hosted-engine-08 : Engine caracal01 10.46.30.1: set as SPM. no vms caracal02 10.46.30.2: SPM priority set to 'never'. no vms caracal03 10.46.30.3: SPM priority set to 'never'. running the HE vm Attaching vdsm logs for all of the hosts + engine log I began the test at 09:06 Israeli time. Gathered logs at 09:20 Created attachment 1711802 [details]
Second reproduction logs
Thanks Ilan.
So we fail to run the VM because of:
libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2020-08-19T06:10:19.734831Z qemu-kvm: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":
false},"driver":"qcow2","file":"libvirt-1-storage","backing":null}: Failed to get "write" lock
Is another process using the image [/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge__8__nfs__0/263bb298-10aa-4cef-877f-fa23aa303848/images/ee14d011-127f-4819-b368-a6b1d9136129/b3535dbf-fd95-4e6a-b17c-d597624f5d16]?
The rule you've set in the firewall - does it block the connection also to the NFS storage domain 263bb298-10aa-4cef-877f-fa23aa303848 or only to the scsi storage domain that the lease resides in?
(In reply to Arik from comment #6) > Thanks Ilan. > So we fail to run the VM because of: > libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: > 2020-08-19T06:10:19.734831Z qemu-kvm: -blockdev > {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no- > flush": > false},"driver":"qcow2","file":"libvirt-1-storage","backing":null}: Failed > to get "write" lock > Is another process using the image > [/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com: > _nas01_ge__8__nfs__0/263bb298-10aa-4cef-877f-fa23aa303848/images/ee14d011- > 127f-4819-b368-a6b1d9136129/b3535dbf-fd95-4e6a-b17c-d597624f5d16]? > > The rule you've set in the firewall - does it block the connection also to > the NFS storage domain 263bb298-10aa-4cef-877f-fa23aa303848 or only to the > scsi storage domain that the lease resides in? NFS storage resides on another domain: mantis-nfs-lif2.lab.eng.XXX.com While i only blocked the iscsi domain: 3par-iscsi-XXX.com The lease was made on iscsi. Nir, the qemu process lost connection to the lease so the lease was released but apparently the lock on a disk that reside in a different storage domain was not released. Shouldn't the VM switch to paused state in this scenario -> leading to releasing also the disk's lock? (In reply to Arik from comment #8) I'm not sure I understand what happened on the host since there is not enough data here. But according to your question, I assume that: - VM was running with a lease on iSCSI storage, and disk on NFS storage - VM holds a lease on iSCSI storage - VM holds a lock on NFS storage - Access to iSCSI storage was blocked - VM paused because of I/O error - Libvirt released the lease on iSCSI storage I assume that the VM was paused on I/O error, because if the other case is VM killed by sanlock after the lease was expired. In this case the qemu process would terminate and the lock on NFS would be released. In this case VM still holds a lock on the disk in NFS storage. This lock is not managed by libvirt, so it cannot release the lock. Peter, can you confirm this? Yes, the qemu locks behave slightly differently in such situations. Specifically even a paused VM still holds qemu's write lock on the disk image (stored on NFS in the above case) since qemu still possibly has unwritten metadata in memory. The image will be unlocked only after the qemu process terminates. Nir, Peter - thanks! So per comment 10, the expectation needs to change here - we shouldn't stop trying to restart the VM elsewhere but it won't start until it goes down on the disconnected host. I assume we are using "Resume behavior: Kill", right? In this case the VM will be killed if it was paused for too long time, and after that starting the VM on another host should succeed. Yep, the VM is configured with a VM lease so its resume behavior is set to 'kill'. OK, so the engine actually handles the scenario I described in comment 1 properly - by keeping the VM in status Unknown and with run_on_vds pointing to the non-responsive host. The attached engine.log ends ~9 minutes after the failed restart attempt (the failure is explained in comment 10) - in my environment it takes longer for the next restart attempt to happen so I believe there were more restart attempts later on. If that's not the case (i.e., there is no additional restart attempt of that VM), please reopen. Considered not a bug in last release (4.4.3) |