Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1869162

Summary: HE env - HA vm fails to start on HSM after the SPM went down and all SDs deactivated
Product: [oVirt] ovirt-engine Reporter: Ilan Zuckerman <izuckerm>
Component: BLL.VirtAssignee: Arik <ahadas>
Status: CLOSED NOTABUG QA Contact: meital avital <mavital>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4.1.10CC: ahadas, bugs, eshenitz, nsoffer, pkrempa
Target Milestone: ovirt-4.4.3Flags: pm-rhel: ovirt-4.4+
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-10 18:24:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine and vdsm logs
none
Second reproduction logs none

Description Ilan Zuckerman 2020-08-17 06:21:04 UTC
Created attachment 1711553 [details]
engine and vdsm logs

Description of problem:

This BZ is related to BZ 1527249 (which was verified on regular env), but failed to be verified on HOSTED ENGINE env. Hence this bug was opened.
The HA vm fails to be restarted on a non SPM host after all of the SDs become deactivated.

------------
Here is the verification scenario from original BZ 1527249:

In an environment with 2 hosts 'h1' SPM and 'h2' HSM:
 
1. Set the HSM host ('h2') SPM priority to 'never'
2. Create a VM with a disk and a lease
3. Run the VM on the SPM
4. Block the connection from the SPM to the storage.
5. Block the connection from the engine to the SPM. ==> simulating crashed SPM
 
Actual:
VM failed to start on the HSM ('h2') after the SPM ('h1') went down and all the storage domain deactivated

Expected:
VM managed to run on the HSM ('h2') even if all the storage domain are down
__________________________________________________

Here is a detailed steps to reproduce the issue on HOSTED ENGINE env:

Have the following setup on your env:

host1: set as SPM. no vms
host2: SPM priority set to 'never'. no vms
host3: SPM priority set to 'never'. running the HE vm

Steps:
1. Create template vm with lease on iscsi
2. Start it on host1
3. Block the connection from the SPM to the storage:
[root@caracal04 ~]# iptables -A OUTPUT -d 3par-iscsi-XXX.com -j DROP

4. Block the connection from the engine to the SPM. ==> simulating crashed SPM
[root@hosted-engine-09 ~]# iptables -A OUTPUT -d 10.46.XX.X -j DROP
(the ip is for caracal04)

Now, all of the SD's went down,
The vm went to status 'unknown' but showing still on host1

Actual results:
The HA vm is NOT migrating to host2 / host3 as expected, but remains in 'unknown' state

Expected results:
VM managed to run on the HSM ('h2') even if all the storage domain are down


Version-Release number of selected component (if applicable):
HE env rhv-release-4.4.1-12-001.noarch

How reproducible:
100%

Attaching vdsm and engine logs

Comment 1 Arik 2020-08-18 08:16:45 UTC
We try to restart the VM on host_mixed_3:
2020-08-05 10:58:18,022+03 INFO  [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-162671) [6ca36120] Running command: RunVmCommand internal: true. Entities affected :  ID: 3d6f05b6-3f91-4979-b006-a045891cf6f8 Type: VMAction group RUN_VM with role type USER

But it fails, probably due to inability to access the lease (Ilan please attach the vdsm log from host_mixed_3):
2020-08-05 10:58:18,013+03 WARN  [org.ovirt.engine.core.bll.RunVmCommand] (EE-ManagedThreadFactory-engine-Thread-162671) [6ca36120] The VM lease storage domain 'iscsi_0' status is not active, VM 'test' will fail to run in case the storage domain isn't reachable

Looks like the engine is not prepared for a scenario in which a VM that was set to Unknown status on host X is detected as Down on host Y [1]
So we need to do the following:
If a VM that is set to Unknown on host X is detected as down on host Y and the exit-reason indicates it's not a proper shutdown, we should keep its run_on_vds pointing to host X and its Unknown status.

[1] The assumption was that if the VM ran on host X it will at least get to PoweringUp state on host Y but this is not necessarily true, especially with https://gerrit.ovirt.org/#/c/94904

Comment 2 Ilan Zuckerman 2020-08-18 08:59:51 UTC
Sorry, the env which was used for this issue was already re-built.
I dont have any additional logs.
I believe you can fairly easily reproduce this on any other HE env.

Comment 3 Arik 2020-08-18 09:58:52 UTC
Please reproduce it and attach the vdsm log on the host that the VM fails to run on.

Comment 4 Ilan Zuckerman 2020-08-19 06:23:29 UTC
(In reply to Arik from comment #3)
> Please reproduce it and attach the vdsm log on the host that the VM fails to
> run on.

Reproduced the issue again with the same setup:

hosted-engine-08 : Engine

caracal01  10.46.30.1: set as SPM. no vms
caracal02  10.46.30.2: SPM priority set to 'never'. no vms
caracal03  10.46.30.3: SPM priority set to 'never'. running the HE vm

Attaching vdsm logs for all of the hosts + engine log

I began the test at 09:06 Israeli time. Gathered logs at 09:20

Comment 5 Ilan Zuckerman 2020-08-19 06:23:59 UTC
Created attachment 1711802 [details]
Second reproduction logs

Comment 6 Arik 2020-08-19 07:44:21 UTC
Thanks Ilan.
So we fail to run the VM because of:
libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2020-08-19T06:10:19.734831Z qemu-kvm: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":
false},"driver":"qcow2","file":"libvirt-1-storage","backing":null}: Failed to get "write" lock
Is another process using the image [/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:_nas01_ge__8__nfs__0/263bb298-10aa-4cef-877f-fa23aa303848/images/ee14d011-127f-4819-b368-a6b1d9136129/b3535dbf-fd95-4e6a-b17c-d597624f5d16]?

The rule you've set in the firewall - does it block the connection also to the NFS storage domain 263bb298-10aa-4cef-877f-fa23aa303848 or only to the scsi storage domain that the lease resides in?

Comment 7 Ilan Zuckerman 2020-08-19 07:54:24 UTC
(In reply to Arik from comment #6)
> Thanks Ilan.
> So we fail to run the VM because of:
> libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor:
> 2020-08-19T06:10:19.734831Z qemu-kvm: -blockdev
> {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-
> flush":
> false},"driver":"qcow2","file":"libvirt-1-storage","backing":null}: Failed
> to get "write" lock
> Is another process using the image
> [/rhev/data-center/mnt/mantis-nfs-lif2.lab.eng.tlv2.redhat.com:
> _nas01_ge__8__nfs__0/263bb298-10aa-4cef-877f-fa23aa303848/images/ee14d011-
> 127f-4819-b368-a6b1d9136129/b3535dbf-fd95-4e6a-b17c-d597624f5d16]?
> 
> The rule you've set in the firewall - does it block the connection also to
> the NFS storage domain 263bb298-10aa-4cef-877f-fa23aa303848 or only to the
> scsi storage domain that the lease resides in?

NFS storage resides on another domain: mantis-nfs-lif2.lab.eng.XXX.com
While i only blocked the iscsi domain: 3par-iscsi-XXX.com
The lease was made on iscsi.

Comment 8 Arik 2020-08-19 08:10:09 UTC
Nir, the qemu process lost connection to the lease so the lease was released but apparently the lock on a disk that reside in a different storage domain was not released.
Shouldn't the VM switch to paused state in this scenario -> leading to releasing also the disk's lock?

Comment 9 Nir Soffer 2020-08-19 09:09:18 UTC
(In reply to Arik from comment #8)
I'm not sure I understand what happened on the host since there is not enough
data here.

But according to your question, I assume that:
- VM was running with a lease on iSCSI storage, and disk on NFS storage
- VM holds a lease on iSCSI storage
- VM holds a lock on NFS storage
- Access to iSCSI storage was blocked
- VM paused because of I/O error
- Libvirt released the lease on iSCSI storage

I assume that the VM was paused on I/O error, because if the other case is 
VM killed by sanlock after the lease was expired. In this case the qemu
process would terminate and the lock on NFS would be released.

In this case VM still holds a lock on the disk in NFS storage. This lock is
not managed by libvirt, so it cannot release the lock.

Peter, can you confirm this?

Comment 10 Peter Krempa 2020-08-19 11:07:07 UTC
Yes, the qemu locks behave slightly differently in such situations. Specifically even a paused VM still holds qemu's write lock on the disk image (stored on NFS in the above case) since qemu still possibly has unwritten metadata in memory. The image will be unlocked only after the qemu process terminates.

Comment 11 Arik 2020-08-19 14:19:36 UTC
Nir, Peter - thanks!
So per comment 10, the expectation needs to change here - we shouldn't stop trying to restart the VM elsewhere but it won't start until it goes down on the disconnected host.

Comment 12 Nir Soffer 2020-08-19 15:07:32 UTC
I assume we are using "Resume behavior: Kill", right?

In this case the VM will be killed if it was paused for too long time,
and after that starting the VM on another host should succeed.

Comment 13 Arik 2020-08-19 16:40:01 UTC
Yep, the VM is configured with a VM lease so its resume behavior is set to 'kill'.

Comment 14 Arik 2020-11-10 18:24:30 UTC
OK, so the engine actually handles the scenario I described in comment 1 properly - by keeping the VM in status Unknown and with run_on_vds pointing to the non-responsive host.
The attached engine.log ends ~9 minutes after the failed restart attempt (the failure is explained in comment 10) - in my environment it takes longer for the next restart attempt to happen so I believe there were more restart attempts later on.
If that's not the case (i.e., there is no additional restart attempt of that VM), please reopen.

Comment 15 Sandro Bonazzola 2020-11-30 10:36:54 UTC
Considered not a bug in last release (4.4.3)