Bug 1419967

Summary: HA VM is reported as restarted even if restart fails
Product: [oVirt] ovirt-engine Reporter: Evgheni Dereveanchin <ederevea>
Component: BLL.VirtAssignee: Shahar Havivi <shavivi>
Status: CLOSED CURRENTRELEASE QA Contact: Nisim Simsolo <nsimsolo>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0.4CC: bugs, ederevea, lveyde, michal.skrivanek, nsimsolo, shavivi, tjelinek
Target Milestone: ovirt-4.1.2Flags: tjelinek: ovirt-4.1?
tjelinek: planning_ack?
rule-engine: devel_ack+
rule-engine: testing_ack+
Target Release: 4.1.2   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-05-23 08:18:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
engine log snippet none

Description Evgheni Dereveanchin 2017-02-07 14:15:45 UTC
Description of problem:
With new HA VM leases, if an HA VM is restarted but fails to do so due to sanlock lease there is no error shown in the UI

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.0.4-1.el7

How reproducible:
always

Steps to Reproduce:
1. install oVirt 4.1 (no fencing)
2. create VM and press Edit
3. go to "High Availability" tab, set it as "Highly Available" and select a "Target Storage Domain for VM lease"
3. start VM
4. go to hypervisor where it is running (host1 in this example) and cut VDSM and SSH port (this imitates loss of the management network while the storage network is OK)
# iptables -I INPUT  -p tcp -m tcp --dport 54321 -j DROP
# iptables -I INPUT  -p tcp -m tcp --dport 22 -j DROP
5. monitor the Admin portal

Actual results:
 VM status changes to "unknown"
 Host status changes to "non responsive"
 There is a false report of successful VM restart:
  "VM vm1 was restarted on Host host2"

 In fact VM is still running on host1 and holding the storage lease, so start 
 attempt on host2 fails but this is not reflected in the UI

Expected results:
 VM status changes to "unknown"
 Host status changes to "non responsive"
 There is report of failed VM restart:
  "Failed to restart VM vm1 on Host host2: VM running on different host"

Additional info:

This approach protects against possible split brain, yet it needs to become more transparent to end users so that they are aware that manual intervention is required to get their important HA VMs back up and running.

Such a scenario is typical when the storage network is separate from management (i.e. FibreChannel) and loss of IP connectivity does not assume loss of storage.

Also, it would be great to be able to somehow clear locks from the UI to ensure the VM is killed on problematic hosts so that it can be restarted on healthy ones safely.

Comment 1 Tomas Jelinek 2017-02-08 08:43:20 UTC
(In reply to Evgheni Dereveanchin from comment #0)
> Description of problem:
> With new HA VM leases, if an HA VM is restarted but fails to do so due to
> sanlock lease there is no error shown in the UI
> 
> Version-Release number of selected component (if applicable):
> ovirt-engine-4.1.0.4-1.el7
> 
> How reproducible:
> always
> 
> Steps to Reproduce:
> 1. install oVirt 4.1 (no fencing)
> 2. create VM and press Edit
> 3. go to "High Availability" tab, set it as "Highly Available" and select a
> "Target Storage Domain for VM lease"
> 3. start VM
> 4. go to hypervisor where it is running (host1 in this example) and cut VDSM
> and SSH port (this imitates loss of the management network while the storage
> network is OK)
> # iptables -I INPUT  -p tcp -m tcp --dport 54321 -j DROP
> # iptables -I INPUT  -p tcp -m tcp --dport 22 -j DROP
> 5. monitor the Admin portal
> 
> Actual results:
>  VM status changes to "unknown"
>  Host status changes to "non responsive"
>  There is a false report of successful VM restart:
>   "VM vm1 was restarted on Host host2"

But after this message a new message should be shown that the run vm failed. Is this not happening? If yes, only rephrasing the message to something like:
"VM vm1 restart on Host host2 was initiated" should be enough to make it clear.

Would this be ok?

> 
>  In fact VM is still running on host1 and holding the storage lease, so
> start 
>  attempt on host2 fails but this is not reflected in the UI
> 
> Expected results:
>  VM status changes to "unknown"
>  Host status changes to "non responsive"
>  There is report of failed VM restart:
>   "Failed to restart VM vm1 on Host host2: VM running on different host"
> 
> Additional info:
> 
> This approach protects against possible split brain, yet it needs to become
> more transparent to end users so that they are aware that manual
> intervention is required to get their important HA VMs back up and running.
> 
> Such a scenario is typical when the storage network is separate from
> management (i.e. FibreChannel) and loss of IP connectivity does not assume
> loss of storage.
> 
> Also, it would be great to be able to somehow clear locks from the UI to
> ensure the VM is killed on problematic hosts so that it can be restarted on
> healthy ones safely.

This should not be needed. If the host is really problematic than fencing it should solve the issue. Or did you have something else in mind?

Comment 2 Evgheni Dereveanchin 2017-02-14 12:33:45 UTC
No, there is no message shown about failure to start. I waited 15 minutes and restored VDSM connectivity to get a status update.

Full audit log of the outage test:

 2017-02-07 08:33:48.463-05 | VM vm1 started on Host ovirt-host2
 2017-02-07 08:34:56.808-05 | VDSM ovirt-host2 command GetStatsVDS failed: Heartbeat exceeded
 2017-02-07 08:34:56.828-05 | Host ovirt-host2 is not responding. Host cannot be fenced automatically because power management for the host is disabled.
 2017-02-07 08:35:19.891-05 | VM vm1 was set to the Unknown status.
 2017-02-07 08:35:19.905-05 | Host ovirt-host2 is non responsive.
 2017-02-07 08:35:37.159-05 | Available memory of host ovirt-host1 [933 MB] is under defined threshold [1024 MB].
 2017-02-07 08:37:15.135-05 | VM vm1 was restarted on Host ovirt-host1
 2017-02-07 08:52:27.22-05  | VM vm1 status was restored to Up.
 2017-02-07 08:52:30.68-05  | Status of host ovirt-host2 was set to Up.

As you see, there is just a "vm was started" message in the log.

Comment 3 Evgheni Dereveanchin 2017-02-14 12:36:54 UTC
Created attachment 1250216 [details]
engine log snippet

Attached an engine log snippet around the outage test in case that helps.

Comment 4 Tomas Jelinek 2017-02-22 08:33:11 UTC
The fix will be to rephrase the "VM vm1 was restarted on Host host2" to "The attempt to restart the VM vm1 on Host host2 was initiated".

Comment 5 Evgheni Dereveanchin 2017-02-23 15:56:22 UTC
Please also ensure the failure message is sent to the audit log. As noted above, in my case no such message was shown within 15 minutes and this fact was the initial motivation to log this bugzilla.

Comment 6 Shahar Havivi 2017-02-27 09:00:33 UTC
(In reply to Evgheni Dereveanchin from comment #5)
> Please also ensure the failure message is sent to the audit log. As noted
> above, in my case no such message was shown within 15 minutes and this fact
> was the initial motivation to log this bugzilla.

The audit log is wrong,
We will change it for "Trying to restart ${VmName} on Host ${VdsName}" which is more accurate.

As for the failure message, this will require a change in the infrastructure which is not for z-stream.

Comment 7 Yaniv Kaul 2017-03-15 09:20:52 UTC
Any reason it's targeted to 4.1.3 and not 4.1.2?

Comment 8 Shahar Havivi 2017-03-15 09:25:23 UTC
(In reply to Yaniv Kaul from comment #7)
> Any reason it's targeted to 4.1.3 and not 4.1.2?

No reason...

Comment 9 Nisim Simsolo 2017-05-03 13:01:21 UTC
Verification build:
ovirt-engine-4.1.2-0.1.el7
libvirt-client-2.0.0-10.el7_3.5.x86_64
sanlock-3.4.0-1.el7.x86_64
qemu-kvm-rhev-2.6.0-28.el7_3.9.x86_64
vdsm-4.19.11-1.el7ev.x86_64

Verification scenario:
1. create VM and press Edit
2. go to "High Availability" tab, set it as "Highly Available" and select a "Target Storage Domain for VM lease"
3. start VM
4. go to hypervisor where it is running (host1 in this example) and cut VDSM and SSH port (this imitates loss of the management network while the storage network is OK)
# iptables -I INPUT  -p tcp -m tcp --dport 54321 -j DROP
# iptables -I INPUT  -p tcp -m tcp --dport 22 -j DROP
5. monitor the Admin portal, wait for "unknown" VM status and verify that the next audit log appears:
"May 3, 2017 3:45:52 PM
Trying to restart VM golden_env_mixed_virtio_0 on Host host_mixed_2
42151a9f
oVirt"