Bug 1647388

Summary:	[downstream clone - 4.2.8] Power on on already powered on host sets VMs as down and results in split-brain
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	RHV bug bot <rhv-bugzilla-bot>
Component:	ovirt-engine	Assignee:	Eli Mesika <emesika>
Status:	CLOSED ERRATA	QA Contact:	Petr Matyáš <pmatyas>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	4.2.6	CC:	audgiri, gveitmic, lleistne, mgoldboi, mperina, ratamir, Rhev-m-bugs
Target Milestone:	ovirt-4.2.8	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	ovirt-engine-4.2.8.1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1641882	Environment:
Last Closed:	2019-01-22 12:44:51 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Infra	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1641882
Bug Blocks:

Description RHV bug bot 2018-11-07 11:44:13 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1641882 +++
======================================================================

Description of problem:

If the user attempts to "Power Management -> Start" a Not Responding and still Powered On host, the engine sets all VMs as down, allowing them to be started on another host. But the VMs are all running on the Not Responding host.

1. Host set as not responsive

2018-10-23 15:29:19,141+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-167) [] EVENT_ID: VDS_FAILURE(12), Host host1 is non responsive.

2. User goes to GUI and click on Power Management -> Start

2018-10-23 15:29:31,798+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management status on Host host1 using Proxy Host host2 and Fence Agent virt:192.168.100.254.

3. Engine does status, sees the Host is ON

2018-10-23 15:29:31,860+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: 450ef72e

4. Its already ON, but engine sends a START anyway (??)

2018-10-23 15:29:31,981+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] START, FenceVdsVDSCommand(HostName = host2, FenceVdsVDSCommandParameters:{hostId='beaebbcc-a113-4531-971f-0c8616f73596', targetVdsId='b4cdc809-af98-4542-91ad-c9f6b8a2fedf', action='START', agent='FenceAgent:{id='a6e5ddaa-231b-4277-9669-f66dd3a938f5', hostId='b4cdc809-af98-4542-91ad-c9f6b8a2fedf', order='1', type='virt', ip='192.168.100.254', port='null', user='germano', password='***', encryptOptions='false', options='port=host1'}', policy='null'}), log id: 299a63c0

5. Yes, it really ON ;)

2018-10-23 15:29:37,156+10 INFO  [org.ovirt.engine.core.bll.pm.SingleAgentFenceActionExecutor] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] Host 'host1.rhvlab' status is 'ON'

6. And set the VM running on that host as down

2018-10-23 15:29:37,450+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] EVENT_ID: VM_WAS_SET_DOWN_DUE_TO_HOST_REBOOT_OR_MANUAL_FENCE(143), Vm test was shut down due to host1 host reboot or manual fence

The host was never manually fenced. It was UP all the time. The VMs are running there. The engine cannot set the VMs as down as a result of a power ON command succeeding. It must be on OFF only.

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.6.4-1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Configure a host with Power Management enabled
2. Disable automatic fencing for the cluster
3. Run a VM on the host
4. Enable firewalld panic on the host
   # firewall-cmd --panic-on
5. Host goes Not Responding
6. Go to the GUI and click on Management -> Power Management -> Start

Actual results:
Split brain, VMs running twice, corrupted

Expected results:
Do not set the VMs as down, unless the host powered off.

(Originally by Germano Veit Michel)

Comment 1 RHV bug bot 2018-11-07 11:44:19 UTC

Customer hit this on 4.1.6.

On newer RHEL versions, and depending on the storage, qemu-kvm improved image locking saves the VM from running twice - "Failed to get "write" lock
Is another process using the image?.". A VM lease would do the same).

However, the engine bug is still there should be fixed.

(Originally by Germano Veit Michel)

Comment 3 RHV bug bot 2018-11-07 11:44:27 UTC

Customer hit this on 4.1.6.

On newer RHEL versions, and depending on the storage, qemu-kvm improved image locking saves the VM from running twice - "Failed to get "write" lock
Is another process using the image?.". A VM lease would do the same).

However, the engine bug is still there should be fixed.

(Originally by Germano Veit Michel)

Comment 4 RHV bug bot 2018-11-07 11:44:31 UTC

Please attach the full log , it is mandatory for fully understanding the flow

(Originally by Eli Mesika)

Comment 5 RHV bug bot 2018-11-07 11:44:35 UTC

(In reply to Eli Mesika from comment #2)
> Please attach the full log , it is mandatory for fully understanding the flow

I'm attaching the customer's logs (4.1.6) because mine (4.2.6 from comment #0) are gone as I redeployed. 

Look for this correlation ID - 20bb556f-05f5-43ad-80e0-514b14042b59

You will see a PM START on an already ON host, which triggered VM_WAS_SET_DOWN_DUE_TO_HOST_REBOOT_OR_MANUAL_FENCE on some VMs, which the customer lost due to corruption.

This is easily reproducible on 4.2.6 too.

(Originally by Germano Veit Michel)

Comment 9 Petr Matyáš 2018-12-14 11:46:21 UTC

Verified on ovirt-engine-4.2.8.1-0.1.el7ev.noarch

Comment 11 errata-xmlrpc 2019-01-22 12:44:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0121