1647388 – [downstream clone - 4.2.8] Power on on already powered on host sets VMs as down and results in split-brain

Bug 1647388 - [downstream clone - 4.2.8] Power on on already powered on host sets VMs as down and results in split-brain

Summary: [downstream clone - 4.2.8] Power on on already powered on host sets VMs as do...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.2.6
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	ovirt-4.2.8
Target Release:	---
Assignee:	Eli Mesika
QA Contact:	Petr Matyáš
Docs Contact:
URL:
Whiteboard:
Depends On:	1641882
Blocks:
TreeView+	depends on / blocked

Reported:	2018-11-07 11:44 UTC by RHV bug bot
Modified:	2021-12-10 18:23 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ovirt-engine-4.2.8.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1641882
Environment:
Last Closed:	2019-01-22 12:44:51 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-44334	None	None	None	2021-12-10 18:23:10 UTC
Red Hat Product Errata	RHBA-2019:0121	None	None	None	2019-01-22 12:44:59 UTC
oVirt gerrit	95411	None	MERGED	core: fix a bug in manual fencing	2020-07-17 22:39:06 UTC

Description RHV bug bot 2018-11-07 11:44:13 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1641882 +++
======================================================================

Description of problem:

If the user attempts to "Power Management -> Start" a Not Responding and still Powered On host, the engine sets all VMs as down, allowing them to be started on another host. But the VMs are all running on the Not Responding host.

1. Host set as not responsive

2018-10-23 15:29:19,141+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-167) [] EVENT_ID: VDS_FAILURE(12), Host host1 is non responsive.

2. User goes to GUI and click on Power Management -> Start

2018-10-23 15:29:31,798+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management status on Host host1 using Proxy Host host2 and Fence Agent virt:192.168.100.254.

3. Engine does status, sees the Host is ON

2018-10-23 15:29:31,860+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: 450ef72e

4. Its already ON, but engine sends a START anyway (??)

2018-10-23 15:29:31,981+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] START, FenceVdsVDSCommand(HostName = host2, FenceVdsVDSCommandParameters:{hostId='beaebbcc-a113-4531-971f-0c8616f73596', targetVdsId='b4cdc809-af98-4542-91ad-c9f6b8a2fedf', action='START', agent='FenceAgent:{id='a6e5ddaa-231b-4277-9669-f66dd3a938f5', hostId='b4cdc809-af98-4542-91ad-c9f6b8a2fedf', order='1', type='virt', ip='192.168.100.254', port='null', user='germano', password='***', encryptOptions='false', options='port=host1'}', policy='null'}), log id: 299a63c0

5. Yes, it really ON ;)

2018-10-23 15:29:37,156+10 INFO  [org.ovirt.engine.core.bll.pm.SingleAgentFenceActionExecutor] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] Host 'host1.rhvlab' status is 'ON'

6. And set the VM running on that host as down

2018-10-23 15:29:37,450+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] EVENT_ID: VM_WAS_SET_DOWN_DUE_TO_HOST_REBOOT_OR_MANUAL_FENCE(143), Vm test was shut down due to host1 host reboot or manual fence

The host was never manually fenced. It was UP all the time. The VMs are running there. The engine cannot set the VMs as down as a result of a power ON command succeeding. It must be on OFF only.

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.6.4-1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Configure a host with Power Management enabled
2. Disable automatic fencing for the cluster
3. Run a VM on the host
4. Enable firewalld panic on the host
   # firewall-cmd --panic-on
5. Host goes Not Responding
6. Go to the GUI and click on Management -> Power Management -> Start

Actual results:
Split brain, VMs running twice, corrupted

Expected results:
Do not set the VMs as down, unless the host powered off.

(Originally by Germano Veit Michel)

Comment 1 RHV bug bot 2018-11-07 11:44:19 UTC

Customer hit this on 4.1.6.

On newer RHEL versions, and depending on the storage, qemu-kvm improved image locking saves the VM from running twice - "Failed to get "write" lock
Is another process using the image?.". A VM lease would do the same).

However, the engine bug is still there should be fixed.

(Originally by Germano Veit Michel)

Comment 3 RHV bug bot 2018-11-07 11:44:27 UTC

Customer hit this on 4.1.6.

On newer RHEL versions, and depending on the storage, qemu-kvm improved image locking saves the VM from running twice - "Failed to get "write" lock
Is another process using the image?.". A VM lease would do the same).

However, the engine bug is still there should be fixed.

(Originally by Germano Veit Michel)

Comment 4 RHV bug bot 2018-11-07 11:44:31 UTC

Please attach the full log , it is mandatory for fully understanding the flow

(Originally by Eli Mesika)

Comment 5 RHV bug bot 2018-11-07 11:44:35 UTC

(In reply to Eli Mesika from comment #2)
> Please attach the full log , it is mandatory for fully understanding the flow

I'm attaching the customer's logs (4.1.6) because mine (4.2.6 from comment #0) are gone as I redeployed. 

Look for this correlation ID - 20bb556f-05f5-43ad-80e0-514b14042b59

You will see a PM START on an already ON host, which triggered VM_WAS_SET_DOWN_DUE_TO_HOST_REBOOT_OR_MANUAL_FENCE on some VMs, which the customer lost due to corruption.

This is easily reproducible on 4.2.6 too.

(Originally by Germano Veit Michel)

Comment 9 Petr Matyáš 2018-12-14 11:46:21 UTC

Verified on ovirt-engine-4.2.8.1-0.1.el7ev.noarch

Comment 11 errata-xmlrpc 2019-01-22 12:44:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0121

Note You need to log in before you can comment on or make changes to this bug.