Bug 1647388 - [downstream clone - 4.2.8] Power on on already powered on host sets VMs as down and results in split-brain
Summary: [downstream clone - 4.2.8] Power on on already powered on host sets VMs as do...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.2.6
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ovirt-4.2.8
: ---
Assignee: Eli Mesika
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On: 1641882
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-07 11:44 UTC by RHV bug bot
Modified: 2021-12-10 18:23 UTC (History)
7 users (show)

Fixed In Version: ovirt-engine-4.2.8.1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1641882
Environment:
Last Closed: 2019-01-22 12:44:51 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-44334 0 None None None 2021-12-10 18:23:10 UTC
Red Hat Product Errata RHBA-2019:0121 0 None None None 2019-01-22 12:44:59 UTC
oVirt gerrit 95411 0 None MERGED core: fix a bug in manual fencing 2020-07-17 22:39:06 UTC

Description RHV bug bot 2018-11-07 11:44:13 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1641882 +++
======================================================================

Description of problem:

If the user attempts to "Power Management -> Start" a Not Responding and still Powered On host, the engine sets all VMs as down, allowing them to be started on another host. But the VMs are all running on the Not Responding host.

1. Host set as not responsive

2018-10-23 15:29:19,141+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-167) [] EVENT_ID: VDS_FAILURE(12), Host host1 is non responsive.

2. User goes to GUI and click on Power Management -> Start

2018-10-23 15:29:31,798+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management status on Host host1 using Proxy Host host2 and Fence Agent virt:192.168.100.254.

3. Engine does status, sees the Host is ON

2018-10-23 15:29:31,860+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: 450ef72e

4. Its already ON, but engine sends a START anyway (??)

2018-10-23 15:29:31,981+10 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] START, FenceVdsVDSCommand(HostName = host2, FenceVdsVDSCommandParameters:{hostId='beaebbcc-a113-4531-971f-0c8616f73596', targetVdsId='b4cdc809-af98-4542-91ad-c9f6b8a2fedf', action='START', agent='FenceAgent:{id='a6e5ddaa-231b-4277-9669-f66dd3a938f5', hostId='b4cdc809-af98-4542-91ad-c9f6b8a2fedf', order='1', type='virt', ip='192.168.100.254', port='null', user='germano', password='***', encryptOptions='false', options='port=host1'}', policy='null'}), log id: 299a63c0

5. Yes, it really ON ;)

2018-10-23 15:29:37,156+10 INFO  [org.ovirt.engine.core.bll.pm.SingleAgentFenceActionExecutor] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] Host 'host1.rhvlab' status is 'ON'

6. And set the VM running on that host as down

2018-10-23 15:29:37,450+10 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] EVENT_ID: VM_WAS_SET_DOWN_DUE_TO_HOST_REBOOT_OR_MANUAL_FENCE(143), Vm test was shut down due to host1 host reboot or manual fence

The host was never manually fenced. It was UP all the time. The VMs are running there. The engine cannot set the VMs as down as a result of a power ON command succeeding. It must be on OFF only.

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.6.4-1.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Configure a host with Power Management enabled
2. Disable automatic fencing for the cluster
3. Run a VM on the host
4. Enable firewalld panic on the host
   # firewall-cmd --panic-on
5. Host goes Not Responding
6. Go to the GUI and click on Management -> Power Management -> Start

Actual results:
Split brain, VMs running twice, corrupted

Expected results:
Do not set the VMs as down, unless the host powered off.

(Originally by Germano Veit Michel)

Comment 1 RHV bug bot 2018-11-07 11:44:19 UTC
Customer hit this on 4.1.6.

On newer RHEL versions, and depending on the storage, qemu-kvm improved image locking saves the VM from running twice - "Failed to get "write" lock
Is another process using the image?.". A VM lease would do the same).

However, the engine bug is still there should be fixed.

(Originally by Germano Veit Michel)

Comment 3 RHV bug bot 2018-11-07 11:44:27 UTC
Customer hit this on 4.1.6.

On newer RHEL versions, and depending on the storage, qemu-kvm improved image locking saves the VM from running twice - "Failed to get "write" lock
Is another process using the image?.". A VM lease would do the same).

However, the engine bug is still there should be fixed.

(Originally by Germano Veit Michel)

Comment 4 RHV bug bot 2018-11-07 11:44:31 UTC
Please attach the full log , it is mandatory for fully understanding the flow

(Originally by Eli Mesika)

Comment 5 RHV bug bot 2018-11-07 11:44:35 UTC
(In reply to Eli Mesika from comment #2)
> Please attach the full log , it is mandatory for fully understanding the flow

I'm attaching the customer's logs (4.1.6) because mine (4.2.6 from comment #0) are gone as I redeployed. 

Look for this correlation ID - 20bb556f-05f5-43ad-80e0-514b14042b59

You will see a PM START on an already ON host, which triggered VM_WAS_SET_DOWN_DUE_TO_HOST_REBOOT_OR_MANUAL_FENCE on some VMs, which the customer lost due to corruption.

This is easily reproducible on 4.2.6 too.

(Originally by Germano Veit Michel)

Comment 9 Petr Matyáš 2018-12-14 11:46:21 UTC
Verified on ovirt-engine-4.2.8.1-0.1.el7ev.noarch

Comment 11 errata-xmlrpc 2019-01-22 12:44:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0121


Note You need to log in before you can comment on or make changes to this bug.