+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1641882 +++ ====================================================================== Description of problem: If the user attempts to "Power Management -> Start" a Not Responding and still Powered On host, the engine sets all VMs as down, allowing them to be started on another host. But the VMs are all running on the Not Responding host. 1. Host set as not responsive 2018-10-23 15:29:19,141+10 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-167) [] EVENT_ID: VDS_FAILURE(12), Host host1 is non responsive. 2. User goes to GUI and click on Power Management -> Start 2018-10-23 15:29:31,798+10 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management status on Host host1 using Proxy Host host2 and Fence Agent virt:192.168.100.254. 3. Engine does status, sees the Host is ON 2018-10-23 15:29:31,860+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: 450ef72e 4. Its already ON, but engine sends a START anyway (??) 2018-10-23 15:29:31,981+10 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] START, FenceVdsVDSCommand(HostName = host2, FenceVdsVDSCommandParameters:{hostId='beaebbcc-a113-4531-971f-0c8616f73596', targetVdsId='b4cdc809-af98-4542-91ad-c9f6b8a2fedf', action='START', agent='FenceAgent:{id='a6e5ddaa-231b-4277-9669-f66dd3a938f5', hostId='b4cdc809-af98-4542-91ad-c9f6b8a2fedf', order='1', type='virt', ip='192.168.100.254', port='null', user='germano', password='***', encryptOptions='false', options='port=host1'}', policy='null'}), log id: 299a63c0 5. Yes, it really ON ;) 2018-10-23 15:29:37,156+10 INFO [org.ovirt.engine.core.bll.pm.SingleAgentFenceActionExecutor] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] Host 'host1.rhvlab' status is 'ON' 6. And set the VM running on that host as down 2018-10-23 15:29:37,450+10 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-173) [3fa1c8c4-1be2-46f6-a475-6b219baa15c2] EVENT_ID: VM_WAS_SET_DOWN_DUE_TO_HOST_REBOOT_OR_MANUAL_FENCE(143), Vm test was shut down due to host1 host reboot or manual fence The host was never manually fenced. It was UP all the time. The VMs are running there. The engine cannot set the VMs as down as a result of a power ON command succeeding. It must be on OFF only. Version-Release number of selected component (if applicable): ovirt-engine-4.2.6.4-1.el7.noarch How reproducible: 100% Steps to Reproduce: 1. Configure a host with Power Management enabled 2. Disable automatic fencing for the cluster 3. Run a VM on the host 4. Enable firewalld panic on the host # firewall-cmd --panic-on 5. Host goes Not Responding 6. Go to the GUI and click on Management -> Power Management -> Start Actual results: Split brain, VMs running twice, corrupted Expected results: Do not set the VMs as down, unless the host powered off. (Originally by Germano Veit Michel)
Customer hit this on 4.1.6. On newer RHEL versions, and depending on the storage, qemu-kvm improved image locking saves the VM from running twice - "Failed to get "write" lock Is another process using the image?.". A VM lease would do the same). However, the engine bug is still there should be fixed. (Originally by Germano Veit Michel)
Please attach the full log , it is mandatory for fully understanding the flow (Originally by Eli Mesika)
(In reply to Eli Mesika from comment #2) > Please attach the full log , it is mandatory for fully understanding the flow I'm attaching the customer's logs (4.1.6) because mine (4.2.6 from comment #0) are gone as I redeployed. Look for this correlation ID - 20bb556f-05f5-43ad-80e0-514b14042b59 You will see a PM START on an already ON host, which triggered VM_WAS_SET_DOWN_DUE_TO_HOST_REBOOT_OR_MANUAL_FENCE on some VMs, which the customer lost due to corruption. This is easily reproducible on 4.2.6 too. (Originally by Germano Veit Michel)
Verified on ovirt-engine-4.2.8.1-0.1.el7ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0121