Description of problem: To mimic a hardware failure, we stopped the ovirtmgmt nic and the engine detect the failure and marked it as "not responding". ~~~ 2021-09-01 13:25:14,037+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163288) [] EVENT_ID: VDS_FAILURE(12), Host dell-r530-3 is non responsive. 2021-09-01 13:25:14,037+05 INFO [org.ovirt.engine.core.bll.VdsEventListener] (EE-ManagedThreadFactory-engine-Thread-163289) [] ResourceManager::vdsNotResponding entered for Host '52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', 'dell-r530-3' ~~~ ssh soft fencing failed. ~~~ 2021-09-01 13:25:14,151+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management status on Host dell-r530-3 using Proxy Host dell-r530-4. and Fence Agent ipmilan:10.65.177.172. 2021-09-01 13:25:14,152+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STATUS', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='null'}), log id: 3b9bdeeb 2021-09-01 13:25:14,432+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: 3b9bdeeb 2021-09-01 13:25:14,436+05 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] Opening SSH Soft Fencing session on host 'dell-r530-3.' 2021-09-01 13:25:17,060+05 ERROR [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] SSH Soft Fencing command failed on host 'dell-': Failed to get the session. Stdout: Stderr: ~~~ Power management stop. ~~~ 2021-09-01 13:25:17,134+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] EVENT_ID: FENCE_OPERATION_STARTED(9,017), Power management stop of Host dell-r530-3 initiated. 2021-09-01 13:25:23,569+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] START, FenceVdsVDSCommand(HostName = dell-r530, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 5f31a9d3 2021-09-01 13:25:40,658+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='UNKNOWN', message=''}, log id: 5f31a9d3 ~~~ Confirmed that the server is off. ~~~ 2021-09-01 13:25:45,679+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STATUS', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 660cdf34 2021-09-01 13:25:45,870+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='OFF', message=''}, log id: 660cdf34 2021-09-01 13:25:45,870+05 INFO [org.ovirt.engine.core.bll.pm.SingleAgentFenceActionExecutor] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] Host 'dell-r530-3' status is 'OFF' 2021-09-01 13:25:45,904+05 INFO [org.ovirt.engine.core.bll.pm.StopVdsCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] Power-Management: STOP host 'dell-r530-3' succeeded. ~~~~ Power management start. ~~~ 2021-09-01 13:25:46,365+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='START', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 74d6da49 ~~~ However, in between at **13:25:23**, the status of the host was changed to `Reboot`. ~~~ 2021-09-01 13:25:23,273+05 INFO [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] START, SetVdsStatusVDSCommand(HostName = dell-r530-3, SetVdsStatusVDSCommandParameters:{hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', status='Reboot', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 2cc8642f ~~~ And it looks like this initiated `spmstop` and `VdsNotRespondingTreatment` again. ~~~ 2021-09-01 13:25:23,289+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] START, SpmStopVDSCommand(HostName = dell-r530-3, SpmStopVDSCommandParameters:{hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', storagePoolId='71cf243c-edec-11eb-aa8c-002b6a01557c'}), log id: a28bc28 2021-09-01 13:25:26,275+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] SpmStopVDSCommand::Stopping SPM on vds 'dell-r530-3', pool id '71cf243c-edec-11eb-aa8c-002b6a01557c' 2021-09-01 13:25:26,296+05 INFO [org.ovirt.engine.core.bll.VdsEventListener] (EE-ManagedThreadFactory-engine-Thread-163296) [32289749] ResourceManager::vdsNotResponding entered for Host '52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', 'dell-r530-3' ~~~ Then it started another thread for fencing. ~~~ 2021-09-01 13:25:29,381+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STATUS', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 6b310d2e 2021-09-01 13:25:29,389+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163296) [5a82ec0a] EVENT_ID: FENCE_OPERATION_STARTED(9,017), Power management stop of Host dell-r530-3 initiated. 2021-09-01 13:25:29,583+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: 6b310d2e 2021-09-01 13:25:29,604+05 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management stop on Host dell-r530 using Proxy Host dell-r530-4 and Fence Agent ipmilan:10.65.177.172. 2021-09-01 13:25:29,604+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 594b7647 2021-09-01 13:25:40,945+05 INFO [org.ovirt.engine.core.bll.pm.SingleAgentFenceActionExecutor] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] Waiting for host 'dell-r530-3' to reach status 'OFF' ~~~ And then multiple threads were created to fence the host. ~~~ 2021-09-01 13:25:35,866+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163296) [5a82ec0a] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 54a60ebc 2021-09-01 13:25:42,062+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163300) [7f716505] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 6b01d4b2 2021-09-01 13:25:45,365+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163304) [2cef138c] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 31522f1b 2021-09-01 13:25:47,988+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163302) [5355bf98] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 5830ba91 ~~~ And hence there were multiple "stop" and "start" attempts to recover the host. Version-Release number of selected component (if applicable): rhvm-4.4.7.6-0.11.el8ev.noarch How reproducible: 100% Steps to Reproduce: 1. Disconnect the network of a current SPM host. 2. Check the logs or watch the console of the host. The host will be stopped and started multiple times. This only happens when the hosts is SPM host and works fine with a non SPM host. Actual results: SPM host is rebooted multiple times when engine recovers the host. Expected results: Additional info:
Not sure if this bz should move to Infra or Storage - seems to me more related to host lifecycle than to serving as SPM
Steps to Reproduce: 1. Disconnect the network of a current SPM host. 2. Check the logs or watch the console of the host. Result: The host was stopped and restarted only once. After switching the network back on, the host wasn't restarted. Verified in: ovirt-engine-4.5.0.6-0.7.el8ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.0] security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4711
Due to QE capacity, we are not going to cover this issue in our automation