Bug 2000031

Summary: SPM host is rebooted multiple times when engine recovers the host
Product: Red Hat Enterprise Virtualization Manager Reporter: nijin ashok <nashok>
Component: ovirt-engineAssignee: Eli Mesika <emesika>
Status: CLOSED ERRATA QA Contact: Ivana Saranova <isaranov>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.4.7CC: ahadas, emarcus, gdeolive, lsurette, mavital, mperina, nsurati, srevivo, ycui
Target Milestone: ovirt-4.5.0Keywords: ZStream
Target Release: 4.5.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.5.0 Doc Type: Bug Fix
Doc Text:
Previously, host non-reponding treatment coould be called multiple times simultaneously. In this release, multiple calls to non-reponding treatment are prevented, and the host comes up much faster.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-26 16:23:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nijin ashok 2021-09-01 08:51:26 UTC
Description of problem:

To mimic a hardware failure, we stopped the ovirtmgmt nic and the engine detect the failure and marked it as "not responding".

~~~
2021-09-01 13:25:14,037+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163288) [] EVENT_ID: VDS_FAILURE(12), Host dell-r530-3 is non responsive.
2021-09-01 13:25:14,037+05 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (EE-ManagedThreadFactory-engine-Thread-163289) [] ResourceManager::vdsNotResponding entered for Host '52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', 'dell-r530-3'
~~~

ssh soft fencing failed.

~~~
2021-09-01 13:25:14,151+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management status on Host dell-r530-3 using Proxy Host dell-r530-4. and Fence Agent ipmilan:10.65.177.172.
2021-09-01 13:25:14,152+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STATUS', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='null'}), log id: 3b9bdeeb
2021-09-01 13:25:14,432+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: 3b9bdeeb
2021-09-01 13:25:14,436+05 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] Opening SSH Soft Fencing session on host 'dell-r530-3.'

2021-09-01 13:25:17,060+05 ERROR [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] SSH Soft Fencing command failed on host 'dell-': Failed to get the session.
Stdout:
Stderr:
~~~

Power management stop.

~~~
2021-09-01 13:25:17,134+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] EVENT_ID: FENCE_OPERATION_STARTED(9,017), Power management stop of Host dell-r530-3 initiated.

2021-09-01 13:25:23,569+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] START, FenceVdsVDSCommand(HostName = dell-r530, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 5f31a9d3
2021-09-01 13:25:40,658+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='UNKNOWN', message=''}, log id: 5f31a9d3
~~~

Confirmed that the server is off.

~~~
2021-09-01 13:25:45,679+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STATUS', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 660cdf34
2021-09-01 13:25:45,870+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='OFF', message=''}, log id: 660cdf34
2021-09-01 13:25:45,870+05 INFO  [org.ovirt.engine.core.bll.pm.SingleAgentFenceActionExecutor] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] Host 'dell-r530-3' status is 'OFF'
2021-09-01 13:25:45,904+05 INFO  [org.ovirt.engine.core.bll.pm.StopVdsCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] Power-Management: STOP host 'dell-r530-3' succeeded.
~~~~

Power management start.

~~~
2021-09-01 13:25:46,365+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='START', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 74d6da49
~~~

However, in between at **13:25:23**, the status of the host was changed to `Reboot`.

~~~
2021-09-01 13:25:23,273+05 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] START, SetVdsStatusVDSCommand(HostName = dell-r530-3, SetVdsStatusVDSCommandParameters:{hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', status='Reboot', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 2cc8642f
~~~

And it looks like this initiated `spmstop` and `VdsNotRespondingTreatment` again.

~~~
2021-09-01 13:25:23,289+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] START, SpmStopVDSCommand(HostName = dell-r530-3, SpmStopVDSCommandParameters:{hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', storagePoolId='71cf243c-edec-11eb-aa8c-002b6a01557c'}), log id: a28bc28

2021-09-01 13:25:26,275+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] SpmStopVDSCommand::Stopping SPM on vds 'dell-r530-3', pool id '71cf243c-edec-11eb-aa8c-002b6a01557c'

2021-09-01 13:25:26,296+05 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (EE-ManagedThreadFactory-engine-Thread-163296) [32289749] ResourceManager::vdsNotResponding entered for Host '52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', 'dell-r530-3'
~~~

Then it started another thread for fencing.

~~~
2021-09-01 13:25:29,381+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STATUS', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 6b310d2e

2021-09-01 13:25:29,389+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163296) [5a82ec0a] EVENT_ID: FENCE_OPERATION_STARTED(9,017), Power management stop of Host dell-r530-3 initiated.

2021-09-01 13:25:29,583+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: 6b310d2e

2021-09-01 13:25:29,604+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management stop on Host dell-r530 using Proxy Host dell-r530-4 and Fence Agent ipmilan:10.65.177.172.

2021-09-01 13:25:29,604+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 594b7647

2021-09-01 13:25:40,945+05 INFO  [org.ovirt.engine.core.bll.pm.SingleAgentFenceActionExecutor] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] Waiting for host 'dell-r530-3' to reach status 'OFF'
~~~

And then multiple threads were created to fence the host.

~~~
2021-09-01 13:25:35,866+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163296) [5a82ec0a] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 54a60ebc

2021-09-01 13:25:42,062+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163300) [7f716505] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 6b01d4b2

2021-09-01 13:25:45,365+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163304) [2cef138c] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 31522f1b


2021-09-01 13:25:47,988+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163302) [5355bf98] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 5830ba91
~~~

And hence there were multiple "stop" and "start" attempts to recover the host.


Version-Release number of selected component (if applicable):

rhvm-4.4.7.6-0.11.el8ev.noarch

How reproducible:

100%

Steps to Reproduce:
1. Disconnect the network of a current SPM host.
2. Check the logs or watch the console of the host. The host will be stopped and started multiple times.

This only happens when the hosts is SPM host and works fine with a non SPM host.

Actual results:

SPM host is rebooted multiple times when engine recovers the host.

Expected results:


Additional info:

Comment 2 Arik 2021-09-01 09:10:59 UTC
Not sure if this bz should move to Infra or Storage - seems to me more related to host lifecycle than to serving as SPM

Comment 9 Ivana Saranova 2022-05-12 13:26:18 UTC
Steps to Reproduce:
1. Disconnect the network of a current SPM host.
2. Check the logs or watch the console of the host.

Result:
The host was stopped and restarted only once. After switching the network back on, the host wasn't restarted.

Verified in:
ovirt-engine-4.5.0.6-0.7.el8ev.noarch

Comment 14 errata-xmlrpc 2022-05-26 16:23:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.0] security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4711

Comment 15 meital avital 2022-08-07 09:59:31 UTC
Due to QE capacity, we are not going to cover this issue in our automation