Bug 2000031 - SPM host is rebooted multiple times when engine recovers the host
Summary: SPM host is rebooted multiple times when engine recovers the host
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.7
Hardware: All
OS: Linux
medium
medium
Target Milestone: ovirt-4.5.0
: 4.5.0
Assignee: Eli Mesika
QA Contact: Ivana Saranova
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-01 08:51 UTC by nijin ashok
Modified: 2022-08-07 09:59 UTC (History)
9 users (show)

Fixed In Version: ovirt-engine-4.5.0
Doc Type: Bug Fix
Doc Text:
Previously, host non-reponding treatment coould be called multiple times simultaneously. In this release, multiple calls to non-reponding treatment are prevented, and the host comes up much faster.
Clone Of:
Environment:
Last Closed: 2022-05-26 16:23:07 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43344 0 None None None 2021-09-01 08:52:45 UTC
Red Hat Knowledge Base (Solution) 6863411 0 None None None 2022-03-30 13:48:18 UTC
Red Hat Product Errata RHSA-2022:4711 0 None None None 2022-05-26 16:23:16 UTC
oVirt gerrit 117089 0 master MERGED prevent nested calls to non-responding treatment 2022-01-25 15:59:36 UTC

Description nijin ashok 2021-09-01 08:51:26 UTC
Description of problem:

To mimic a hardware failure, we stopped the ovirtmgmt nic and the engine detect the failure and marked it as "not responding".

~~~
2021-09-01 13:25:14,037+05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163288) [] EVENT_ID: VDS_FAILURE(12), Host dell-r530-3 is non responsive.
2021-09-01 13:25:14,037+05 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (EE-ManagedThreadFactory-engine-Thread-163289) [] ResourceManager::vdsNotResponding entered for Host '52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', 'dell-r530-3'
~~~

ssh soft fencing failed.

~~~
2021-09-01 13:25:14,151+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management status on Host dell-r530-3 using Proxy Host dell-r530-4. and Fence Agent ipmilan:10.65.177.172.
2021-09-01 13:25:14,152+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STATUS', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='null'}), log id: 3b9bdeeb
2021-09-01 13:25:14,432+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: 3b9bdeeb
2021-09-01 13:25:14,436+05 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] Opening SSH Soft Fencing session on host 'dell-r530-3.'

2021-09-01 13:25:17,060+05 ERROR [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [2e038587] SSH Soft Fencing command failed on host 'dell-': Failed to get the session.
Stdout:
Stderr:
~~~

Power management stop.

~~~
2021-09-01 13:25:17,134+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] EVENT_ID: FENCE_OPERATION_STARTED(9,017), Power management stop of Host dell-r530-3 initiated.

2021-09-01 13:25:23,569+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] START, FenceVdsVDSCommand(HostName = dell-r530, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 5f31a9d3
2021-09-01 13:25:40,658+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='UNKNOWN', message=''}, log id: 5f31a9d3
~~~

Confirmed that the server is off.

~~~
2021-09-01 13:25:45,679+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STATUS', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 660cdf34
2021-09-01 13:25:45,870+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='OFF', message=''}, log id: 660cdf34
2021-09-01 13:25:45,870+05 INFO  [org.ovirt.engine.core.bll.pm.SingleAgentFenceActionExecutor] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] Host 'dell-r530-3' status is 'OFF'
2021-09-01 13:25:45,904+05 INFO  [org.ovirt.engine.core.bll.pm.StopVdsCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [6f0f6524] Power-Management: STOP host 'dell-r530-3' succeeded.
~~~~

Power management start.

~~~
2021-09-01 13:25:46,365+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163289) [] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='START', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 74d6da49
~~~

However, in between at **13:25:23**, the status of the host was changed to `Reboot`.

~~~
2021-09-01 13:25:23,273+05 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] START, SetVdsStatusVDSCommand(HostName = dell-r530-3, SetVdsStatusVDSCommandParameters:{hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', status='Reboot', nonOperationalReason='NONE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 2cc8642f
~~~

And it looks like this initiated `spmstop` and `VdsNotRespondingTreatment` again.

~~~
2021-09-01 13:25:23,289+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] START, SpmStopVDSCommand(HostName = dell-r530-3, SpmStopVDSCommandParameters:{hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', storagePoolId='71cf243c-edec-11eb-aa8c-002b6a01557c'}), log id: a28bc28

2021-09-01 13:25:26,275+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] SpmStopVDSCommand::Stopping SPM on vds 'dell-r530-3', pool id '71cf243c-edec-11eb-aa8c-002b6a01557c'

2021-09-01 13:25:26,296+05 INFO  [org.ovirt.engine.core.bll.VdsEventListener] (EE-ManagedThreadFactory-engine-Thread-163296) [32289749] ResourceManager::vdsNotResponding entered for Host '52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', 'dell-r530-3'
~~~

Then it started another thread for fencing.

~~~
2021-09-01 13:25:29,381+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STATUS', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 6b310d2e

2021-09-01 13:25:29,389+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163296) [5a82ec0a] EVENT_ID: FENCE_OPERATION_STARTED(9,017), Power management stop of Host dell-r530-3 initiated.

2021-09-01 13:25:29,583+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: 6b310d2e

2021-09-01 13:25:29,604+05 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management stop on Host dell-r530 using Proxy Host dell-r530-4 and Fence Agent ipmilan:10.65.177.172.

2021-09-01 13:25:29,604+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 594b7647

2021-09-01 13:25:40,945+05 INFO  [org.ovirt.engine.core.bll.pm.SingleAgentFenceActionExecutor] (EE-ManagedThreadFactory-engine-Thread-163294) [32289749] Waiting for host 'dell-r530-3' to reach status 'OFF'
~~~

And then multiple threads were created to fence the host.

~~~
2021-09-01 13:25:35,866+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163296) [5a82ec0a] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 54a60ebc

2021-09-01 13:25:42,062+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163300) [7f716505] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 6b01d4b2

2021-09-01 13:25:45,365+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163304) [2cef138c] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 31522f1b


2021-09-01 13:25:47,988+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-163302) [5355bf98] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', action='STOP', agent='FenceAgent:{id='1002e99a-5b12-4ad0-a586-ed46af7b27b6', hostId='52dbdd4a-5a1f-4a8f-af0f-393d75ef7baf', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='root', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 5830ba91
~~~

And hence there were multiple "stop" and "start" attempts to recover the host.


Version-Release number of selected component (if applicable):

rhvm-4.4.7.6-0.11.el8ev.noarch

How reproducible:

100%

Steps to Reproduce:
1. Disconnect the network of a current SPM host.
2. Check the logs or watch the console of the host. The host will be stopped and started multiple times.

This only happens when the hosts is SPM host and works fine with a non SPM host.

Actual results:

SPM host is rebooted multiple times when engine recovers the host.

Expected results:


Additional info:

Comment 2 Arik 2021-09-01 09:10:59 UTC
Not sure if this bz should move to Infra or Storage - seems to me more related to host lifecycle than to serving as SPM

Comment 9 Ivana Saranova 2022-05-12 13:26:18 UTC
Steps to Reproduce:
1. Disconnect the network of a current SPM host.
2. Check the logs or watch the console of the host.

Result:
The host was stopped and restarted only once. After switching the network back on, the host wasn't restarted.

Verified in:
ovirt-engine-4.5.0.6-0.7.el8ev.noarch

Comment 14 errata-xmlrpc 2022-05-26 16:23:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.0] security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4711

Comment 15 meital avital 2022-08-07 09:59:31 UTC
Due to QE capacity, we are not going to cover this issue in our automation


Note You need to log in before you can comment on or make changes to this bug.