Bug 2007286

Summary: Host is never fenced after a soft fence attempt
Product: Red Hat Enterprise Virtualization Manager Reporter: Juan Orti <jortialc>
Component: ovirt-engineAssignee: Eli Mesika <emesika>
Status: CLOSED ERRATA QA Contact: Pavol Brilla <pbrilla>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.4.6CC: apinnick, dfodor, emarcus, gdeolive, imomin, mavital, mperina
Target Milestone: ovirt-4.4.10Keywords: ZStream
Target Release: 4.4.10   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.10.1 Doc Type: Release Note
Doc Text:
Previously, a non-responding host was first soft-fenced by the Engine, but this did not fix the connectivity issue. The engine did not initiate a hard fence and the host was left in an non-responding state. In this release, soft fencing has been fixed so that if the soft fencing does not make the host responsive again, then the non-responding host treatment process continues correctly with the additional steps.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-08 10:04:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Juan Orti 2021-09-23 14:02:18 UTC
Description of problem:
A non responsive host is first soft-fenced by the engine, but it doesn't fix the connectivity issue. The engine doesn't initiate a hard fence and the host is left in non responsive status

Version-Release number of selected component (if applicable):
ovirt-engine-4.4.6.8-0.1.el8ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Configure two hosts in a cluster with power management enabled
2. Make one host non responsive, for example blocking vdsm in the firewall:

# iptables -I INPUT 1 -p tcp -m tcp --dport 54321 -j DROP

Actual results:
The host is soft-fenced and vdsmd restarts in the host, but this doesn't fix the connectivity problem and it remains in non responsive status.

The host is never hard fenced and according to the documentation, it should be:

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html/administration_guide/sect-host_resilience#Soft-Fencing_Hosts

Expected results:
The fence agent should reboot the host after a grace timeout after the soft fence.

Additional info:

engine.log:

## Soft fence initiated:

2021-09-23 12:08:29,960+02 DEBUG [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Permission check skipped for internal action VdsNotRespondingTreatment.
2021-09-23 12:08:29,963+02 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Before acquiring lock 'EngineLock:{exclusiveLocks='[9f089443-550a-4974-a341-9f71dd6667f9=VDS_FENCE]', sharedLocks=''}'
2021-09-23 12:08:29,963+02 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Success acquiring lock 'EngineLock:{exclusiveLocks='[9f089443-550a-4974-a341-9f71dd6667f9=VDS_FENCE]', sharedLocks=''}'
2021-09-23 12:08:29,963+02 INFO  [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Lock Acquired to object 'EngineLock:{exclusiveLocks='[9f089443-550a-4974-a341-9f71dd6667f9=VDS_FENCE]', sharedLocks=''}'
2021-09-23 12:08:29,965+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: get, params: [9f089443-550a-4974-a341-9f71dd6667f9], timeElapsed: 2ms
2021-09-23 12:08:29,998+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: get, params: [b5c49008-cb42-4d3d-a6c9-e17ee845ef11], timeElapsed: 31ms
2021-09-23 12:08:30,055+02 INFO  [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Running command: VdsNotRespondingTreatmentCommand(KeepPolicyPMEnabled = false, ChangeHostToMaintenanceOnStart = false, FencingPolicy = null, VdsId = 9f089443-550a-4974-a341-9f71dd6667f9, RunSilent = false) internal: true. Entities affected :  ID: 9f089443-550a-4974-a341-9f71dd6667f9 Type: VDS
2021-09-23 12:08:30,057+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: get, params: [9f089443-550a-4974-a341-9f71dd6667f9], timeElapsed: 2ms
2021-09-23 12:08:30,061+02 DEBUG [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Permission check skipped for internal action SshSoftFencing.
2021-09-23 12:08:30,062+02 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Running command: SshSoftFencingCommand(KeepPolicyPMEnabled = false, ChangeHostToMaintenanceOnStart = false, FencingPolicy = null, VdsId = 9f089443-550a-4974-a341-9f71dd6667f9, RunSilent = false) internal: true. Entities affected :  ID: 9f089443-550a-4974-a341-9f71dd6667f9 Type: VDS
2021-09-23 12:08:30,063+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: get, params: [9f089443-550a-4974-a341-9f71dd6667f9], timeElapsed: 1ms
2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: getAll, params: [], timeElapsed: 28ms
2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Evaluating host 'rhvh44-01.laptop.lab'
2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Evaluating host 'rhvh44-02.laptop.lab'
2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' was excluded intentionally: false
2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' matches proxy source 'CLUSTER': true
2021-09-23 12:08:30,109+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' has compatible fence agents: true
2021-09-23 12:08:30,109+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' supports fencing policy 'null': true
2021-09-23 12:08:30,109+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' with status 'Up' is unreachable: false
2021-09-23 12:08:30,112+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Executing fence action 'STATUS', proxy='rhvh44-02.laptop.lab', target='rhvh44-01.laptop.lab', agent='FenceAgent:{id='0766b87b-a3ba-4a05-864a-85aac5429117', hostId='9f089443-550a-4974-a341-9f71dd6667f9', order='1', type='ipmilan', ip='192.168.100.1', port='null', user='admin', password='***', encryptOptions='false', options='ipport=6230, lanplus=1'}', policy='null'
2021-09-23 12:08:30,139+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management status on Host rhvh44-01.laptop.lab using Proxy Host rhvh44-02.laptop.lab and Fence Agent ipmilan:192.168.100.1.
2021-09-23 12:08:30,139+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] START, FenceVdsVDSCommand(HostName = rhvh44-02.laptop.lab, FenceVdsVDSCommandParameters:{hostId='e549b2c5-b86d-4588-89b0-f419fa1246d6', targetVdsId='9f089443-550a-4974-a341-9f71dd6667f9', action='STATUS', agent='FenceAgent:{id='0766b87b-a3ba-4a05-864a-85aac5429117', hostId='9f089443-550a-4974-a341-9f71dd6667f9', order='1', type='ipmilan', ip='192.168.100.1', port='null', user='admin', password='***', encryptOptions='false', options='ipport=6230, lanplus=1'}', policy='null'}), log id: d795f56
2021-09-23 12:08:30,247+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: d795f56
2021-09-23 12:08:30,248+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: runVdsCommand, params: [FenceVds, FenceVdsVDSCommandParameters:{hostId='e549b2c5-b86d-4588-89b0-f419fa1246d6', targetVdsId='9f089443-550a-4974-a341-9f71dd6667f9', action='STATUS', agent='FenceAgent:{id='0766b87b-a3ba-4a05-864a-85aac5429117', hostId='9f089443-550a-4974-a341-9f71dd6667f9', order='1', type='ipmilan', ip='192.168.100.1', port='null', user='admin', password='***', encryptOptions='false', options='ipport=6230, lanplus=1'}', policy='null'}], timeElapsed: 108ms
2021-09-23 12:08:30,248+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Result of 'FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}' fence action: {}
2021-09-23 12:08:30,248+02 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Opening SSH Soft Fencing session on host 'rhvh44-01.laptop.lab'
2021-09-23 12:08:30,258+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Connecting 'root.lab'
2021-09-23 12:08:30,332+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Connected: 'root.lab'
2021-09-23 12:08:30,332+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Authenticating: 'root.lab'
2021-09-23 12:08:30,461+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Authenticated: 'root.lab'
2021-09-23 12:08:30,461+02 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Executing SSH Soft Fencing command on host 'rhvh44-01.laptop.lab'
2021-09-23 12:08:30,462+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Executing: '/usr/bin/vdsm-tool service-restart vdsmd'
2021-09-23 12:08:50,108+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Executed: '/usr/bin/vdsm-tool service-restart vdsmd'
2021-09-23 12:08:50,112+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: getVdsManager, params: [9f089443-550a-4974-a341-9f71dd6667f9], timeElapsed: 0ms

## Hours later I restart ovirt-engine, and we can see that the fence task was still running:

2021-09-23 14:50:22,037+02 WARN  [org.ovirt.engine.core.utils.ThreadUtils] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Interrupted: sleep interrupted
2021-09-23 14:50:22,037+02 DEBUG [org.ovirt.engine.core.utils.ThreadUtils] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Exception: java.lang.InterruptedException: sleep interrupted

Comment 5 Pavol Brilla 2022-01-25 08:16:54 UTC
After unsuccessful soft-fence, server was successfully hard-fenced.

Verified on 4.4.10.4-0.1.el8ev

Comment 9 errata-xmlrpc 2022-02-08 10:04:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Manager (ovirt-engine) [ovirt-4.4.10]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0461

Comment 10 meital avital 2022-08-07 11:31:45 UTC
Due to QE capacity, we are not going to cover this issue in our automation