Description of problem: A non responsive host is first soft-fenced by the engine, but it doesn't fix the connectivity issue. The engine doesn't initiate a hard fence and the host is left in non responsive status Version-Release number of selected component (if applicable): ovirt-engine-4.4.6.8-0.1.el8ev.noarch How reproducible: Always Steps to Reproduce: 1. Configure two hosts in a cluster with power management enabled 2. Make one host non responsive, for example blocking vdsm in the firewall: # iptables -I INPUT 1 -p tcp -m tcp --dport 54321 -j DROP Actual results: The host is soft-fenced and vdsmd restarts in the host, but this doesn't fix the connectivity problem and it remains in non responsive status. The host is never hard fenced and according to the documentation, it should be: https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html/administration_guide/sect-host_resilience#Soft-Fencing_Hosts Expected results: The fence agent should reboot the host after a grace timeout after the soft fence. Additional info: engine.log: ## Soft fence initiated: 2021-09-23 12:08:29,960+02 DEBUG [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Permission check skipped for internal action VdsNotRespondingTreatment. 2021-09-23 12:08:29,963+02 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Before acquiring lock 'EngineLock:{exclusiveLocks='[9f089443-550a-4974-a341-9f71dd6667f9=VDS_FENCE]', sharedLocks=''}' 2021-09-23 12:08:29,963+02 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Success acquiring lock 'EngineLock:{exclusiveLocks='[9f089443-550a-4974-a341-9f71dd6667f9=VDS_FENCE]', sharedLocks=''}' 2021-09-23 12:08:29,963+02 INFO [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Lock Acquired to object 'EngineLock:{exclusiveLocks='[9f089443-550a-4974-a341-9f71dd6667f9=VDS_FENCE]', sharedLocks=''}' 2021-09-23 12:08:29,965+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: get, params: [9f089443-550a-4974-a341-9f71dd6667f9], timeElapsed: 2ms 2021-09-23 12:08:29,998+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: get, params: [b5c49008-cb42-4d3d-a6c9-e17ee845ef11], timeElapsed: 31ms 2021-09-23 12:08:30,055+02 INFO [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Running command: VdsNotRespondingTreatmentCommand(KeepPolicyPMEnabled = false, ChangeHostToMaintenanceOnStart = false, FencingPolicy = null, VdsId = 9f089443-550a-4974-a341-9f71dd6667f9, RunSilent = false) internal: true. Entities affected : ID: 9f089443-550a-4974-a341-9f71dd6667f9 Type: VDS 2021-09-23 12:08:30,057+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: get, params: [9f089443-550a-4974-a341-9f71dd6667f9], timeElapsed: 2ms 2021-09-23 12:08:30,061+02 DEBUG [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Permission check skipped for internal action SshSoftFencing. 2021-09-23 12:08:30,062+02 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Running command: SshSoftFencingCommand(KeepPolicyPMEnabled = false, ChangeHostToMaintenanceOnStart = false, FencingPolicy = null, VdsId = 9f089443-550a-4974-a341-9f71dd6667f9, RunSilent = false) internal: true. Entities affected : ID: 9f089443-550a-4974-a341-9f71dd6667f9 Type: VDS 2021-09-23 12:08:30,063+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: get, params: [9f089443-550a-4974-a341-9f71dd6667f9], timeElapsed: 1ms 2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: getAll, params: [], timeElapsed: 28ms 2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Evaluating host 'rhvh44-01.laptop.lab' 2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Evaluating host 'rhvh44-02.laptop.lab' 2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' was excluded intentionally: false 2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' matches proxy source 'CLUSTER': true 2021-09-23 12:08:30,109+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' has compatible fence agents: true 2021-09-23 12:08:30,109+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' supports fencing policy 'null': true 2021-09-23 12:08:30,109+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' with status 'Up' is unreachable: false 2021-09-23 12:08:30,112+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Executing fence action 'STATUS', proxy='rhvh44-02.laptop.lab', target='rhvh44-01.laptop.lab', agent='FenceAgent:{id='0766b87b-a3ba-4a05-864a-85aac5429117', hostId='9f089443-550a-4974-a341-9f71dd6667f9', order='1', type='ipmilan', ip='192.168.100.1', port='null', user='admin', password='***', encryptOptions='false', options='ipport=6230, lanplus=1'}', policy='null' 2021-09-23 12:08:30,139+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management status on Host rhvh44-01.laptop.lab using Proxy Host rhvh44-02.laptop.lab and Fence Agent ipmilan:192.168.100.1. 2021-09-23 12:08:30,139+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] START, FenceVdsVDSCommand(HostName = rhvh44-02.laptop.lab, FenceVdsVDSCommandParameters:{hostId='e549b2c5-b86d-4588-89b0-f419fa1246d6', targetVdsId='9f089443-550a-4974-a341-9f71dd6667f9', action='STATUS', agent='FenceAgent:{id='0766b87b-a3ba-4a05-864a-85aac5429117', hostId='9f089443-550a-4974-a341-9f71dd6667f9', order='1', type='ipmilan', ip='192.168.100.1', port='null', user='admin', password='***', encryptOptions='false', options='ipport=6230, lanplus=1'}', policy='null'}), log id: d795f56 2021-09-23 12:08:30,247+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: d795f56 2021-09-23 12:08:30,248+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: runVdsCommand, params: [FenceVds, FenceVdsVDSCommandParameters:{hostId='e549b2c5-b86d-4588-89b0-f419fa1246d6', targetVdsId='9f089443-550a-4974-a341-9f71dd6667f9', action='STATUS', agent='FenceAgent:{id='0766b87b-a3ba-4a05-864a-85aac5429117', hostId='9f089443-550a-4974-a341-9f71dd6667f9', order='1', type='ipmilan', ip='192.168.100.1', port='null', user='admin', password='***', encryptOptions='false', options='ipport=6230, lanplus=1'}', policy='null'}], timeElapsed: 108ms 2021-09-23 12:08:30,248+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Result of 'FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}' fence action: {} 2021-09-23 12:08:30,248+02 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Opening SSH Soft Fencing session on host 'rhvh44-01.laptop.lab' 2021-09-23 12:08:30,258+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Connecting 'root.lab' 2021-09-23 12:08:30,332+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Connected: 'root.lab' 2021-09-23 12:08:30,332+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Authenticating: 'root.lab' 2021-09-23 12:08:30,461+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Authenticated: 'root.lab' 2021-09-23 12:08:30,461+02 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Executing SSH Soft Fencing command on host 'rhvh44-01.laptop.lab' 2021-09-23 12:08:30,462+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Executing: '/usr/bin/vdsm-tool service-restart vdsmd' 2021-09-23 12:08:50,108+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Executed: '/usr/bin/vdsm-tool service-restart vdsmd' 2021-09-23 12:08:50,112+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: getVdsManager, params: [9f089443-550a-4974-a341-9f71dd6667f9], timeElapsed: 0ms ## Hours later I restart ovirt-engine, and we can see that the fence task was still running: 2021-09-23 14:50:22,037+02 WARN [org.ovirt.engine.core.utils.ThreadUtils] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Interrupted: sleep interrupted 2021-09-23 14:50:22,037+02 DEBUG [org.ovirt.engine.core.utils.ThreadUtils] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Exception: java.lang.InterruptedException: sleep interrupted
After unsuccessful soft-fence, server was successfully hard-fenced. Verified on 4.4.10.4-0.1.el8ev
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV Manager (ovirt-engine) [ovirt-4.4.10]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0461
Due to QE capacity, we are not going to cover this issue in our automation