Bug 2007286 - Host is never fenced after a soft fence attempt
Summary: Host is never fenced after a soft fence attempt
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.4.6
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.4.10
: 4.4.10
Assignee: Eli Mesika
QA Contact: Pavol Brilla
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-09-23 14:02 UTC by Juan Orti
Modified: 2022-08-07 11:31 UTC (History)
7 users (show)

Fixed In Version: ovirt-engine-4.4.10.1
Doc Type: Release Note
Doc Text:
Previously, a non-responding host was first soft-fenced by the Engine, but this did not fix the connectivity issue. The engine did not initiate a hard fence and the host was left in an non-responding state. In this release, soft fencing has been fixed so that if the soft fencing does not make the host responsive again, then the non-responding host treatment process continues correctly with the additional steps.
Clone Of:
Environment:
Last Closed: 2022-02-08 10:04:44 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43699 0 None None None 2021-09-23 14:05:51 UTC
Red Hat Knowledge Base (Solution) 6356542 0 None None None 2021-09-23 14:16:36 UTC
Red Hat Product Errata RHBA-2022:0461 0 None None None 2022-02-08 10:05:02 UTC
oVirt gerrit 117048 0 master MERGED fix endless loop in soft fencing 2021-11-08 14:53:18 UTC
oVirt gerrit 117594 0 ovirt-engine-4.4 ABANDONED fix endless loop in soft fencing 2021-11-12 07:13:52 UTC
oVirt gerrit 117604 0 ovirt-engine-4.4 MERGED fix endless loop in soft fencing 2021-11-15 16:38:47 UTC

Description Juan Orti 2021-09-23 14:02:18 UTC
Description of problem:
A non responsive host is first soft-fenced by the engine, but it doesn't fix the connectivity issue. The engine doesn't initiate a hard fence and the host is left in non responsive status

Version-Release number of selected component (if applicable):
ovirt-engine-4.4.6.8-0.1.el8ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Configure two hosts in a cluster with power management enabled
2. Make one host non responsive, for example blocking vdsm in the firewall:

# iptables -I INPUT 1 -p tcp -m tcp --dport 54321 -j DROP

Actual results:
The host is soft-fenced and vdsmd restarts in the host, but this doesn't fix the connectivity problem and it remains in non responsive status.

The host is never hard fenced and according to the documentation, it should be:

https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.4/html/administration_guide/sect-host_resilience#Soft-Fencing_Hosts

Expected results:
The fence agent should reboot the host after a grace timeout after the soft fence.

Additional info:

engine.log:

## Soft fence initiated:

2021-09-23 12:08:29,960+02 DEBUG [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Permission check skipped for internal action VdsNotRespondingTreatment.
2021-09-23 12:08:29,963+02 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Before acquiring lock 'EngineLock:{exclusiveLocks='[9f089443-550a-4974-a341-9f71dd6667f9=VDS_FENCE]', sharedLocks=''}'
2021-09-23 12:08:29,963+02 DEBUG [org.ovirt.engine.core.bll.lock.InMemoryLockManager] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Success acquiring lock 'EngineLock:{exclusiveLocks='[9f089443-550a-4974-a341-9f71dd6667f9=VDS_FENCE]', sharedLocks=''}'
2021-09-23 12:08:29,963+02 INFO  [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Lock Acquired to object 'EngineLock:{exclusiveLocks='[9f089443-550a-4974-a341-9f71dd6667f9=VDS_FENCE]', sharedLocks=''}'
2021-09-23 12:08:29,965+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: get, params: [9f089443-550a-4974-a341-9f71dd6667f9], timeElapsed: 2ms
2021-09-23 12:08:29,998+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: get, params: [b5c49008-cb42-4d3d-a6c9-e17ee845ef11], timeElapsed: 31ms
2021-09-23 12:08:30,055+02 INFO  [org.ovirt.engine.core.bll.pm.VdsNotRespondingTreatmentCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Running command: VdsNotRespondingTreatmentCommand(KeepPolicyPMEnabled = false, ChangeHostToMaintenanceOnStart = false, FencingPolicy = null, VdsId = 9f089443-550a-4974-a341-9f71dd6667f9, RunSilent = false) internal: true. Entities affected :  ID: 9f089443-550a-4974-a341-9f71dd6667f9 Type: VDS
2021-09-23 12:08:30,057+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: get, params: [9f089443-550a-4974-a341-9f71dd6667f9], timeElapsed: 2ms
2021-09-23 12:08:30,061+02 DEBUG [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Permission check skipped for internal action SshSoftFencing.
2021-09-23 12:08:30,062+02 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Running command: SshSoftFencingCommand(KeepPolicyPMEnabled = false, ChangeHostToMaintenanceOnStart = false, FencingPolicy = null, VdsId = 9f089443-550a-4974-a341-9f71dd6667f9, RunSilent = false) internal: true. Entities affected :  ID: 9f089443-550a-4974-a341-9f71dd6667f9 Type: VDS
2021-09-23 12:08:30,063+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: get, params: [9f089443-550a-4974-a341-9f71dd6667f9], timeElapsed: 1ms
2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: getAll, params: [], timeElapsed: 28ms
2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Evaluating host 'rhvh44-01.laptop.lab'
2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Evaluating host 'rhvh44-02.laptop.lab'
2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' was excluded intentionally: false
2021-09-23 12:08:30,101+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' matches proxy source 'CLUSTER': true
2021-09-23 12:08:30,109+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' has compatible fence agents: true
2021-09-23 12:08:30,109+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' supports fencing policy 'null': true
2021-09-23 12:08:30,109+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceProxyLocator] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Proxy candidate 'rhvh44-02.laptop.lab' with status 'Up' is unreachable: false
2021-09-23 12:08:30,112+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Executing fence action 'STATUS', proxy='rhvh44-02.laptop.lab', target='rhvh44-01.laptop.lab', agent='FenceAgent:{id='0766b87b-a3ba-4a05-864a-85aac5429117', hostId='9f089443-550a-4974-a341-9f71dd6667f9', order='1', type='ipmilan', ip='192.168.100.1', port='null', user='admin', password='***', encryptOptions='false', options='ipport=6230, lanplus=1'}', policy='null'
2021-09-23 12:08:30,139+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] EVENT_ID: FENCE_OPERATION_USING_AGENT_AND_PROXY_STARTED(9,020), Executing power management status on Host rhvh44-01.laptop.lab using Proxy Host rhvh44-02.laptop.lab and Fence Agent ipmilan:192.168.100.1.
2021-09-23 12:08:30,139+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] START, FenceVdsVDSCommand(HostName = rhvh44-02.laptop.lab, FenceVdsVDSCommandParameters:{hostId='e549b2c5-b86d-4588-89b0-f419fa1246d6', targetVdsId='9f089443-550a-4974-a341-9f71dd6667f9', action='STATUS', agent='FenceAgent:{id='0766b87b-a3ba-4a05-864a-85aac5429117', hostId='9f089443-550a-4974-a341-9f71dd6667f9', order='1', type='ipmilan', ip='192.168.100.1', port='null', user='admin', password='***', encryptOptions='false', options='ipport=6230, lanplus=1'}', policy='null'}), log id: d795f56
2021-09-23 12:08:30,247+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] FINISH, FenceVdsVDSCommand, return: FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}, log id: d795f56
2021-09-23 12:08:30,248+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: runVdsCommand, params: [FenceVds, FenceVdsVDSCommandParameters:{hostId='e549b2c5-b86d-4588-89b0-f419fa1246d6', targetVdsId='9f089443-550a-4974-a341-9f71dd6667f9', action='STATUS', agent='FenceAgent:{id='0766b87b-a3ba-4a05-864a-85aac5429117', hostId='9f089443-550a-4974-a341-9f71dd6667f9', order='1', type='ipmilan', ip='192.168.100.1', port='null', user='admin', password='***', encryptOptions='false', options='ipport=6230, lanplus=1'}', policy='null'}], timeElapsed: 108ms
2021-09-23 12:08:30,248+02 DEBUG [org.ovirt.engine.core.bll.pm.FenceAgentExecutor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Result of 'FenceOperationResult:{status='SUCCESS', powerStatus='ON', message=''}' fence action: {}
2021-09-23 12:08:30,248+02 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Opening SSH Soft Fencing session on host 'rhvh44-01.laptop.lab'
2021-09-23 12:08:30,258+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Connecting 'root.lab'
2021-09-23 12:08:30,332+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Connected: 'root.lab'
2021-09-23 12:08:30,332+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Authenticating: 'root.lab'
2021-09-23 12:08:30,461+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Authenticated: 'root.lab'
2021-09-23 12:08:30,461+02 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Executing SSH Soft Fencing command on host 'rhvh44-01.laptop.lab'
2021-09-23 12:08:30,462+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Executing: '/usr/bin/vdsm-tool service-restart vdsmd'
2021-09-23 12:08:50,108+02 DEBUG [org.ovirt.engine.core.uutils.ssh.SSHClient] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Executed: '/usr/bin/vdsm-tool service-restart vdsmd'
2021-09-23 12:08:50,112+02 DEBUG [org.ovirt.engine.core.common.di.interceptor.DebugLoggingInterceptor] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] method: getVdsManager, params: [9f089443-550a-4974-a341-9f71dd6667f9], timeElapsed: 0ms

## Hours later I restart ovirt-engine, and we can see that the fence task was still running:

2021-09-23 14:50:22,037+02 WARN  [org.ovirt.engine.core.utils.ThreadUtils] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Interrupted: sleep interrupted
2021-09-23 14:50:22,037+02 DEBUG [org.ovirt.engine.core.utils.ThreadUtils] (EE-ManagedThreadFactory-engine-Thread-2977) [2766b4e] Exception: java.lang.InterruptedException: sleep interrupted

Comment 5 Pavol Brilla 2022-01-25 08:16:54 UTC
After unsuccessful soft-fence, server was successfully hard-fenced.

Verified on 4.4.10.4-0.1.el8ev

Comment 9 errata-xmlrpc 2022-02-08 10:04:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV Manager (ovirt-engine) [ovirt-4.4.10]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0461

Comment 10 meital avital 2022-08-07 11:31:45 UTC
Due to QE capacity, we are not going to cover this issue in our automation


Note You need to log in before you can comment on or make changes to this bug.