2071468 – Engine fenced host that was already reconnected and set to Up status.

Bug 2071468 - Engine fenced host that was already reconnected and set to Up status.

Summary: Engine fenced host that was already reconnected and set to Up status.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.4.10
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.5.0-1
Target Release:	4.5.0
Assignee:	Eli Mesika
QA Contact:	Lucie Leistnerova
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-04 00:23 UTC by Germano Veit Michel
Modified:	2022-09-13 12:26 UTC (History)
CC List:	8 users (show)
Fixed In Version:	ovirt-engine-4.5.0.5
Doc Type:	Release Note
Doc Text:	If SSH soft fencing needs to be executed on a problematic host, The Red Hat Virtualization Manager now waits the expected time interval before it continues with fencing. As a result,the VDSM has enough time to start and respond to the Red Hat Virtualization Manager.
Clone Of:
Environment:
Last Closed:	2022-05-26 16:23:55 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	oVirt ovirt-engine pull 254	None	Merged	Fix network exception handling and fencing flow logic.	2022-04-25 12:41:31 UTC
Github	oVirt ovirt-engine pull 306	None	Merged	backport: core: fix network faults handling and fencing flow	2022-04-25 12:41:34 UTC
Red Hat Issue Tracker	RHV-45521	None	None	None	2022-04-04 00:28:08 UTC
Red Hat Knowledge Base (Solution)	6956044	None	None	None	2022-05-03 04:11:26 UTC
Red Hat Product Errata	RHSA-2022:4711	None	None	None	2022-05-26 16:24:06 UTC

Description Germano Veit Michel 2022-04-04 00:23:25 UTC

Description of problem:

The engine lost connection to a host, then reconnected to it fine but fenced it anyway while the host was already in Up status. See this.

1. libvirt on the hypervisor crashed (known problem), vdsm restarts

2. Due to above, GetAllVmStats fails and the host is set to non-responding
2022-04-02 14:35:33,362-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1162) [] EVENT_ID: VDS_FAILURE(12), Host xxx is non responsive.

3. Engine starts soft fencing (ssh) and check Power Management status

4. VDSM comes up and the connection succeeds

5. Engine does the usual vds_init stuff, connectStorage server to this host, and the host is now in 'Up' state about 22s after [2], VMs on the host were set to Up too.
2022-04-02 14:35:55,548-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [783b13bb] EVENT_ID: VDS_DETECTED(13), Status of host xxx was set to Up.

6. VMs are set to up

7. Engine powers off the host 18s after it was set to Up.
2022-04-02 14:36:13,298-04 INFO  [org.ovirt.engine.core.bll.pm.StopVdsCommand] (EE-ManagedThreadFactory-engine-Thread-1163) [39c35cb0] Power-Management: STOP of host 'xxx' initiated.

Version-Release number of selected component (if applicable):
rhvm-4.4.10.7-0.4.el8ev.noarch
vdsm-4.40.40-1.el8ev.x86_64

How reproducible:
Unknown

Comment 4 Germano Veit Michel 2022-04-04 00:32:42 UTC

There are some locking errors in the logs related to VDS_INIT, but those seem related to another host that failed to install, unsure if related.

Note the host was set to Up while KdumpDetection was running, and it went straight to STOP after ssh soft fencing and kdump detection finishing.

2022-04-02 14:35:43,209-04 INFO  [org.ovirt.engine.core.bll.pm.VdsKdumpDetectionCommand] (EE-ManagedThreadFactory-engine-Thread-1163) [55a443d4] Running command: VdsKdumpDetectionCommand internal: true. Entities affected :  ID: d6a9e3d8-ab11-4cd9-bcf9-4ef492fd1157 Type: VDS

2022-04-02 14:36:13,229-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1163) [55a443d4] EVENT_ID: KDUMP_FLOW_NOT_DETECTED_ON_VDS(615), Kdump flow is not in progress on host xxx.

Comment 5 Germano Veit Michel 2022-04-04 01:03:04 UTC

BZ2007286 modified isHostInGracePeriod:

        // return when either attempts reached or timeout passed, the sooner takes
        if (unrespondedAttempts.get() > unrespondedAttemptsBarrier) {
            // too many unresponded attempts
            return false;

That unrespondedAttempts seems to be incremented by 1 with NetworkException, and the limit is 2 until SSH soft fencing succeds.

Maybe this is reaching > 2 quickly?

Comment 6 nijin ashok 2022-04-04 06:18:07 UTC

Adding another case.

Can reproduce this with bug 2070045.

OVF update failed.

~~~
2022-04-04 11:18:07,032+05 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStorageDomainCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [315f61be] Running command: ProcessOvfUpdateForStorageDomainCommand internal: true. Entities affected :  ID: b30addc2-3067-42a2-98e6-765e4893d866 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN

2022-04-04 11:18:27,326+05 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [315f61be] Command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketTimeoutException: Read timed out (Failed with error VDS_NETWORK_ERROR and code 5022)
~~~

SSH soft fencing.

~~~
2022-04-04 11:18:27,467+05 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Running command: SshSoftFencingCommand internal: true. Entities affected :  ID: 9184ea18-bade-4a82-8f36-cede09de0175 Type: VDS
2022-04-04 11:18:27,676+05 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Opening SSH Soft Fencing session on host 'dell-r530-3.'
2022-04-04 11:18:28,022+05 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Executing SSH Soft Fencing command on host 'dell-r530-3.'
~~~

GetAllVmStatsVDSCommand failed because of ssh fencing.

~~~
2022-04-04 11:18:45,515+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-66) [] Command 'GetAllVmStatsVDSCommand(HostName = dell-r530-3., VdsIdVDSCommandParametersBase:{hostId='9184ea18-bade-4a82-8f36-cede09de0175'})' execution failed: VDSGenericException: VDSNetworkException: Broken pipe
~~~

Power-Management: STOP

~~~
2022-04-04 11:18:48,351+05 INFO  [org.ovirt.engine.core.bll.pm.StopVdsCommand] (EE-ManagedThreadFactory-engine-Thread-235) [1ea685a5] Power-Management: STOP of host 'dell-r530-3.' initiated.

2022-04-04 11:18:48,722+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-235) [1ea685a5] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='9184ea18-bade-4a82-8f36-cede09de0175', action='STOP', agent='FenceAgent:{id='aa954daa-9ac5-46bd-8311-e46e2e82920b', hostId='9184ea18-bade-4a82-8f36-cede09de0175', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='rcuser', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 65284f25
~~~

So the host was stopped within 20 seconds of the first network failure even when it was SPM.

How reproducible:
100%

Comment 8 Eli Mesika 2022-04-04 08:57:49 UTC

(In reply to Germano Veit Michel from comment #5)

> Maybe this is reaching > 2 quickly?

This can be configured from engine-config 
Can you try setting VDSAttemptsToResetCount to a higher value , restart engine and test the scenario again ?

Comment 12 nijin ashok 2022-04-06 07:07:27 UTC

(In reply to Eli Mesika from comment #8)
> (In reply to Germano Veit Michel from comment #5)
> 
> > Maybe this is reaching > 2 quickly?
> 
> This can be configured from engine-config 
> Can you try setting VDSAttemptsToResetCount to a higher value , restart
> engine and test the scenario again ?

Checked with VDSAttemptsToResetCount=5. The host was fenced within 30 seconds.

2022-04-06 12:27:54,351+05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-53) [5df9e8e3] Command 'UploadStreamVDSCommand(HostName = dell-r530-3, UploadStreamVDSCommandParameters:{hostId='9184ea18-bade-4a82-8f36-cede09de0175'})' execution failed: java.net.SocketTimeoutException: Read timed out


2022-04-06 12:28:17,524+05 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-104) [5df9e8e3] Server failed to respond, vds_id='9184ea18-bade-4a82-8f36-cede09de0175', vds_name='dell-r530-3', vm_count=2, spm_status='SPM', non-responsive_timeout (seconds)=81, error: org.apache.http.NoHttpResponseException: dell-r530-3:54321 failed to respond

2022-04-06 12:28:22,249+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-96) [34af6e23] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='9184ea18-bade-4a82-8f36-cede09de0175', action='STOP', agent='FenceAgent:{id='f25e834e-774e-485c-946e-8f8d2d15b903', hostId='9184ea18-bade-4a82-8f36-cede09de0175', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='rcuser', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 48a31986

Comment 14 Eli Mesika 2022-04-11 11:24:34 UTC

While we have to handle the network exception grace period when the host
is switched to 'connecting' state due to its load regarding number of
running VMs and SPM status, in the case of soft-fencing flow, the host
is already in not-responding status, other host already took the SPM
role and all its running VMs set to 'unknown' status. So we should not
consider the host load at all and a fixed grace period (configurable 1
min) is enough to restart the vdsmd service on the host and get it up
and running.

Solution was tested with host as SPM with running VMs (some are HA),
with a non SPM host running VMs and with a regular host.

Results:

    Both initial grace between connecting and non-responding and between
    soft-fencing and hard-fencing are honored.

    Code is more readable and straight foreword

Comment 16 Eli Mesika 2022-04-12 14:29:02 UTC

Suggested Tests: 
Configuration : Two running hosts A and B , A with PM configured, active SD , several running VMs , some are HA 

1)  
Host A is SPM and running several VMs some of them HA
    stop vdsmd service on host A
    status set to connecting for grace period
    host become non-responding 
    soft-fencing is executed on A and succeeded
    VMs restored on host A, HA VMs are up 
    SPM transfered to Host B

2)    

    Host A is SPM and running several VMs some of them HA
    block vdsm port on A
    status set to connecting for grace period
    host become non-responding 
    soft-fencing is executed and fails
    host is rebooted after 1 min
    VMs restored on host B, HA VMs are up 
    SPM transfered to Host B

3)  
Host A is SPM and running several VMs some of them HA
    stop vdsmd service on A
    status set to connecting for grace period
    start vdsmd service on A within 1 min 
    host become up 
    VMs restored on host A, HA VMs are up 
    SPM transfered to Host B

4)    

    Host A is SPM and running several VMs some of them HA
    block vdsm port on A
    status set to connecting for grace period
    host become non-responding 
    unblock vdsm port on A within 1 min
    host A is set to UP
    VMs restored on host A, HA VMs are up 
    SPM transfered to Host B


There are 2 grace period times 

1. when we get a Network error(in connecting state) , this grace period is >= 60 sec according to the host load (20 sec for SPM + 0.5 sec * <number of running vms>)
2. when the host is soft-fenced (in non-responding state before hard-fence is executed) , this should be exactly 1 min

Comment 19 Lucie Leistnerova 2022-05-24 05:49:15 UTC

Tested suggested flows in ovirt-engine-4.5.0.7-0.9.el8ev.noarch

1) After 1 minute fence agent triggered, SPM moved to other host, VMs restored
2) After 1 minute soft fencing triggered and failed, trying to restart via PM, SPM moved to other host, VMs restored
3) SPM moved to other host, back Up, VMs restored, no additional PM actions
4) SPM moved to other host, back Up, VMs restored, no additional PM actions

Comment 20 Lucie Leistnerova 2022-05-24 05:53:04 UTC

Sorry, sent a comment not updated properly
2) After 1 minute soft fencing triggered and failed, SPM moved to other host, trying to restart via PM, restarted, HA VM restored on other host

Comment 25 errata-xmlrpc 2022-05-26 16:23:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.0] security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4711

Note You need to log in before you can comment on or make changes to this bug.