Bug 2071468

Summary: Engine fenced host that was already reconnected and set to Up status.
Product: Red Hat Enterprise Virtualization Manager Reporter: Germano Veit Michel <gveitmic>
Component: ovirt-engineAssignee: Eli Mesika <emesika>
Status: CLOSED ERRATA QA Contact: Lucie Leistnerova <lleistne>
Severity: high Docs Contact:
Priority: high    
Version: 4.4.10CC: amashah, emarcus, lsvaty, mavital, michal.skrivanek, mperina, nashok, schandle
Target Milestone: ovirt-4.5.0-1Keywords: Regression
Target Release: 4.5.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ovirt-engine-4.5.0.5 Doc Type: Release Note
Doc Text:
If SSH soft fencing needs to be executed on a problematic host, The Red Hat Virtualization Manager now waits the expected time interval before it continues with fencing. As a result,the VDSM has enough time to start and respond to the Red Hat Virtualization Manager.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-26 16:23:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2022-04-04 00:23:25 UTC
Description of problem:

The engine lost connection to a host, then reconnected to it fine but fenced it anyway while the host was already in Up status. See this.

1. libvirt on the hypervisor crashed (known problem), vdsm restarts

2. Due to above, GetAllVmStats fails and the host is set to non-responding
2022-04-02 14:35:33,362-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1162) [] EVENT_ID: VDS_FAILURE(12), Host xxx is non responsive.

3. Engine starts soft fencing (ssh) and check Power Management status

4. VDSM comes up and the connection succeeds

5. Engine does the usual vds_init stuff, connectStorage server to this host, and the host is now in 'Up' state about 22s after [2], VMs on the host were set to Up too.
2022-04-02 14:35:55,548-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [783b13bb] EVENT_ID: VDS_DETECTED(13), Status of host xxx was set to Up.

6. VMs are set to up

7. Engine powers off the host 18s after it was set to Up.
2022-04-02 14:36:13,298-04 INFO  [org.ovirt.engine.core.bll.pm.StopVdsCommand] (EE-ManagedThreadFactory-engine-Thread-1163) [39c35cb0] Power-Management: STOP of host 'xxx' initiated.

Version-Release number of selected component (if applicable):
rhvm-4.4.10.7-0.4.el8ev.noarch
vdsm-4.40.40-1.el8ev.x86_64

How reproducible:
Unknown

Comment 4 Germano Veit Michel 2022-04-04 00:32:42 UTC
There are some locking errors in the logs related to VDS_INIT, but those seem related to another host that failed to install, unsure if related.

Note the host was set to Up while KdumpDetection was running, and it went straight to STOP after ssh soft fencing and kdump detection finishing.

2022-04-02 14:35:43,209-04 INFO  [org.ovirt.engine.core.bll.pm.VdsKdumpDetectionCommand] (EE-ManagedThreadFactory-engine-Thread-1163) [55a443d4] Running command: VdsKdumpDetectionCommand internal: true. Entities affected :  ID: d6a9e3d8-ab11-4cd9-bcf9-4ef492fd1157 Type: VDS

2022-04-02 14:36:13,229-04 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1163) [55a443d4] EVENT_ID: KDUMP_FLOW_NOT_DETECTED_ON_VDS(615), Kdump flow is not in progress on host xxx.

Comment 5 Germano Veit Michel 2022-04-04 01:03:04 UTC
BZ2007286 modified isHostInGracePeriod:

        // return when either attempts reached or timeout passed, the sooner takes
        if (unrespondedAttempts.get() > unrespondedAttemptsBarrier) {
            // too many unresponded attempts
            return false;

That unrespondedAttempts seems to be incremented by 1 with NetworkException, and the limit is 2 until SSH soft fencing succeds.

Maybe this is reaching > 2 quickly?

Comment 6 nijin ashok 2022-04-04 06:18:07 UTC
Adding another case.

Can reproduce this with bug 2070045.

OVF update failed.

~~~
2022-04-04 11:18:07,032+05 INFO  [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStorageDomainCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [315f61be] Running command: ProcessOvfUpdateForStorageDomainCommand internal: true. Entities affected :  ID: b30addc2-3067-42a2-98e6-765e4893d866 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN

2022-04-04 11:18:27,326+05 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [315f61be] Command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketTimeoutException: Read timed out (Failed with error VDS_NETWORK_ERROR and code 5022)
~~~

SSH soft fencing.

~~~
2022-04-04 11:18:27,467+05 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Running command: SshSoftFencingCommand internal: true. Entities affected :  ID: 9184ea18-bade-4a82-8f36-cede09de0175 Type: VDS
2022-04-04 11:18:27,676+05 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Opening SSH Soft Fencing session on host 'dell-r530-3.'
2022-04-04 11:18:28,022+05 INFO  [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Executing SSH Soft Fencing command on host 'dell-r530-3.'
~~~

GetAllVmStatsVDSCommand failed because of ssh fencing.

~~~
2022-04-04 11:18:45,515+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-66) [] Command 'GetAllVmStatsVDSCommand(HostName = dell-r530-3., VdsIdVDSCommandParametersBase:{hostId='9184ea18-bade-4a82-8f36-cede09de0175'})' execution failed: VDSGenericException: VDSNetworkException: Broken pipe
~~~

Power-Management: STOP

~~~
2022-04-04 11:18:48,351+05 INFO  [org.ovirt.engine.core.bll.pm.StopVdsCommand] (EE-ManagedThreadFactory-engine-Thread-235) [1ea685a5] Power-Management: STOP of host 'dell-r530-3.' initiated.

2022-04-04 11:18:48,722+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-235) [1ea685a5] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='9184ea18-bade-4a82-8f36-cede09de0175', action='STOP', agent='FenceAgent:{id='aa954daa-9ac5-46bd-8311-e46e2e82920b', hostId='9184ea18-bade-4a82-8f36-cede09de0175', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='rcuser', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 65284f25
~~~

So the host was stopped within 20 seconds of the first network failure even when it was SPM.

How reproducible:
100%

Comment 8 Eli Mesika 2022-04-04 08:57:49 UTC
(In reply to Germano Veit Michel from comment #5)

> Maybe this is reaching > 2 quickly?

This can be configured from engine-config 
Can you try setting VDSAttemptsToResetCount to a higher value , restart engine and test the scenario again ?

Comment 12 nijin ashok 2022-04-06 07:07:27 UTC
(In reply to Eli Mesika from comment #8)
> (In reply to Germano Veit Michel from comment #5)
> 
> > Maybe this is reaching > 2 quickly?
> 
> This can be configured from engine-config 
> Can you try setting VDSAttemptsToResetCount to a higher value , restart
> engine and test the scenario again ?

Checked with VDSAttemptsToResetCount=5. The host was fenced within 30 seconds.

2022-04-06 12:27:54,351+05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-53) [5df9e8e3] Command 'UploadStreamVDSCommand(HostName = dell-r530-3, UploadStreamVDSCommandParameters:{hostId='9184ea18-bade-4a82-8f36-cede09de0175'})' execution failed: java.net.SocketTimeoutException: Read timed out


2022-04-06 12:28:17,524+05 INFO  [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-104) [5df9e8e3] Server failed to respond, vds_id='9184ea18-bade-4a82-8f36-cede09de0175', vds_name='dell-r530-3', vm_count=2, spm_status='SPM', non-responsive_timeout (seconds)=81, error: org.apache.http.NoHttpResponseException: dell-r530-3:54321 failed to respond

2022-04-06 12:28:22,249+05 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-96) [34af6e23] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='9184ea18-bade-4a82-8f36-cede09de0175', action='STOP', agent='FenceAgent:{id='f25e834e-774e-485c-946e-8f8d2d15b903', hostId='9184ea18-bade-4a82-8f36-cede09de0175', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='rcuser', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 48a31986

Comment 14 Eli Mesika 2022-04-11 11:24:34 UTC
While we have to handle the network exception grace period when the host
is switched to 'connecting' state due to its load regarding number of
running VMs and SPM status, in the case of soft-fencing flow, the host
is already in not-responding status, other host already took the SPM
role and all its running VMs set to 'unknown' status. So we should not
consider the host load at all and a fixed grace period (configurable 1
min) is enough to restart the vdsmd service on the host and get it up
and running.

Solution was tested with host as SPM with running VMs (some are HA),
with a non SPM host running VMs and with a regular host.

Results:

    Both initial grace between connecting and non-responding and between
    soft-fencing and hard-fencing are honored.

    Code is more readable and straight foreword

Comment 16 Eli Mesika 2022-04-12 14:29:02 UTC
Suggested Tests: 
Configuration : Two running hosts A and B , A with PM configured, active SD , several running VMs , some are HA 

1)  
Host A is SPM and running several VMs some of them HA
    stop vdsmd service on host A
    status set to connecting for grace period
    host become non-responding 
    soft-fencing is executed on A and succeeded
    VMs restored on host A, HA VMs are up 
    SPM transfered to Host B

2)    

    Host A is SPM and running several VMs some of them HA
    block vdsm port on A
    status set to connecting for grace period
    host become non-responding 
    soft-fencing is executed and fails
    host is rebooted after 1 min
    VMs restored on host B, HA VMs are up 
    SPM transfered to Host B

3)  
Host A is SPM and running several VMs some of them HA
    stop vdsmd service on A
    status set to connecting for grace period
    start vdsmd service on A within 1 min 
    host become up 
    VMs restored on host A, HA VMs are up 
    SPM transfered to Host B

4)    

    Host A is SPM and running several VMs some of them HA
    block vdsm port on A
    status set to connecting for grace period
    host become non-responding 
    unblock vdsm port on A within 1 min
    host A is set to UP
    VMs restored on host A, HA VMs are up 
    SPM transfered to Host B


There are 2 grace period times 

1. when we get a Network error(in connecting state) , this grace period is >= 60 sec according to the host load (20 sec for SPM + 0.5 sec * <number of running vms>)
2. when the host is soft-fenced (in non-responding state before hard-fence is executed) , this should be exactly 1 min

Comment 19 Lucie Leistnerova 2022-05-24 05:49:15 UTC
Tested suggested flows in ovirt-engine-4.5.0.7-0.9.el8ev.noarch

1) After 1 minute fence agent triggered, SPM moved to other host, VMs restored
2) After 1 minute soft fencing triggered and failed, trying to restart via PM, SPM moved to other host, VMs restored
3) SPM moved to other host, back Up, VMs restored, no additional PM actions
4) SPM moved to other host, back Up, VMs restored, no additional PM actions

Comment 20 Lucie Leistnerova 2022-05-24 05:53:04 UTC
Sorry, sent a comment not updated properly
2) After 1 minute soft fencing triggered and failed, SPM moved to other host, trying to restart via PM, restarted, HA VM restored on other host

Comment 25 errata-xmlrpc 2022-05-26 16:23:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.0] security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4711