Bug 2071468
Summary: | Engine fenced host that was already reconnected and set to Up status. | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
Component: | ovirt-engine | Assignee: | Eli Mesika <emesika> |
Status: | CLOSED ERRATA | QA Contact: | Lucie Leistnerova <lleistne> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.4.10 | CC: | amashah, emarcus, lsvaty, mavital, michal.skrivanek, mperina, nashok, schandle |
Target Milestone: | ovirt-4.5.0-1 | Keywords: | Regression |
Target Release: | 4.5.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | ovirt-engine-4.5.0.5 | Doc Type: | Release Note |
Doc Text: |
If SSH soft fencing needs to be executed on a problematic host, The Red Hat Virtualization Manager now waits the expected time interval before it continues with fencing. As a result,the VDSM has enough time to start and respond to the Red Hat Virtualization Manager.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-05-26 16:23:55 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Germano Veit Michel
2022-04-04 00:23:25 UTC
There are some locking errors in the logs related to VDS_INIT, but those seem related to another host that failed to install, unsure if related. Note the host was set to Up while KdumpDetection was running, and it went straight to STOP after ssh soft fencing and kdump detection finishing. 2022-04-02 14:35:43,209-04 INFO [org.ovirt.engine.core.bll.pm.VdsKdumpDetectionCommand] (EE-ManagedThreadFactory-engine-Thread-1163) [55a443d4] Running command: VdsKdumpDetectionCommand internal: true. Entities affected : ID: d6a9e3d8-ab11-4cd9-bcf9-4ef492fd1157 Type: VDS 2022-04-02 14:36:13,229-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1163) [55a443d4] EVENT_ID: KDUMP_FLOW_NOT_DETECTED_ON_VDS(615), Kdump flow is not in progress on host xxx. BZ2007286 modified isHostInGracePeriod: // return when either attempts reached or timeout passed, the sooner takes if (unrespondedAttempts.get() > unrespondedAttemptsBarrier) { // too many unresponded attempts return false; That unrespondedAttempts seems to be incremented by 1 with NetworkException, and the limit is 2 until SSH soft fencing succeds. Maybe this is reaching > 2 quickly? Adding another case. Can reproduce this with bug 2070045. OVF update failed. ~~~ 2022-04-04 11:18:07,032+05 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStorageDomainCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [315f61be] Running command: ProcessOvfUpdateForStorageDomainCommand internal: true. Entities affected : ID: b30addc2-3067-42a2-98e6-765e4893d866 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN 2022-04-04 11:18:27,326+05 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [315f61be] Command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketTimeoutException: Read timed out (Failed with error VDS_NETWORK_ERROR and code 5022) ~~~ SSH soft fencing. ~~~ 2022-04-04 11:18:27,467+05 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Running command: SshSoftFencingCommand internal: true. Entities affected : ID: 9184ea18-bade-4a82-8f36-cede09de0175 Type: VDS 2022-04-04 11:18:27,676+05 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Opening SSH Soft Fencing session on host 'dell-r530-3.' 2022-04-04 11:18:28,022+05 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Executing SSH Soft Fencing command on host 'dell-r530-3.' ~~~ GetAllVmStatsVDSCommand failed because of ssh fencing. ~~~ 2022-04-04 11:18:45,515+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-66) [] Command 'GetAllVmStatsVDSCommand(HostName = dell-r530-3., VdsIdVDSCommandParametersBase:{hostId='9184ea18-bade-4a82-8f36-cede09de0175'})' execution failed: VDSGenericException: VDSNetworkException: Broken pipe ~~~ Power-Management: STOP ~~~ 2022-04-04 11:18:48,351+05 INFO [org.ovirt.engine.core.bll.pm.StopVdsCommand] (EE-ManagedThreadFactory-engine-Thread-235) [1ea685a5] Power-Management: STOP of host 'dell-r530-3.' initiated. 2022-04-04 11:18:48,722+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-235) [1ea685a5] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='9184ea18-bade-4a82-8f36-cede09de0175', action='STOP', agent='FenceAgent:{id='aa954daa-9ac5-46bd-8311-e46e2e82920b', hostId='9184ea18-bade-4a82-8f36-cede09de0175', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='rcuser', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 65284f25 ~~~ So the host was stopped within 20 seconds of the first network failure even when it was SPM. How reproducible: 100% (In reply to Germano Veit Michel from comment #5) > Maybe this is reaching > 2 quickly? This can be configured from engine-config Can you try setting VDSAttemptsToResetCount to a higher value , restart engine and test the scenario again ? (In reply to Eli Mesika from comment #8) > (In reply to Germano Veit Michel from comment #5) > > > Maybe this is reaching > 2 quickly? > > This can be configured from engine-config > Can you try setting VDSAttemptsToResetCount to a higher value , restart > engine and test the scenario again ? Checked with VDSAttemptsToResetCount=5. The host was fenced within 30 seconds. 2022-04-06 12:27:54,351+05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-53) [5df9e8e3] Command 'UploadStreamVDSCommand(HostName = dell-r530-3, UploadStreamVDSCommandParameters:{hostId='9184ea18-bade-4a82-8f36-cede09de0175'})' execution failed: java.net.SocketTimeoutException: Read timed out 2022-04-06 12:28:17,524+05 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-104) [5df9e8e3] Server failed to respond, vds_id='9184ea18-bade-4a82-8f36-cede09de0175', vds_name='dell-r530-3', vm_count=2, spm_status='SPM', non-responsive_timeout (seconds)=81, error: org.apache.http.NoHttpResponseException: dell-r530-3:54321 failed to respond 2022-04-06 12:28:22,249+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-96) [34af6e23] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='9184ea18-bade-4a82-8f36-cede09de0175', action='STOP', agent='FenceAgent:{id='f25e834e-774e-485c-946e-8f8d2d15b903', hostId='9184ea18-bade-4a82-8f36-cede09de0175', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='rcuser', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 48a31986 While we have to handle the network exception grace period when the host is switched to 'connecting' state due to its load regarding number of running VMs and SPM status, in the case of soft-fencing flow, the host is already in not-responding status, other host already took the SPM role and all its running VMs set to 'unknown' status. So we should not consider the host load at all and a fixed grace period (configurable 1 min) is enough to restart the vdsmd service on the host and get it up and running. Solution was tested with host as SPM with running VMs (some are HA), with a non SPM host running VMs and with a regular host. Results: Both initial grace between connecting and non-responding and between soft-fencing and hard-fencing are honored. Code is more readable and straight foreword Suggested Tests: Configuration : Two running hosts A and B , A with PM configured, active SD , several running VMs , some are HA 1) Host A is SPM and running several VMs some of them HA stop vdsmd service on host A status set to connecting for grace period host become non-responding soft-fencing is executed on A and succeeded VMs restored on host A, HA VMs are up SPM transfered to Host B 2) Host A is SPM and running several VMs some of them HA block vdsm port on A status set to connecting for grace period host become non-responding soft-fencing is executed and fails host is rebooted after 1 min VMs restored on host B, HA VMs are up SPM transfered to Host B 3) Host A is SPM and running several VMs some of them HA stop vdsmd service on A status set to connecting for grace period start vdsmd service on A within 1 min host become up VMs restored on host A, HA VMs are up SPM transfered to Host B 4) Host A is SPM and running several VMs some of them HA block vdsm port on A status set to connecting for grace period host become non-responding unblock vdsm port on A within 1 min host A is set to UP VMs restored on host A, HA VMs are up SPM transfered to Host B There are 2 grace period times 1. when we get a Network error(in connecting state) , this grace period is >= 60 sec according to the host load (20 sec for SPM + 0.5 sec * <number of running vms>) 2. when the host is soft-fenced (in non-responding state before hard-fence is executed) , this should be exactly 1 min Tested suggested flows in ovirt-engine-4.5.0.7-0.9.el8ev.noarch 1) After 1 minute fence agent triggered, SPM moved to other host, VMs restored 2) After 1 minute soft fencing triggered and failed, trying to restart via PM, SPM moved to other host, VMs restored 3) SPM moved to other host, back Up, VMs restored, no additional PM actions 4) SPM moved to other host, back Up, VMs restored, no additional PM actions Sorry, sent a comment not updated properly 2) After 1 minute soft fencing triggered and failed, SPM moved to other host, trying to restart via PM, restarted, HA VM restored on other host Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.0] security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4711 |