Bug 2071468
| Summary: | Engine fenced host that was already reconnected and set to Up status. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
| Component: | ovirt-engine | Assignee: | Eli Mesika <emesika> |
| Status: | CLOSED ERRATA | QA Contact: | Lucie Leistnerova <lleistne> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.4.10 | CC: | amashah, emarcus, lsvaty, mavital, michal.skrivanek, mperina, nashok, schandle |
| Target Milestone: | ovirt-4.5.0-1 | Keywords: | Regression |
| Target Release: | 4.5.0 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | ovirt-engine-4.5.0.5 | Doc Type: | Release Note |
| Doc Text: |
If SSH soft fencing needs to be executed on a problematic host, The Red Hat Virtualization Manager now waits the expected time interval before it continues with fencing. As a result,the VDSM has enough time to start and respond to the Red Hat Virtualization Manager.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-26 16:23:55 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Germano Veit Michel
2022-04-04 00:23:25 UTC
There are some locking errors in the logs related to VDS_INIT, but those seem related to another host that failed to install, unsure if related. Note the host was set to Up while KdumpDetection was running, and it went straight to STOP after ssh soft fencing and kdump detection finishing. 2022-04-02 14:35:43,209-04 INFO [org.ovirt.engine.core.bll.pm.VdsKdumpDetectionCommand] (EE-ManagedThreadFactory-engine-Thread-1163) [55a443d4] Running command: VdsKdumpDetectionCommand internal: true. Entities affected : ID: d6a9e3d8-ab11-4cd9-bcf9-4ef492fd1157 Type: VDS 2022-04-02 14:36:13,229-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1163) [55a443d4] EVENT_ID: KDUMP_FLOW_NOT_DETECTED_ON_VDS(615), Kdump flow is not in progress on host xxx. BZ2007286 modified isHostInGracePeriod: // return when either attempts reached or timeout passed, the sooner takes if (unrespondedAttempts.get() > unrespondedAttemptsBarrier) { // too many unresponded attempts return false; That unrespondedAttempts seems to be incremented by 1 with NetworkException, and the limit is 2 until SSH soft fencing succeds. Maybe this is reaching > 2 quickly? Adding another case. Can reproduce this with bug 2070045. OVF update failed. ~~~ 2022-04-04 11:18:07,032+05 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStorageDomainCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [315f61be] Running command: ProcessOvfUpdateForStorageDomainCommand internal: true. Entities affected : ID: b30addc2-3067-42a2-98e6-765e4893d866 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN 2022-04-04 11:18:27,326+05 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [315f61be] Command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketTimeoutException: Read timed out (Failed with error VDS_NETWORK_ERROR and code 5022) ~~~ SSH soft fencing. ~~~ 2022-04-04 11:18:27,467+05 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Running command: SshSoftFencingCommand internal: true. Entities affected : ID: 9184ea18-bade-4a82-8f36-cede09de0175 Type: VDS 2022-04-04 11:18:27,676+05 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Opening SSH Soft Fencing session on host 'dell-r530-3.' 2022-04-04 11:18:28,022+05 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Executing SSH Soft Fencing command on host 'dell-r530-3.' ~~~ GetAllVmStatsVDSCommand failed because of ssh fencing. ~~~ 2022-04-04 11:18:45,515+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-66) [] Command 'GetAllVmStatsVDSCommand(HostName = dell-r530-3., VdsIdVDSCommandParametersBase:{hostId='9184ea18-bade-4a82-8f36-cede09de0175'})' execution failed: VDSGenericException: VDSNetworkException: Broken pipe ~~~ Power-Management: STOP ~~~ 2022-04-04 11:18:48,351+05 INFO [org.ovirt.engine.core.bll.pm.StopVdsCommand] (EE-ManagedThreadFactory-engine-Thread-235) [1ea685a5] Power-Management: STOP of host 'dell-r530-3.' initiated. 2022-04-04 11:18:48,722+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-235) [1ea685a5] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='9184ea18-bade-4a82-8f36-cede09de0175', action='STOP', agent='FenceAgent:{id='aa954daa-9ac5-46bd-8311-e46e2e82920b', hostId='9184ea18-bade-4a82-8f36-cede09de0175', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='rcuser', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 65284f25 ~~~ So the host was stopped within 20 seconds of the first network failure even when it was SPM. How reproducible: 100% (In reply to Germano Veit Michel from comment #5) > Maybe this is reaching > 2 quickly? This can be configured from engine-config Can you try setting VDSAttemptsToResetCount to a higher value , restart engine and test the scenario again ? (In reply to Eli Mesika from comment #8) > (In reply to Germano Veit Michel from comment #5) > > > Maybe this is reaching > 2 quickly? > > This can be configured from engine-config > Can you try setting VDSAttemptsToResetCount to a higher value , restart > engine and test the scenario again ? Checked with VDSAttemptsToResetCount=5. The host was fenced within 30 seconds. 2022-04-06 12:27:54,351+05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-53) [5df9e8e3] Command 'UploadStreamVDSCommand(HostName = dell-r530-3, UploadStreamVDSCommandParameters:{hostId='9184ea18-bade-4a82-8f36-cede09de0175'})' execution failed: java.net.SocketTimeoutException: Read timed out 2022-04-06 12:28:17,524+05 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-104) [5df9e8e3] Server failed to respond, vds_id='9184ea18-bade-4a82-8f36-cede09de0175', vds_name='dell-r530-3', vm_count=2, spm_status='SPM', non-responsive_timeout (seconds)=81, error: org.apache.http.NoHttpResponseException: dell-r530-3:54321 failed to respond 2022-04-06 12:28:22,249+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-96) [34af6e23] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='9184ea18-bade-4a82-8f36-cede09de0175', action='STOP', agent='FenceAgent:{id='f25e834e-774e-485c-946e-8f8d2d15b903', hostId='9184ea18-bade-4a82-8f36-cede09de0175', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='rcuser', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 48a31986 While we have to handle the network exception grace period when the host
is switched to 'connecting' state due to its load regarding number of
running VMs and SPM status, in the case of soft-fencing flow, the host
is already in not-responding status, other host already took the SPM
role and all its running VMs set to 'unknown' status. So we should not
consider the host load at all and a fixed grace period (configurable 1
min) is enough to restart the vdsmd service on the host and get it up
and running.
Solution was tested with host as SPM with running VMs (some are HA),
with a non SPM host running VMs and with a regular host.
Results:
Both initial grace between connecting and non-responding and between
soft-fencing and hard-fencing are honored.
Code is more readable and straight foreword
Suggested Tests:
Configuration : Two running hosts A and B , A with PM configured, active SD , several running VMs , some are HA
1)
Host A is SPM and running several VMs some of them HA
stop vdsmd service on host A
status set to connecting for grace period
host become non-responding
soft-fencing is executed on A and succeeded
VMs restored on host A, HA VMs are up
SPM transfered to Host B
2)
Host A is SPM and running several VMs some of them HA
block vdsm port on A
status set to connecting for grace period
host become non-responding
soft-fencing is executed and fails
host is rebooted after 1 min
VMs restored on host B, HA VMs are up
SPM transfered to Host B
3)
Host A is SPM and running several VMs some of them HA
stop vdsmd service on A
status set to connecting for grace period
start vdsmd service on A within 1 min
host become up
VMs restored on host A, HA VMs are up
SPM transfered to Host B
4)
Host A is SPM and running several VMs some of them HA
block vdsm port on A
status set to connecting for grace period
host become non-responding
unblock vdsm port on A within 1 min
host A is set to UP
VMs restored on host A, HA VMs are up
SPM transfered to Host B
There are 2 grace period times
1. when we get a Network error(in connecting state) , this grace period is >= 60 sec according to the host load (20 sec for SPM + 0.5 sec * <number of running vms>)
2. when the host is soft-fenced (in non-responding state before hard-fence is executed) , this should be exactly 1 min
Tested suggested flows in ovirt-engine-4.5.0.7-0.9.el8ev.noarch 1) After 1 minute fence agent triggered, SPM moved to other host, VMs restored 2) After 1 minute soft fencing triggered and failed, trying to restart via PM, SPM moved to other host, VMs restored 3) SPM moved to other host, back Up, VMs restored, no additional PM actions 4) SPM moved to other host, back Up, VMs restored, no additional PM actions Sorry, sent a comment not updated properly 2) After 1 minute soft fencing triggered and failed, SPM moved to other host, trying to restart via PM, restarted, HA VM restored on other host Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.0] security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4711 |