Description of problem: The engine lost connection to a host, then reconnected to it fine but fenced it anyway while the host was already in Up status. See this. 1. libvirt on the hypervisor crashed (known problem), vdsm restarts 2. Due to above, GetAllVmStats fails and the host is set to non-responding 2022-04-02 14:35:33,362-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1162) [] EVENT_ID: VDS_FAILURE(12), Host xxx is non responsive. 3. Engine starts soft fencing (ssh) and check Power Management status 4. VDSM comes up and the connection succeeds 5. Engine does the usual vds_init stuff, connectStorage server to this host, and the host is now in 'Up' state about 22s after [2], VMs on the host were set to Up too. 2022-04-02 14:35:55,548-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-54) [783b13bb] EVENT_ID: VDS_DETECTED(13), Status of host xxx was set to Up. 6. VMs are set to up 7. Engine powers off the host 18s after it was set to Up. 2022-04-02 14:36:13,298-04 INFO [org.ovirt.engine.core.bll.pm.StopVdsCommand] (EE-ManagedThreadFactory-engine-Thread-1163) [39c35cb0] Power-Management: STOP of host 'xxx' initiated. Version-Release number of selected component (if applicable): rhvm-4.4.10.7-0.4.el8ev.noarch vdsm-4.40.40-1.el8ev.x86_64 How reproducible: Unknown
There are some locking errors in the logs related to VDS_INIT, but those seem related to another host that failed to install, unsure if related. Note the host was set to Up while KdumpDetection was running, and it went straight to STOP after ssh soft fencing and kdump detection finishing. 2022-04-02 14:35:43,209-04 INFO [org.ovirt.engine.core.bll.pm.VdsKdumpDetectionCommand] (EE-ManagedThreadFactory-engine-Thread-1163) [55a443d4] Running command: VdsKdumpDetectionCommand internal: true. Entities affected : ID: d6a9e3d8-ab11-4cd9-bcf9-4ef492fd1157 Type: VDS 2022-04-02 14:36:13,229-04 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1163) [55a443d4] EVENT_ID: KDUMP_FLOW_NOT_DETECTED_ON_VDS(615), Kdump flow is not in progress on host xxx.
BZ2007286 modified isHostInGracePeriod: // return when either attempts reached or timeout passed, the sooner takes if (unrespondedAttempts.get() > unrespondedAttemptsBarrier) { // too many unresponded attempts return false; That unrespondedAttempts seems to be incremented by 1 with NetworkException, and the limit is 2 until SSH soft fencing succeds. Maybe this is reaching > 2 quickly?
Adding another case. Can reproduce this with bug 2070045. OVF update failed. ~~~ 2022-04-04 11:18:07,032+05 INFO [org.ovirt.engine.core.bll.storage.ovfstore.ProcessOvfUpdateForStorageDomainCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [315f61be] Running command: ProcessOvfUpdateForStorageDomainCommand internal: true. Entities affected : ID: b30addc2-3067-42a2-98e6-765e4893d866 Type: StorageAction group MANIPULATE_STORAGE_DOMAIN with role type ADMIN 2022-04-04 11:18:27,326+05 ERROR [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-39) [315f61be] Command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketTimeoutException: Read timed out (Failed with error VDS_NETWORK_ERROR and code 5022) ~~~ SSH soft fencing. ~~~ 2022-04-04 11:18:27,467+05 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Running command: SshSoftFencingCommand internal: true. Entities affected : ID: 9184ea18-bade-4a82-8f36-cede09de0175 Type: VDS 2022-04-04 11:18:27,676+05 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Opening SSH Soft Fencing session on host 'dell-r530-3.' 2022-04-04 11:18:28,022+05 INFO [org.ovirt.engine.core.bll.pm.SshSoftFencingCommand] (EE-ManagedThreadFactory-engine-Thread-235) [6a9c4a94] Executing SSH Soft Fencing command on host 'dell-r530-3.' ~~~ GetAllVmStatsVDSCommand failed because of ssh fencing. ~~~ 2022-04-04 11:18:45,515+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-66) [] Command 'GetAllVmStatsVDSCommand(HostName = dell-r530-3., VdsIdVDSCommandParametersBase:{hostId='9184ea18-bade-4a82-8f36-cede09de0175'})' execution failed: VDSGenericException: VDSNetworkException: Broken pipe ~~~ Power-Management: STOP ~~~ 2022-04-04 11:18:48,351+05 INFO [org.ovirt.engine.core.bll.pm.StopVdsCommand] (EE-ManagedThreadFactory-engine-Thread-235) [1ea685a5] Power-Management: STOP of host 'dell-r530-3.' initiated. 2022-04-04 11:18:48,722+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-235) [1ea685a5] START, FenceVdsVDSCommand(HostName = dell-r530-4., FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='9184ea18-bade-4a82-8f36-cede09de0175', action='STOP', agent='FenceAgent:{id='aa954daa-9ac5-46bd-8311-e46e2e82920b', hostId='9184ea18-bade-4a82-8f36-cede09de0175', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='rcuser', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 65284f25 ~~~ So the host was stopped within 20 seconds of the first network failure even when it was SPM. How reproducible: 100%
(In reply to Germano Veit Michel from comment #5) > Maybe this is reaching > 2 quickly? This can be configured from engine-config Can you try setting VDSAttemptsToResetCount to a higher value , restart engine and test the scenario again ?
(In reply to Eli Mesika from comment #8) > (In reply to Germano Veit Michel from comment #5) > > > Maybe this is reaching > 2 quickly? > > This can be configured from engine-config > Can you try setting VDSAttemptsToResetCount to a higher value , restart > engine and test the scenario again ? Checked with VDSAttemptsToResetCount=5. The host was fenced within 30 seconds. 2022-04-06 12:27:54,351+05 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-53) [5df9e8e3] Command 'UploadStreamVDSCommand(HostName = dell-r530-3, UploadStreamVDSCommandParameters:{hostId='9184ea18-bade-4a82-8f36-cede09de0175'})' execution failed: java.net.SocketTimeoutException: Read timed out 2022-04-06 12:28:17,524+05 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-104) [5df9e8e3] Server failed to respond, vds_id='9184ea18-bade-4a82-8f36-cede09de0175', vds_name='dell-r530-3', vm_count=2, spm_status='SPM', non-responsive_timeout (seconds)=81, error: org.apache.http.NoHttpResponseException: dell-r530-3:54321 failed to respond 2022-04-06 12:28:22,249+05 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.FenceVdsVDSCommand] (EE-ManagedThreadFactory-engine-Thread-96) [34af6e23] START, FenceVdsVDSCommand(HostName = dell-r530-4, FenceVdsVDSCommandParameters:{hostId='213ae779-00b1-4a7c-abce-f8f5c116b7bf', targetVdsId='9184ea18-bade-4a82-8f36-cede09de0175', action='STOP', agent='FenceAgent:{id='f25e834e-774e-485c-946e-8f8d2d15b903', hostId='9184ea18-bade-4a82-8f36-cede09de0175', order='1', type='ipmilan', ip='10.65.177.172', port='null', user='rcuser', password='***', encryptOptions='false', options='privlvl=OPERATOR, delay=10, lanplus=1'}', policy='[]'}), log id: 48a31986
While we have to handle the network exception grace period when the host is switched to 'connecting' state due to its load regarding number of running VMs and SPM status, in the case of soft-fencing flow, the host is already in not-responding status, other host already took the SPM role and all its running VMs set to 'unknown' status. So we should not consider the host load at all and a fixed grace period (configurable 1 min) is enough to restart the vdsmd service on the host and get it up and running. Solution was tested with host as SPM with running VMs (some are HA), with a non SPM host running VMs and with a regular host. Results: Both initial grace between connecting and non-responding and between soft-fencing and hard-fencing are honored. Code is more readable and straight foreword
Suggested Tests: Configuration : Two running hosts A and B , A with PM configured, active SD , several running VMs , some are HA 1) Host A is SPM and running several VMs some of them HA stop vdsmd service on host A status set to connecting for grace period host become non-responding soft-fencing is executed on A and succeeded VMs restored on host A, HA VMs are up SPM transfered to Host B 2) Host A is SPM and running several VMs some of them HA block vdsm port on A status set to connecting for grace period host become non-responding soft-fencing is executed and fails host is rebooted after 1 min VMs restored on host B, HA VMs are up SPM transfered to Host B 3) Host A is SPM and running several VMs some of them HA stop vdsmd service on A status set to connecting for grace period start vdsmd service on A within 1 min host become up VMs restored on host A, HA VMs are up SPM transfered to Host B 4) Host A is SPM and running several VMs some of them HA block vdsm port on A status set to connecting for grace period host become non-responding unblock vdsm port on A within 1 min host A is set to UP VMs restored on host A, HA VMs are up SPM transfered to Host B There are 2 grace period times 1. when we get a Network error(in connecting state) , this grace period is >= 60 sec according to the host load (20 sec for SPM + 0.5 sec * <number of running vms>) 2. when the host is soft-fenced (in non-responding state before hard-fence is executed) , this should be exactly 1 min
Tested suggested flows in ovirt-engine-4.5.0.7-0.9.el8ev.noarch 1) After 1 minute fence agent triggered, SPM moved to other host, VMs restored 2) After 1 minute soft fencing triggered and failed, trying to restart via PM, SPM moved to other host, VMs restored 3) SPM moved to other host, back Up, VMs restored, no additional PM actions 4) SPM moved to other host, back Up, VMs restored, no additional PM actions
Sorry, sent a comment not updated properly 2) After 1 minute soft fencing triggered and failed, SPM moved to other host, trying to restart via PM, restarted, HA VM restored on other host
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.0] security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4711