Bug 2070045
Summary: | UploadStreamVDSCommand fails with java.net.SocketTimeoutException after 20 seconds | ||
---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | nijin ashok <nashok> |
Component: | vdsm | Assignee: | Mark Kemel <mkemel> |
Status: | CLOSED ERRATA | QA Contact: | sshmulev |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.4.10 | CC: | ahadas, apinnick, asocha, dfodor, fdelorey, jclaretm, lsurette, mavital, mkemel, mperina, sfishbai, srevivo, ycui |
Target Milestone: | ovirt-4.5.1 | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | ovirt-engine-4.5.1.2 | Doc Type: | Bug Fix |
Doc Text: |
The host no longer enters a non-responsive state if the OVF store update operation times out because of network errors.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-07-14 12:54:31 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
nijin ashok
2022-03-30 11:20:03 UTC
(In reply to nijin ashok from comment #0) > I am unable to find which timeout is hitting here that is causing the > SocketTimeoutException. From the tcpdump, I can see that the vdsm is closing > the connection. It is vdsConnectionTimeout. Also, the delay of 20 seconds in downloadImageFromStream is not because of storage performance and the thread was waiting for the other LVM commands to complete which were executed in parallel. There was hotuplugdisk task to unplug the disk from the backup/vendor VM which started an LVM deactivate task. These LVs are the ones used by the VM and hence failed with an "in use" error. However, LVM will retry 25 times, and hence any LV deactivate operation of an open LV will typically take around 5 seconds to complete. With two LVs here, it took 10 seconds to complete and the downloadImageFromStream has to wait during this time before it runs its LVM command. ~~~ 2022-03-30 01:42:27,267+0000 INFO (jsonrpc/6) [storage.LVM] Deactivating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['1dea92cc-6522-4003-924c-0c7b50b1bae3', 'ae642a1e-42c9-4512-bf7f-15768b74a178'] (lvm:1802) <--- jsonrpc/6 thread which is trying to deactivates LV. 2022-03-30 01:42:27,355+0000 INFO (tasks/4) [storage.LVM] Activating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['5379bf1a-8239-4252-8478-e08cca50a11e'] (lvm:1791) <----- tasks/4 downloadImageFromStream thread 022-03-30 01:42:37,581+0000 WARN (jsonrpc/6) [storage.LVM] Command with specific filter failed or returned no data, retrying with a wider filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'lvchange\', \'--config\', \'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/ma........ rc=5 out=[] err=[\' Logical volume b109b83d-f184-43bc-b4bd-ac608ca42215/1dea92cc-6522-4003-924c-0c7b50b1bae3 in use.\', \' Logical volume b109b83d-f184-43bc-b4bd-ac608ca42215/ae642a1e-42c9-4512-bf7f-15768b74a178 in use.\']' (lvm:534) 2022-03-30 01:42:37,665+0000 INFO (tasks/4) [storage.StorageDomain] Creating image run directory '/run/vdsm/storage/b109b83d-f184-43bc-b4bd-ac608ca42215/14cdc8ca-5671-4339-b2ff-bc0aa604d353' (blockSD:1363) 2022-03-30 01:42:38,143+0000 INFO (tasks/4) [storage.fileUtils] Removing directory: /run/vdsm/storage/b109b83d-f184-43bc-b4bd-ac608ca42215/14cdc8ca-5671-4339-b2ff-bc0aa604d353 (fileUtils:182) < ----------- getAllVolumes() will be called here that will run lvs command ----> 022-03-30 01:42:48,132+0000 WARN (jsonrpc/6) [storage.LVM] All 2 tries have failed: LVM command failed: 'cmd=[\'/sbin/lvm\', \'lvchange\', ........lobal { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }\', \'--autobackup\', \'n\', \'--available\', \'n\', \'b109b83d-f184-43bc-b4bd-ac608ca42215/1dea92cc-6522-4003-924c-0c7b50b1bae3\', \'b109b83d-f184-43bc-b4bd-ac608ca42215/ae642a1e-42c9-4512-bf7f-15768b74a178\'] rc=5 out=[] err=[\' Logical volume b109b83d-f184-43bc-b4bd-ac608ca42215/1dea92cc-6522-4003-924c-0c7b50b1bae3 in use.\', \' Logical volume b109b83d-f184-43bc-b4bd-ac608ca42215/ae642a1e-42c9-4512-bf7f-15768b74a178 in use.\']' (lvm:561) 2022-03-30 01:42:48,196+0000 INFO (tasks/4) [storage.LVM] Deactivating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['5379bf1a-8239-4252-8478-e08cca50a11e'] (lvm:1802) ~~~ Shouldn't this timeout vdsTimeout which is used for other tasks? (In reply to nijin ashok from comment #0) > Steps to Reproduce: > > 1. Added time.sleep(30) under deactivateImage to delay the > downloadImageFromStream. > 2. Created a disk in storage domain and manually did "Update OVFs". > 3. The UploadStreamVDSCommand fails after 20 seconds. We have a longer timeout set for synchronous operations against VDSM So it's interesting to see what triggers the timeout here - whether it's something at the lower level infrastructure we use (the json-rpc library) or what can explain the shorter timeout here of 20 sec This patch https://github.com/oVirt/ovirt-engine/pull/463 changes the engine behavior so host does not move to non-responsive state in case OVF update fails on timeout. To make OVF update succeed - vdsConnectionTimeout should be extended locally: $ engine-config -s vdsConnectionTimeout=30 Verified. versions: engine-4.5.1.2-0.11.el8ev vdsm-4.50.1.3-1.el8ev Verified it on an old version and saw that the host turned to non-responsive after the "SocketTimeoutException". (ovirt-engine-4.4.10.7-0.4.el8ev) I can see the same exception on the latest build with the fix, but the host doesn't go to non-responsive. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.1] security, bug fix and update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5555 |