Description of problem: The downloadImageFromStream took 21 seconds to complete. It took 10 seconds for activating the LV and took another 10 seconds for getAllVolumes(). ~~~ 2022-03-30 01:42:27,161+0000 INFO (Thread-263) [vdsm.api] START downloadImageFromStream(methodArgs={'fileObj': <_io.BufferedReader name=154>, 'length': 680448}, callback=<function ImageRequestHandler._createEventWithCallback.<locals>.setCallback at 0x7fd6847aa840>, spUUID='4ae7ba94-91a4-11ec-9bd7-00163e3b7ed0', sdUUID='b109b83d-f184-43bc-b4bd-ac608ca42215', imgUUID='14cdc8ca-5671-4339-b2ff-bc0aa604d353', volUUID='5379bf1a-8239-4252-8478-e08cca50a11e') from=internal, task_id=bbfe9e30-827f-4950-9891-c5da42faba46 (api:48) 2022-03-30 01:42:27,355+0000 INFO (tasks/4) [storage.LVM] Activating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['5379bf1a-8239-4252-8478-e08cca50a11e'] (lvm:1791) < ---- 10 secods ---> 2022-03-30 01:42:37,665+0000 INFO (tasks/4) [storage.StorageDomain] Creating image run directory '/run/vdsm/storage/b109b83d-f184-43bc-b4bd-ac608ca42215/14cdc8ca-5671-4339-b2ff-bc0aa604d353' (blockSD:1363) 2022-03-30 01:42:38,143+0000 INFO (tasks/4) [storage.fileUtils] Removing directory: /run/vdsm/storage/b109b83d-f184-43bc-b4bd-ac608ca42215/14cdc8ca-5671-4339-b2ff-bc0aa604d353 (fileUtils:182) < ---- 10 seconds ---> 2022-03-30 01:42:48,196+0000 INFO (tasks/4) [storage.LVM] Deactivating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['5379bf1a-8239-4252-8478-e08cca50a11e'] (lvm:1802) 2022-03-30 01:42:48,345+0000 INFO (tasks/4) [storage.ThreadPool.WorkerThread] FINISH task bbfe9e30-827f-4950-9891-c5da42faba46 (threadPool:148) ~~~ However, the engine got timeout before this in 20 seconds with "java.net.SocketTimeoutException" error. ~~~ 2022-03-30 01:42:27,146Z INFO [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [6093d872] START, UploadStreamVDSCommand(HostName =, UploadStreamVDSCommandParameters:{hostId='3835317d-34be-4844-a091-2b13e8be4a59'}), log id: bc3440e 2022-03-30 01:42:47,265Z ERROR [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [6093d872] Command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketTimeoutException: Read timed out (Failed with error VDS_NETWORK_ERROR and code 5022) ~~~ I am unable to find which timeout is hitting here that is causing the SocketTimeoutException. From the tcpdump, I can see that the vdsm is closing the connection. The host was moved to "not responding" status and was subsequently fenced after this error. Version-Release number of selected component (if applicable): rhvm-4.4.10.6-0.1.el8ev.noarch How reproducible: Observed twice in the customer environment and can reproduce manually if we delay downloadImageFromStream. Steps to Reproduce: 1. Added time.sleep(30) under deactivateImage to delay the downloadImageFromStream. 2. Created a disk in storage domain and manually did "Update OVFs". 3. The UploadStreamVDSCommand fails after 20 seconds. Actual results: UploadStreamVDSCommand fails with java.net.SocketTimeoutException after 20 seconds Expected results: Additional info:
(In reply to nijin ashok from comment #0) > I am unable to find which timeout is hitting here that is causing the > SocketTimeoutException. From the tcpdump, I can see that the vdsm is closing > the connection. It is vdsConnectionTimeout. Also, the delay of 20 seconds in downloadImageFromStream is not because of storage performance and the thread was waiting for the other LVM commands to complete which were executed in parallel. There was hotuplugdisk task to unplug the disk from the backup/vendor VM which started an LVM deactivate task. These LVs are the ones used by the VM and hence failed with an "in use" error. However, LVM will retry 25 times, and hence any LV deactivate operation of an open LV will typically take around 5 seconds to complete. With two LVs here, it took 10 seconds to complete and the downloadImageFromStream has to wait during this time before it runs its LVM command. ~~~ 2022-03-30 01:42:27,267+0000 INFO (jsonrpc/6) [storage.LVM] Deactivating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['1dea92cc-6522-4003-924c-0c7b50b1bae3', 'ae642a1e-42c9-4512-bf7f-15768b74a178'] (lvm:1802) <--- jsonrpc/6 thread which is trying to deactivates LV. 2022-03-30 01:42:27,355+0000 INFO (tasks/4) [storage.LVM] Activating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['5379bf1a-8239-4252-8478-e08cca50a11e'] (lvm:1791) <----- tasks/4 downloadImageFromStream thread 022-03-30 01:42:37,581+0000 WARN (jsonrpc/6) [storage.LVM] Command with specific filter failed or returned no data, retrying with a wider filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'lvchange\', \'--config\', \'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/ma........ rc=5 out=[] err=[\' Logical volume b109b83d-f184-43bc-b4bd-ac608ca42215/1dea92cc-6522-4003-924c-0c7b50b1bae3 in use.\', \' Logical volume b109b83d-f184-43bc-b4bd-ac608ca42215/ae642a1e-42c9-4512-bf7f-15768b74a178 in use.\']' (lvm:534) 2022-03-30 01:42:37,665+0000 INFO (tasks/4) [storage.StorageDomain] Creating image run directory '/run/vdsm/storage/b109b83d-f184-43bc-b4bd-ac608ca42215/14cdc8ca-5671-4339-b2ff-bc0aa604d353' (blockSD:1363) 2022-03-30 01:42:38,143+0000 INFO (tasks/4) [storage.fileUtils] Removing directory: /run/vdsm/storage/b109b83d-f184-43bc-b4bd-ac608ca42215/14cdc8ca-5671-4339-b2ff-bc0aa604d353 (fileUtils:182) < ----------- getAllVolumes() will be called here that will run lvs command ----> 022-03-30 01:42:48,132+0000 WARN (jsonrpc/6) [storage.LVM] All 2 tries have failed: LVM command failed: 'cmd=[\'/sbin/lvm\', \'lvchange\', ........lobal { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }\', \'--autobackup\', \'n\', \'--available\', \'n\', \'b109b83d-f184-43bc-b4bd-ac608ca42215/1dea92cc-6522-4003-924c-0c7b50b1bae3\', \'b109b83d-f184-43bc-b4bd-ac608ca42215/ae642a1e-42c9-4512-bf7f-15768b74a178\'] rc=5 out=[] err=[\' Logical volume b109b83d-f184-43bc-b4bd-ac608ca42215/1dea92cc-6522-4003-924c-0c7b50b1bae3 in use.\', \' Logical volume b109b83d-f184-43bc-b4bd-ac608ca42215/ae642a1e-42c9-4512-bf7f-15768b74a178 in use.\']' (lvm:561) 2022-03-30 01:42:48,196+0000 INFO (tasks/4) [storage.LVM] Deactivating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['5379bf1a-8239-4252-8478-e08cca50a11e'] (lvm:1802) ~~~ Shouldn't this timeout vdsTimeout which is used for other tasks?
(In reply to nijin ashok from comment #0) > Steps to Reproduce: > > 1. Added time.sleep(30) under deactivateImage to delay the > downloadImageFromStream. > 2. Created a disk in storage domain and manually did "Update OVFs". > 3. The UploadStreamVDSCommand fails after 20 seconds. We have a longer timeout set for synchronous operations against VDSM So it's interesting to see what triggers the timeout here - whether it's something at the lower level infrastructure we use (the json-rpc library) or what can explain the shorter timeout here of 20 sec
This patch https://github.com/oVirt/ovirt-engine/pull/463 changes the engine behavior so host does not move to non-responsive state in case OVF update fails on timeout. To make OVF update succeed - vdsConnectionTimeout should be extended locally: $ engine-config -s vdsConnectionTimeout=30
Verified. versions: engine-4.5.1.2-0.11.el8ev vdsm-4.50.1.3-1.el8ev Verified it on an old version and saw that the host turned to non-responsive after the "SocketTimeoutException". (ovirt-engine-4.4.10.7-0.4.el8ev) I can see the same exception on the latest build with the fix, but the host doesn't go to non-responsive.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.1] security, bug fix and update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5555