Bug 2070045 - UploadStreamVDSCommand fails with java.net.SocketTimeoutException after 20 seconds
Summary: UploadStreamVDSCommand fails with java.net.SocketTimeoutException after 20 s...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.4.10
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ovirt-4.5.1
: ---
Assignee: Mark Kemel
QA Contact: sshmulev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-03-30 11:20 UTC by nijin ashok
Modified: 2022-12-08 09:09 UTC (History)
13 users (show)

Fixed In Version: ovirt-engine-4.5.1.2
Doc Type: Bug Fix
Doc Text:
The host no longer enters a non-responsive state if the OVF store update operation times out because of network errors.
Clone Of:
Environment:
Last Closed: 2022-07-14 12:54:31 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 463 0 None open core: ignore UploadStreamVDSCommand network failure 2022-06-15 09:04:18 UTC
Red Hat Issue Tracker RHV-45480 0 None None None 2022-03-30 11:20:54 UTC
Red Hat Product Errata RHSA-2022:5555 0 None None None 2022-07-14 12:55:15 UTC

Description nijin ashok 2022-03-30 11:20:03 UTC
Description of problem:

The downloadImageFromStream took 21 seconds to complete. It took 10 seconds for activating the LV and took another 10 seconds for getAllVolumes().

~~~
2022-03-30 01:42:27,161+0000 INFO  (Thread-263) [vdsm.api] START downloadImageFromStream(methodArgs={'fileObj': <_io.BufferedReader name=154>, 'length': 680448}, callback=<function ImageRequestHandler._createEventWithCallback.<locals>.setCallback at 0x7fd6847aa840>, spUUID='4ae7ba94-91a4-11ec-9bd7-00163e3b7ed0', sdUUID='b109b83d-f184-43bc-b4bd-ac608ca42215', imgUUID='14cdc8ca-5671-4339-b2ff-bc0aa604d353', volUUID='5379bf1a-8239-4252-8478-e08cca50a11e') from=internal, task_id=bbfe9e30-827f-4950-9891-c5da42faba46 (api:48)
2022-03-30 01:42:27,355+0000 INFO  (tasks/4) [storage.LVM] Activating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['5379bf1a-8239-4252-8478-e08cca50a11e'] (lvm:1791)

< ---- 10 secods --->

2022-03-30 01:42:37,665+0000 INFO  (tasks/4) [storage.StorageDomain] Creating image run directory '/run/vdsm/storage/b109b83d-f184-43bc-b4bd-ac608ca42215/14cdc8ca-5671-4339-b2ff-bc0aa604d353' (blockSD:1363)

2022-03-30 01:42:38,143+0000 INFO  (tasks/4) [storage.fileUtils] Removing directory: /run/vdsm/storage/b109b83d-f184-43bc-b4bd-ac608ca42215/14cdc8ca-5671-4339-b2ff-bc0aa604d353 (fileUtils:182)

< ---- 10 seconds --->

2022-03-30 01:42:48,196+0000 INFO  (tasks/4) [storage.LVM] Deactivating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['5379bf1a-8239-4252-8478-e08cca50a11e'] (lvm:1802)
2022-03-30 01:42:48,345+0000 INFO  (tasks/4) [storage.ThreadPool.WorkerThread] FINISH task bbfe9e30-827f-4950-9891-c5da42faba46 (threadPool:148)
~~~

However, the engine got timeout before this in 20 seconds with "java.net.SocketTimeoutException" error.

~~~
2022-03-30 01:42:27,146Z INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.UploadStreamVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [6093d872] START, UploadStreamVDSCommand(HostName =, UploadStreamVDSCommandParameters:{hostId='3835317d-34be-4844-a091-2b13e8be4a59'}), log id: bc3440e

2022-03-30 01:42:47,265Z ERROR [org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [6093d872] Command 'org.ovirt.engine.core.bll.storage.ovfstore.UploadStreamCommand' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.SocketTimeoutException: Read timed out (Failed with error VDS_NETWORK_ERROR and code 5022)
~~~

I am unable to find which timeout is hitting here that is causing the SocketTimeoutException. From the tcpdump, I can see that the vdsm is closing the connection.

The host was moved to "not responding" status and was subsequently fenced after this error.


Version-Release number of selected component (if applicable):

rhvm-4.4.10.6-0.1.el8ev.noarch

How reproducible:

Observed twice in the customer environment and can reproduce manually if we delay downloadImageFromStream.

Steps to Reproduce:

1. Added time.sleep(30) under deactivateImage to delay the downloadImageFromStream.
2. Created a disk in storage domain and manually did "Update OVFs".
3. The UploadStreamVDSCommand fails after 20 seconds.

Actual results:

UploadStreamVDSCommand  fails with java.net.SocketTimeoutException after 20 seconds

Expected results:


Additional info:

Comment 2 nijin ashok 2022-03-31 08:24:15 UTC
(In reply to nijin ashok from comment #0)
> I am unable to find which timeout is hitting here that is causing the
> SocketTimeoutException. From the tcpdump, I can see that the vdsm is closing
> the connection.

It is vdsConnectionTimeout.

Also, the delay of 20 seconds in downloadImageFromStream is not because of storage performance and the thread was waiting for the other LVM commands to complete which were executed in parallel.

There was hotuplugdisk task to unplug the disk from the backup/vendor VM which started an LVM deactivate task. These LVs are the ones used by the VM and hence failed with an "in use" error. However, LVM will retry 25 times, and hence any LV deactivate operation of an open LV will typically take around 5 seconds to complete. With two LVs here, it took 10 seconds to complete and the downloadImageFromStream has to wait during this time before it runs its LVM command.

~~~
2022-03-30 01:42:27,267+0000 INFO  (jsonrpc/6) [storage.LVM] Deactivating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['1dea92cc-6522-4003-924c-0c7b50b1bae3', 'ae642a1e-42c9-4512-bf7f-15768b74a178'] (lvm:1802)  <---  jsonrpc/6 thread which is trying to deactivates LV.

2022-03-30 01:42:27,355+0000 INFO  (tasks/4) [storage.LVM] Activating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['5379bf1a-8239-4252-8478-e08cca50a11e'] (lvm:1791)  <----- tasks/4 downloadImageFromStream thread 

022-03-30 01:42:37,581+0000 WARN  (jsonrpc/6) [storage.LVM] Command with specific filter failed or returned no data, retrying with a wider filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'lvchange\', \'--config\', \'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/ma........ rc=5 out=[] err=[\'  Logical volume b109b83d-f184-43bc-b4bd-ac608ca42215/1dea92cc-6522-4003-924c-0c7b50b1bae3 in use.\', \'  Logical volume b109b83d-f184-43bc-b4bd-ac608ca42215/ae642a1e-42c9-4512-bf7f-15768b74a178 in use.\']' (lvm:534)

2022-03-30 01:42:37,665+0000 INFO  (tasks/4) [storage.StorageDomain] Creating image run directory '/run/vdsm/storage/b109b83d-f184-43bc-b4bd-ac608ca42215/14cdc8ca-5671-4339-b2ff-bc0aa604d353' (blockSD:1363)

2022-03-30 01:42:38,143+0000 INFO  (tasks/4) [storage.fileUtils] Removing directory: /run/vdsm/storage/b109b83d-f184-43bc-b4bd-ac608ca42215/14cdc8ca-5671-4339-b2ff-bc0aa604d353 (fileUtils:182)

< ----------- getAllVolumes() will be called here that will run lvs command ---->

022-03-30 01:42:48,132+0000 WARN  (jsonrpc/6) [storage.LVM] All 2 tries have failed: LVM command failed: 'cmd=[\'/sbin/lvm\', \'lvchange\', ........lobal {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }\', \'--autobackup\', \'n\', \'--available\', \'n\', \'b109b83d-f184-43bc-b4bd-ac608ca42215/1dea92cc-6522-4003-924c-0c7b50b1bae3\', \'b109b83d-f184-43bc-b4bd-ac608ca42215/ae642a1e-42c9-4512-bf7f-15768b74a178\'] rc=5 out=[] err=[\'  Logical volume b109b83d-f184-43bc-b4bd-ac608ca42215/1dea92cc-6522-4003-924c-0c7b50b1bae3 in use.\', \'  Logical volume b109b83d-f184-43bc-b4bd-ac608ca42215/ae642a1e-42c9-4512-bf7f-15768b74a178 in use.\']' (lvm:561)

2022-03-30 01:42:48,196+0000 INFO  (tasks/4) [storage.LVM] Deactivating lvs: vg=b109b83d-f184-43bc-b4bd-ac608ca42215 lvs=['5379bf1a-8239-4252-8478-e08cca50a11e'] (lvm:1802)
~~~

Shouldn't this timeout  vdsTimeout which is used for other tasks?

Comment 3 Arik 2022-04-04 14:44:56 UTC
(In reply to nijin ashok from comment #0)
> Steps to Reproduce:
> 
> 1. Added time.sleep(30) under deactivateImage to delay the
> downloadImageFromStream.
> 2. Created a disk in storage domain and manually did "Update OVFs".
> 3. The UploadStreamVDSCommand fails after 20 seconds.

We have a longer timeout set for synchronous operations against VDSM
So it's interesting to see what triggers the timeout here - whether it's something at the lower level infrastructure we use (the json-rpc library) or what can explain the shorter timeout here of 20 sec

Comment 13 Mark Kemel 2022-06-20 13:25:03 UTC
This patch https://github.com/oVirt/ovirt-engine/pull/463 changes the engine behavior so host does not move to non-responsive state in case OVF update fails on timeout.
To make OVF update succeed - vdsConnectionTimeout should be extended locally:
$ engine-config -s vdsConnectionTimeout=30

Comment 14 sshmulev 2022-06-23 12:11:02 UTC
Verified.

versions:
engine-4.5.1.2-0.11.el8ev
vdsm-4.50.1.3-1.el8ev

Verified it on an old version and saw that the host turned to non-responsive after the "SocketTimeoutException". (ovirt-engine-4.4.10.7-0.4.el8ev)
I can see the same exception on the latest build with the fix, but the host doesn't go to non-responsive.

Comment 18 errata-xmlrpc 2022-07-14 12:54:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) [ovirt-4.5.1] security, bug fix and update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5555


Note You need to log in before you can comment on or make changes to this bug.