Bug 1993085 - OST: Timeout while starting NBD server
Summary: OST: Timeout while starting NBD server
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.40.80.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ovirt-4.4.9
: 4.40.90.1
Assignee: shani
QA Contact: Avihai
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-12 11:23 UTC by shani
Modified: 2021-12-23 16:35 UTC (History)
3 users (show)

Fixed In Version: vdsm-4.40.90.1
Clone Of:
Environment:
Last Closed: 2021-10-21 07:27:14 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.4+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43014 0 None None None 2021-08-12 11:25:04 UTC
Red Hat Issue Tracker RHV-43015 0 None None None 2021-08-12 11:33:03 UTC
oVirt gerrit 116177 0 master MERGED nbd: changing timeout on starting nbd server 2021-08-16 17:50:56 UTC

Description shani 2021-08-12 11:23:14 UTC
Description of problem:
It seems that 1 second might be too quick on OST.
Usually qemu-nbd starts in 20 milliseconds,
In case of a timeout, the following error appear on the vdsm log:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 110, in wrapper
    return m(self, *a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 1190, in prepare
    raise self.error
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-233>", line 2, in start_nbd_server
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3886, in start_nbd_server
    url = nbd.start_server(server_id, config)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/nbd.py", line 155, in start_server
    .format(server_id, config))
vdsm.storage.nbd.Timeout: Timeout starting NBD server e0fa361f-76f9-499e-aca8-4341dbf9f951: {'discard': False, 'readonly': False, 'bitmap': None, 'sd_id': '3b943471-b6c1-4ed1-baa3-0be92d8366ab', 'img_id': 'dfc6432b-d86f-4607-a2c2-747b262f3dbc', 'vol_id': 'a557d5db-f03f-45bf-9810-3571554fc940', 'backing_chain': True}


Version-Release number of selected component (if applicable):
4.40.80.4

How reproducible:
Once in a while - not constantly (or easily) reproduced.

Comment 2 Eyal Shenitzky 2021-08-17 09:45:34 UTC
Shani, are there any steps for the QE to reproduce this issue?

Comment 3 shani 2021-08-17 10:15:51 UTC
This one happened once on OST, so it required involving to reproduce.
I've edited the vdsm compiled files and used time.sleep(2) instead of 0.02 in nbd.py#_wait_for_socket.
(/usr/lib/python3.6/site-packages/vdsm/storage/nbd.py)

Since the timeout appear after 1 second, I believe that stopping a service (not sure if the vdsm service is the one needed)
should be done extremely fast before the connection to the nbd server will be established.

Comment 4 Eyal Shenitzky 2021-08-31 06:25:53 UTC
Shani, can you please verify this bug with a proper explanation?

Comment 5 shani 2021-08-31 08:19:23 UTC
This one may occur in case of a timeout while starting the nbd server.
Usually, qemu-nbd starts in 20 milliseconds, so one second seems a good timeout, 
but in the case of some overloaded environments (like ost), one second is not always enough and the nbd server reaches a timeout.

In order to trigger such a timeout, for example, we can use 'sleep' on the vdsm nbd.py file [1]:
vi /usr/lib/python3.6/site-packages/vdsm/storage/nbd.py
in def _wait_for_socket > change `time.sleep(0.02)` to `time.sleep(2)`

This way, we exceeded the 1-second limit and causing a timeout.

[1] https://github.com/oVirt/vdsm/blob/802caa1b6bfbd4f08933a48d739883ef11c82545/lib/vdsm/storage/nbd.py#L521


This patch changes the timeout to 10.0 seconds.

Comment 6 Sandro Bonazzola 2021-10-21 07:27:14 UTC
This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.