Description of problem: It seems that 1 second might be too quick on OST. Usually qemu-nbd starts in 20 milliseconds, In case of a timeout, the following error appear on the vdsm log: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 110, in wrapper return m(self, *a, **kw) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 1190, in prepare raise self.error File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run return fn(*args, **kargs) File "<decorator-gen-233>", line 2, in start_nbd_server File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3886, in start_nbd_server url = nbd.start_server(server_id, config) File "/usr/lib/python3.6/site-packages/vdsm/storage/nbd.py", line 155, in start_server .format(server_id, config)) vdsm.storage.nbd.Timeout: Timeout starting NBD server e0fa361f-76f9-499e-aca8-4341dbf9f951: {'discard': False, 'readonly': False, 'bitmap': None, 'sd_id': '3b943471-b6c1-4ed1-baa3-0be92d8366ab', 'img_id': 'dfc6432b-d86f-4607-a2c2-747b262f3dbc', 'vol_id': 'a557d5db-f03f-45bf-9810-3571554fc940', 'backing_chain': True} Version-Release number of selected component (if applicable): 4.40.80.4 How reproducible: Once in a while - not constantly (or easily) reproduced.
Shani, are there any steps for the QE to reproduce this issue?
This one happened once on OST, so it required involving to reproduce. I've edited the vdsm compiled files and used time.sleep(2) instead of 0.02 in nbd.py#_wait_for_socket. (/usr/lib/python3.6/site-packages/vdsm/storage/nbd.py) Since the timeout appear after 1 second, I believe that stopping a service (not sure if the vdsm service is the one needed) should be done extremely fast before the connection to the nbd server will be established.
Shani, can you please verify this bug with a proper explanation?
This one may occur in case of a timeout while starting the nbd server. Usually, qemu-nbd starts in 20 milliseconds, so one second seems a good timeout, but in the case of some overloaded environments (like ost), one second is not always enough and the nbd server reaches a timeout. In order to trigger such a timeout, for example, we can use 'sleep' on the vdsm nbd.py file [1]: vi /usr/lib/python3.6/site-packages/vdsm/storage/nbd.py in def _wait_for_socket > change `time.sleep(0.02)` to `time.sleep(2)` This way, we exceeded the 1-second limit and causing a timeout. [1] https://github.com/oVirt/vdsm/blob/802caa1b6bfbd4f08933a48d739883ef11c82545/lib/vdsm/storage/nbd.py#L521 This patch changes the timeout to 10.0 seconds.
This bugzilla is included in oVirt 4.4.9 release, published on October 20th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.9 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.