Description of problem: When uploading a disk with the upload_disk script, the upload fails with an illegal device error. This is due to a regression that changed the order of the stopNbdServer function call [1] Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Upload a disk via the upload_disk.py script 2. 3. Actual results: Uploading fails Expected results: Upload will succeed Additional info: [1] https://gerrit.ovirt.org/#/c/110417/
Created attachment 1712565 [details] VDSM Log
Steven, can you attach the engine log showing this failure?
The issue is that qemu-img info cannot lock the image - another process is holding a lock on this image: 2020-08-25 15:36:30,400+0300 ERROR (jsonrpc/7) [storage.Dispatcher] FINISH verify_untrusted_volume error=Command ['/usr/bin/prlimit', '--cpu=30', '--as=1073741824', '/usr/bi n/qemu-img', 'info', '--output', 'json', '/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/image s/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7'] failed with rc=1 out=b'' err=b'qemu-img: Could not open \'/rhev/data-center/mnt/vserver-spider. eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7\': Failed to get shared "write" lock\nIs another process using the image [/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0 -4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7]?\n' (dispatcher:87) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 110, in wrapper return m(self, *a, **kw) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 1198, in prepare raise self.error File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "<decorator-gen-65>", line 2, in verify_untrusted_volume File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 1571, in verify_untrusted_volume qemu_info = qemuimg.info(vol.getVolumePath(), trusted_image=False) File "/usr/lib/python3.6/site-packages/vdsm/storage/qemuimg.py", line 115, in info out = _run_cmd(cmd) File "/usr/lib/python3.6/site-packages/vdsm/storage/qemuimg.py", line 488, in _run_cmd raise cmdutils.Error(cmd, rc, out, err) vdsm.common.cmdutils.Error: Command ['/usr/bin/prlimit', '--cpu=30', '--as=1073741824', '/usr/bin/qemu-img', 'info', '--output', 'json', '/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7'] failed with rc=1 out=b'' err=b'qemu-img: Could not open \'/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7\': Failed to get shared "write" lock\nIs another process using the image [/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7]?\n' The relevant flow is: 1. Starting NBD server for this disk: 2020-08-25 15:36:00,929+0300 INFO (jsonrpc/3) [storage.nbd] Starting transient service vdsm-nbd-60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.service, serving volume 7119bdb0-64f0-4b42-b885-316cb28913f9/bff05295-a99d-4d0f-a6f0-0783257358a7 via unix socket /run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock (nbd:103) 2. Adding image ticket for this disk 2020-08-25 15:36:01,126+0300 INFO (jsonrpc/1) [vdsm.api] START add_image_ticket(ticket={'dirty': False, 'ops': ['write'], 'size': 16106127360, 'sparse': True, 'transfer_id': '0b10a195-73fc-4032-be28-31275b1f59e8', 'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock'}) from=::ffff:10.46.10.158,43758, flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d, task_id=cd5d690a-0f46-4edc-b7e9-063b970c13a3 (api:48) 3. Monitor ticket status 2020-08-25 15:36:04,300+0300 INFO (jsonrpc/0) [vdsm.api] FINISH get_image_ticket return={'result': {'active': True, 'canceled': False, 'connections': 4, 'expires': 5339450, 'idle_time': 0, 'ops': ['write'], 'size': 16106127360, 'sparse': True, 'dirty': False, 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock', 'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'transfer_id': '0b10a195-73fc-4032-be28-31275b1f59e8', 'transferred': 2415788032}} from=::ffff:10.46.10.158,43758, flow ... 2020-08-25 15:36:10,261+0300 INFO (jsonrpc/4) [vdsm.api] FINISH get_image_ticket return={'result': {'active': True, 'canceled': False, 'connections': 4, 'expires': 5339456, 'idle_time': 0, 'ops': ['write'], 'size': 16106127360, 'sparse': True, 'dirty': False, 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock', 'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'transfer_id': '0b10a195-73fc-4032-be28-31275b1f59e8', 'transferred': 5997658112}} from=::ffff:10.46.10.158,43758, flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d, task_id=50e3cb6b-c1f2-4334-a596-bbd8d3618f0e (api:54) ... 2020-08-25 15:36:20,024+0300 INFO (jsonrpc/2) [vdsm.api] FINISH get_image_ticket return={'result': {'active': True, 'canceled': False, 'connections': 4, 'expires': 5339465, 'idle_time': 0, 'ops': ['write'], 'size': 16106127360, 'sparse': True, 'dirty': False, 'timeout': 300, 'url': 'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock', 'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'transfer_id': '0b10a195-73fc-4032-be28-31275b1f59e8', 'transferred': 13247512576}} from=::ffff:10.46.10.158,43758, flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d, task_id=2ec2e34d-0bd2-4294-8766-ebd808c4c6e2 (api:54) 4. Remove image ticket 2020-08-25 15:36:30,329+0300 INFO (jsonrpc/1) [vdsm.api] START remove_image_ticket(uuid='60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9') from=::ffff:10.46.10.158,43758, flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d, task_id=e86a58d7-9916-44a3-890c-d8625c1433e5 (api:48) 2020-08-25 15:36:30,334+0300 INFO (jsonrpc/1) [vdsm.api] FINISH remove_image_ticket return=None from=::ffff:10.46.10.158,43758, flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d, task_id=e86a58d7-9916-44a3-890c-d8625c1433e5 (api:54) 2020-08-25 15:36:30,334+0300 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.remove_image_ticket succeeded in 0.01 seconds (__init__:312) At this point if you are using ovirt-imageio >= 2.0.10-1 (make sure you do!), the connection to qemu-nbd is closed, and we can stop the NBD server. 5. Stop NBD server engine dis not send this - this is the root cause. 6. Verifying the image 2020-08-25 15:36:30,366+0300 INFO (jsonrpc/7) [vdsm.api] START verify_untrusted_volume(spUUID='602d1c80-e56d-11ea-8959-001a4a16039e', sdUUID='7119bdb0-64f0-4b42-b885-316cb28913f9', imgUUID='5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74', volUUID='bff05295-a99d-4d0f-a6f0-0783257358a7') from=::ffff:10.46.10.158,43758, flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d, task_id=46c8795f-98e2-4892-a60a-3a154684e98f (api:48) This breaks because qemu-nbd is still running. Engine did not stop the transfer properly.
Steven, can you test the attached patch? Does it solve the issue in your environment?
This is a regression, introduce in ovirt-engine-4.4.2. Mostly affects developers that do not run ovirt-imageio on engine host. It is less likely to affect users since ovirt-imageio is always running, but if ovirt-imageio service is stopped for some reason, they may experience the same issue. The result is running NBD server that nobody stop, with an active LV that nobody can remove (since qemu-nbd never exit). This is pretty bad situation. I think we should have this in the next 4.4.2 build.
This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.
If it’s not affecting users it’s not a blocker for release. It’s too late for 4.4.2
Created attachment 1712643 [details] engine log
As Michal mentioned above it’s too late for 4.4.2, moving this bug to 4.4.3.
Verified on rhv-4.4.3-4 [root@storage-ge5-vdsm3 examples]# python3 upload_disk.py ~/fedora-31.img --engine-url https://storage-ge-05xxx --username admin@internal --disk-format qcow2 --disk-sparse --sd-name nfs_0 -c ~/ca.pem Checking image... Image format: raw Disk format: cow Disk content type: data Disk provisioned size: 6442450944 Disk initial size: 1281425408 Disk name: fedora-31.qcow2 Disk backup: False Connecting... Password: Creating disk... Disk ID: 74ed499d-c8e7-406f-91fd-6e7cd46c9b83 Creating image transfer... Transfer ID: 60ea83d8-38e6-48ac-a876-d531f793d4b5 Transfer host name: host_mixed_3 Uploading image... [ 100.00% ] 6.00 GiB, 7.11 seconds, 864.08 MiB/s Finalizing image transfer... Upload completed successfully
This bugzilla is included in oVirt 4.4.3 release, published on November 10th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.3 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.