Bug 1872383
| Summary: | Uploading disks via the upload_disk script fails due to an engine regression. | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Steven Rosenberg <srosenbe> | ||||||
| Component: | Backend.Core | Assignee: | Nir Soffer <nsoffer> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Ilan Zuckerman <izuckerm> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 4.4.0 | CC: | aoconnor, bugs, michal.skrivanek, sfishbai | ||||||
| Target Milestone: | ovirt-4.4.3 | Keywords: | Regression | ||||||
| Target Release: | 4.4.3.3 | Flags: | pm-rhel:
ovirt-4.4+
aoconnor: blocker- |
||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | ovirt-engine-4.4.3.3 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Cause: When utilizing the upload disk script in order to upload a disk to oVirt, a previous change meant to avoid duplicate code prevented the stopNbdServer function from being called. This was because the removeImageTicketFromProxy function failed to remove a partial Ticket and returned early without calling the stopNbdServer function.
Consequence: The image was still locked when it needed to be verified during image verification.
Fix: When the removeImageTicketFromProxy function does return failure, we ignore the return value and continue so that the stopNbdServer function can be called.
Result: The upload disk script succeeds to upload the disk.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2020-11-11 06:42:04 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Steven Rosenberg
2020-08-25 15:53:14 UTC
Created attachment 1712565 [details]
VDSM Log
Steven, can you attach the engine log showing this failure? The issue is that qemu-img info cannot lock the image - another
process is holding a lock on this
image:
2020-08-25 15:36:30,400+0300 ERROR (jsonrpc/7) [storage.Dispatcher]
FINISH verify_untrusted_volume error=Command ['/usr/bin/prlimit',
'--cpu=30', '--as=1073741824', '/usr/bi
n/qemu-img', 'info', '--output', 'json',
'/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/image
s/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7']
failed with rc=1 out=b'' err=b'qemu-img: Could not open
\'/rhev/data-center/mnt/vserver-spider.
eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7\':
Failed to get shared "write" lock\nIs another process using the image
[/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0
-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7]?\n'
(dispatcher:87)
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py",
line 74, in wrapper
result = ctask.prepare(func, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line
110, in wrapper
return m(self, *a, **kw)
File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line
1198, in prepare
raise self.error
File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line
887, in _run
return fn(*args, **kargs)
File "<decorator-gen-65>", line 2, in verify_untrusted_volume
File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
ret = func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line
1571, in verify_untrusted_volume
qemu_info = qemuimg.info(vol.getVolumePath(), trusted_image=False)
File "/usr/lib/python3.6/site-packages/vdsm/storage/qemuimg.py",
line 115, in info
out = _run_cmd(cmd)
File "/usr/lib/python3.6/site-packages/vdsm/storage/qemuimg.py",
line 488, in _run_cmd
raise cmdutils.Error(cmd, rc, out, err)
vdsm.common.cmdutils.Error: Command ['/usr/bin/prlimit', '--cpu=30',
'--as=1073741824', '/usr/bin/qemu-img', 'info', '--output', 'json',
'/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7']
failed with rc=1 out=b'' err=b'qemu-img: Could not open
\'/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7\':
Failed to get shared "write" lock\nIs another process using the image
[/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7]?\n'
The relevant flow is:
1. Starting NBD server for this disk:
2020-08-25 15:36:00,929+0300 INFO (jsonrpc/3) [storage.nbd] Starting
transient service
vdsm-nbd-60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.service, serving volume
7119bdb0-64f0-4b42-b885-316cb28913f9/bff05295-a99d-4d0f-a6f0-0783257358a7
via unix socket
/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock (nbd:103)
2. Adding image ticket for this disk
2020-08-25 15:36:01,126+0300 INFO (jsonrpc/1) [vdsm.api] START
add_image_ticket(ticket={'dirty': False, 'ops': ['write'], 'size':
16106127360, 'sparse': True, 'transfer_id':
'0b10a195-73fc-4032-be28-31275b1f59e8', 'uuid':
'60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'timeout': 300, 'url':
'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock'})
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=cd5d690a-0f46-4edc-b7e9-063b970c13a3 (api:48)
3. Monitor ticket status
2020-08-25 15:36:04,300+0300 INFO (jsonrpc/0) [vdsm.api] FINISH
get_image_ticket return={'result': {'active': True, 'canceled': False,
'connections': 4, 'expires': 5339450, 'idle_time': 0, 'ops':
['write'], 'size': 16106127360, 'sparse': True, 'dirty': False,
'timeout': 300, 'url':
'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock',
'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'transfer_id':
'0b10a195-73fc-4032-be28-31275b1f59e8', 'transferred': 2415788032}}
from=::ffff:10.46.10.158,43758, flow
...
2020-08-25 15:36:10,261+0300 INFO (jsonrpc/4) [vdsm.api] FINISH
get_image_ticket return={'result': {'active': True, 'canceled': False,
'connections': 4, 'expires': 5339456, 'idle_time': 0, 'ops':
['write'], 'size': 16106127360, 'sparse': True, 'dirty': False,
'timeout': 300, 'url':
'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock',
'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'transfer_id':
'0b10a195-73fc-4032-be28-31275b1f59e8', 'transferred': 5997658112}}
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=50e3cb6b-c1f2-4334-a596-bbd8d3618f0e (api:54)
...
2020-08-25 15:36:20,024+0300 INFO (jsonrpc/2) [vdsm.api] FINISH
get_image_ticket return={'result': {'active': True, 'canceled': False,
'connections': 4, 'expires': 5339465, 'idle_time': 0, 'ops':
['write'], 'size': 16106127360, 'sparse': True, 'dirty': False,
'timeout': 300, 'url':
'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock',
'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'transfer_id':
'0b10a195-73fc-4032-be28-31275b1f59e8', 'transferred': 13247512576}}
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=2ec2e34d-0bd2-4294-8766-ebd808c4c6e2 (api:54)
4. Remove image ticket
2020-08-25 15:36:30,329+0300 INFO (jsonrpc/1) [vdsm.api] START
remove_image_ticket(uuid='60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9')
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=e86a58d7-9916-44a3-890c-d8625c1433e5 (api:48)
2020-08-25 15:36:30,334+0300 INFO (jsonrpc/1) [vdsm.api] FINISH
remove_image_ticket return=None from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=e86a58d7-9916-44a3-890c-d8625c1433e5 (api:54)
2020-08-25 15:36:30,334+0300 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer]
RPC call Host.remove_image_ticket succeeded in 0.01 seconds
(__init__:312)
At this point if you are using ovirt-imageio >= 2.0.10-1 (make sure
you do!), the connection to
qemu-nbd is closed, and we can stop the NBD server.
5. Stop NBD server
engine dis not send this - this is the root cause.
6. Verifying the image
2020-08-25 15:36:30,366+0300 INFO (jsonrpc/7) [vdsm.api] START
verify_untrusted_volume(spUUID='602d1c80-e56d-11ea-8959-001a4a16039e',
sdUUID='7119bdb0-64f0-4b42-b885-316cb28913f9',
imgUUID='5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74',
volUUID='bff05295-a99d-4d0f-a6f0-0783257358a7')
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=46c8795f-98e2-4892-a60a-3a154684e98f (api:48)
This breaks because qemu-nbd is still running.
Engine did not stop the transfer properly.
Steven, can you test the attached patch? Does it solve the issue in your environment? This is a regression, introduce in ovirt-engine-4.4.2. Mostly affects developers that do not run ovirt-imageio on engine host. It is less likely to affect users since ovirt-imageio is always running, but if ovirt-imageio service is stopped for some reason, they may experience the same issue. The result is running NBD server that nobody stop, with an active LV that nobody can remove (since qemu-nbd never exit). This is pretty bad situation. I think we should have this in the next 4.4.2 build. This bug report has Keywords: Regression or TestBlocker. Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP. If it’s not affecting users it’s not a blocker for release. It’s too late for 4.4.2 Created attachment 1712643 [details]
engine log
As Michal mentioned above it’s too late for 4.4.2, moving this bug to 4.4.3. Verified on rhv-4.4.3-4 [root@storage-ge5-vdsm3 examples]# python3 upload_disk.py ~/fedora-31.img --engine-url https://storage-ge-05xxx --username admin@internal --disk-format qcow2 --disk-sparse --sd-name nfs_0 -c ~/ca.pem Checking image... Image format: raw Disk format: cow Disk content type: data Disk provisioned size: 6442450944 Disk initial size: 1281425408 Disk name: fedora-31.qcow2 Disk backup: False Connecting... Password: Creating disk... Disk ID: 74ed499d-c8e7-406f-91fd-6e7cd46c9b83 Creating image transfer... Transfer ID: 60ea83d8-38e6-48ac-a876-d531f793d4b5 Transfer host name: host_mixed_3 Uploading image... [ 100.00% ] 6.00 GiB, 7.11 seconds, 864.08 MiB/s Finalizing image transfer... Upload completed successfully This bugzilla is included in oVirt 4.4.3 release, published on November 10th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.3 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |