Bug 1872383

Summary: Uploading disks via the upload_disk script fails due to an engine regression.
Product: [oVirt] ovirt-engine Reporter: Steven Rosenberg <srosenbe>
Component: Backend.CoreAssignee: Nir Soffer <nsoffer>
Status: CLOSED CURRENTRELEASE QA Contact: Ilan Zuckerman <izuckerm>
Severity: high Docs Contact:
Priority: high    
Version: 4.4.0CC: aoconnor, bugs, michal.skrivanek, sfishbai
Target Milestone: ovirt-4.4.3Keywords: Regression
Target Release: 4.4.3.3Flags: pm-rhel: ovirt-4.4+
aoconnor: blocker-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.3.3 Doc Type: Bug Fix
Doc Text:
Cause: When utilizing the upload disk script in order to upload a disk to oVirt, a previous change meant to avoid duplicate code prevented the stopNbdServer function from being called. This was because the removeImageTicketFromProxy function failed to remove a partial Ticket and returned early without calling the stopNbdServer function. Consequence: The image was still locked when it needed to be verified during image verification. Fix: When the removeImageTicketFromProxy function does return failure, we ignore the return value and continue so that the stopNbdServer function can be called. Result: The upload disk script succeeds to upload the disk.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-11 06:42:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
VDSM Log
none
engine log none

Description Steven Rosenberg 2020-08-25 15:53:14 UTC
Description of problem: When uploading a disk with the upload_disk script, the upload fails with an illegal device error. This is due to a regression that changed the order of the stopNbdServer function call [1]


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Upload a disk via the upload_disk.py script
2.
3.

Actual results: Uploading fails


Expected results: Upload will succeed


Additional info:

[1] https://gerrit.ovirt.org/#/c/110417/

Comment 1 Steven Rosenberg 2020-08-25 15:53:50 UTC
Created attachment 1712565 [details]
VDSM Log

Comment 2 Nir Soffer 2020-08-25 16:12:06 UTC
Steven, can you attach the engine log showing this failure?

Comment 3 Nir Soffer 2020-08-25 16:15:16 UTC
The issue is that qemu-img info cannot lock the image - another
process is holding a lock on this
image:

2020-08-25 15:36:30,400+0300 ERROR (jsonrpc/7) [storage.Dispatcher]
FINISH verify_untrusted_volume error=Command ['/usr/bin/prlimit',
'--cpu=30', '--as=1073741824', '/usr/bi
n/qemu-img', 'info', '--output', 'json',
'/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/image
s/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7']
failed with rc=1 out=b'' err=b'qemu-img: Could not open
\'/rhev/data-center/mnt/vserver-spider.
eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7\':
Failed to get shared "write" lock\nIs another process using the image
[/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0
-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7]?\n'
(dispatcher:87)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py",
line 74, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line
110, in wrapper
    return m(self, *a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line
1198, in prepare
    raise self.error
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line
887, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-65>", line 2, in verify_untrusted_volume
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line
1571, in verify_untrusted_volume
    qemu_info = qemuimg.info(vol.getVolumePath(), trusted_image=False)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/qemuimg.py",
line 115, in info
    out = _run_cmd(cmd)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/qemuimg.py",
line 488, in _run_cmd
    raise cmdutils.Error(cmd, rc, out, err)
vdsm.common.cmdutils.Error: Command ['/usr/bin/prlimit', '--cpu=30',
'--as=1073741824', '/usr/bin/qemu-img', 'info', '--output', 'json',
'/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7']
failed with rc=1 out=b'' err=b'qemu-img: Could not open
\'/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7\':
Failed to get shared "write" lock\nIs another process using the image
[/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7]?\n'


The relevant flow is:


1. Starting NBD server for this disk:

2020-08-25 15:36:00,929+0300 INFO  (jsonrpc/3) [storage.nbd] Starting
transient service
vdsm-nbd-60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.service, serving volume
7119bdb0-64f0-4b42-b885-316cb28913f9/bff05295-a99d-4d0f-a6f0-0783257358a7
via unix socket
/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock (nbd:103)

2. Adding image ticket for this disk

2020-08-25 15:36:01,126+0300 INFO  (jsonrpc/1) [vdsm.api] START
add_image_ticket(ticket={'dirty': False, 'ops': ['write'], 'size':
16106127360, 'sparse': True, 'transfer_id':
'0b10a195-73fc-4032-be28-31275b1f59e8', 'uuid':
'60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'timeout': 300, 'url':
'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock'})
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=cd5d690a-0f46-4edc-b7e9-063b970c13a3 (api:48)

3. Monitor ticket status

2020-08-25 15:36:04,300+0300 INFO  (jsonrpc/0) [vdsm.api] FINISH
get_image_ticket return={'result': {'active': True, 'canceled': False,
'connections': 4, 'expires': 5339450, 'idle_time': 0, 'ops':
['write'], 'size': 16106127360, 'sparse': True, 'dirty': False,
'timeout': 300, 'url':
'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock',
'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'transfer_id':
'0b10a195-73fc-4032-be28-31275b1f59e8', 'transferred': 2415788032}}
from=::ffff:10.46.10.158,43758, flow
...
2020-08-25 15:36:10,261+0300 INFO  (jsonrpc/4) [vdsm.api] FINISH
get_image_ticket return={'result': {'active': True, 'canceled': False,
'connections': 4, 'expires': 5339456, 'idle_time': 0, 'ops':
['write'], 'size': 16106127360, 'sparse': True, 'dirty': False,
'timeout': 300, 'url':
'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock',
'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'transfer_id':
'0b10a195-73fc-4032-be28-31275b1f59e8', 'transferred': 5997658112}}
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=50e3cb6b-c1f2-4334-a596-bbd8d3618f0e (api:54)
...
2020-08-25 15:36:20,024+0300 INFO  (jsonrpc/2) [vdsm.api] FINISH
get_image_ticket return={'result': {'active': True, 'canceled': False,
'connections': 4, 'expires': 5339465, 'idle_time': 0, 'ops':
['write'], 'size': 16106127360, 'sparse': True, 'dirty': False,
'timeout': 300, 'url':
'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock',
'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'transfer_id':
'0b10a195-73fc-4032-be28-31275b1f59e8', 'transferred': 13247512576}}
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=2ec2e34d-0bd2-4294-8766-ebd808c4c6e2 (api:54)

4. Remove image ticket

2020-08-25 15:36:30,329+0300 INFO  (jsonrpc/1) [vdsm.api] START
remove_image_ticket(uuid='60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9')
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=e86a58d7-9916-44a3-890c-d8625c1433e5 (api:48)
2020-08-25 15:36:30,334+0300 INFO  (jsonrpc/1) [vdsm.api] FINISH
remove_image_ticket return=None from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=e86a58d7-9916-44a3-890c-d8625c1433e5 (api:54)
2020-08-25 15:36:30,334+0300 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer]
RPC call Host.remove_image_ticket succeeded in 0.01 seconds
(__init__:312)

At this point if you are using ovirt-imageio >= 2.0.10-1 (make sure
you do!), the connection to
qemu-nbd is closed, and we can stop the NBD server.

5. Stop NBD server

    engine dis not send this - this is the root cause.

6.  Verifying the image

2020-08-25 15:36:30,366+0300 INFO  (jsonrpc/7) [vdsm.api] START
verify_untrusted_volume(spUUID='602d1c80-e56d-11ea-8959-001a4a16039e',
sdUUID='7119bdb0-64f0-4b42-b885-316cb28913f9',
imgUUID='5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74',
volUUID='bff05295-a99d-4d0f-a6f0-0783257358a7')
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=46c8795f-98e2-4892-a60a-3a154684e98f (api:48)

This breaks because qemu-nbd is still running.

Engine did not stop the transfer properly.

Comment 4 Nir Soffer 2020-08-25 16:30:16 UTC
Steven, can you test the attached patch? Does it solve the issue in 
your environment?

Comment 5 Nir Soffer 2020-08-25 16:50:14 UTC
This is a regression, introduce in ovirt-engine-4.4.2.

Mostly affects developers that do not run ovirt-imageio on
engine host.

It is less likely to affect users since ovirt-imageio is always
running, but if ovirt-imageio service is stopped for some reason,
they may experience the same issue.

The result is running NBD server that nobody stop, with an active
LV that nobody can remove (since qemu-nbd never exit). This is
pretty bad situation.

I think we should have this in the next 4.4.2 build.

Comment 6 RHEL Program Management 2020-08-25 16:50:22 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 7 Michal Skrivanek 2020-08-26 03:55:28 UTC
If it’s not affecting users it’s not a blocker for release. It’s too late for 4.4.2

Comment 8 Steven Rosenberg 2020-08-26 08:15:04 UTC
Created attachment 1712643 [details]
engine log

Comment 9 Shir Fishbain 2020-08-27 06:53:21 UTC
As Michal mentioned above it’s too late for 4.4.2, moving this bug to 4.4.3.

Comment 10 Ilan Zuckerman 2020-09-13 08:40:20 UTC
Verified on rhv-4.4.3-4

[root@storage-ge5-vdsm3 examples]# python3 upload_disk.py ~/fedora-31.img --engine-url https://storage-ge-05xxx --username admin@internal --disk-format qcow2 --disk-sparse --sd-name nfs_0 -c ~/ca.pem 
Checking image...
Image format: raw
Disk format: cow
Disk content type: data
Disk provisioned size: 6442450944
Disk initial size: 1281425408
Disk name: fedora-31.qcow2
Disk backup: False
Connecting...
Password: 
Creating disk...
Disk ID: 74ed499d-c8e7-406f-91fd-6e7cd46c9b83
Creating image transfer...
Transfer ID: 60ea83d8-38e6-48ac-a876-d531f793d4b5
Transfer host name: host_mixed_3
Uploading image...
[ 100.00% ] 6.00 GiB, 7.11 seconds, 864.08 MiB/s                               
Finalizing image transfer...
Upload completed successfully

Comment 11 Sandro Bonazzola 2020-11-11 06:42:04 UTC
This bugzilla is included in oVirt 4.4.3 release, published on November 10th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.