Bug 1872383 - Uploading disks via the upload_disk script fails due to an engine regression.
Summary: Uploading disks via the upload_disk script fails due to an engine regression.
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.4.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.4.3
: 4.4.3.3
Assignee: Nir Soffer
QA Contact: Ilan Zuckerman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-25 15:53 UTC by Steven Rosenberg
Modified: 2020-11-11 06:46 UTC (History)
4 users (show)

Fixed In Version: ovirt-engine-4.4.3.3
Clone Of:
Environment:
Last Closed: 2020-11-11 06:42:04 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.4+
aoconnor: blocker-


Attachments (Terms of Use)
VDSM Log (595.99 KB, application/gzip)
2020-08-25 15:53 UTC, Steven Rosenberg
no flags Details
engine log (253.14 KB, application/gzip)
2020-08-26 08:15 UTC, Steven Rosenberg
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 110937 0 master MERGED core: Fix stopping transfer session 2020-11-11 09:08:37 UTC
oVirt gerrit 110942 0 ovirt-engine-4.4.2.z MERGED core: Fix stopping transfer session 2020-11-11 09:08:37 UTC

Description Steven Rosenberg 2020-08-25 15:53:14 UTC
Description of problem: When uploading a disk with the upload_disk script, the upload fails with an illegal device error. This is due to a regression that changed the order of the stopNbdServer function call [1]


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Upload a disk via the upload_disk.py script
2.
3.

Actual results: Uploading fails


Expected results: Upload will succeed


Additional info:

[1] https://gerrit.ovirt.org/#/c/110417/

Comment 1 Steven Rosenberg 2020-08-25 15:53:50 UTC
Created attachment 1712565 [details]
VDSM Log

Comment 2 Nir Soffer 2020-08-25 16:12:06 UTC
Steven, can you attach the engine log showing this failure?

Comment 3 Nir Soffer 2020-08-25 16:15:16 UTC
The issue is that qemu-img info cannot lock the image - another
process is holding a lock on this
image:

2020-08-25 15:36:30,400+0300 ERROR (jsonrpc/7) [storage.Dispatcher]
FINISH verify_untrusted_volume error=Command ['/usr/bin/prlimit',
'--cpu=30', '--as=1073741824', '/usr/bi
n/qemu-img', 'info', '--output', 'json',
'/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/image
s/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7']
failed with rc=1 out=b'' err=b'qemu-img: Could not open
\'/rhev/data-center/mnt/vserver-spider.
eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7\':
Failed to get shared "write" lock\nIs another process using the image
[/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0
-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7]?\n'
(dispatcher:87)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py",
line 74, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line
110, in wrapper
    return m(self, *a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line
1198, in prepare
    raise self.error
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line
887, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-65>", line 2, in verify_untrusted_volume
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line
1571, in verify_untrusted_volume
    qemu_info = qemuimg.info(vol.getVolumePath(), trusted_image=False)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/qemuimg.py",
line 115, in info
    out = _run_cmd(cmd)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/qemuimg.py",
line 488, in _run_cmd
    raise cmdutils.Error(cmd, rc, out, err)
vdsm.common.cmdutils.Error: Command ['/usr/bin/prlimit', '--cpu=30',
'--as=1073741824', '/usr/bin/qemu-img', 'info', '--output', 'json',
'/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7']
failed with rc=1 out=b'' err=b'qemu-img: Could not open
\'/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7\':
Failed to get shared "write" lock\nIs another process using the image
[/rhev/data-center/mnt/vserver-spider.eng.lab.tlv.redhat.com:_pub_srosenbe_engine__master/7119bdb0-64f0-4b42-b885-316cb28913f9/images/5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74/bff05295-a99d-4d0f-a6f0-0783257358a7]?\n'


The relevant flow is:


1. Starting NBD server for this disk:

2020-08-25 15:36:00,929+0300 INFO  (jsonrpc/3) [storage.nbd] Starting
transient service
vdsm-nbd-60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.service, serving volume
7119bdb0-64f0-4b42-b885-316cb28913f9/bff05295-a99d-4d0f-a6f0-0783257358a7
via unix socket
/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock (nbd:103)

2. Adding image ticket for this disk

2020-08-25 15:36:01,126+0300 INFO  (jsonrpc/1) [vdsm.api] START
add_image_ticket(ticket={'dirty': False, 'ops': ['write'], 'size':
16106127360, 'sparse': True, 'transfer_id':
'0b10a195-73fc-4032-be28-31275b1f59e8', 'uuid':
'60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'timeout': 300, 'url':
'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock'})
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=cd5d690a-0f46-4edc-b7e9-063b970c13a3 (api:48)

3. Monitor ticket status

2020-08-25 15:36:04,300+0300 INFO  (jsonrpc/0) [vdsm.api] FINISH
get_image_ticket return={'result': {'active': True, 'canceled': False,
'connections': 4, 'expires': 5339450, 'idle_time': 0, 'ops':
['write'], 'size': 16106127360, 'sparse': True, 'dirty': False,
'timeout': 300, 'url':
'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock',
'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'transfer_id':
'0b10a195-73fc-4032-be28-31275b1f59e8', 'transferred': 2415788032}}
from=::ffff:10.46.10.158,43758, flow
...
2020-08-25 15:36:10,261+0300 INFO  (jsonrpc/4) [vdsm.api] FINISH
get_image_ticket return={'result': {'active': True, 'canceled': False,
'connections': 4, 'expires': 5339456, 'idle_time': 0, 'ops':
['write'], 'size': 16106127360, 'sparse': True, 'dirty': False,
'timeout': 300, 'url':
'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock',
'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'transfer_id':
'0b10a195-73fc-4032-be28-31275b1f59e8', 'transferred': 5997658112}}
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=50e3cb6b-c1f2-4334-a596-bbd8d3618f0e (api:54)
...
2020-08-25 15:36:20,024+0300 INFO  (jsonrpc/2) [vdsm.api] FINISH
get_image_ticket return={'result': {'active': True, 'canceled': False,
'connections': 4, 'expires': 5339465, 'idle_time': 0, 'ops':
['write'], 'size': 16106127360, 'sparse': True, 'dirty': False,
'timeout': 300, 'url':
'nbd:unix:/run/vdsm/nbd/60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9.sock',
'uuid': '60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9', 'transfer_id':
'0b10a195-73fc-4032-be28-31275b1f59e8', 'transferred': 13247512576}}
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=2ec2e34d-0bd2-4294-8766-ebd808c4c6e2 (api:54)

4. Remove image ticket

2020-08-25 15:36:30,329+0300 INFO  (jsonrpc/1) [vdsm.api] START
remove_image_ticket(uuid='60a0f9e2-1cc2-4ed9-8f7c-41002fc44ea9')
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=e86a58d7-9916-44a3-890c-d8625c1433e5 (api:48)
2020-08-25 15:36:30,334+0300 INFO  (jsonrpc/1) [vdsm.api] FINISH
remove_image_ticket return=None from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=e86a58d7-9916-44a3-890c-d8625c1433e5 (api:54)
2020-08-25 15:36:30,334+0300 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer]
RPC call Host.remove_image_ticket succeeded in 0.01 seconds
(__init__:312)

At this point if you are using ovirt-imageio >= 2.0.10-1 (make sure
you do!), the connection to
qemu-nbd is closed, and we can stop the NBD server.

5. Stop NBD server

    engine dis not send this - this is the root cause.

6.  Verifying the image

2020-08-25 15:36:30,366+0300 INFO  (jsonrpc/7) [vdsm.api] START
verify_untrusted_volume(spUUID='602d1c80-e56d-11ea-8959-001a4a16039e',
sdUUID='7119bdb0-64f0-4b42-b885-316cb28913f9',
imgUUID='5ef4210d-fa63-4d0e-a0b4-f060bd6d3a74',
volUUID='bff05295-a99d-4d0f-a6f0-0783257358a7')
from=::ffff:10.46.10.158,43758,
flow_id=1e9d265c-92ca-4abe-b78d-bf11a8b2454d,
task_id=46c8795f-98e2-4892-a60a-3a154684e98f (api:48)

This breaks because qemu-nbd is still running.

Engine did not stop the transfer properly.

Comment 4 Nir Soffer 2020-08-25 16:30:16 UTC
Steven, can you test the attached patch? Does it solve the issue in 
your environment?

Comment 5 Nir Soffer 2020-08-25 16:50:14 UTC
This is a regression, introduce in ovirt-engine-4.4.2.

Mostly affects developers that do not run ovirt-imageio on
engine host.

It is less likely to affect users since ovirt-imageio is always
running, but if ovirt-imageio service is stopped for some reason,
they may experience the same issue.

The result is running NBD server that nobody stop, with an active
LV that nobody can remove (since qemu-nbd never exit). This is
pretty bad situation.

I think we should have this in the next 4.4.2 build.

Comment 6 RHEL Program Management 2020-08-25 16:50:22 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 7 Michal Skrivanek 2020-08-26 03:55:28 UTC
If it’s not affecting users it’s not a blocker for release. It’s too late for 4.4.2

Comment 8 Steven Rosenberg 2020-08-26 08:15:04 UTC
Created attachment 1712643 [details]
engine log

Comment 9 Shir Fishbain 2020-08-27 06:53:21 UTC
As Michal mentioned above it’s too late for 4.4.2, moving this bug to 4.4.3.

Comment 10 Ilan Zuckerman 2020-09-13 08:40:20 UTC
Verified on rhv-4.4.3-4

[root@storage-ge5-vdsm3 examples]# python3 upload_disk.py ~/fedora-31.img --engine-url https://storage-ge-05xxx --username admin@internal --disk-format qcow2 --disk-sparse --sd-name nfs_0 -c ~/ca.pem 
Checking image...
Image format: raw
Disk format: cow
Disk content type: data
Disk provisioned size: 6442450944
Disk initial size: 1281425408
Disk name: fedora-31.qcow2
Disk backup: False
Connecting...
Password: 
Creating disk...
Disk ID: 74ed499d-c8e7-406f-91fd-6e7cd46c9b83
Creating image transfer...
Transfer ID: 60ea83d8-38e6-48ac-a876-d531f793d4b5
Transfer host name: host_mixed_3
Uploading image...
[ 100.00% ] 6.00 GiB, 7.11 seconds, 864.08 MiB/s                               
Finalizing image transfer...
Upload completed successfully

Comment 11 Sandro Bonazzola 2020-11-11 06:42:04 UTC
This bugzilla is included in oVirt 4.4.3 release, published on November 10th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.3 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.