Bug 1948532 - Resize disk when vm is up fails, vm cannot be stopped
Summary: Resize disk when vm is up fails, vm cannot be stopped
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.40.60.3
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ovirt-4.4.6
: 4.40.60.5
Assignee: Vojtech Juranek
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On: 1903511
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-12 11:48 UTC by Ilan Zuckerman
Modified: 2021-11-04 19:28 UTC (History)
8 users (show)

Fixed In Version: vdsm-4.40.60.5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-05 05:36:25 UTC
oVirt Team: Storage
Embargoed:


Attachments (Terms of Use)
logs and qemu core dump (5.06 MB, application/gzip)
2021-04-12 11:48 UTC, Ilan Zuckerman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 114307 0 master MERGED spec: update qemu-kvm requirement 2021-04-16 08:40:34 UTC

Description Ilan Zuckerman 2021-04-12 11:48:00 UTC
Created attachment 1771340 [details]
logs and qemu core dump

Description of problem:

block-resize job for qcow2 disk never finish. The only way to complete the
job is to kill -9 qemu.

This breaks basic functionality in RHV in a very bad way.

It seems that iothread is deadlocked:

Thread 3 (Thread 0x7fe9fec25700 (LWP 70060)):
#0  0x00007fea0b8e565d in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1  0x00007fea0b8dea44 in pthread_mutex_lock () from target:/lib64/libpthread.so.0
#2  0x00005653cd1da40d in qemu_mutex_lock_impl (mutex=0x5653cfa5e420, file=0x5653cd5352e0 "../util/async.c", line=645) at ../util/qemu-thread-posix.c:79
#3  0x00005653ccfeaf70 in virtio_scsi_acquire (s=0x5653d1333510) at /usr/src/debug/qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64/include/hw/virtio/virtio-scsi.h:136
#4  virtio_scsi_data_plane_handle_event (vdev=<optimized out>, vq=0x7fe9fc55b0a8) at ../hw/scsi/virtio-scsi-dataplane.c:84
#5  0x00005653cd030a6e in virtio_queue_notify_aio_vq (vq=<optimized out>) at ../hw/virtio/virtio.c:2326
#6  0x00005653cd1d5a9f in run_poll_handlers_once (timeout=<synthetic pointer>, now=4854163337034, ctx=0x5653cfa5e3c0) at ../util/aio-posix.c:397
#7  run_poll_handlers (timeout=<synthetic pointer>, max_ns=4000, ctx=0x5653cfa5e3c0) at ../util/aio-posix.c:493
#8  try_poll_mode (timeout=<synthetic pointer>, ctx=0x5653cfa5e3c0) at ../util/aio-posix.c:536
#9  aio_poll (ctx=0x5653cfa5e3c0, blocking=blocking@entry=true) at ../util/aio-posix.c:577
#10 0x00005653cd0c8eca in iothread_run (opaque=0x5653cf985f20) at ../iothread.c:73


Version-Release number of selected component (if applicable):
qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64
libvirt-daemon-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64


How reproducible:
Always (4/4)


Steps to Reproduce:
1. Start vm with a qcow2 disk using virtio or virtio-scsi and iothread
2. Try to resize one of the qcow2 disks in RHV


Actual results:
Libvirt blockResize call does not return.
In vdsm log, warning about blocked worker every 60 seconds
Vm cannot be stopped
qemu process cannot be terminated (kill)
The only way to recover is to kill it (kill -9)


Expected results:
Disk resized quickly


Additional info:

Workaround: disbale iothread.


Vdsm log:

2021-04-11 18:29:40,871+0300 INFO  (jsonrpc/7) [api.virt] START diskSizeExtend(driveSpecs={'poolID': 'fb7797bd-3cca-4baa-bcb3-35e8851015f7', 'volumeID': '31fdb830-422f-4f64-8d24-ae54bfcaca27', 'imageID': 'b48e41d8-0b37-4511-bf81-bfadb27f01ee', 'domainID': 'a682d0d8-b135-43ca-ab3c-48155cebd40b'}, newSize='3221225472') from=::ffff:192.168.122.11,52826, flow_id=098c702f-3008-4c0a-b4c6-956a4c6d5dbf, vmId=46ecb982-50c6-406a-bc5c-16bf9b842ea4 (api:48)

Vdsm call libvirt like this:

    flags = libvirt.VIR_DOMAIN_BLOCK_RESIZE_BYTES
    self.log.debug("Calling blockResise drive=%s size=%s flags=%s",
                   drive.name, newSizeBytes, flags)
    self._dom.blockResize(drive.name, newSizeBytes, flags=flags)


Libvirt log:

2021-04-11 15:29:40.895+0000: 34970: debug : virDomainBlockResize:6057 : dom=0x7fc0c800b460, (VM: name=resize-disk, uuid=46ecb982-50c6-406a-bc5c-16bf9b842ea4), disk=vdb, size=3221225472, flags=0x1

2021-04-11 15:29:40.895+0000: 34970: info : qemuMonitorSend:950 : QEMU_MONITOR_SEND_MSG: mon=0x7fc11804a1b0 msg={"execute":"
block_resize","arguments":{"node-name":"libvirt-2-format","size":3221225472},"id":"libvirt-435"}
 fd=-1

No reply seen for id libvirt-435.

Then lot of errors like:

2021-04-11 15:30:18.532+0000: 34971: error : qemuDomainObjBeginJobInternal:975 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainBlockResize)
...
2021-04-11 15:31:53.614+0000: 34967: warning : qemuDomainObjBeginJobInternal:953 : Cannot start job (query, none, none) for domain resize-disk; current job is (modify, none, none) owned by (34970 remoteDispatchDomainBlockResize, 0 <null>, 0 <null> (flags=0x0)) for (132s, 0s, 0s)
...
2021-04-11 15:44:24.066+0000: 34968: error : qemuDomainObjBeginJobInternal:975 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainBlockResize)

Comment 1 Nir Soffer 2021-04-12 12:17:01 UTC
Vdsm need to require qemu version fixing this issue. We hope to get
a fix in RHEL AV 8.4 or 8.4.z.

Comment 4 Eyal Shenitzky 2021-04-15 06:42:53 UTC
Let's bump QEMU version in VDSM to consume the fix.

Comment 5 Ilan Zuckerman 2021-04-20 13:37:25 UTC
Verified on rhv-4.4.6-5

Tried to resize disks with both interfaces: virtio and virtio-scsi
All worked fine.

Comment 7 Sandro Bonazzola 2021-05-05 05:36:25 UTC
This bugzilla is included in oVirt 4.4.6 release, published on May 4th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.6 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.