Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1948296

Summary: qemu iothread deadlock in block-resize
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Nir Soffer <nsoffer>
Component: qemu-kvmAssignee: Virtualization Maintenance <virt-maint>
qemu-kvm sub component: Storage QA Contact: qing.wang <qinwang>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: coli, kwolf, pkrempa, slopezpa, virt-maint
Version: 8.4Flags: pm-rhel: mirror+
Target Milestone: rc   
Target Release: 8.4   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-12 00:57:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs and qemu core dump none

Description Nir Soffer 2021-04-11 16:30:10 UTC
Created attachment 1771193 [details]
logs and qemu core dump

Description of problem:

block-resize job for qcow2 disk never finish. The only way to complete the
job is to kill -9 qemu.

It seems that iothread is deadlocked:

Thread 3 (Thread 0x7fe9fec25700 (LWP 70060)):
#0  0x00007fea0b8e565d in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1  0x00007fea0b8dea44 in pthread_mutex_lock () from target:/lib64/libpthread.so.0
#2  0x00005653cd1da40d in qemu_mutex_lock_impl (mutex=0x5653cfa5e420, file=0x5653cd5352e0 "../util/async.c", line=645) at ../util/qemu-thread-posix.c:79
#3  0x00005653ccfeaf70 in virtio_scsi_acquire (s=0x5653d1333510) at /usr/src/debug/qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64/include/hw/virtio/virtio-scsi.h:136
#4  virtio_scsi_data_plane_handle_event (vdev=<optimized out>, vq=0x7fe9fc55b0a8) at ../hw/scsi/virtio-scsi-dataplane.c:84
#5  0x00005653cd030a6e in virtio_queue_notify_aio_vq (vq=<optimized out>) at ../hw/virtio/virtio.c:2326
#6  0x00005653cd1d5a9f in run_poll_handlers_once (timeout=<synthetic pointer>, now=4854163337034, ctx=0x5653cfa5e3c0) at ../util/aio-posix.c:397
#7  run_poll_handlers (timeout=<synthetic pointer>, max_ns=4000, ctx=0x5653cfa5e3c0) at ../util/aio-posix.c:493
#8  try_poll_mode (timeout=<synthetic pointer>, ctx=0x5653cfa5e3c0) at ../util/aio-posix.c:536
#9  aio_poll (ctx=0x5653cfa5e3c0, blocking=blocking@entry=true) at ../util/aio-posix.c:577
#10 0x00005653cd0c8eca in iothread_run (opaque=0x5653cf985f20) at ../iothread.c:73


Version-Release number of selected component (if applicable):
qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64
libvirt-daemon-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64


How reproducible:
Always (4/4)


Steps to Reproduce:
1. Start vm with a qcow2 disk using virtio or virtio-scsi and iothread
2. Try to resize one of the qcow2 disks in RHV


Actual results:
Libvirt blockResize call does not return.
In vdsm log, warning about blocked worker every 60 seconds
Vm cannot be stopped
qemu process cannot be terminated (kill)
The only way to recover is to kill it (kill -9)


Expected results:
Disk resized quickly


Additional info:

Workaround: disbale iothread.


Vdsm log:

2021-04-11 18:29:40,871+0300 INFO  (jsonrpc/7) [api.virt] START diskSizeExtend(driveSpecs={'poolID': 'fb7797bd-3cca-4baa-bcb3-35e8851015f7', 'volumeID': '31fdb830-422f-4f64-8d24-ae54bfcaca27', 'imageID': 'b48e41d8-0b37-4511-bf81-bfadb27f01ee', 'domainID': 'a682d0d8-b135-43ca-ab3c-48155cebd40b'}, newSize='3221225472') from=::ffff:192.168.122.11,52826, flow_id=098c702f-3008-4c0a-b4c6-956a4c6d5dbf, vmId=46ecb982-50c6-406a-bc5c-16bf9b842ea4 (api:48)

Vdsm call libvirt like this:

    flags = libvirt.VIR_DOMAIN_BLOCK_RESIZE_BYTES
    self.log.debug("Calling blockResise drive=%s size=%s flags=%s",
                   drive.name, newSizeBytes, flags)
    self._dom.blockResize(drive.name, newSizeBytes, flags=flags)


Libvirt log:

2021-04-11 15:29:40.895+0000: 34970: debug : virDomainBlockResize:6057 : dom=0x7fc0c800b460, (VM: name=resize-disk, uuid=46ecb982-50c6-406a-bc5c-16bf9b842ea4), disk=vdb, size=3221225472, flags=0x1

2021-04-11 15:29:40.895+0000: 34970: info : qemuMonitorSend:950 : QEMU_MONITOR_SEND_MSG: mon=0x7fc11804a1b0 msg={"execute":"
block_resize","arguments":{"node-name":"libvirt-2-format","size":3221225472},"id":"libvirt-435"}
 fd=-1

No reply seen for id libvirt-435.

Then lot of errors like:

2021-04-11 15:30:18.532+0000: 34971: error : qemuDomainObjBeginJobInternal:975 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainBlockResize)
...
2021-04-11 15:31:53.614+0000: 34967: warning : qemuDomainObjBeginJobInternal:953 : Cannot start job (query, none, none) for domain resize-disk; current job is (modify, none, none) owned by (34970 remoteDispatchDomainBlockResize, 0 <null>, 0 <null> (flags=0x0)) for (132s, 0s, 0s)
...
2021-04-11 15:44:24.066+0000: 34968: error : qemuDomainObjBeginJobInternal:975 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainBlockResize)

Comment 1 CongLi 2021-04-12 00:57:07 UTC

*** This bug has been marked as a duplicate of bug 1903511 ***

Comment 2 Nir Soffer 2021-04-12 09:56:01 UTC
(In reply to CongLi from comment #1)
Thanks CongLi,

But note that you converted a bug with:
- Severity: Urgent
- Dependent Products:	Red Hat Enterprise Virtualization Manager

To a bug with:
- Severity: High
- Dependent Products:	(none)

This bug breaks basic functionallity in RHV, in a very bad way,
and we hope to get it fixed in 8.4.

Comment 3 CongLi 2021-04-12 10:44:56 UTC
(In reply to Nir Soffer from comment #2)
> (In reply to CongLi from comment #1)
> Thanks CongLi,
> 
> But note that you converted a bug with:
> - Severity: Urgent
> - Dependent Products:	Red Hat Enterprise Virtualization Manager
> 
> To a bug with:
> - Severity: High
> - Dependent Products:	(none)
> 
> This bug breaks basic functionallity in RHV, in a very bad way,
> and we hope to get it fixed in 8.4.

Thanks Nir, I will take care of the info next time when close a bug and update the corresponding bug.