Bug 1948296 - qemu iothread deadlock in block-resize
Summary: qemu iothread deadlock in block-resize
Keywords:
Status: CLOSED DUPLICATE of bug 1903511
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: qemu-kvm
Version: 8.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: 8.4
Assignee: Virtualization Maintenance
QA Contact: qing.wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-11 16:30 UTC by Nir Soffer
Modified: 2021-04-12 10:44 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-12 00:57:07 UTC
Type: Bug
Target Upstream Version:
Embargoed:
pm-rhel: mirror+


Attachments (Terms of Use)
logs and qemu core dump (5.06 MB, application/gzip)
2021-04-11 16:30 UTC, Nir Soffer
no flags Details

Description Nir Soffer 2021-04-11 16:30:10 UTC
Created attachment 1771193 [details]
logs and qemu core dump

Description of problem:

block-resize job for qcow2 disk never finish. The only way to complete the
job is to kill -9 qemu.

It seems that iothread is deadlocked:

Thread 3 (Thread 0x7fe9fec25700 (LWP 70060)):
#0  0x00007fea0b8e565d in __lll_lock_wait () from target:/lib64/libpthread.so.0
#1  0x00007fea0b8dea44 in pthread_mutex_lock () from target:/lib64/libpthread.so.0
#2  0x00005653cd1da40d in qemu_mutex_lock_impl (mutex=0x5653cfa5e420, file=0x5653cd5352e0 "../util/async.c", line=645) at ../util/qemu-thread-posix.c:79
#3  0x00005653ccfeaf70 in virtio_scsi_acquire (s=0x5653d1333510) at /usr/src/debug/qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64/include/hw/virtio/virtio-scsi.h:136
#4  virtio_scsi_data_plane_handle_event (vdev=<optimized out>, vq=0x7fe9fc55b0a8) at ../hw/scsi/virtio-scsi-dataplane.c:84
#5  0x00005653cd030a6e in virtio_queue_notify_aio_vq (vq=<optimized out>) at ../hw/virtio/virtio.c:2326
#6  0x00005653cd1d5a9f in run_poll_handlers_once (timeout=<synthetic pointer>, now=4854163337034, ctx=0x5653cfa5e3c0) at ../util/aio-posix.c:397
#7  run_poll_handlers (timeout=<synthetic pointer>, max_ns=4000, ctx=0x5653cfa5e3c0) at ../util/aio-posix.c:493
#8  try_poll_mode (timeout=<synthetic pointer>, ctx=0x5653cfa5e3c0) at ../util/aio-posix.c:536
#9  aio_poll (ctx=0x5653cfa5e3c0, blocking=blocking@entry=true) at ../util/aio-posix.c:577
#10 0x00005653cd0c8eca in iothread_run (opaque=0x5653cf985f20) at ../iothread.c:73


Version-Release number of selected component (if applicable):
qemu-kvm-5.2.0-14.module+el8.4.0+10425+ad586fa5.x86_64
libvirt-daemon-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64


How reproducible:
Always (4/4)


Steps to Reproduce:
1. Start vm with a qcow2 disk using virtio or virtio-scsi and iothread
2. Try to resize one of the qcow2 disks in RHV


Actual results:
Libvirt blockResize call does not return.
In vdsm log, warning about blocked worker every 60 seconds
Vm cannot be stopped
qemu process cannot be terminated (kill)
The only way to recover is to kill it (kill -9)


Expected results:
Disk resized quickly


Additional info:

Workaround: disbale iothread.


Vdsm log:

2021-04-11 18:29:40,871+0300 INFO  (jsonrpc/7) [api.virt] START diskSizeExtend(driveSpecs={'poolID': 'fb7797bd-3cca-4baa-bcb3-35e8851015f7', 'volumeID': '31fdb830-422f-4f64-8d24-ae54bfcaca27', 'imageID': 'b48e41d8-0b37-4511-bf81-bfadb27f01ee', 'domainID': 'a682d0d8-b135-43ca-ab3c-48155cebd40b'}, newSize='3221225472') from=::ffff:192.168.122.11,52826, flow_id=098c702f-3008-4c0a-b4c6-956a4c6d5dbf, vmId=46ecb982-50c6-406a-bc5c-16bf9b842ea4 (api:48)

Vdsm call libvirt like this:

    flags = libvirt.VIR_DOMAIN_BLOCK_RESIZE_BYTES
    self.log.debug("Calling blockResise drive=%s size=%s flags=%s",
                   drive.name, newSizeBytes, flags)
    self._dom.blockResize(drive.name, newSizeBytes, flags=flags)


Libvirt log:

2021-04-11 15:29:40.895+0000: 34970: debug : virDomainBlockResize:6057 : dom=0x7fc0c800b460, (VM: name=resize-disk, uuid=46ecb982-50c6-406a-bc5c-16bf9b842ea4), disk=vdb, size=3221225472, flags=0x1

2021-04-11 15:29:40.895+0000: 34970: info : qemuMonitorSend:950 : QEMU_MONITOR_SEND_MSG: mon=0x7fc11804a1b0 msg={"execute":"
block_resize","arguments":{"node-name":"libvirt-2-format","size":3221225472},"id":"libvirt-435"}
 fd=-1

No reply seen for id libvirt-435.

Then lot of errors like:

2021-04-11 15:30:18.532+0000: 34971: error : qemuDomainObjBeginJobInternal:975 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainBlockResize)
...
2021-04-11 15:31:53.614+0000: 34967: warning : qemuDomainObjBeginJobInternal:953 : Cannot start job (query, none, none) for domain resize-disk; current job is (modify, none, none) owned by (34970 remoteDispatchDomainBlockResize, 0 <null>, 0 <null> (flags=0x0)) for (132s, 0s, 0s)
...
2021-04-11 15:44:24.066+0000: 34968: error : qemuDomainObjBeginJobInternal:975 : Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainBlockResize)

Comment 1 CongLi 2021-04-12 00:57:07 UTC

*** This bug has been marked as a duplicate of bug 1903511 ***

Comment 2 Nir Soffer 2021-04-12 09:56:01 UTC
(In reply to CongLi from comment #1)
Thanks CongLi,

But note that you converted a bug with:
- Severity: Urgent
- Dependent Products:	Red Hat Enterprise Virtualization Manager

To a bug with:
- Severity: High
- Dependent Products:	(none)

This bug breaks basic functionallity in RHV, in a very bad way,
and we hope to get it fixed in 8.4.

Comment 3 CongLi 2021-04-12 10:44:56 UTC
(In reply to Nir Soffer from comment #2)
> (In reply to CongLi from comment #1)
> Thanks CongLi,
> 
> But note that you converted a bug with:
> - Severity: Urgent
> - Dependent Products:	Red Hat Enterprise Virtualization Manager
> 
> To a bug with:
> - Severity: High
> - Dependent Products:	(none)
> 
> This bug breaks basic functionallity in RHV, in a very bad way,
> and we hope to get it fixed in 8.4.

Thanks Nir, I will take care of the info next time when close a bug and update the corresponding bug.


Note You need to log in before you can comment on or make changes to this bug.