Bug 1430847 - QEMU monitor blocking on slow storage
Summary: QEMU monitor blocking on slow storage
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.1
Assignee: Virtualization Maintenance
QA Contact: Han Han
URL:
Whiteboard:
Depends On:
Blocks: 1427782 1758964
TreeView+ depends on / blocked
 
Reported: 2017-03-09 17:14 UTC by Milan Zamazal
Modified: 2020-02-11 13:27 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-11 13:27:46 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
libvirt log (50.35 KB, application/x-xz)
2017-03-09 17:14 UTC, Milan Zamazal
no flags Details

Description Milan Zamazal 2017-03-09 17:14:39 UTC
Created attachment 1261635 [details]
libvirt log

Description of problem:

When storage used by a VM is slow, retrieving VM storage stats may block for some time. This causes virConnectGetAllDomainStats libvirt call block instead of returning the stats immediately. Thus one VM with a problematic storage may block retrieving stats for other VMs and the management software may loose track of VM states and stats.

Version-Release number of selected component (if applicable):

qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
libvirt-2.0.0-10.el7_3.5.x86_64

How reproducible:

Always.

Steps to Reproduce (using virt-manager):
1. Add some slow disk to a VM (I used a newly created qcow2 image residing on slowfs: https://github.com/nirs/slowfs).
2. Start the VM and mount the slow disk in it.
3. Make sure the disk is really slow at this moment -- I set `getattr' delay in slowfs.cfg to 10 seconds.
4. On the host, run `virsh domstats DOMAIN'. The call blocks for 10 seconds before returning output.

Actual results:

virsh domstats command blocks for some time.

Expected results:

virsh domstats command returns immediately.

Additional info:

I attach libvirtd.log demonstrating the problem. Thread 14370 freezes for 10 seconds after the following line:

  2017-03-09 16:00:30.359+0000: 14370: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f74740033f0 msg={"execute":"query-block","id":"libvirt-38"}

Comment 2 Ademar Reis 2017-03-10 12:32:41 UTC
Looks like a duplicate (variation) of Bug 665820: "[RFE] Send monitor event upon stuck storage".

If that's indeed the case, there's not much we can do at the moment. Making it asynchronous requires substantial architecture changes in the way QEMU works and is not planned for the short term. See https://bugzilla.redhat.com/show_bug.cgi?id=665820#c18 and https://bugzilla.redhat.com/show_bug.cgi?id=665820#c20 for more details.

Stefan: Just like Bug 665820 (assigned to you), I'm deferring this to 7.5.

If I'm mistaken and this BZ can actually be fixed or workarounded somehow, then please forgive my ignorance and feel free to fix it in 7.4. :-)

Comment 3 Kevin Wolf 2017-03-10 13:42:35 UTC
Actually, this bug is described in terms of libvirt functions, so I'm not sure why
it's filed against qemu-kvm-rhev rather than libvirt.

(In reply to Milan Zamazal from comment #0)
> Actual results:
> 
> virsh domstats command blocks for some time.
> 
> Expected results:
> 
> virsh domstats command returns immediately.

More importantly however, this bug report doesn't even tell _what_ should be
returned immediately. Do you want a timeout and get only an error then? Because
obviously the real information is only available after some time.

What's the actual use case behind this request?

Comment 4 Daniel Berrangé 2017-03-10 13:46:22 UTC
Re-assigning to libvirt, since the problem is described in terms of libvirt APIs. We can open QEMU bugs later if we find that fixing libvirt requires new functionality QEMU.

Comment 5 Milan Zamazal 2017-03-10 14:51:01 UTC
The actual use case is that we query VM stats via virConnectGetAllDomainStats call in RHV periodically. If that blocks, we lose track what's happening with the VMs on the host and the user no longer receives information about them.

As for virConnectGetAllDomainStats, I would expect that it responds immediately or perhaps after some timeout. Handling unexpectedly blocking calls is always a bit problem. The response should definitely contain stats for all the VMs without problems and preferably also all immediately available stats for the VMs with problematic storage. Receiving incomplete current stats immediately is typically better than receiving complete but old stats after long time or not receiving any stats.

(Sorry for filing the bug against QEMU, it has been guessed that it's likely a QEMU problem, but it's better to start from libvirt with libvirt APIs. Proper starting target will be selected the next time :-).)

Comment 6 Stefan Hajnoczi 2017-03-16 08:31:40 UTC
Although there is an architectural challenge in QEMU which is hard to solve, this specific case of querying block stats might be amenable to a non-blocking approach.

Dou Liyang <douly.fnst.com> from Fujitsu recently worked on improving performance of query-blockstats.  There has been discussion about making the block stats accessible without a lock so the monitor command can complete unhindered even if the storage is hung.

Feel free to clone a child bug for qemu-kvm-rhev and we'll investigate if blocking can be avoided.

Comment 7 Han Han 2018-12-28 09:21:02 UTC
(In reply to Milan Zamazal from comment #0)
> Created attachment 1261635 [details]
> libvirt log
> 
> Description of problem:
> 
> When storage used by a VM is slow, retrieving VM storage stats may block for
> some time. This causes virConnectGetAllDomainStats libvirt call block
> instead of returning the stats immediately. Thus one VM with a problematic
> storage may block retrieving stats for other VMs and the management software
> may loose track of VM states and stats.
> 
> Version-Release number of selected component (if applicable):
> 
> qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64
> libvirt-2.0.0-10.el7_3.5.x86_64
> 
> How reproducible:
> 
> Always.
> 
> Steps to Reproduce (using virt-manager):
> 1. Add some slow disk to a VM (I used a newly created qcow2 image residing
> on slowfs: https://github.com/nirs/slowfs).
> 2. Start the VM and mount the slow disk in it.
> 3. Make sure the disk is really slow at this moment -- I set `getattr' delay
> in slowfs.cfg to 10 seconds.
> 4. On the host, run `virsh domstats DOMAIN'. The call blocks for 10 seconds
> before returning output.
> 
> Actual results:
> 
> virsh domstats command blocks for some time.
> 
> Expected results:
> 
> virsh domstats command returns immediately.
> 
> Additional info:
> 
> I attach libvirtd.log demonstrating the problem. Thread 14370 freezes for 10
> seconds after the following line:
> 
>   2017-03-09 16:00:30.359+0000: 14370: info : qemuMonitorSend:1009 :
> QEMU_MONITOR_SEND_MSG: mon=0x7f74740033f0
> msg={"execute":"query-block","id":"libvirt-38"}

Hello, I tried as your scenario on libvirt-4.5.0-10.el7_6.3.x86_64 qemu-kvm-rhev-2.12.0-19.el7_6.2.x86_64:
When I cold-plug the slowfs disk to vm and start vm, I will get this error:
# virsh -k0 start Q35        
error: Failed to start domain Q35
error: internal error: qemu unexpectedly closed the monitor: 2018-12-28T08:54:58.936295Z qemu-kvm: -drive file=/mnt/a,format=raw,if=none,id=drive-virtio-disk2: Could not open '/mnt/a': Permission denied

When I hot-plug the disk to running VM, I got that error:
# virsh attach-disk Q35 /mnt/a vdc         
error: Failed to attach disk
error: internal error: unable to execute QEMU command '__com.redhat_drive_add': Device 'drive-virtio-disk2' could not be initialized

My selinux is Permissive and the disk file permission is 777. Could you tell me how you started the VM with slowfs disk?

Comment 8 Milan Zamazal 2019-01-03 09:11:08 UTC
Hi Han, this is probably problem with FUSE permissions. Please make sure that:

- You run slowfs with --allow-other option.
- user_allow_other is enabled in /etc/fuse.conf.

Comment 9 Fangge Jin 2019-02-18 09:19:43 UTC
There is a bug on vdsm: Bug 1613514 - [RFE] send --nowait to libvirt when we collect qemu stats, to consume bz#1552092
--nowait means "report only stats that are accessible instantly", which implies information that requires qemu monitor will not be returned.

Comment 10 Jaroslav Suchanek 2019-04-24 12:26:43 UTC
This bug is going to be addressed in next major release.

Comment 11 Jaroslav Suchanek 2020-02-11 13:27:46 UTC
This bug was closed deferred as a result of bug triage.

Please reopen if you disagree and provide justification why this bug should
get enough priority. Most important would be information about impact on
customer or layered product. Please indicate requested target release.


Note You need to log in before you can comment on or make changes to this bug.