Created attachment 1261635 [details] libvirt log Description of problem: When storage used by a VM is slow, retrieving VM storage stats may block for some time. This causes virConnectGetAllDomainStats libvirt call block instead of returning the stats immediately. Thus one VM with a problematic storage may block retrieving stats for other VMs and the management software may loose track of VM states and stats. Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64 libvirt-2.0.0-10.el7_3.5.x86_64 How reproducible: Always. Steps to Reproduce (using virt-manager): 1. Add some slow disk to a VM (I used a newly created qcow2 image residing on slowfs: https://github.com/nirs/slowfs). 2. Start the VM and mount the slow disk in it. 3. Make sure the disk is really slow at this moment -- I set `getattr' delay in slowfs.cfg to 10 seconds. 4. On the host, run `virsh domstats DOMAIN'. The call blocks for 10 seconds before returning output. Actual results: virsh domstats command blocks for some time. Expected results: virsh domstats command returns immediately. Additional info: I attach libvirtd.log demonstrating the problem. Thread 14370 freezes for 10 seconds after the following line: 2017-03-09 16:00:30.359+0000: 14370: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f74740033f0 msg={"execute":"query-block","id":"libvirt-38"}
Looks like a duplicate (variation) of Bug 665820: "[RFE] Send monitor event upon stuck storage". If that's indeed the case, there's not much we can do at the moment. Making it asynchronous requires substantial architecture changes in the way QEMU works and is not planned for the short term. See https://bugzilla.redhat.com/show_bug.cgi?id=665820#c18 and https://bugzilla.redhat.com/show_bug.cgi?id=665820#c20 for more details. Stefan: Just like Bug 665820 (assigned to you), I'm deferring this to 7.5. If I'm mistaken and this BZ can actually be fixed or workarounded somehow, then please forgive my ignorance and feel free to fix it in 7.4. :-)
Actually, this bug is described in terms of libvirt functions, so I'm not sure why it's filed against qemu-kvm-rhev rather than libvirt. (In reply to Milan Zamazal from comment #0) > Actual results: > > virsh domstats command blocks for some time. > > Expected results: > > virsh domstats command returns immediately. More importantly however, this bug report doesn't even tell _what_ should be returned immediately. Do you want a timeout and get only an error then? Because obviously the real information is only available after some time. What's the actual use case behind this request?
Re-assigning to libvirt, since the problem is described in terms of libvirt APIs. We can open QEMU bugs later if we find that fixing libvirt requires new functionality QEMU.
The actual use case is that we query VM stats via virConnectGetAllDomainStats call in RHV periodically. If that blocks, we lose track what's happening with the VMs on the host and the user no longer receives information about them. As for virConnectGetAllDomainStats, I would expect that it responds immediately or perhaps after some timeout. Handling unexpectedly blocking calls is always a bit problem. The response should definitely contain stats for all the VMs without problems and preferably also all immediately available stats for the VMs with problematic storage. Receiving incomplete current stats immediately is typically better than receiving complete but old stats after long time or not receiving any stats. (Sorry for filing the bug against QEMU, it has been guessed that it's likely a QEMU problem, but it's better to start from libvirt with libvirt APIs. Proper starting target will be selected the next time :-).)
Although there is an architectural challenge in QEMU which is hard to solve, this specific case of querying block stats might be amenable to a non-blocking approach. Dou Liyang <douly.fnst.com> from Fujitsu recently worked on improving performance of query-blockstats. There has been discussion about making the block stats accessible without a lock so the monitor command can complete unhindered even if the storage is hung. Feel free to clone a child bug for qemu-kvm-rhev and we'll investigate if blocking can be avoided.
(In reply to Milan Zamazal from comment #0) > Created attachment 1261635 [details] > libvirt log > > Description of problem: > > When storage used by a VM is slow, retrieving VM storage stats may block for > some time. This causes virConnectGetAllDomainStats libvirt call block > instead of returning the stats immediately. Thus one VM with a problematic > storage may block retrieving stats for other VMs and the management software > may loose track of VM states and stats. > > Version-Release number of selected component (if applicable): > > qemu-kvm-rhev-2.6.0-28.el7_3.6.x86_64 > libvirt-2.0.0-10.el7_3.5.x86_64 > > How reproducible: > > Always. > > Steps to Reproduce (using virt-manager): > 1. Add some slow disk to a VM (I used a newly created qcow2 image residing > on slowfs: https://github.com/nirs/slowfs). > 2. Start the VM and mount the slow disk in it. > 3. Make sure the disk is really slow at this moment -- I set `getattr' delay > in slowfs.cfg to 10 seconds. > 4. On the host, run `virsh domstats DOMAIN'. The call blocks for 10 seconds > before returning output. > > Actual results: > > virsh domstats command blocks for some time. > > Expected results: > > virsh domstats command returns immediately. > > Additional info: > > I attach libvirtd.log demonstrating the problem. Thread 14370 freezes for 10 > seconds after the following line: > > 2017-03-09 16:00:30.359+0000: 14370: info : qemuMonitorSend:1009 : > QEMU_MONITOR_SEND_MSG: mon=0x7f74740033f0 > msg={"execute":"query-block","id":"libvirt-38"} Hello, I tried as your scenario on libvirt-4.5.0-10.el7_6.3.x86_64 qemu-kvm-rhev-2.12.0-19.el7_6.2.x86_64: When I cold-plug the slowfs disk to vm and start vm, I will get this error: # virsh -k0 start Q35 error: Failed to start domain Q35 error: internal error: qemu unexpectedly closed the monitor: 2018-12-28T08:54:58.936295Z qemu-kvm: -drive file=/mnt/a,format=raw,if=none,id=drive-virtio-disk2: Could not open '/mnt/a': Permission denied When I hot-plug the disk to running VM, I got that error: # virsh attach-disk Q35 /mnt/a vdc error: Failed to attach disk error: internal error: unable to execute QEMU command '__com.redhat_drive_add': Device 'drive-virtio-disk2' could not be initialized My selinux is Permissive and the disk file permission is 777. Could you tell me how you started the VM with slowfs disk?
Hi Han, this is probably problem with FUSE permissions. Please make sure that: - You run slowfs with --allow-other option. - user_allow_other is enabled in /etc/fuse.conf.
There is a bug on vdsm: Bug 1613514 - [RFE] send --nowait to libvirt when we collect qemu stats, to consume bz#1552092 --nowait means "report only stats that are accessible instantly", which implies information that requires qemu monitor will not be returned.
This bug is going to be addressed in next major release.
This bug was closed deferred as a result of bug triage. Please reopen if you disagree and provide justification why this bug should get enough priority. Most important would be information about impact on customer or layered product. Please indicate requested target release.