Bug 2170472
Summary: | [qemu] snapshot is failing with error libvirt.libvirtError: internal error: QEMU monitor reply exceeds buffer size (10485760 bytes) | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Marian Jankular <mjankula> | |
Component: | libvirt | Assignee: | Peter Krempa <pkrempa> | |
Status: | CLOSED ERRATA | QA Contact: | Han Han <hhan> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 8.6 | CC: | aesteve, ahadas, bzlotnik, coli, hhan, jdenemar, jinzhao, jsuchane, juzhang, lmen, michal.skrivanek, mtessun, pkrempa, vgoyal, virt-maint, xuwei, xuzhang, yisun, ymankad | |
Target Milestone: | rc | Keywords: | Triaged, ZStream | |
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | libvirt-8.0.0-20.module+el8.9.0+18894+a91bbb94 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2181575 2181576 2181577 (view as bug list) | Environment: | ||
Last Closed: | 2023-11-14 15:33:25 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 2181575, 2181576, 2181577 | |||
Deadline: | 2023-04-03 |
Description
Marian Jankular
2023-02-16 13:07:17 UTC
(In reply to Marian Jankular from comment #0) > Expected results: > according to https://access.redhat.com/solutions/3123531 in 4.1 we supported > up to 90 snapshots - did we decrease this limit in 4.4? no, we didn't decrease the amount of snapshots in 4.4 - actually this limit was dropped back in 2017 [1].. [1] https://github.com/oVirt/ovirt-engine/commit/dac5b69d1cbd01cd27d6643cf6d0e03ca98e69fc Benny, Albert - could this also be a consequence of enabling incremental backup? Marian, I see that you were able to reproduce it - could you please try the procedure that was used to verify bz 1524278 [1] - and see if it makes any difference when incremental backup is set? [1] https://bugzilla.redhat.com/show_bug.cgi?id=1524278#c15 when incremental backup is not set* Hello Arik, the reproducer I did was without incremental backup set for the disk, regarding the https://bugzilla.redhat.com/show_bug.cgi?id=1524278#c15 - do you want me to reproduce it on plain kvm as i was reproducing on RHVH with blockbased SD (In reply to Marian Jankular from comment #5) > the reproducer I did was without incremental backup set for the disk, > regarding the https://bugzilla.redhat.com/show_bug.cgi?id=1524278#c15 - do > you want me to reproduce it on plain kvm as i was reproducing on RHVH with > blockbased SD Ack, thanks Marian, I don't think that would be necessary then in this case, I think it is likely to be a regression on the platform side Not sure if that's the right assignment but we suspect this is an issue on the platform side since as part of the verification of bug 1524278, we see that we get this error when making almost 200 snapshots and here we seem to get it after 50 snapshots The error message actually comes from libvirt and is triggered when qemu's response object is more than 10MiB to prevent unbounded memory consumption which could be triggered by a compromised qemu. Now in normal use this should not happen. I've witnessed and fixed this some time ago by changing how we invoke 'query-named-block-nodes'. Now in the version here it should be fixed. Unfortunately the sosreport doesn't seem to contain libvirt logs at all. Is there a possibility I could get a debug log from the libvirt daemon? The documentation for setting up logging is here: https://libvirt.org/kbase/debuglogs.html#log-outputs please use the filters in the following example: https://libvirt.org/kbase/debuglogs.html#targeted-logging-for-debugging-qemu-vms I'll assign it to libvirt for now until I investigate what's happening. Not reproduced for now on rhel8.6 libvirt-8.0.0-5.7.module+el8.6.0+18204+67eb6a69.x86_64 qemu-kvm-6.2.0-11.module+el8.6.0+18167+43cf40f3.8.x86_64 Steps: 1. have a running vm with 2 disks [root@dell-per730-66 ~]# virsh domblklist vm1 Target Source ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sda /rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sda.qcow2 sdb /rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sdb.qcow2 2. prepare snapshot xml due to rhv using snapshot xml to create snapshot [root@dell-per730-66 ~]# cat snap_pattern.xml <domainsnapshot> <description>Snapshot test</description> <name>#snap_name#</name> <memory snapshot='external' file='#mem_path#'/> <disks> <disk name="sda" snapshot="external" type="file"> <source file="#sda_snap_path#"/> </disk> <disk name="sdb" snapshot="external" type="file"> <source file="#sdb_snap_path#"/> </disk> </disks> </domainsnapshot> 3. prepare test shell script to generate 100 external mem+disk snapshots. --reuse-external and --no-metadata involvded due to rhv turn on these flags [root@dell-per730-66 ~]# cat test.sh #!/bin/sh vm_name="vm1" snap_xml_pattern="snap_pattern.xml" test_dir="/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/" for i in {1..100} do snap_name="snap"$i sda_snap_path=$test_dir"sda."$snap_name sdb_snap_path=$test_dir"sdb."$snap_name mem_path=$test_dir"mem."$snap_name snap_xml=$test_dir$snap_name".xml" # prepare snapshot image due to --reuse-external will be used qemu-img create -f qcow2 $sda_snap_path 1G qemu-img create -f qcow2 $sdb_snap_path 1G # prepare snapshot xml sed -e "s,#snap_name#,$snap_name,g" $snap_xml_pattern > $snap_xml sed -i -e "s,#mem_path#,$mem_path,g" $snap_xml sed -i -e "s,#sda_snap_path#,$sda_snap_path,g" $snap_xml sed -i -e "s,#sdb_snap_path#,$sdb_snap_path,g" $snap_xml # start to create snapshot virsh snapshot-create $vm_name $snap_xml --reuse-external --no-metadata done 4. 100 snapshots created without errors root@dell-per730-66 ~]# sh test.sh Formatting '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sda.snap1', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1073741824 lazy_refcounts=off refcount_bits=16 Formatting '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sdb.snap1', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1073741824 lazy_refcounts=off refcount_bits=16 Domain snapshot snap1 created from '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/snap1.xml' Formatting '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sda.snap2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1073741824 lazy_refcounts=off refcount_bits=16 Formatting '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sdb.snap2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1073741824 lazy_refcounts=off refcount_bits=16 Domain snapshot snap2 created from '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/snap2.xml' ... ... ... Formatting '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sda.snap100', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1073741824 lazy_refcounts=off refcount_bits=16 Formatting '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sdb.snap100', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1073741824 lazy_refcounts=off refcount_bits=16 Domain snapshot snap100 created from '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/snap100.xml' also tried with same libvirt version as comment0 but not reproduced. [root@dell-per730-66 ~]# rpm -qa | egrep "^libvirt-8|^qemu-kvm-6" libvirt-8.0.0-5.6.module+el8.6.0+17751+d6559882.x86_64 qemu-kvm-6.2.0-11.module+el8.6.0+18167+43cf40f3.8.x86_64 [root@dell-per730-66 ~]# virsh domblklist vm1 Target Source ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sda /rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sda.qcow2 sdb /rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sdb.qcow2 [root@dell-per730-66 ~]# sh test.sh Formatting '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sda.snap1', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1073741824 lazy_refcounts=off refcount_bits=16 Formatting '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sdb.snap1', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1073741824 lazy_refcounts=off refcount_bits=16 Domain snapshot snap1 created from '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/snap1.xml' ... ... Formatting '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sda.snap100', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1073741824 lazy_refcounts=off refcount_bits=16 Formatting '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sdb.snap100', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=1073741824 lazy_refcounts=off refcount_bits=16 Domain snapshot snap100 created from '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/snap100.xml' Do you happen to have the full log? The log already starts with the monitor being broken. While I'll invesitgate why the error doesn't clear itself (as it should only apply to one operation) I'd really love to see what actually caused it. 2023-03-15 13:05:42.811+0000: 1806689: debug : qemuMonitorSend:901 : Attempt to send command while error is set internal error: QEMU monitor reply exceeds buffer size (10485760 bytes) mon=0x7fd1a41cb460 vm=0x7fd1a41190b0 name=mj-rhel82 Thank you for the log, I was able to pinpoint the issue: The last invoked command before the error was set is: 2023-03-15 15:27:20.512+0000: 1988581: info : qemuMonitorIOWrite:404 : QEMU_MONITOR_IO_WRITE: mon=0x7fb2841ca740 buf={"execute":"query-named-block-nodes","id":"libvirt-2751"}^M This was done in: 219477 2023-03-15 15:27:20.511+0000: 1960336: debug : qemuMonitorBlockStatsUpdateCapacityBlockdev:2067 : mon:0x7fb2841ca740 vm:0x7fb2841186d0 fd:41 Which gathers statistics in 219459 2023-03-15 15:27:20.492+0000: 1960336: debug : virDomainListGetStats:12078 : doms=0x7fb2dc093ed0, stats=0x20, retStats=0x7fb2e1e8c8f8, flags=0x40000000 The problem is that libvirt doesn't set the 'flat' mode for 'query-named-block-nodes' in which case qemu returns a massive JSON. All other uses of 'query-named-block-nodes' already do set this. Additionally I've fixed this problem recenly upstream: commit bbd4d4899391b3bd1906cce61a3634f42f4b1bdf Author: Peter Krempa <pkrempa> Date: Wed Nov 9 11:06:25 2022 +0100 qemuMonitorJSONBlockStatsUpdateCapacityBlockdev: Use 'flat' mode of query-named-block-nodes 'query-named-block-nodes' in non-flat mode returns redundantly nested data under the 'backing-image' field. Fortunately we don't need it when updating the capacity stats. This function was unfortunately not fixed originally when the support for flat mode was added. Use the flat cached in the monitor object to force flat mode if available. Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Ján Tomko <jtomko> diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index 8a3421a6fc..0e34cb55ad 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -2503,7 +2503,7 @@ qemuMonitorJSONBlockStatsUpdateCapacityBlockdev(qemuMonitor *mon, { g_autoptr(virJSONValue) nodes = NULL; - if (!(nodes = qemuMonitorJSONQueryNamedBlockNodes(mon, false))) + if (!(nodes = qemuMonitorJSONQueryNamedBlockNodes(mon, mon->queryNamedBlockNodesFlat))) return -1; if (virJSONValueArrayForeachSteal(nodes, Reproduced on libvirt-8.0.0-5.6.module+el8.6.0+17751+d6559882.x86_64 with above info 1. test script modified to use a vm with only one disk [root@dell-per740xd-23 tmp]# virsh domblklist vm1 Target Source --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sda /rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/sda.qcow2 [root@dell-per740xd-23 tmp]# cat snap_pattern.xml <domainsnapshot> <description>Snapshot test</description> <name>#snap_name#</name> <memory snapshot='external' file='#mem_path#'/> <disks> <disk name="sda" snapshot="external" type="file"> <source file="#sda_snap_path#"/> </disk> </disks> </domainsnapshot> [root@dell-per740xd-23 tmp]# cat test.sh #!/bin/sh vm_name="vm1" snap_xml_pattern="snap_pattern.xml" test_dir="/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/" for i in {1..100} do snap_name="snap"$i sda_snap_path=$test_dir"sda."$snap_name mem_path=$test_dir"mem."$snap_name snap_xml=$test_dir$snap_name".xml" # prepare snapshot image due to --reuse-external will be used qemu-img create -f qcow2 $sda_snap_path 1G > /dev/null # prepare snapshot xml sed -e "s,#snap_name#,$snap_name,g" $snap_xml_pattern > $snap_xml sed -i -e "s,#mem_path#,$mem_path,g" $snap_xml sed -i -e "s,#sda_snap_path#,$sda_snap_path,g" $snap_xml # start to create snapshot virsh snapshot-create $vm_name $snap_xml --reuse-external --no-metadata done 2. start vm and run the test.sh [root@dell-per740xd-23 tmp]# sh test.sh Domain snapshot snap1 created from '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/snap1.xml' Domain snapshot snap2 created from '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/snap2.xml' ... Domain snapshot snap99 created from '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/snap99.xml' Domain snapshot snap100 created from '/rhev/data-center/mnt/blockSD/6814fac1-2e9a-4ab3-bbfe-02a14c2049fe/images/f5dcca7c-84f4-49b7-9df7-966d402b4747/0b8814ed-562a-4af8-b82e-17adfa5a110d/snap100.xml' 3. clear libvirtd log [root@dell-per740xd-23 tmp]# echo "" > /var/log/libvirtd-debug.log 4. use domblkinfo to trigger the error [root@dell-per740xd-23 tmp]# virsh domblkinfo vm1 sda error: internal error: QEMU monitor reply exceeds buffer size (10485760 bytes) [root@dell-per740xd-23 tmp]# cat /var/log/libvirtd-debug.log | grep '{"execute":"query-named-block-nodes"' 2023-03-20 12:27:14.244+0000: 163002: info : qemuMonitorSend:916 : QEMU_MONITOR_SEND_MSG: mon=0x7f50f41c9460 msg={"execute":"query-named-block-nodes","id":"libvirt-21"} 2023-03-20 12:27:14.244+0000: 163104: info : qemuMonitorIOWrite:404 : QEMU_MONITOR_IO_WRITE: mon=0x7f50f41c9460 buf={"execute":"query-named-block-nodes","id":"libvirt-21"} <==== no "flat":"true" set here for query-named-block-nodes [root@dell-per740xd-23 tmp]# virsh domblkthreshold vm1 sda 100000 error: internal error: QEMU monitor reply exceeds buffer size (10485760 bytes) Test on libvirt-8.0.0-5.9.module+el8.6.0+18552+b96c5a91.x86_64 qemu-kvm-6.2.0-11.module+el8.6.0+14712+f96656d3.x86_64 1. Create 199 disk external snapshots for a running VM 2. Get the info of the disk with snapshots: ➜ ~ virsh domblkinfo vm1 vda Capacity: 10737418240 Allocation: 15732736 Physical: 7536640 No error happens. Check the qmp of domblkinfo, "arguments":{"flat":true} is added for the "query-named-block-nodes" cmd. Test against libvirt-8.0.0-20.module+el8.9.0+18894+a91bbb94.x86_64 Test result: PASS Use the test script in attachment, log as follow: ... ... ========== Round_100: ========== Do domblkinfo to trigger qmp cmd: query-named-block-nodes Capacity: 1073741824 Allocation: 204800 Physical: 196624 Do snapshot creation Domain snapshot snap100 created from '/tmp/snap100.xml' Test PASSED, pls double check if query-named-block-nodes has flat=true in log below: The query-named-block-nodes called by libvirt as follow: 2023-06-09 08:24:40.393+0000: 286876: info : qemuMonitorSend:868 : QEMU_MONITOR_SEND_MSG: mon=0x7facac1c65d0 msg={"execute":"query-named-block-nodes","arguments":{"flat":true},"id":"libvirt-1998"} 2023-06-09 08:24:40.393+0000: 287026: info : qemuMonitorIOWrite:354 : QEMU_MONITOR_IO_WRITE: mon=0x7facac1c65d0 buf={"execute":"query-named-block-nodes","arguments":{"flat":true},"id":"libvirt-1998"} 2023-06-09 08:24:41.544+0000: 286877: info : qemuMonitorSend:868 : QEMU_MONITOR_SEND_MSG: mon=0x7facac1c65d0 msg={"execute":"query-named-block-nodes","arguments":{"flat":true},"id":"libvirt-2000"} 2023-06-09 08:24:41.545+0000: 287026: info : qemuMonitorIOWrite:354 : QEMU_MONITOR_IO_WRITE: mon=0x7facac1c65d0 buf={"execute":"query-named-block-nodes","arguments":{"flat":true},"id":"libvirt-2000"} Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: virt:rhel and virt-devel:rhel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6980 |