Description of problem: VM with 400 external snapshots will abnormally shut down after restarting libvirtd, while still can create internal snapshot for shutdown VM Version-Release number of selected component (if applicable): libvirt-3.9.0-5.el7.x86_64 qemu-kvm-rhev-2.9.0-16.el7_4.12.x86_64 kernel-3.10.0-801.el7.x86_64 How reproducible: 100% Steps to Reproduce: Step1. Prepare a shutdown VM(without snapshot) with the following configuration # virsh domstate test shut off # virsh dumpxml test --inactive |grep "<disk" -A5 <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/RHEL-7.5-x86_64-latest.qcow2'/> <target dev='hda' bus='ide'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> Step2. Start VM and check the snapshot info of VM # virsh start test Domain test started # virsh domstate test running # virsh snapshot-list test Name Creation Time State ------------------------------------------------------------ # virsh snapshot-list test |wc -l 3 Step3. Create 400 external snapshots for VM # for i in {1..400}; do virsh snapshot-create-as test s$i --disk-only; done Domain snapshot s1 created Domain snapshot s2 created ... Domain snapshot s399 created Domain snapshot s400 created # virsh snapshot-list test |wc -l 403 Step4. Check status of VM and libvirtd # virsh list --all Id Name State ---------------------------------------------------- 3 test running # systemctl status libvirtd ● libvirtd.service - Virtualization daemon Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2017-12-11 14:18:23 CST; 5min ago Step5. Restart libvirtd, then check # systemctl restart libvirtd # systemctl status libvirtd status of VM and libvirtd again ● libvirtd.service - Virtualization daemon Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2017-12-11 14:23:33 CST; 1s ago # virsh list --all Id Name State ---------------------------------------------------- - test shut off Step6. Create internal disk snapshot for VM # virsh snapshot-create-as test Domain snapshot 1512973444 created # virsh snapshot-info test 1512973444 Name: 1512973444 Domain: test Current: yes State: shutoff Location: internal Actual results: As Step5 shows Expected results: In Step5, VM should be in running status. Additional info:
Additional info after Step5 in bug descrition While trying to start the VM, raising the following error info: # virsh start test error: Failed to start domain test1 error: Cannot open log file: '/var/log/libvirt/qemu/test.log': Device or resource busy
This happens because libxml2 refuses to parse nested elements with depth > 255. Unfortunately the only thing we can do here is refuse snapshot if the depth is too much.
Hi, Peter. Actually, seems that 200 snapshots will also cause problem. while there are 400 snapshots, restart libvirtd, vm is shutoff show in 'virsh domstate' but qemu process still exists. Could you please help check whether they are same problem? Test scenario for 200 snapshots: libvirt-3.9.0-6.el7.x86_64 kernel-3.10.0-820.el7.x86_64 qemu-kvm-rhev-2.10.0-12.el7.x86_64 # virsh domstate test1 shut off # virsh dumpxml test1 --inactive |grep "<disk" -A5 <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/RHEL-7.5-x86_64-latest.qcow2'/> <target dev='hda' bus='ide'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> # virsh start test1 Domain test1 started # virsh domstate test1 running # for i in {1..200}; do virsh snapshot-create-as test1 test1s$i --disk-only; done Domain snapshot test1s1 created Domain snapshot test1s2 created ... Domain snapshot test1s199 created Domain snapshot test1s200 created # systemctl status libvirtd |grep running Active: active (running) since Wed 2017-12-13 16:26:26 CST; 3h 37min ago # virsh domstate test1 running # systemctl restart libvirtd # systemctl status libvirtd |grep running Active: active (running) since Wed 2017-12-13 20:04:40 CST; 2s ago # virsh domstate test1 running # virsh snapshot-create-as test1 ss --disk-only error: Timed out during operation: cannot acquire state change lock (held by qemuProcessReconnect) Test scenario for 400 snapshots: libvirt-3.9.0-6.el7.x86_64 kernel-3.10.0-820.el7.x86_64 qemu-kvm-rhev-2.10.0-12.el7.x86_64 # virsh domstate test1 shut off # virsh dumpxml test1 --inactive |grep "<disk" -A5 <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/RHEL-7.5-x86_64-latest.qcow2'/> <target dev='hda' bus='ide'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> # virsh start test1 Domain test1 started # virsh domstate test1 running # for i in {1..400}; do virsh snapshot-create-as test1 test1s$i --disk-only; done Domain snapshot test1s1 created Domain snapshot test1s2 created ... Domain snapshot test1s399 created Domain snapshot test1s400 created # systemctl status libvirtd |grep running Active: active (running) since Wed 2017-12-13 16:56:57 CST; 3h 15min ago # virsh domstate test1 running # systemctl restart libvirtd # systemctl status libvirtd |grep running Active: active (running) since Wed 2017-12-13 20:12:42 CST; 2s ago # virsh domstate test1 shut off # ps -ef |grep test1 qemu 3832 1 12 20:10 ? 00:00:19 /usr/libexec/qemu-kvm -name guest=test1,debug-threads=on
That looks like a different problem. Looks like the reconnect thread got stuck or is taking some time to finish for some reason. The difference is that with 260+ snapshots the XML parser returns an error, thus the VM is dropped completely. In this case the status XML was parsed properly but something else broke while reconnecting. In such case a debug log would be handy.
Hi, Peter, Thanks for your checing, file another bug and upload 'gstack `pidof libvirtd`' info as attachment. https://bugzilla.redhat.com/show_bug.cgi?id=1525496
This bug is going to be addressed in the next major release.
Fixed upstream: commit 510d154a0b41aa70aadabc0918d16dee22882394 Author: Peter Krempa <pkrempa> Date: Wed Sep 4 16:58:08 2019 +0200 qemu: Prevent storage causing too much nested XML Since libvirt stores the backing chain into the XML in a nested way it is the prime possibility to hit libxml2's parsing limit of 256 layers. Introduce code which will crawl the backing chain and verify that it's not too deep. The maximum nesting is set to 200 layers so that there's still some space left for additional properties or nesting into snapshot XMLs. The check is applied to all disk use cases (starting, hotplug, media change) as well as block copy which changes image and snapshots. We simply report an error and refuse the operation. Without this check a restart of libvirtd would result in the status XML failing to be parsed and thus losing the VM. v5.7.0-14-g510d154a0b
Hi Peter, When tested with latest libvirt&qemu libvirt-6.0.0-1.module+el8.2.0+5453+31b2b136.x86_64 qemu-kvm-4.2.0-6.module+el8.2.0+5453+31b2b136.x86_64 Seems only 194 snapshots can be created, and then qemu will return an error after that - "No complete monitor response found in 10485760 bytes: Numerical result out of range". So seems a 200snapshots error could not be triggered now. Should libvirt report a more friendly error or it's expected? thx [root@dell-per730-67 ~]# virsh start vm1 Domain vm1 started [root@dell-per730-67 ~]# virsh domblklist vm1 Target Source ------------------------------------------------------------------------ vda /var/lib/avocado/data/avocado-vt/images/jeos-27-x86_64.qcow2 [root@dell-per730-67 ~]# for i in {1..202}; do virsh snapshot-create-as vm1 s$i --disk-only; done Domain snapshot s2 created Domain snapshot s3 created Domain snapshot s4 created Domain snapshot s5 created ... Domain snapshot s193 created Domain snapshot s194 created error: No complete monitor response found in 10485760 bytes: Numerical result out of range error: No complete monitor response found in 10485760 bytes: Numerical result out of range ... error: No complete monitor response found in 10485760 bytes: Numerical result out of range [root@dell-per730-67 ~]# echo "" > /var/log/libvirtd-debug.log [root@dell-per730-67 ~]# virsh snapshot-create-as vm1 s195 --disk-only error: No complete monitor response found in 10485760 bytes: Numerical result out of range The libvirtd debug log will contain following: ... 2020-01-19 13:44:46.520+0000: 5267: error : qemuMonitorIORead:489 : No complete monitor response found in 10485760 bytes: Numerical result out of range 2020-01-19 13:44:46.520+0000: 5267: debug : qemuMonitorIO:666 : Error on monitor No complete monitor response found in 10485760 bytes: Numerical result out of range 2020-01-19 13:44:46.521+0000: 5267: info : virEventPollUpdateHandle:147 : EVENT_POLL_UPDATE_HANDLE: watch=11 events=12 2020-01-19 13:44:46.521+0000: 5267: debug : virEventPollInterruptLocked:723 : Skip interrupt, 1 140682715274240 2020-01-19 13:44:46.521+0000: 5267: debug : qemuMonitorIO:697 : Triggering error callback 2020-01-19 13:44:46.521+0000: 5267: debug : qemuProcessHandleMonitorError:341 : Received error on 0x7ff2d82a5ec0 'vm1'
Well, the problem is that the response to 'query-named-block-nodes' is too long now. This is due to the nesting of structs returned by query-named-block-nodes. I've filed https://bugzilla.redhat.com/show_bug.cgi?id=1793036 which should track the replacement of that command by something more sane. (The qemu part is tracked Additionally the error message reported here is rather hairy. I'll send a patch. Also I've noticed that the limit is not properly applied for external inactive snapshots. I'll send a patch for that as well. After that it will be one of the ways to verify it until we can get saner responses from qemu.
(In reply to Peter Krempa from comment #11) > Well, the problem is that the response to 'query-named-block-nodes' is too > long now. This is due to the nesting of structs returned by > query-named-block-nodes. I've filed > https://bugzilla.redhat.com/show_bug.cgi?id=1793036 which should track the > replacement of that command by something more sane. (The qemu part is > tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1533105 )
Fixed upstream: c07ef7c563 qemu: snapshot: go through cleanup on error 8e9b168fa88b8 qemu: snapshot: Prevent too-nested domain XML when doing inactive snapshot e73a984 qemu: snapshot: Always rewrite backingStore data when reusing existing images 29d43bf96a qemu: monitor: Improve error message when QEMU reply is too large
Verified with libvirt-6.0.0-4.module+el8.2.0+5642+838f3513.x86_64 root@yisun-test1 ~ 07:46:03$ rpm -qa | grep libvirt-6 libvirt-6.0.0-4.module+el8.2.0+5642+838f3513.x86_64 1. with active vm root@yisun-test1 /etc/yum.repos.d 04:32:49$ virsh dumpxml avocado-vt-vm1 --inactive |grep "<disk" -A5 <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/RHEL-7.5-x86_64-latest.qcow2'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </disk> root@yisun-test1 /etc/yum.repos.d 04:33:30$ virsh start avocado-vt-vm1 Domain avocado-vt-vm1 started root@yisun-test1 /etc/yum.repos.d 04:35:53$ echo "" > /var/log/libvirtd-debug.log root@yisun-test1 /etc/yum.repos.d 04:36:02$ for i in {1..202}; do virsh snapshot-create-as avocado-vt-vm1 snap_$i --disk-only; done Domain snapshot snap_1 created Domain snapshot snap_2 created Domain snapshot snap_3 created Domain snapshot snap_4 created ... Domain snapshot snap_195 created Domain snapshot snap_196 created error: internal error: QEMU monitor reply exceeds buffer size (10485760 bytes) ... error: internal error: QEMU monitor reply exceeds buffer size (10485760 bytes) <=== error message is clear now 2. with inactive vm root@yisun-test1 ~ 07:43:56$ virsh define /tmp/avocado-vt-vm1.xml Domain avocado-vt-vm1 defined from /tmp/avocado-vt-vm1.xml root@yisun-test1 ~ 07:44:01$ virsh start avocado-vt-vm1 Domain avocado-vt-vm1 started root@yisun-test1 ~ 07:44:10$ virsh destroy avocado-vt-vm1 Domain avocado-vt-vm1 destroyed root@yisun-test1 ~ 07:44:14$ for i in {1..202}; do virsh snapshot-create-as avocado-vt-vm1 snap_$i --disk-only; done Domain snapshot snap_1 created Domain snapshot snap_2 created Domain snapshot snap_3 created ... Domain snapshot snap_200 created Domain snapshot snap_201 created Domain snapshot snap_202 created root@yisun-test1 ~ 07:45:15$ virsh start avocado-vt-vm1 error: Failed to start domain avocado-vt-vm1 error: Operation not supported: backing chains more than 200 layers deep are not supported for disk 'vda' <=== error message is clear now
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2017