Bug 1525496
Summary: | libvirtd fails to reconnect to a qemu process after creating 200 snapshots in a chain | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | jiyan <jiyan> | ||||||||
Component: | libvirt | Assignee: | Peter Krempa <pkrempa> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | yisun | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 7.5 | CC: | dyuan, fjin, hhan, jdenemar, jiyan, lmen, meili, pkrempa, xuzhang, yisun | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | libvirt-4.3.0-1.el7 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2018-10-30 09:52:20 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Thread 2 (Thread 0x7f63f0d70700 (LWP 7695)): #0 0x00007f643c9fe8f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f643f6bd62e in virCondWait (c=c@entry=0x7f63c80010c8, m=m@entry=0x7f63c80010a0) at util/virthread.c:154 #2 0x00007f63f47538b3 in qemuMonitorSend (mon=mon@entry=0x7f63c8001090, msg=msg@entry=0x7f63f0d6fa20) at qemu/qemu_monitor.c:1067 #3 0x00007f63f476ea18 in qemuMonitorJSONCommandWithFd (mon=mon@entry=0x7f63c8001090, cmd=cmd@entry=0x7f63c8001710, scm_fd=scm_fd@entry=-1, reply=reply@entry=0x7f63f0d6faa0) at qemu/qemu_monitor_json.c:300 #4 0x00007f63f47843d8 in qemuMonitorJSONCommand (reply=0x7f63f0d6faa0, cmd=0x7f63c8001710, mon=0x7f63c8001090) at qemu/qemu_monitor_json.c:330 #5 qemuMonitorJSONQueryNamedBlockNodes (mon=mon@entry=0x7f63c8001090) at qemu/qemu_monitor_json.c:7854 #6 0x00007f63f4763b26 in qemuMonitorQueryNamedBlockNodes (mon=0x7f63c8001090) at qemu/qemu_monitor.c:4322 #7 0x00007f63f46ac976 in qemuBlockNodeNamesDetect (driver=driver@entry=0x7f63d013b640, vm=0x7f63d0139d80, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE) at qemu/qemu_block.c:337 #8 0x00007f63f47392ec in qemuProcessReconnect (opaque=<optimized out>) at qemu/qemu_process.c:7249 According to the attached stack-trace libvirt is waiting on a response from qemu which is used to detect node names from the backing chain. Could you also attach a stack trace of qemu and look for lines containing "QEMU_MONITOR_IO" in the libvirtd debug log and attach the last few of them? Created attachment 1367754 [details]
gstatck qemu info
This is a libvirt issue which happens when debug logging (or info-logging) is enabled. The reason for this is that when reconnecting libvirt calls query-named-block-nodes, which produces a lot of output for a 200 layer deep chain. I'll file a qemu bug to provide a replacement/fix for query-block and query-named-block-nodes so that they do not report all nested data over and over again. Additionally I'll try to optimize the amount of logging done in the monitor. Fixed upstream by removing unnecessary logging: commit f10bb3347b43d900ff361cda5fe1996782284991 Author: Peter Krempa <pkrempa> Date: Wed Dec 20 13:09:07 2017 +0100 qemu: monitor: Decrease logging verbosity The PROBE macro used in qemuMonitorIOProcess and the VIR_DEBUG message in qemuMonitorJSONIOProcess create a lot of logging churn when debug logging is enabled during monitor communication. The messages logged from the PROBE macro are rather useless since they are reporting the partial state of receiving the reply from qemu. The actual full reply is still logged in qemuMonitorJSONIOProcessLine once the full message is received. commit f06e488d5484031a76e7ed231c8fef8fa1181d2c Author: Peter Krempa <pkrempa> Date: Wed Dec 20 12:58:36 2017 +0100 util: probe: Add quiet versions of the "PROBE" macro PROBE macro adds a logging entry, when used in places seeing a lot of traffic this can cause a significant slowdown. Test on libvirt-4.4.0-2.el7.x86_64 Internal snapshots work as expected. For external snapshots, VM with 253 or more snapshots will not be controlled by libvirt after libvirtd restarted. 1. Create 253 external snapshots 2. Start VM # virsh start RRR Domain RRR started 3. Restart libvirtd # systemctl restart libvirtd 4. Check by virsh # virsh list Id Name State ---------------------------------------------------- 5. Check VM process: # ps aux|grep qemu qemu 6901 13.9 5.8 2158940 462412 ? Sl 11:05 0:18 /usr/libexec/qemu-kvm -name guest=RRR,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-RRR/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off -cpu SandyBridge-IBRS -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 5da1f3b9-611f-4e41-92ba-9a089172ecc7 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-RRR/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/var/lib/libvirt/images/253.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=27,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:ef:a4:6e,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5900,addr=127.0.0.1,disable-ticketing,image-compression=off,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=1 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on root 7882 0.0 0.0 112712 992 pts/16 S+ 11:07 0:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn qemu So VM is not controlled by libvirt after libvirtd restart. Please check this issue Created attachment 1452799 [details] The libvirt log of step3 in comment8 That issue is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1524278 Verified on: libvirt-4.5.0-6.el7.x86_64 # virsh domblklist vm1 Target Source ------------------------------------------------ vda /nfs/RHEL-7.5-x86_64-latest_5.qcow2 # virsh domstate vm1 running # for i in {1..200}; do virsh snapshot-create-as vm1 s$i --disk-only; done Domain snapshot s1 created Domain snapshot s2 created ... Domain snapshot s200 created # systemctl status libvirtd |grep running Active: active (running) since Wed 2018-08-15 16:13:21 CST; 2 days ago # systemctl restart libvirtd # systemctl status libvirtd |grep running Active: active (running) since Fri 2018-08-17 16:48:43 CST; 5s ago # virsh domstate vm1 running # virsh snapshot-create-as vm1 ss --disk-only Domain snapshot ss created # virsh snapshot-list vm1 Name Creation Time State ------------------------------------------------------------ s1 2018-08-17 16:43:10 +0800 disk-snapshot s10 2018-08-17 16:43:14 +0800 disk-snapshot ... ss 2018-08-17 16:49:02 +0800 disk-snapshot Check vm can be logged in and operate. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:3113 |
Created attachment 1367328 [details] gstatck libvirtd info Description of problem: Create another external snapshot for VM already with 200 snapshots after restarting libvirtd will fail and raise abnormal error info Version-Release number of selected component (if applicable): libvirt-3.9.0-6.el7.x86_64 kernel-3.10.0-820.el7.x86_64 qemu-kvm-rhev-2.10.0-12.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1. Prepare a shutdown VM with the following configuration # virsh domstate test1 shut off # virsh dumpxml test1 --inactive |grep "<disk" -A5 <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/RHEL-7.5-x86_64-latest.qcow2'/> <target dev='hda' bus='ide'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> 2. Start VM and create 200 snapshots for VM, confirm there are enough space # virsh start test1 Domain test1 started # virsh domstate test1 running # for i in {1..200}; do virsh snapshot-create-as test1 test1s$i --disk-only; done Domain snapshot test1s1 created Domain snapshot test1s2 created ... Domain snapshot test1s199 created Domain snapshot test1s200 created 3. Check status of VM and libvirtd # systemctl status libvirtd |grep running Active: active (running) since Wed 2017-12-13 16:26:26 CST; 3h 37min ago # virsh domstate test1 running 4. Restart libvirtd, check status of VM and libvirtd again # systemctl restart libvirtd # systemctl status libvirtd |grep running Active: active (running) since Wed 2017-12-13 20:04:40 CST; 2s ago # virsh domstate test1 running 5. Create another new snapshot for VM # virsh snapshot-create-as test1 ss --disk-only error: Timed out during operation: cannot acquire state change lock (held by qemuProcessReconnect) Actual results: As step 5 shows Expected results: Creating snapshot in step 5 should succeed. Additional info: