Bug 1525496

Summary: libvirtd fails to reconnect to a qemu process after creating 200 snapshots in a chain
Product: Red Hat Enterprise Linux 7 Reporter: jiyan <jiyan>
Component: libvirtAssignee: Peter Krempa <pkrempa>
Status: CLOSED ERRATA QA Contact: yisun
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.5CC: dyuan, fjin, hhan, jdenemar, jiyan, lmen, meili, pkrempa, xuzhang, yisun
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-4.3.0-1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-30 09:52:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
gstatck libvirtd info
none
gstatck qemu info
none
The libvirt log of step3 in comment8 none

Description jiyan 2017-12-13 12:45:32 UTC
Created attachment 1367328 [details]
gstatck libvirtd info

Description of problem:
Create another external snapshot for VM already with 200 snapshots after restarting libvirtd will fail and raise abnormal error info

Version-Release number of selected component (if applicable):
libvirt-3.9.0-6.el7.x86_64
kernel-3.10.0-820.el7.x86_64
qemu-kvm-rhev-2.10.0-12.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Prepare a shutdown VM with the following configuration
# virsh domstate test1
shut off

# virsh dumpxml test1 --inactive |grep "<disk" -A5
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/RHEL-7.5-x86_64-latest.qcow2'/>
      <target dev='hda' bus='ide'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>

2. Start VM and create 200 snapshots for VM, confirm there are enough space
# virsh start test1
Domain test1 started

# virsh domstate test1
running

# for i in {1..200}; do virsh snapshot-create-as test1 test1s$i --disk-only; done
Domain snapshot test1s1 created
Domain snapshot test1s2 created
...
Domain snapshot test1s199 created
Domain snapshot test1s200 created

3. Check status of VM and libvirtd
# systemctl status libvirtd |grep running
   Active: active (running) since Wed 2017-12-13 16:26:26 CST; 3h 37min ago

# virsh domstate test1
running

4. Restart libvirtd, check status of VM and libvirtd again
# systemctl restart libvirtd

# systemctl status libvirtd |grep running
   Active: active (running) since Wed 2017-12-13 20:04:40 CST; 2s ago

# virsh domstate test1
running

5. Create another new snapshot for VM
# virsh snapshot-create-as test1 ss --disk-only
error: Timed out during operation: cannot acquire state change lock (held by qemuProcessReconnect)


Actual results:
As step 5 shows

Expected results:
Creating snapshot in step 5 should succeed.

Additional info:

Comment 2 Peter Krempa 2017-12-13 12:53:17 UTC
Thread 2 (Thread 0x7f63f0d70700 (LWP 7695)):
#0  0x00007f643c9fe8f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f643f6bd62e in virCondWait (c=c@entry=0x7f63c80010c8, m=m@entry=0x7f63c80010a0) at util/virthread.c:154
#2  0x00007f63f47538b3 in qemuMonitorSend (mon=mon@entry=0x7f63c8001090, msg=msg@entry=0x7f63f0d6fa20) at qemu/qemu_monitor.c:1067
#3  0x00007f63f476ea18 in qemuMonitorJSONCommandWithFd (mon=mon@entry=0x7f63c8001090, cmd=cmd@entry=0x7f63c8001710, scm_fd=scm_fd@entry=-1, reply=reply@entry=0x7f63f0d6faa0) at qemu/qemu_monitor_json.c:300
#4  0x00007f63f47843d8 in qemuMonitorJSONCommand (reply=0x7f63f0d6faa0, cmd=0x7f63c8001710, mon=0x7f63c8001090) at qemu/qemu_monitor_json.c:330
#5  qemuMonitorJSONQueryNamedBlockNodes (mon=mon@entry=0x7f63c8001090) at qemu/qemu_monitor_json.c:7854
#6  0x00007f63f4763b26 in qemuMonitorQueryNamedBlockNodes (mon=0x7f63c8001090) at qemu/qemu_monitor.c:4322
#7  0x00007f63f46ac976 in qemuBlockNodeNamesDetect (driver=driver@entry=0x7f63d013b640, vm=0x7f63d0139d80, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE) at qemu/qemu_block.c:337
#8  0x00007f63f47392ec in qemuProcessReconnect (opaque=<optimized out>) at qemu/qemu_process.c:7249

According to the attached stack-trace libvirt is waiting on a response from qemu which is used to detect node names from the backing chain.

Could you also attach a stack trace of qemu and look for lines containing "QEMU_MONITOR_IO" in the libvirtd debug log and attach the last few of them?

Comment 3 jiyan 2017-12-14 02:36:14 UTC
Created attachment 1367754 [details]
gstatck qemu info

Comment 4 Peter Krempa 2017-12-20 12:27:05 UTC
This is a libvirt issue which happens when debug logging (or info-logging) is enabled. The reason for this is that when reconnecting libvirt calls query-named-block-nodes, which produces a lot of output for a 200 layer deep chain.

I'll file a qemu bug to provide a replacement/fix for query-block and query-named-block-nodes so that they do not report all nested data over and over again. Additionally I'll try to optimize the amount of logging done in the monitor.

Comment 6 Peter Krempa 2018-01-03 14:27:58 UTC
Fixed upstream by removing unnecessary logging:

commit f10bb3347b43d900ff361cda5fe1996782284991 
Author: Peter Krempa <pkrempa>
Date:   Wed Dec 20 13:09:07 2017 +0100

    qemu: monitor: Decrease logging verbosity
    
    The PROBE macro used in qemuMonitorIOProcess and the VIR_DEBUG message
    in qemuMonitorJSONIOProcess create a lot of logging churn when debug
    logging is enabled during monitor communication.
    
    The messages logged from the PROBE macro are rather useless since they
    are reporting the partial state of receiving the reply from qemu. The
    actual full reply is still logged in qemuMonitorJSONIOProcessLine once
    the full message is received.

commit f06e488d5484031a76e7ed231c8fef8fa1181d2c
Author: Peter Krempa <pkrempa>
Date:   Wed Dec 20 12:58:36 2017 +0100

    util: probe: Add quiet versions of the "PROBE" macro
    
    PROBE macro adds a logging entry, when used in places seeing a lot of
    traffic this can cause a significant slowdown.

Comment 8 Han Han 2018-06-19 03:09:10 UTC
Test on libvirt-4.4.0-2.el7.x86_64
Internal snapshots work as expected. 
For external snapshots, VM with 253 or more snapshots will not be controlled by libvirt after libvirtd restarted.
1. Create 253 external snapshots
2. Start VM
# virsh start RRR                     
Domain RRR started
3. Restart libvirtd
# systemctl restart libvirtd
4. Check by virsh
# virsh list                
 Id    Name                           State
----------------------------------------------------


5. Check VM process:
# ps aux|grep qemu                      
qemu      6901 13.9  5.8 2158940 462412 ?      Sl   11:05   0:18 /usr/libexec/qemu-kvm -name guest=RRR,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-RRR/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off -cpu SandyBridge-IBRS -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 5da1f3b9-611f-4e41-92ba-9a089172ecc7 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-RRR/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/var/lib/libvirt/images/253.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=27,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:ef:a4:6e,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5900,addr=127.0.0.1,disable-ticketing,image-compression=off,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=1 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
root      7882  0.0  0.0 112712   992 pts/16   S+   11:07   0:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn qemu

So VM is not controlled by libvirt after libvirtd restart. Please check this issue

Comment 9 Han Han 2018-06-19 03:11:09 UTC
Created attachment 1452799 [details]
The libvirt log of step3 in comment8

Comment 10 Peter Krempa 2018-06-19 06:37:00 UTC
That issue is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1524278

Comment 11 yisun 2018-08-17 09:06:32 UTC
Verified on: libvirt-4.5.0-6.el7.x86_64

# virsh domblklist vm1
Target     Source
------------------------------------------------
vda        /nfs/RHEL-7.5-x86_64-latest_5.qcow2

# virsh domstate vm1
running



# for i in {1..200}; do virsh snapshot-create-as vm1 s$i --disk-only; done
Domain snapshot s1 created
Domain snapshot s2 created
...
Domain snapshot s200 created

# systemctl status libvirtd |grep running
   Active: active (running) since Wed 2018-08-15 16:13:21 CST; 2 days ago

# systemctl restart libvirtd

# systemctl status libvirtd |grep running
   Active: active (running) since Fri 2018-08-17 16:48:43 CST; 5s ago

# virsh domstate vm1
running

# virsh snapshot-create-as vm1 ss --disk-only
Domain snapshot ss created

# virsh snapshot-list vm1
 Name                 Creation Time             State
------------------------------------------------------------
 s1                   2018-08-17 16:43:10 +0800 disk-snapshot
 s10                  2018-08-17 16:43:14 +0800 disk-snapshot
...
 ss                   2018-08-17 16:49:02 +0800 disk-snapshot

Check vm can be logged in and operate.

Comment 13 errata-xmlrpc 2018-10-30 09:52:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3113