1525496 – libvirtd fails to reconnect to a qemu process after creating 200 snapshots in a chain

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1525496 - libvirtd fails to reconnect to a qemu process after creating 200 snapshots in a chain

Summary: libvirtd fails to reconnect to a qemu process after creating 200 snapshots in...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	7.5
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Krempa
QA Contact:	yisun
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-13 12:45 UTC by jiyan
Modified:	2018-10-30 09:52 UTC (History)
CC List:	10 users (show)
Fixed In Version:	libvirt-4.3.0-1.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-30 09:52:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
gstatck libvirtd info (12.52 KB, text/plain) 2017-12-13 12:45 UTC, jiyan	no flags	Details
gstatck qemu info (14.56 KB, text/plain) 2017-12-14 02:36 UTC, jiyan	no flags	Details
The libvirt log of step3 in comment8 (334.32 KB, application/x-xz) 2018-06-19 03:11 UTC, Han Han	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2018:3113	0	None	None	None	2018-10-30 09:52:59 UTC

Description jiyan 2017-12-13 12:45:32 UTC

Created attachment 1367328 [details]
gstatck libvirtd info

Description of problem:
Create another external snapshot for VM already with 200 snapshots after restarting libvirtd will fail and raise abnormal error info

Version-Release number of selected component (if applicable):
libvirt-3.9.0-6.el7.x86_64
kernel-3.10.0-820.el7.x86_64
qemu-kvm-rhev-2.10.0-12.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Prepare a shutdown VM with the following configuration
# virsh domstate test1
shut off

# virsh dumpxml test1 --inactive |grep "<disk" -A5
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/RHEL-7.5-x86_64-latest.qcow2'/>
      <target dev='hda' bus='ide'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>

2. Start VM and create 200 snapshots for VM, confirm there are enough space
# virsh start test1
Domain test1 started

# virsh domstate test1
running

# for i in {1..200}; do virsh snapshot-create-as test1 test1s$i --disk-only; done
Domain snapshot test1s1 created
Domain snapshot test1s2 created
...
Domain snapshot test1s199 created
Domain snapshot test1s200 created

3. Check status of VM and libvirtd
# systemctl status libvirtd |grep running
   Active: active (running) since Wed 2017-12-13 16:26:26 CST; 3h 37min ago

# virsh domstate test1
running

4. Restart libvirtd, check status of VM and libvirtd again
# systemctl restart libvirtd

# systemctl status libvirtd |grep running
   Active: active (running) since Wed 2017-12-13 20:04:40 CST; 2s ago

# virsh domstate test1
running

5. Create another new snapshot for VM
# virsh snapshot-create-as test1 ss --disk-only
error: Timed out during operation: cannot acquire state change lock (held by qemuProcessReconnect)


Actual results:
As step 5 shows

Expected results:
Creating snapshot in step 5 should succeed.

Additional info:

Comment 2 Peter Krempa 2017-12-13 12:53:17 UTC

Thread 2 (Thread 0x7f63f0d70700 (LWP 7695)):
#0  0x00007f643c9fe8f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f643f6bd62e in virCondWait (c=c@entry=0x7f63c80010c8, m=m@entry=0x7f63c80010a0) at util/virthread.c:154
#2  0x00007f63f47538b3 in qemuMonitorSend (mon=mon@entry=0x7f63c8001090, msg=msg@entry=0x7f63f0d6fa20) at qemu/qemu_monitor.c:1067
#3  0x00007f63f476ea18 in qemuMonitorJSONCommandWithFd (mon=mon@entry=0x7f63c8001090, cmd=cmd@entry=0x7f63c8001710, scm_fd=scm_fd@entry=-1, reply=reply@entry=0x7f63f0d6faa0) at qemu/qemu_monitor_json.c:300
#4  0x00007f63f47843d8 in qemuMonitorJSONCommand (reply=0x7f63f0d6faa0, cmd=0x7f63c8001710, mon=0x7f63c8001090) at qemu/qemu_monitor_json.c:330
#5  qemuMonitorJSONQueryNamedBlockNodes (mon=mon@entry=0x7f63c8001090) at qemu/qemu_monitor_json.c:7854
#6  0x00007f63f4763b26 in qemuMonitorQueryNamedBlockNodes (mon=0x7f63c8001090) at qemu/qemu_monitor.c:4322
#7  0x00007f63f46ac976 in qemuBlockNodeNamesDetect (driver=driver@entry=0x7f63d013b640, vm=0x7f63d0139d80, asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE) at qemu/qemu_block.c:337
#8  0x00007f63f47392ec in qemuProcessReconnect (opaque=<optimized out>) at qemu/qemu_process.c:7249

According to the attached stack-trace libvirt is waiting on a response from qemu which is used to detect node names from the backing chain.

Could you also attach a stack trace of qemu and look for lines containing "QEMU_MONITOR_IO" in the libvirtd debug log and attach the last few of them?

Comment 3 jiyan 2017-12-14 02:36:14 UTC

Created attachment 1367754 [details]
gstatck qemu info

Comment 4 Peter Krempa 2017-12-20 12:27:05 UTC

This is a libvirt issue which happens when debug logging (or info-logging) is enabled. The reason for this is that when reconnecting libvirt calls query-named-block-nodes, which produces a lot of output for a 200 layer deep chain.

I'll file a qemu bug to provide a replacement/fix for query-block and query-named-block-nodes so that they do not report all nested data over and over again. Additionally I'll try to optimize the amount of logging done in the monitor.

Comment 6 Peter Krempa 2018-01-03 14:27:58 UTC

Fixed upstream by removing unnecessary logging:

commit f10bb3347b43d900ff361cda5fe1996782284991 
Author: Peter Krempa <pkrempa>
Date:   Wed Dec 20 13:09:07 2017 +0100

    qemu: monitor: Decrease logging verbosity
    
    The PROBE macro used in qemuMonitorIOProcess and the VIR_DEBUG message
    in qemuMonitorJSONIOProcess create a lot of logging churn when debug
    logging is enabled during monitor communication.
    
    The messages logged from the PROBE macro are rather useless since they
    are reporting the partial state of receiving the reply from qemu. The
    actual full reply is still logged in qemuMonitorJSONIOProcessLine once
    the full message is received.

commit f06e488d5484031a76e7ed231c8fef8fa1181d2c
Author: Peter Krempa <pkrempa>
Date:   Wed Dec 20 12:58:36 2017 +0100

    util: probe: Add quiet versions of the "PROBE" macro
    
    PROBE macro adds a logging entry, when used in places seeing a lot of
    traffic this can cause a significant slowdown.

Comment 8 Han Han 2018-06-19 03:09:10 UTC

Test on libvirt-4.4.0-2.el7.x86_64
Internal snapshots work as expected. 
For external snapshots, VM with 253 or more snapshots will not be controlled by libvirt after libvirtd restarted.
1. Create 253 external snapshots
2. Start VM
# virsh start RRR                     
Domain RRR started
3. Restart libvirtd
# systemctl restart libvirtd
4. Check by virsh
# virsh list                
 Id    Name                           State
----------------------------------------------------


5. Check VM process:
# ps aux|grep qemu                      
qemu      6901 13.9  5.8 2158940 462412 ?      Sl   11:05   0:18 /usr/libexec/qemu-kvm -name guest=RRR,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-RRR/master-key.aes -machine pc-i440fx-rhel7.5.0,accel=kvm,usb=off,vmport=off,dump-guest-core=off -cpu SandyBridge-IBRS -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid 5da1f3b9-611f-4e41-92ba-9a089172ecc7 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-RRR/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -boot strict=on -device ich9-usb-ehci1,id=usb,bus=pci.0,addr=0x5.0x7 -device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pci.0,multifunction=on,addr=0x5 -device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pci.0,addr=0x5.0x1 -device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pci.0,addr=0x5.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x6 -drive file=/var/lib/libvirt/images/253.qcow2,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=27,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:ef:a4:6e,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5900,addr=127.0.0.1,disable-ticketing,image-compression=off,seamless-migration=on -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0,bus=usb.0,port=1 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1,bus=usb.0,port=2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x8 -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
root      7882  0.0  0.0 112712   992 pts/16   S+   11:07   0:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn qemu

So VM is not controlled by libvirt after libvirtd restart. Please check this issue

Comment 9 Han Han 2018-06-19 03:11:09 UTC

Created attachment 1452799 [details]
The libvirt log of step3 in comment8

Comment 10 Peter Krempa 2018-06-19 06:37:00 UTC

That issue is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1524278

Comment 11 yisun 2018-08-17 09:06:32 UTC

Verified on: libvirt-4.5.0-6.el7.x86_64

# virsh domblklist vm1
Target     Source
------------------------------------------------
vda        /nfs/RHEL-7.5-x86_64-latest_5.qcow2

# virsh domstate vm1
running



# for i in {1..200}; do virsh snapshot-create-as vm1 s$i --disk-only; done
Domain snapshot s1 created
Domain snapshot s2 created
...
Domain snapshot s200 created

# systemctl status libvirtd |grep running
   Active: active (running) since Wed 2018-08-15 16:13:21 CST; 2 days ago

# systemctl restart libvirtd

# systemctl status libvirtd |grep running
   Active: active (running) since Fri 2018-08-17 16:48:43 CST; 5s ago

# virsh domstate vm1
running

# virsh snapshot-create-as vm1 ss --disk-only
Domain snapshot ss created

# virsh snapshot-list vm1
 Name                 Creation Time             State
------------------------------------------------------------
 s1                   2018-08-17 16:43:10 +0800 disk-snapshot
 s10                  2018-08-17 16:43:14 +0800 disk-snapshot
...
 ss                   2018-08-17 16:49:02 +0800 disk-snapshot

Check vm can be logged in and operate.

Comment 13 errata-xmlrpc 2018-10-30 09:52:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3113

Note You need to log in before you can comment on or make changes to this bug.