Bug 1054785

Summary: inappropriate error message when guest crashes while connecting to agent monitor
Product: Red Hat Enterprise Linux 7 Reporter: Peter Krempa <pkrempa>
Component: libvirtAssignee: Peter Krempa <pkrempa>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.0CC: acathrow, ahadas, dallan, dyuan, fjin, honzhang, iheim, istein, jdenemar, jiahu, lpeer, michal.skrivanek, mzhan, pkrempa, Rhev-m-bugs, yeylon, zhwang, zpeng
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: libvirt-1.1.1-19.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1047659 Environment:
Last Closed: 2014-06-13 12:38:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1047659    
Bug Blocks:    
Attachments:
Description Flags
libvirtd log none

Description Peter Krempa 2014-01-17 13:22:22 UTC
+++ This bug was initially created as a clone of Bug #1047659 +++

Description of problem:
When the VM crashes while attempting to connect to the guest agent monitor socket, the regular monitor object is removed. It's later accessed at the startup phase. This now doesn't cause a crash but reports inappropriate error message:

error: invalid argument: monitor must not be NULL

instead of the real error message. This situation is likely to happen when doing migrations under heavy host load.

Version-Release number of selected component (if applicable):
libvirt-1.1.1-18.el7

--- Additional comment from Peter Krempa on 2014-01-14 19:29:36 CET ---

Reproducer:

commit ba95d2d1220061ce86fd0b66f0a2f19a90c6b51c
Author: Peter Krempa <pkrempa>
Date:   Tue Jan 14 19:09:36 2014 +0100

    DO NOT APPLY UPSTREAM: Reproducer
    
    Apply this patch and run a patched libvirt daemon. Then start a VM and
    kill it's process after few seconds:
    
     # virsh start VM & sleep 3; killall -9 qemu-kvm
    
    Upstream version doesn't crash but the virsh command returns cryptic
    error message:
     error: invalid argument: monitor must not be NULL
    
    0.10.2 and similar downstream versions segfault:
    
    Program received signal SIGSEGV, Segmentation fault.
    [Switching to Thread 0x7fffed467700 (LWP 29256)]
    __pthread_mutex_lock (mutex=0x10) at pthread_mutex_lock.c:50
    50        unsigned int type = PTHREAD_MUTEX_TYPE (mutex);
    Missing separate debuginfos, use: debuginfo-install VirtualBox-4.2-4.2.22_91556_el6-1.x86_64
    (gdb) bt
     #0  __pthread_mutex_lock (mutex=0x10) at pthread_mutex_lock.c:50
     #1  0x00007fffe891585e in qemuDomainObjEnterMonitorInternal (driver=0x7fffe00858f0, driver_locked=true, obj=0x7fffe01689c0, asyncJob=<value optimized out>)
         at qemu/qemu_domain.c:1008
     #2  0x00007fffe892662a in qemuProcessDetectVcpuPIDs (driver=0x7fffe00858f0, vm=0x7fffe01689c0) at qemu/qemu_process.c:1831
     #3  0x00007fffe892c417 in qemuProcessStart (conn=0x7fffdc000ae0, driver=0x7fffe00858f0, vm=0x7fffe01689c0, migrateFrom=0x0, stdin_fd=-1, stdin_path=0x0, snapshot=0x0,
         vmop=VIR_NETDEV_VPORT_PROFILE_OP_CREATE, flags=1) at qemu/qemu_process.c:4022
     #4  0x00007fffe8973f3e in qemuDomainObjStart (conn=0x7fffdc000ae0, driver=0x7fffe00858f0, vm=0x7fffe01689c0, flags=<value optimized out>) at qemu/qemu_driver.c:6025

diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
index 9331744..a27eded 100644
--- a/src/qemu/qemu_process.c
+++ b/src/qemu/qemu_process.c
@@ -226,40 +226,42 @@ qemuConnectAgent(virQEMUDriverPtr driver, virDomainObjPtr vm)
         return 0;

     if (virSecurityManagerSetDaemonSocketLabel(driver->securityManager,
                                                vm->def) < 0) {
         VIR_ERROR(_("Failed to set security context for agent for %s"),
                   vm->def->name);
         goto cleanup;
     }

     /* Hold an extra reference because we can't allow 'vm' to be
      * deleted while the agent is active */
     virObjectRef(vm);

     ignore_value(virTimeMillisNow(&priv->agentStart));
     virObjectUnlock(vm);

     agent = qemuAgentOpen(vm,
                           config,
                           &agentCallbacks);

+    sleep(10);
+
     virObjectLock(vm);
     priv->agentStart = 0;

     if (virSecurityManagerClearSocketLabel(driver->securityManager,
                                            vm->def) < 0) {
         VIR_ERROR(_("Failed to clear security context for agent for %s"),
                   vm->def->name);
         goto cleanup;
     }

     if (agent == NULL)
         virObjectUnref(vm);

     if (!virDomainObjIsActive(vm)) {
         qemuAgentClose(agent);
         goto cleanup;
     }
     priv->agent = agent;

     if (priv->agent == NULL) {

--- Additional comment from Peter Krempa on 2014-01-15 18:11:14 CET ---

Fixed upstream:

commit b952cbbccafd5ead8b5a70b2608a1d5a7f03b31e
Author: Peter Krempa <pkrempa>
Date:   Tue Jan 14 19:13:30 2014 +0100

    qemu: Avoid operations on NULL monitor if VM fails early
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1047659
    
    If a VM dies very early during an attempted connect to the guest agent
    while the locks are down the domain monitor object will be freed. The
    object is then accessed later as any failure during guest agent startup
    isn't considered fatal.
    
    In the current upstream version this doesn't lead to a crash as
    virObjectLock called when entering the monitor in
    qemuProcessDetectVcpuPIDs checks the pointer before attempting to
    dereference (lock) it. The NULL pointer is then caught in the monitor
    helper code.
    
    Before the introduction of virObjectLockable - observed on 0.10.2 - the
    pointer is locked directly via virMutexLock leading to a crash.
    
    To avoid this problem we need to differentiate between the guest agent
    not being present and the VM quitting when the locks were down. The fix
    reorganizes the code in qemuConnectAgent to add the check and then adds
    special handling to the callers.

Comment 3 zhenfeng wang 2014-01-23 09:30:00 UTC
I can reproduce this bug with the reproducer in comment0 with libvirt.1.1-18, and I could get the expect error after i update the libvirt to the libvirt-1.1.1-19. so this bug can be marked verified

Reproduce steps
1. rebuild the libvirtd packet as the reproducer in comment0
2. Excute the following command
# virsh start rheltest2 & sleep 3; killall -9 qemu-kvm
[1] 29173
error: invalid argument: monitor must not be NULL

# ps aux|grep libvirtd
root     14377  6.8  0.2 1127840 23560 ?       Ssl  16:29   2:55 /usr/sbin/libvirtd


Verify steps
1. rebuild the libvirtd packet with the reproducer in comment0
2. Excute the following command, the libvirtd didn't crash also get the expect error
# virsh start rheltest2 & sleep 3; killall -9 qemu-kvm
error: Failed to start domain rheltest2
error: internal error guest crashed while connecting to the guest agent

# ps aux|grep libvirtd
root      9784  5.1  0.0 1058464 18544 ?       Sl   19:09   2:03 libvirtd --daemon


3.Start the guest,after the guest start successfully, do S4 with the guest

#virsh start rheltest2
#virsh dompmsuspend rheltest2 --target disk
Domain rheltest2 successfully suspended
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     rheltest2                      shut off

4.Excute the step2's command, the libvirtd didn't crash  also get the expect error
#virsh start rheltest2 & sleep 3; killall -9 qemu-kvm
error: Failed to start domain rheltest2
error: internal error guest crashed while connecting to the guest agent

# ps aux|grep libvirtd
root      9784  5.1  0.0 1058464 18544 ?       Sl   19:09   2:03 libvirtd --daemon


5.Start the guest, after the guest start completely, do managedsave with the guest

#virsh start rheltest2
#virsh managedsave rheltest2

Domain rheltest2 state saved by libvirt

6.Excute the step2's command, the libvirtd didn't crash  also get an error
#virsh start rheltest2 & sleep 3; killall -9 qemu-kvm
error: internal error: guest crashed while connecting to the guest agent

# ps aux|grep libvirtd
root      9784  5.1  0.0 1058464 18544 ?       Sl   19:09   2:03 libvirtd --daemon

7.Test the upper scenario without rebuild the libvirt packet, the guest will be in shutoff status and didn't report the error, also the libvirtd didn't crash

#  virsh start rheltest2 & sleep 3; killall -9 qemu-kvm
[1] 14584

# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     rheltest2                      shut off

8.Start the guest, then check the qemu-guest-agent's function with the following command, all the following command works as expect
#virsh start rheltest2
# virsh setvcpus rheltest2 2 --guest

# virsh vcpucount rheltest2 --guest
2
#virsh dompmsuspend rheltest2 --target mem
#virsh dompmwakeup rheltest2
#virsh dompmsuspend rheltest2 --target disk
#virsh start rheltest2
#virsh shutdown rheltest2 --mode agent
#virsh reboot rheltest2 --mode agent

Comment 4 Ludek Smid 2014-06-13 12:38:31 UTC
This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Comment 6 Fangge Jin 2016-04-14 09:51:31 UTC
Hi Peter,
I'm testing this bug on build libvirt-1.3.3-1.el7.x86_64, but I can't get this error "guest crashed while connecting to the guest agent". I'm not sure if this scenario is still testable, please have a look. Thank you.

Steps:
1. Rebuild libvirt with the following change in src code:

    236     agent = qemuAgentOpen(vm,
    237                           &config->source,
    238                           &agentCallbacks);
    239     sleep(20);   ===> add this line
    240 
    241     virObjectLock(vm);

2. In a terminal:
# tailf /var/log/libvirt/libvirtd.log|grep qemuAgentOpen

3. In another terminal:
# virsh start rhel7
Domain rhel7 started

4. After this log is printed, kill qemu-kvm process:
# tailf /var/log/libvirt/libvirtd.log|grep qemuAgentOpen
2016-04-14 09:37:42.343+0000: 25074: debug : qemuAgentOpen:778 : New mon 0x7f21f41fb870 fd =22 watch=122

# kill -9 qemu-kvm

5. After 20s, check libvirtd log, no "guest crashed while connecting to the guest agent" log is found.

Comment 7 Fangge Jin 2016-04-14 09:52:49 UTC
Created attachment 1147103 [details]
libvirtd log

Comment 8 Peter Krempa 2016-07-26 15:10:22 UTC
It won't be possible at this point to get the message at the point when it would be emitted. The agent connection is now usually made in the callback handler for the VSERPORT_CHANGE event and the handler takes a job. The destroy action uses a libvirt job too so they are interlocked. With a recent enough qemu the message won't appear.