Bug 1058149 - libvirt crashed when executing qemu-agent-command during shutting down virutal machine
Summary: libvirt crashed when executing qemu-agent-command during shutting down viruta...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Martin Kletzander
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-01-27 05:39 UTC by QiangGuan
Modified: 2014-04-16 08:21 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-04-16 08:21:23 UTC
Embargoed:


Attachments (Terms of Use)
libvirtd crash log (66.17 KB, text/plain)
2014-01-27 05:39 UTC, QiangGuan
no flags Details
libvirtd gdb debug info (12.18 KB, text/plain)
2014-03-13 01:27 UTC, QiangGuan
no flags Details
master branch gdb debug info (14.56 KB, text/plain)
2014-03-21 05:51 UTC, QiangGuan
no flags Details
crash_debug_info_4.3 (13.26 KB, text/plain)
2014-04-03 02:00 UTC, QiangGuan
no flags Details

Description QiangGuan 2014-01-27 05:39:55 UTC
Created attachment 855918 [details]
libvirtd crash log

Description of problem:
When exucuting qemu-agent-command during shutting down virutal machine,
a segmentation violation was caught and libvirtd crashed. Detailed libvirtd log can be seen from the attachment.


How reproducible:
exucuting qemu-agent-command during shutting down virutal machine

Steps to Reproduce:
1.Write two shell scripts as following:
hzguanqiang@debian:~$ cat qemu_agent_command.sh           
while ( true )
do
    virsh qemu-agent-command 4100964f-5d72-420a-abf6-f572164a7f28 '{"execute": "guest-ping"}' 
    sleep 1
done

hzguanqiang@debian:~$ cat shutdown_and_start_vm.sh           
while ( true )
do
virsh shutdown 4100964f-5d72-420a-abf6-f572164a7f28
sleep 30
virsh start 4100964f-5d72-420a-abf6-f572164a7f28
sleep 30
done

2. Running these two scripts.
3. After a while, the bug will be triggered.

Actual results:
libvirtd Caught an Segmentation violation and exited abnormally.

Expected results:
libvirtd run normally and qemu-agent-command failed.

Additional info:

Comment 1 QiangGuan 2014-01-27 05:48:15 UTC
libvirt version is 1.1.4 which can be seen from the attachment log file.
And I'm sorry to make a copy-paste mistake at the end of attachment log file, Contents behind the sentence "2014-01-27 04:58:12.511+0000: 661: info : libvirt version: 1.1.4" are redundant and please ignore them.

Comment 2 Martin Kletzander 2014-01-27 07:34:34 UTC
Please try reproducing this with latest version (v1.2.1) as this might have been already fixed by commit v1.2.1-rc2-12-gb952cbb:

commit b952cbbccafd5ead8b5a70b2608a1d5a7f03b31e
Author: Peter Krempa <pkrempa>
Date:   Tue Jan 14 19:13:30 2014 +0100

    qemu: Avoid operations on NULL monitor if VM fails early

Comment 3 QiangGuan 2014-03-13 01:23:40 UTC
I've tried the latest master branch, it's ok.
But when I turned to the origin/v1.1.4-maint branch, the problem still exists.

The gdb details info can be seen in the attachment file 'libvirtd gdb debug info'

Comment 4 QiangGuan 2014-03-13 01:27:01 UTC
Created attachment 873776 [details]
libvirtd gdb debug info

Comment 5 Martin Kletzander 2014-03-13 09:12:26 UTC
I back-ported the patch to 1.1.4-maint and 1.2.0-maint branches, although I'm not 100% certain this really fixes it.  I'm closing it as UPSTREAM since this really is fixed upstream (see comment #3), but feel free to check 1.1.4-main as well and let me know whether this fixes it for you.

Comment 6 QiangGuan 2014-03-17 06:23:19 UTC
Hi Martin, I've tried the 1.1.4-maint with your back-ported patch, the problem still exists. The gdb debug info is just the same as the attachment file 'libvirtd gdb debug info'.

Comment 7 QiangGuan 2014-03-17 07:02:14 UTC
(In reply to Martin Kletzander from comment #5)
> I back-ported the patch to 1.1.4-maint and 1.2.0-maint branches, although
> I'm not 100% certain this really fixes it.  I'm closing it as UPSTREAM since
> this really is fixed upstream (see comment #3), but feel free to check
> 1.1.4-main as well and let me know whether this fixes it for you.


Hi Martin, I've tried the 1.1.4-maint with your back-ported patch, the problem still exists. The gdb debug info is just the same as the attachment file 'libvirtd gdb debug info'.

Comment 8 Martin Kletzander 2014-03-17 10:10:11 UTC
I'm keeping this CLOSED UPSTREAM as this really is fixed in upstream master branch (which this bugzilla is meant for), but that doesn't mean I'm not willing to help.  However, you'll probably have to do 'git bisect' between 1.1.4 and 1.2.1 (or which ones you can and cannot reproduce).  Note that when you're looking for a fixing commit (not a commit that breaks something), you'll have to use 'git bisect bad' for those which are good (not reproducible) and vice versa.  If you find the commit ID, I'll be happy to backport it to maintenance branches.

Comment 9 QiangGuan 2014-03-18 06:49:17 UTC
(In reply to Martin Kletzander from comment #8)
> I'm keeping this CLOSED UPSTREAM as this really is fixed in upstream master
> branch (which this bugzilla is meant for), but that doesn't mean I'm not
> willing to help.  However, you'll probably have to do 'git bisect' between
> 1.1.4 and 1.2.1 (or which ones you can and cannot reproduce).  Note that
> when you're looking for a fixing commit (not a commit that breaks
> something), you'll have to use 'git bisect bad' for those which are good
> (not reproducible) and vice versa.  If you find the commit ID, I'll be happy
> to backport it to maintenance branches.

I tried the v1.2.1-maint branch, the problem still exists. I guess maybe this was solved in v1.2.2 or later, and I'll try my best to find the fixing commit.

Comment 10 QiangGuan 2014-03-21 05:50:15 UTC
(In reply to Martin Kletzander from comment #8)
> I'm keeping this CLOSED UPSTREAM as this really is fixed in upstream master
> branch (which this bugzilla is meant for), but that doesn't mean I'm not
> willing to help.  However, you'll probably have to do 'git bisect' between
> 1.1.4 and 1.2.1 (or which ones you can and cannot reproduce).  Note that
> when you're looking for a fixing commit (not a commit that breaks
> something), you'll have to use 'git bisect bad' for those which are good
> (not reproducible) and vice versa.  If you find the commit ID, I'll be happy
> to backport it to maintenance branches.

When I try to find the fixing commit, I found when I run the bug-reproduced test scripts as long as much, the problem can be reproduced at the master branch(with latest commitid b55cc5f4e31b488c4f9c3c8470c992c1f8f5d09c) too.

The detail debug info can be seen at the attachment file 'master branch debug info'.

Comment 11 QiangGuan 2014-03-21 05:51:25 UTC
Created attachment 877129 [details]
master branch gdb debug info

Comment 12 Martin Kletzander 2014-04-01 13:51:09 UTC
I have ran the reproducer for about an hour and a half now and I experienced no crash at all (with current master), the only thing I've seen is the error:

error: internal error: Missing monitor reply object

while doing the guest-agent-command.  I, however, see a codepath that might lead to a crash, so I've proposed a patch upstream:

https://www.redhat.com/archives/libvir-list/2014-April/msg00015.html

Testing it and reporting whether this fixes it for you would be much appreciated.

Comment 13 QiangGuan 2014-04-02 06:45:29 UTC
(In reply to Martin Kletzander from comment #12)
> I have ran the reproducer for about an hour and a half now and I experienced
> no crash at all (with current master), the only thing I've seen is the error:
> 
> error: internal error: Missing monitor reply object
> 
> while doing the guest-agent-command.  I, however, see a codepath that might
> lead to a crash, so I've proposed a patch upstream:
> 
> https://www.redhat.com/archives/libvir-list/2014-April/msg00015.html
> 
> Testing it and reporting whether this fixes it for you would be much
> appreciated.

Ok, I've started my testing with your patch. To reproduce this problem, I used to run more than 7 hours. So this time, I will let my scripts run as long as much. Testing results will be reported after I verified. Thanks very much for your quickly help.

Comment 14 QiangGuan 2014-04-03 01:46:04 UTC
(In reply to QiangGuan from comment #13)
> (In reply to Martin Kletzander from comment #12)
> > I have ran the reproducer for about an hour and a half now and I experienced
> > no crash at all (with current master), the only thing I've seen is the error:
> > 
> > error: internal error: Missing monitor reply object
> > 
> > while doing the guest-agent-command.  I, however, see a codepath that might
> > lead to a crash, so I've proposed a patch upstream:
> > 
> > https://www.redhat.com/archives/libvir-list/2014-April/msg00015.html
> > 
> > Testing it and reporting whether this fixes it for you would be much
> > appreciated.
> 
> Ok, I've started my testing with your patch. To reproduce this problem, I
> used to run more than 7 hours. So this time, I will let my scripts run as
> long as much. Testing results will be reported after I verified. Thanks very
> much for your quickly help.

I ran the testing for a night and it seems that there are another explicit problem when qemuAgentComand() returned 0, but reply was NULL. Debug info can be seen from the attachment 'crash_debug_info_4.3'. I may have a try to fix it.

Comment 15 QiangGuan 2014-04-03 02:00:37 UTC
Created attachment 882048 [details]
crash_debug_info_4.3

Comment 16 Martin Kletzander 2014-04-03 06:04:40 UTC
Thank you for testing this.  I think I've found where this new issue comes from,  fixed it and posted the fix upstream:

https://www.redhat.com/archives/libvir-list/2014-April/msg00112.html

Any comments on the patch are appreciated.

Comment 17 QiangGuan 2014-04-08 02:09:11 UTC
Hi, Martin. I've ran my test over 3 days. Everything is ok with your patch. Can you backport the patch to the maintance branch? Thanks very much for help.

Comment 18 Ján Tomko 2014-04-16 08:21:23 UTC
Fixed upstream by these commits:
commit 736e017e3608ce4c97ee519a293ff7faecea040d
Author:     Martin Kletzander <mkletzan>
CommitDate: 2014-04-03 09:43:25 +0200

    qemu: make sure agent returns error when required data are missing
    
    Commit 5b3492fa aimed to fix this and caught one error but exposed
    another one.  When agent command is being executed and the thread
    waiting for the reply is woken up by an event (e.g. EOF in case of
    shutdown), the command finishes with no data (rxObject == NULL), but
    no error is reported, since this might be desired by the caller
    (e.g. suspend through agent).  However, in other situations, when the
    data are required (e.g. getting vCPUs), we proceed to getting desired
    data out of the reply, but none of the virJSON*() functions works well
    with NULLs.  I chose the way of a new parameter for qemuAgentCommand()
    function that specifies whether reply is required and behaves
    according to that.
    
    Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1058149
    
    Signed-off-by: Martin Kletzander <mkletzan>

commit e9d09fe19680fcb1810774023aa5c2ef007b10c6
Author:     Martin Kletzander <mkletzan>
CommitDate: 2014-04-02 13:59:32 +0200

    qemu: remove unneeded forward declaration
    
    by moving qemuAgentCommand() after qemuAgentCheckError().
    
    Signed-off-by: Martin Kletzander <mkletzan>

commit 5b3492fadb6bfddd370e263bf8a6953b1b26116f
Author:     Martin Kletzander <mkletzan>
CommitDate: 2014-04-02 07:47:01 +0200

    qemu: cleanup error checking on agent replies
    
    On all the places where qemuAgentComand() was called, we did a check
    for errors in the reply.  Unfortunately, some of the places called
    qemuAgentCheckError() without checking for non-null reply which might
    have resulted in a crash.
    
    So this patch makes the error-checking part of qemuAgentCommand()
    itself, which:
    
     a) makes it look better,
    
     b) makes the check mandatory and, most importantly,
    
     c) checks for the errors if and only if it is appropriate.
    
    This actually fixes a potential crashers when qemuAgentComand()
    returned 0, but reply was NULL.  Having said that, it *should* fix the
    following bug:
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1058149
    
    Signed-off-by: Martin Kletzander <mkletzan>

Which have been backported to maint branches from v1.0.3-maint to v1.2.3-maint.


Note You need to log in before you can comment on or make changes to this bug.