Created attachment 855918 [details] libvirtd crash log Description of problem: When exucuting qemu-agent-command during shutting down virutal machine, a segmentation violation was caught and libvirtd crashed. Detailed libvirtd log can be seen from the attachment. How reproducible: exucuting qemu-agent-command during shutting down virutal machine Steps to Reproduce: 1.Write two shell scripts as following: hzguanqiang@debian:~$ cat qemu_agent_command.sh while ( true ) do virsh qemu-agent-command 4100964f-5d72-420a-abf6-f572164a7f28 '{"execute": "guest-ping"}' sleep 1 done hzguanqiang@debian:~$ cat shutdown_and_start_vm.sh while ( true ) do virsh shutdown 4100964f-5d72-420a-abf6-f572164a7f28 sleep 30 virsh start 4100964f-5d72-420a-abf6-f572164a7f28 sleep 30 done 2. Running these two scripts. 3. After a while, the bug will be triggered. Actual results: libvirtd Caught an Segmentation violation and exited abnormally. Expected results: libvirtd run normally and qemu-agent-command failed. Additional info:
libvirt version is 1.1.4 which can be seen from the attachment log file. And I'm sorry to make a copy-paste mistake at the end of attachment log file, Contents behind the sentence "2014-01-27 04:58:12.511+0000: 661: info : libvirt version: 1.1.4" are redundant and please ignore them.
Please try reproducing this with latest version (v1.2.1) as this might have been already fixed by commit v1.2.1-rc2-12-gb952cbb: commit b952cbbccafd5ead8b5a70b2608a1d5a7f03b31e Author: Peter Krempa <pkrempa> Date: Tue Jan 14 19:13:30 2014 +0100 qemu: Avoid operations on NULL monitor if VM fails early
I've tried the latest master branch, it's ok. But when I turned to the origin/v1.1.4-maint branch, the problem still exists. The gdb details info can be seen in the attachment file 'libvirtd gdb debug info'
Created attachment 873776 [details] libvirtd gdb debug info
I back-ported the patch to 1.1.4-maint and 1.2.0-maint branches, although I'm not 100% certain this really fixes it. I'm closing it as UPSTREAM since this really is fixed upstream (see comment #3), but feel free to check 1.1.4-main as well and let me know whether this fixes it for you.
Hi Martin, I've tried the 1.1.4-maint with your back-ported patch, the problem still exists. The gdb debug info is just the same as the attachment file 'libvirtd gdb debug info'.
(In reply to Martin Kletzander from comment #5) > I back-ported the patch to 1.1.4-maint and 1.2.0-maint branches, although > I'm not 100% certain this really fixes it. I'm closing it as UPSTREAM since > this really is fixed upstream (see comment #3), but feel free to check > 1.1.4-main as well and let me know whether this fixes it for you. Hi Martin, I've tried the 1.1.4-maint with your back-ported patch, the problem still exists. The gdb debug info is just the same as the attachment file 'libvirtd gdb debug info'.
I'm keeping this CLOSED UPSTREAM as this really is fixed in upstream master branch (which this bugzilla is meant for), but that doesn't mean I'm not willing to help. However, you'll probably have to do 'git bisect' between 1.1.4 and 1.2.1 (or which ones you can and cannot reproduce). Note that when you're looking for a fixing commit (not a commit that breaks something), you'll have to use 'git bisect bad' for those which are good (not reproducible) and vice versa. If you find the commit ID, I'll be happy to backport it to maintenance branches.
(In reply to Martin Kletzander from comment #8) > I'm keeping this CLOSED UPSTREAM as this really is fixed in upstream master > branch (which this bugzilla is meant for), but that doesn't mean I'm not > willing to help. However, you'll probably have to do 'git bisect' between > 1.1.4 and 1.2.1 (or which ones you can and cannot reproduce). Note that > when you're looking for a fixing commit (not a commit that breaks > something), you'll have to use 'git bisect bad' for those which are good > (not reproducible) and vice versa. If you find the commit ID, I'll be happy > to backport it to maintenance branches. I tried the v1.2.1-maint branch, the problem still exists. I guess maybe this was solved in v1.2.2 or later, and I'll try my best to find the fixing commit.
(In reply to Martin Kletzander from comment #8) > I'm keeping this CLOSED UPSTREAM as this really is fixed in upstream master > branch (which this bugzilla is meant for), but that doesn't mean I'm not > willing to help. However, you'll probably have to do 'git bisect' between > 1.1.4 and 1.2.1 (or which ones you can and cannot reproduce). Note that > when you're looking for a fixing commit (not a commit that breaks > something), you'll have to use 'git bisect bad' for those which are good > (not reproducible) and vice versa. If you find the commit ID, I'll be happy > to backport it to maintenance branches. When I try to find the fixing commit, I found when I run the bug-reproduced test scripts as long as much, the problem can be reproduced at the master branch(with latest commitid b55cc5f4e31b488c4f9c3c8470c992c1f8f5d09c) too. The detail debug info can be seen at the attachment file 'master branch debug info'.
Created attachment 877129 [details] master branch gdb debug info
I have ran the reproducer for about an hour and a half now and I experienced no crash at all (with current master), the only thing I've seen is the error: error: internal error: Missing monitor reply object while doing the guest-agent-command. I, however, see a codepath that might lead to a crash, so I've proposed a patch upstream: https://www.redhat.com/archives/libvir-list/2014-April/msg00015.html Testing it and reporting whether this fixes it for you would be much appreciated.
(In reply to Martin Kletzander from comment #12) > I have ran the reproducer for about an hour and a half now and I experienced > no crash at all (with current master), the only thing I've seen is the error: > > error: internal error: Missing monitor reply object > > while doing the guest-agent-command. I, however, see a codepath that might > lead to a crash, so I've proposed a patch upstream: > > https://www.redhat.com/archives/libvir-list/2014-April/msg00015.html > > Testing it and reporting whether this fixes it for you would be much > appreciated. Ok, I've started my testing with your patch. To reproduce this problem, I used to run more than 7 hours. So this time, I will let my scripts run as long as much. Testing results will be reported after I verified. Thanks very much for your quickly help.
(In reply to QiangGuan from comment #13) > (In reply to Martin Kletzander from comment #12) > > I have ran the reproducer for about an hour and a half now and I experienced > > no crash at all (with current master), the only thing I've seen is the error: > > > > error: internal error: Missing monitor reply object > > > > while doing the guest-agent-command. I, however, see a codepath that might > > lead to a crash, so I've proposed a patch upstream: > > > > https://www.redhat.com/archives/libvir-list/2014-April/msg00015.html > > > > Testing it and reporting whether this fixes it for you would be much > > appreciated. > > Ok, I've started my testing with your patch. To reproduce this problem, I > used to run more than 7 hours. So this time, I will let my scripts run as > long as much. Testing results will be reported after I verified. Thanks very > much for your quickly help. I ran the testing for a night and it seems that there are another explicit problem when qemuAgentComand() returned 0, but reply was NULL. Debug info can be seen from the attachment 'crash_debug_info_4.3'. I may have a try to fix it.
Created attachment 882048 [details] crash_debug_info_4.3
Thank you for testing this. I think I've found where this new issue comes from, fixed it and posted the fix upstream: https://www.redhat.com/archives/libvir-list/2014-April/msg00112.html Any comments on the patch are appreciated.
Hi, Martin. I've ran my test over 3 days. Everything is ok with your patch. Can you backport the patch to the maintance branch? Thanks very much for help.
Fixed upstream by these commits: commit 736e017e3608ce4c97ee519a293ff7faecea040d Author: Martin Kletzander <mkletzan> CommitDate: 2014-04-03 09:43:25 +0200 qemu: make sure agent returns error when required data are missing Commit 5b3492fa aimed to fix this and caught one error but exposed another one. When agent command is being executed and the thread waiting for the reply is woken up by an event (e.g. EOF in case of shutdown), the command finishes with no data (rxObject == NULL), but no error is reported, since this might be desired by the caller (e.g. suspend through agent). However, in other situations, when the data are required (e.g. getting vCPUs), we proceed to getting desired data out of the reply, but none of the virJSON*() functions works well with NULLs. I chose the way of a new parameter for qemuAgentCommand() function that specifies whether reply is required and behaves according to that. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1058149 Signed-off-by: Martin Kletzander <mkletzan> commit e9d09fe19680fcb1810774023aa5c2ef007b10c6 Author: Martin Kletzander <mkletzan> CommitDate: 2014-04-02 13:59:32 +0200 qemu: remove unneeded forward declaration by moving qemuAgentCommand() after qemuAgentCheckError(). Signed-off-by: Martin Kletzander <mkletzan> commit 5b3492fadb6bfddd370e263bf8a6953b1b26116f Author: Martin Kletzander <mkletzan> CommitDate: 2014-04-02 07:47:01 +0200 qemu: cleanup error checking on agent replies On all the places where qemuAgentComand() was called, we did a check for errors in the reply. Unfortunately, some of the places called qemuAgentCheckError() without checking for non-null reply which might have resulted in a crash. So this patch makes the error-checking part of qemuAgentCommand() itself, which: a) makes it look better, b) makes the check mandatory and, most importantly, c) checks for the errors if and only if it is appropriate. This actually fixes a potential crashers when qemuAgentComand() returned 0, but reply was NULL. Having said that, it *should* fix the following bug: https://bugzilla.redhat.com/show_bug.cgi?id=1058149 Signed-off-by: Martin Kletzander <mkletzan> Which have been backported to maint branches from v1.0.3-maint to v1.2.3-maint.