Bug 928661
Summary: | Libvirtd crash when destroying linux guest which executed a series of operations about S3 and save /restore | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | zhenfeng wang <zhwang> | ||||||||
Component: | libvirt | Assignee: | Eric Blake <eblake> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 6.4 | CC: | acathrow, berrange, bili, cwei, dallan, dyuan, eblake, jdenemar, jiahu, mjenner, mzhan, shyu | ||||||||
Target Milestone: | rc | Keywords: | Regression, ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | libvirt-0.10.2-22.el6 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Cause: Some code refactoring to fix another bug left a case where locks were cleaned up incorrectly.
Consequence: Libvirtd could crash on certain migration to file scenarios.
Fix: The lock cleanup paths were fixed.
Result: Libvirtd no longer crashes when saving a domain to file.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 928672 (view as bug list) | Environment: | |||||||||
Last Closed: | 2013-11-21 08:56:30 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 638512 | ||||||||||
Bug Blocks: | 928672, 1017194 | ||||||||||
Attachments: |
|
Description
zhenfeng wang
2013-03-28 07:49:37 UTC
Created attachment 717485 [details]
The guest's xml
Created attachment 717557 [details]
The gdb info about libvirtd crash
Please provide the *full* stack trace for all threads. ie 'thread apply all bt', not merely 'bt'. Created attachment 717581 [details]
full* stack trace for all threads
Per https://bugzilla.redhat.com/show_bug.cgi?id=928672#c7, we know the right fix. Regression introduced when fixing bug 638512 Hi, According to steps of https://bugzilla.redhat.com/show_bug.cgi?id=928661#c0, libvirtd does not crash any more, but the dompmsuspend command still hang(in step 6) and save command reported error(in step 7). Could you help me confirm below two questions? 1. For my below step 6, I still reproduced it, and found one bug is same with it,Bug 890648 - guest agent commands will hang if the guest agent crashes while executing a command. 2. For my below step 7, the old error message was disappeared, but reported a new, "error: Timed out during operation: cannot acquire state change lock". Version: libvirt-0.10.2-22.el6.x86_64 qemu-kvm-0.12.1.2-2.393.el6.x86_64 qemu-guest-agent-0.12.1.2-2.393.el6.x86_64 kernel-2.6.32-358.el6.x86_64 1.# getenforce Enforcing 2.Prepare a guest with qemu-ga ENV,add below config to domain xml. ... <pm> <suspend-to-mem enabled='yes'/> <suspend-to-disk enabled='yes'/> </pm> ... <channel type='unix'> <source mode='bind' path='/var/lib/libvirt/qemu/r6.agent'/> <target type='virtio' name='org.qemu.guest_agent.0'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> ... # virsh list --all Id Name State ---------------------------------------------------- 7 r6 running 3.Start the qemu-ga service in guest # qemu-ga -d 4.Do s3 with the guest,then wakeup it [root@test ~]# virsh dompmsuspend r6 --target mem Domain r6 successfully suspended [root@test ~]# virsh dompmwakeup r6 Domain r6 successfully woken up 5.Save and restore the guest [root@test ~]# virsh save r6 /tmp/r6.save Domain r6 saved to /tmp/r6.save [root@test ~]# virsh restore /tmp/r6.save Domain restored from /tmp/r6.save 6.Do s3 with the guest again, the virsh command will hang here #virsh dompmsuspend rhelnew1 --target mem [root@test ~]# virsh dompmsuspend r6 --target mem <==hung. ^C 7.Save the guest again,the guest will fail to save [root@test ~]# virsh save r6 /tmp/r6.save error: Failed to save domain r6 to /tmp/r6.save error: Timed out during operation: cannot acquire state change lock <==New err [root@test ~]# virsh domjobinfo r6 Job type: None 8.Destroy the guest,the libvirtd will be crashed here [root@test ~]# virsh destroy r6 Domain r6 destroyed [root@test ~]# service libvirtd status libvirtd (pid 13877) is running... [root@test ~]# ps aux | grep libvirtd | grep -v grep root 13877 2.1 0.1 1050292 13868 ? Sl 18:47 0:16 libvirtd --daemon (In reply to Hu Jianwei from comment #11) > 1. For my below step 6, I still reproduced it, and found one bug is same > with it,Bug 890648 - guest agent commands will hang if the guest agent > crashes while executing a command. Not good - that probably needs to be fixed. > > 2. For my below step 7, the old error message was disappeared, but reported > a new, "error: Timed out during operation: cannot acquire state change lock". This may be a result of the failure in step 6. > 6.Do s3 with the guest again, the virsh command will hang here > #virsh dompmsuspend rhelnew1 --target mem > [root@test ~]# virsh dompmsuspend r6 --target mem <==hung. > ^C In general, when you don't allow one command to finish... > > 7.Save the guest again,the guest will fail to save > [root@test ~]# virsh save r6 /tmp/r6.save > error: Failed to save domain r6 to /tmp/r6.save > error: Timed out during operation: cannot acquire state change lock <==New > err ...then other commands failing to obtain state lock is normal. > > [root@test ~]# virsh domjobinfo r6 > Job type: None > > 8.Destroy the guest,the libvirtd will be crashed here > [root@test ~]# virsh destroy r6 > Domain r6 destroyed > > [root@test ~]# service libvirtd status > libvirtd (pid 13877) is running... How is that evidence of a crash? By crash, I was expecting a core dump or the process to disappear (which is bad); but your paste says it is still active. According to my comment 11, I retested this bug using libvirt-0.10.2-23.el6.x86_64, and got another error message just in below step 6, and the others are passed, no hung error or crash occurred. version: libvirt-0.10.2-23.el6.x86_64 qemu-kvm-0.12.1.2-2.398.el6.x86_64 kernel-2.6.32-412.el6.x86_64 1.getenforce 2.Added related config to domain xml. 3.Start the qemu-ga service in guest. 4.Do s3 with the guest,then wakeup it [root@test777 ~]# virsh dompmsuspend r6 --target mem Domain r6 successfully suspended [root@test777 ~]# virsh dompmwakeup r6 Domain r6 successfully woken up 5.Save and restore the guest [root@test777 ~]# virsh save r6 /tmp/r6.save Domain r6 saved to /tmp/r6.save [root@test777 ~]# Domain restored from /tmp/r6.save 6.Do s3 with the guest again, the virsh command will report a new error. [root@test777 ~]# virsh dompmsuspend r6 --target mem error: Domain r6 could not be suspended error: internal error unable to execute QEMU command 'guest-suspend-ram': child process has failed to suspend <===== here 7.Save the guest again, we can save and restore successfully. [root@test777 ~]# virsh save r6 /tmp/r6.save Domain r6 saved to /tmp/r6.save [root@test777 ~]# virsh restore /tmp/r6.save Domain restored from /tmp/r6.save [root@test777 ~]# virsh domjobinfo r6 Job type: None 8.Destroy the guest,the libvirtd is alive. [root@test777 ~]# virsh destroy r6 Domain r6 destroyed [root@test777 ~]# virsh list --all Id Name State ---------------------------------------------------- - r6 shut off [root@test777 ~]# service libvirtd status libvirtd (pid 5040) is running... For error messages of step 6, sometimes, we can get the followings,like: [root@test777 ~]# virsh dompmsuspend r6 --target mem error: Domain r6 could not be suspended error: Guest agent is not responding: Guest agent not available for now [root@test777 ~]# virsh dompmsuspend r6 --target mem error: Domain r6 could not be suspended error: Guest agent is not responding: QEMU guest agent is not available due to an error Note: Log from libvirtd.log ... 2013-08-29 08:02:14.330+0000: 5040: debug : qemuAgentIOProcessLine:317 : Line [{"error": {"class": "GenericError", "desc": "child process has failed to suspend", "data": {"message": "child process has failed to suspend"}}}] 2013-08-29 08:02:14.330+0000: 5040: debug : virJSONValueFromString:975 : string={"error": {"class": "GenericError", "desc": "child process has failed to suspend", "data": {"message": "child process has failed to suspend"}}} ... Need I report a new bug to track the mentioned issues above? > 6.Do s3 with the guest again, the virsh command will report a new error. > [root@test777 ~]# virsh dompmsuspend r6 --target mem > error: Domain r6 could not be suspended > error: internal error unable to execute QEMU command 'guest-suspend-ram': > child process has failed to suspend <===== here > For my this issue, I found we have a workaround using "qemu-ga -d" instead of "service qemu-ga restart" to start qemu-ga daemon in guest. But, I don't know the root cause why my guest can not do S3 using that default daemon in guest. BTW, when we used "qemu-ga -d" command in guest, we still got a hung of "virsh dompmsuspend r6 --target mem", those results of the rest of steps are same as comment 11. Hi Eric Blake, I tested it with the latest libvirt version, detailed steps follow comment 16, I got same results with comment 17. Version: qemu-kvm-rhev-0.12.1.2-2.411.el6.x86_64 qemu-guest-agent-0.12.1.2-2.411.el6.x86_64 libvirt-0.10.2-29.el6.x86_64 kernel-2.6.32-421.el6.x86_64 1. When using default qemu-ga daemon in guest(started automatically with OS), we got an error. (In guest)[root@dhcp-66-83-11 ~]# ps aux | grep qemu | grep -v grep root 1525 0.0 0.1 15592 1024 ? Ss 18:14 0:00 /usr/bin/qemu-ga --daemonize --method virtio-serial --path /dev/virtio-ports/org.qemu.guest_agent.0 --logfile /var/log/qemu-ga/qemu-ga.log --pidfile /var/run/qemu-ga.pid --blacklist guest-file-open guest-file-close guest-file-read guest-file-write guest-file-seek guest-file-flush (In host)[root@intel-5130-16-2 ~]# virsh dompmsuspend r6 --target mem error: Domain r6 could not be suspended error: internal error unable to execute QEMU command 'guest-suspend-ram': child process has failed to suspend 2. To Kill above qemu-ga process in guest, and start it using "qemu-ga -d", do "virsh dompmsuspend r6 --target mem" again, will be hung after doing save/restore actions. [root@intel-5130-16-2 ~]# virsh dompmsuspend r6 --target mem Domain r6 successfully suspended [root@intel-5130-16-2 ~]# virsh dompmsuspend r6 --target mem Domain r6 successfully suspended [root@intel-5130-16-2 ~]# virsh dompmwakeup r6 Domain r6 successfully woken up [root@intel-5130-16-2 ~]# virsh save r6 /tmp/r6.save Domain r6 saved to /tmp/r6.save [root@intel-5130-16-2 ~]# virsh restore /tmp/r6.save Domain restored from /tmp/r6.save [root@intel-5130-16-2 ~]# virsh dompmsuspend r6 --target mem ^C [root@intel-5130-16-2 ~]# Regarding above problem, is it associated with bug 890648? Could we skip this issue to verify this bug? libvirtd crash issue has been fixed according to comment 14 and comment 19, changed to verified(just verified libvirtd crash issue). Regarding S3 hang after save/restore actions mentioned above, it will be tracked in bug 890648. https://bugzilla.redhat.com/show_bug.cgi?id=890648#c10 If there are any new issues after bug 890648 fixed, I'll reverify it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1581.html |