Bug 928661 - Libvirtd crash when destroying linux guest which executed a series of operations about S3 and save /restore
Libvirtd crash when destroying linux guest which executed a series of operati...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: libvirt (Show other bugs)
6.4
x86_64 Linux
high Severity high
: rc
: ---
Assigned To: Eric Blake
Virtualization Bugs
: Regression, ZStream
Depends On: 638512
Blocks: 928672 1017194
  Show dependency treegraph
 
Reported: 2013-03-28 03:49 EDT by zhenfeng wang
Modified: 2013-11-21 03:56 EST (History)
12 users (show)

See Also:
Fixed In Version: libvirt-0.10.2-22.el6
Doc Type: Bug Fix
Doc Text:
Cause: Some code refactoring to fix another bug left a case where locks were cleaned up incorrectly. Consequence: Libvirtd could crash on certain migration to file scenarios. Fix: The lock cleanup paths were fixed. Result: Libvirtd no longer crashes when saving a domain to file.
Story Points: ---
Clone Of:
: 928672 (view as bug list)
Environment:
Last Closed: 2013-11-21 03:56:30 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
The guest's xml (2.51 KB, text/plain)
2013-03-28 03:56 EDT, zhenfeng wang
no flags Details
The gdb info about libvirtd crash (2.44 KB, text/plain)
2013-03-28 05:54 EDT, zhenfeng wang
no flags Details
full* stack trace for all threads (5.91 KB, text/plain)
2013-03-28 06:24 EDT, zhenfeng wang
no flags Details

  None (edit)
Description zhenfeng wang 2013-03-28 03:49:37 EDT
Description of problem:
Libvirtd crash when  destroyed the guest which excuted the following operation:
dompmsuspend=>dompmwakeup=>save=>restore=>dompmsuspend(the virsh command will hang here)=>save=>destroy=>libvirtd crashed

Version-Release number of selected component (if applicable):
libvirt-0.10.2-18.el6_4.2.x86_64
qemu-kvm-rhev-0.12.1.2-2.355.el6_4.2.x86_64
kernel-2.6.32-358.2.1.el6.x86_64
How reproducible:
100%

Steps
1.# getenforce
Enforcing

2.Prepare a guest with qemu-ga ENV
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 7     rhelnew1                       running

3.Start the qemu-ga service in guest
# qemu-ga -d

4.Do s3 with the guest,then wakeup it
#virsh dompmsuspend rhelnew1 --target mem
#virsh dompmwakeup rhelnew1

5.Save and restore the guest
# virsh save rhelnew1 /tmp/rhelnew1.save

#virsh restore /tmp/rhelnew1.save

6.Do s3 with the guest again, the virsh command will hang here
#virsh dompmsuspend rhelnew1 --target mem

7.Save the guest again,the guest will fail to save
#virsh save rhelnew1 /tmp/rhelnew1.save
error: Failed to save domain rhelnew1 to /tmp/rhelnew1.save
error: internal error unexpected async job 3

# virsh domjobinfo rhelnew1
Job type:         None   

8.Destroy the guest,the libvirtd will be crashed here
# virsh destroy rhelnew1
Domain rhelnew1 destroyed

# virsh list
error: Failed to reconnect to the hypervisor
error: no valid connection
error: Failed to connect socket to '/var/run/libvirt/libvirt-sock': Connection refused

# ps aux|grep libvirtd
root      5123  0.0  0.0 103244   828 pts/2    S+   15:39   0:00 grep libvirtd

# service libvirtd status
libvirtd dead but pid file exists

Actual results:
The libvirtd was crashed

Expected results:
The virsh command shouldn't hang and the libvirtd service shouldn't be crashed  as well
Comment 1 zhenfeng wang 2013-03-28 03:56:55 EDT
Created attachment 717485 [details]
The guest's xml
Comment 2 zhenfeng wang 2013-03-28 05:54:57 EDT
Created attachment 717557 [details]
The gdb info about libvirtd crash
Comment 3 Daniel Berrange 2013-03-28 06:14:30 EDT
Please provide the *full* stack trace for all threads. ie  'thread apply all bt', not merely 'bt'.
Comment 4 zhenfeng wang 2013-03-28 06:24:26 EDT
Created attachment 717581 [details]
full* stack trace for all threads
Comment 5 Eric Blake 2013-08-13 12:01:23 EDT
Per https://bugzilla.redhat.com/show_bug.cgi?id=928672#c7, we know the right fix.
Comment 6 Eric Blake 2013-08-13 17:09:21 EDT
Regression introduced when fixing bug 638512
Comment 11 Hu Jianwei 2013-08-15 07:34:28 EDT
Hi,

According to steps of https://bugzilla.redhat.com/show_bug.cgi?id=928661#c0, libvirtd does not crash any more, but the dompmsuspend command still hang(in step 6) and save command reported error(in step 7).

Could you help me confirm below two questions?

1. For my below step 6, I still reproduced it, and found one bug is same with it,Bug 890648 - guest agent commands will hang if the guest agent crashes while executing a command.

2. For my below step 7, the old error message was disappeared, but reported a new, "error: Timed out during operation: cannot acquire state change lock".

Version:
libvirt-0.10.2-22.el6.x86_64
qemu-kvm-0.12.1.2-2.393.el6.x86_64
qemu-guest-agent-0.12.1.2-2.393.el6.x86_64
kernel-2.6.32-358.el6.x86_64

1.# getenforce
Enforcing

2.Prepare a guest with qemu-ga ENV,add below config to domain xml. 
...
<pm>
    <suspend-to-mem enabled='yes'/>
    <suspend-to-disk enabled='yes'/>
</pm>
...
<channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/r6.agent'/>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
</channel>
...
# virsh list --all
 Id    Name                           State
----------------------------------------------------
 7     r6                             running

3.Start the qemu-ga service in guest
# qemu-ga -d

4.Do s3 with the guest,then wakeup it
[root@test ~]# virsh dompmsuspend r6 --target mem
Domain r6 successfully suspended
[root@test ~]# virsh dompmwakeup r6
Domain r6 successfully woken up

5.Save and restore the guest
[root@test ~]# virsh save r6 /tmp/r6.save

Domain r6 saved to /tmp/r6.save

[root@test ~]# virsh restore /tmp/r6.save
Domain restored from /tmp/r6.save

6.Do s3 with the guest again, the virsh command will hang here        
#virsh dompmsuspend rhelnew1 --target mem
[root@test ~]# virsh dompmsuspend r6 --target mem             <==hung.
^C

7.Save the guest again,the guest will fail to save
[root@test ~]# virsh save r6 /tmp/r6.save
error: Failed to save domain r6 to /tmp/r6.save        
error: Timed out during operation: cannot acquire state change lock   <==New err

[root@test ~]# virsh domjobinfo r6
Job type:         None        

8.Destroy the guest,the libvirtd will be crashed here
[root@test ~]# virsh destroy r6
Domain r6 destroyed

[root@test ~]# service libvirtd status
libvirtd (pid  13877) is running...
[root@test ~]# ps aux | grep libvirtd | grep -v grep
root     13877  2.1  0.1 1050292 13868 ?       Sl   18:47   0:16 libvirtd --daemon
Comment 13 Eric Blake 2013-08-15 08:13:31 EDT
(In reply to Hu Jianwei from comment #11)

> 1. For my below step 6, I still reproduced it, and found one bug is same
> with it,Bug 890648 - guest agent commands will hang if the guest agent
> crashes while executing a command.

Not good - that probably needs to be fixed.

> 
> 2. For my below step 7, the old error message was disappeared, but reported
> a new, "error: Timed out during operation: cannot acquire state change lock".

This may be a result of the failure in step 6.


> 6.Do s3 with the guest again, the virsh command will hang here        
> #virsh dompmsuspend rhelnew1 --target mem
> [root@test ~]# virsh dompmsuspend r6 --target mem             <==hung.
> ^C

In general, when you don't allow one command to finish...

> 
> 7.Save the guest again,the guest will fail to save
> [root@test ~]# virsh save r6 /tmp/r6.save
> error: Failed to save domain r6 to /tmp/r6.save        
> error: Timed out during operation: cannot acquire state change lock   <==New
> err

...then other commands failing to obtain state lock is normal.

> 
> [root@test ~]# virsh domjobinfo r6
> Job type:         None        
> 
> 8.Destroy the guest,the libvirtd will be crashed here
> [root@test ~]# virsh destroy r6
> Domain r6 destroyed
> 
> [root@test ~]# service libvirtd status
> libvirtd (pid  13877) is running...

How is that evidence of a crash?  By crash, I was expecting a core dump or the process to disappear (which is bad); but your paste says it is still active.
Comment 16 Hu Jianwei 2013-08-29 06:44:08 EDT
According to my comment 11, I retested this bug using libvirt-0.10.2-23.el6.x86_64, and got another error message just in below step 6, and the others are passed, no hung error or crash occurred.

version:
libvirt-0.10.2-23.el6.x86_64
qemu-kvm-0.12.1.2-2.398.el6.x86_64
kernel-2.6.32-412.el6.x86_64

1.getenforce
2.Added related config to domain xml.
3.Start the qemu-ga service in guest.
4.Do s3 with the guest,then wakeup it
[root@test777 ~]# virsh dompmsuspend r6 --target mem
Domain r6 successfully suspended
[root@test777 ~]# virsh dompmwakeup r6
Domain r6 successfully woken up
5.Save and restore the guest
[root@test777 ~]# virsh save r6 /tmp/r6.save

Domain r6 saved to /tmp/r6.save

[root@test777 ~]# 
Domain restored from /tmp/r6.save

6.Do s3 with the guest again, the virsh command will report a new error.
[root@test777 ~]# virsh dompmsuspend r6 --target mem
error: Domain r6 could not be suspended
error: internal error unable to execute QEMU command 'guest-suspend-ram': child process has failed to suspend            <===== here

7.Save the guest again, we can save and restore successfully.
[root@test777 ~]# virsh save r6 /tmp/r6.save

Domain r6 saved to /tmp/r6.save

[root@test777 ~]# virsh restore /tmp/r6.save
Domain restored from /tmp/r6.save

[root@test777 ~]# virsh domjobinfo r6
Job type:         None        

8.Destroy the guest,the libvirtd is alive.
[root@test777 ~]# virsh destroy r6
Domain r6 destroyed

[root@test777 ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     r6                             shut off

[root@test777 ~]# service libvirtd status
libvirtd (pid  5040) is running...


For error messages of step 6, sometimes, we can get the followings,like:
[root@test777 ~]# virsh dompmsuspend r6 --target mem
error: Domain r6 could not be suspended
error: Guest agent is not responding: Guest agent not available for now

[root@test777 ~]# virsh dompmsuspend r6 --target mem
error: Domain r6 could not be suspended
error: Guest agent is not responding: QEMU guest agent is not available due to an error

Note: Log from libvirtd.log
...
2013-08-29 08:02:14.330+0000: 5040: debug : qemuAgentIOProcessLine:317 : Line [{"error": {"class": "GenericError", "desc": "child process has failed to suspend", "data": {"message": "child process has failed to suspend"}}}]
2013-08-29 08:02:14.330+0000: 5040: debug : virJSONValueFromString:975 : string={"error": {"class": "GenericError", "desc": "child process has failed to suspend", "data": {"message": "child process has failed to suspend"}}}
...


Need I report a new bug to track the mentioned issues above?
Comment 17 Hu Jianwei 2013-09-01 23:03:38 EDT
> 6.Do s3 with the guest again, the virsh command will report a new error.
> [root@test777 ~]# virsh dompmsuspend r6 --target mem
> error: Domain r6 could not be suspended
> error: internal error unable to execute QEMU command 'guest-suspend-ram':
> child process has failed to suspend            <===== here
> 

For my this issue, I found we have a workaround using "qemu-ga -d" instead of "service qemu-ga restart" to start qemu-ga daemon in guest. But, I don't know the root cause why my guest can not do S3 using that default daemon in guest.

BTW, when we used "qemu-ga -d" command in guest, we still got a hung of "virsh dompmsuspend r6 --target mem", those results of the rest of steps are same as comment 11.
Comment 19 Hu Jianwei 2013-10-11 06:46:01 EDT
Hi Eric Blake,

I tested it with the latest libvirt version, detailed steps follow comment 16, I got same results with comment 17.  

Version:
qemu-kvm-rhev-0.12.1.2-2.411.el6.x86_64
qemu-guest-agent-0.12.1.2-2.411.el6.x86_64
libvirt-0.10.2-29.el6.x86_64
kernel-2.6.32-421.el6.x86_64

1. When using default qemu-ga daemon in guest(started automatically with OS), we got an error.
(In guest)[root@dhcp-66-83-11 ~]# ps aux | grep qemu | grep -v grep
root      1525  0.0  0.1  15592  1024 ?        Ss   18:14   0:00 /usr/bin/qemu-ga --daemonize --method virtio-serial --path /dev/virtio-ports/org.qemu.guest_agent.0 --logfile /var/log/qemu-ga/qemu-ga.log --pidfile /var/run/qemu-ga.pid --blacklist guest-file-open guest-file-close guest-file-read guest-file-write guest-file-seek guest-file-flush

(In host)[root@intel-5130-16-2 ~]# virsh dompmsuspend r6 --target mem
error: Domain r6 could not be suspended
error: internal error unable to execute QEMU command 'guest-suspend-ram': child process has failed to suspend

2. To Kill above qemu-ga process in guest, and start it using "qemu-ga -d", do "virsh dompmsuspend r6 --target mem" again, will be hung after doing save/restore actions.

[root@intel-5130-16-2 ~]# virsh dompmsuspend r6 --target mem
Domain r6 successfully suspended
[root@intel-5130-16-2 ~]# virsh dompmsuspend r6 --target mem
Domain r6 successfully suspended
[root@intel-5130-16-2 ~]# virsh dompmwakeup r6
Domain r6 successfully woken up
[root@intel-5130-16-2 ~]# virsh save r6 /tmp/r6.save

Domain r6 saved to /tmp/r6.save

[root@intel-5130-16-2 ~]# virsh restore /tmp/r6.save 
Domain restored from /tmp/r6.save

[root@intel-5130-16-2 ~]# virsh dompmsuspend r6 --target mem
^C
[root@intel-5130-16-2 ~]# 

Regarding above problem, is it associated with bug 890648? Could we skip this issue to verify this bug?
Comment 20 Hu Jianwei 2013-10-14 06:14:36 EDT
libvirtd crash issue has been fixed according to comment 14 and comment 19, changed to verified(just verified libvirtd crash issue).
Regarding S3 hang after save/restore actions mentioned above, it will be tracked in bug 890648. https://bugzilla.redhat.com/show_bug.cgi?id=890648#c10

If there are any new issues after bug 890648 fixed, I'll reverify it.
Comment 22 errata-xmlrpc 2013-11-21 03:56:30 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1581.html

Note You need to log in before you can comment on or make changes to this bug.