Libvirt and/or virsh are unable to detect stale QMP/qemu-ga responses. Most of the time such stale responses are caused by bugs or malfunctioning components down the management stack (ie. outside of libvirt's scope). However, being able to detect stale responses would make libvirt and/or virsh more robust. As an example, consider the (summarized) bug bellow, which wouldn't happen for the end user if libvirt and/or virsh were able to detect the missing response. +++ This bug was initially created as a clone of Bug #872420 +++ Description of problem: virsh setmem then dompmsuspend to disk will hang forever Version-Release number of selected component (if applicable): libvirt-0.10.2-6.el6.x86_64 qemu-guest-agent-0.12.1.2-2.333 How reproducible: 80% Steps to Reproduce: Prepare a 4G memory guest and start it. [root@zhpeng ~]# virsh dompmsuspend aaa --target disk Domain aaa successfully suspended --------------> no problem [root@zhpeng ~]# virsh list --all Id Name State ---------------------------------------------------- - aaa shut off [root@zhpeng ~]# virsh start aaa Domain aaa started [root@zhpeng ~]# virsh dompmsuspend aaa --target disk Domain aaa successfully suspended -----------> no problem [root@zhpeng ~]# virsh start aaa Domain aaa started [root@zhpeng ~]# virsh setmem --live aaa 2048000 [root@zhpeng ~]# virsh dompmsuspend aaa --target disk ------------> it hangs forever --- Additional comment from Luiz Capitulino on 2012-11-21 11:23:22 EST --- The root cause of this problem is that pm-hibernate in RHEL6.4 does not return a failure exit code when suspending fails. It does in Fedora though, so only RHEL is affected. Here's a quick reproducer: 1. Start a qemu VM with 2 gigas of RAM and RHEL6.4 as a guest (comment 10 has a command-line example) 2. As soon as the guest has booted, change to qemu's monitor and run: (qemu) balloon 700 3. Then log into the system and check that hibernate will fail: # echo disk > /sys/power/state bash: echo: write error: Cannot allocate memory 4. Then try it with pm-hibernate # pm-hibernate # echo $? 0 On F16 pm-hibernate successfully detects the error and returns 128. Some additional comments: 1. qemu-ga doesn't hang. Actually, it's acting as expected: pm-hiberate reports success, so qemu-ga assumes that suspending succeeded and doesn't emit a success response (see last paragraph of comment 15 for more details) 2. libvirt and/or virsh are also buggy, as they should have a timeout to detect stale responses (will clone this bz for libvirt) 3. As a workaround, you could remove the pm-utils package (however having pm-utils installed is *strongly* recommended on regular usage)
libvirt doesn't like to be in the business of inventing hard-coded timeouts. And because of that, we will either have to introduce a new API for PM suspend which supports specifying a timeout or provide a way of cancelling APIs waiting for a reply from guest agent. Neither of these can be done in 6.4.
Agreed this is not for 6.4. It's really something for the future.
This seems like the type of thing that is just going to sit dormant forever until there's another real issue we are hit by, so closing as DEFERRED