Bug 878966

Summary: virsh setmem then dompmsuspend to disk will hang forever
Product: [Community] Virtualization Tools Reporter: Luiz Capitulino <lcapitulino>
Component: libvirtAssignee: John Ferlan <jferlan>
Status: CLOSED DEFERRED QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: unspecifiedCC: amit.shah, crobinso, cwei, dyuan, juzhang, lcapitulino, mkenneth, mzhan, qzhang, rbalakri, rpacheco, shyu, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 872420
: 975376 (view as bug list) Environment:
Last Closed: 2016-03-24 00:57:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 912287, 975376    

Description Luiz Capitulino 2012-11-21 16:41:26 UTC
Libvirt and/or virsh are unable to detect stale QMP/qemu-ga responses. Most of the time such stale responses are caused by bugs or malfunctioning components down the management stack (ie. outside of libvirt's scope). However, being able to detect stale responses would make libvirt and/or virsh more robust.

As an example, consider the (summarized) bug bellow, which wouldn't happen for the end user if libvirt and/or virsh were able to detect the missing response.

+++ This bug was initially created as a clone of Bug #872420 +++

Description of problem:
virsh setmem then dompmsuspend to disk will hang forever

Version-Release number of selected component (if applicable):
libvirt-0.10.2-6.el6.x86_64
qemu-guest-agent-0.12.1.2-2.333

How reproducible:
80%

Steps to Reproduce:
Prepare a 4G memory guest and start it.

[root@zhpeng ~]# virsh dompmsuspend aaa --target disk
Domain aaa successfully suspended               --------------> no problem
[root@zhpeng ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     aaa                            shut off

[root@zhpeng ~]# virsh start aaa
Domain aaa started

[root@zhpeng ~]# virsh dompmsuspend aaa --target disk
Domain aaa successfully suspended                  -----------> no problem
[root@zhpeng ~]# virsh start aaa
Domain aaa started

[root@zhpeng ~]# virsh setmem --live aaa 2048000

[root@zhpeng ~]# virsh dompmsuspend aaa --target disk      ------------> it hangs forever

--- Additional comment from Luiz Capitulino on 2012-11-21 11:23:22 EST ---

The root cause of this problem is that pm-hibernate in RHEL6.4 does not return a failure exit code when suspending fails. It does in Fedora though, so only RHEL is affected.

Here's a quick reproducer:

1. Start a qemu VM with 2 gigas of RAM and RHEL6.4 as a guest (comment 10 has a command-line example)

2. As soon as the guest has booted, change to qemu's monitor and run:

(qemu) balloon 700

3. Then log into the system and check that hibernate will fail:

# echo disk > /sys/power/state
bash: echo: write error: Cannot allocate memory

4. Then try it with pm-hibernate

# pm-hibernate
# echo $?
0

On F16 pm-hibernate successfully detects the error and returns 128.

Some additional comments:

1. qemu-ga doesn't hang. Actually, it's acting as expected: pm-hiberate reports success, so qemu-ga assumes that suspending succeeded and doesn't emit a success response (see last paragraph of comment 15 for more details)

2. libvirt and/or virsh are also buggy, as they should have a timeout to detect stale responses (will clone this bz for libvirt)

3. As a workaround, you could remove the pm-utils package (however having pm-utils installed is *strongly* recommended on regular usage)

Comment 2 Jiri Denemark 2012-11-26 13:48:16 UTC
libvirt doesn't like to be in the business of inventing hard-coded
timeouts. And because of that, we will either have to introduce a new
API for PM suspend which supports specifying a timeout or provide a
way of cancelling APIs waiting for a reply from guest agent. Neither
of these can be done in 6.4.

Comment 3 Luiz Capitulino 2012-11-27 11:36:22 UTC
Agreed this is not for 6.4. It's really something for the future.

Comment 9 Cole Robinson 2016-03-24 00:57:46 UTC
This seems like the type of thing that is just going to sit dormant forever until there's another real issue we are hit by, so closing as DEFERRED