Bug 872420

Summary:

pm-hibernate exit code does not indicate failure when s4 fails

Product:

Red Hat Enterprise Linux 6

Reporter:

zhpeng

Component:

pm-utils

Assignee:

Jaroslav Škarvada <jskarvad>

Status:

CLOSED WONTFIX

QA Contact:

Desktop QE <desktop-qa-list>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

6.4

CC:

amit.shah, cwei, dyuan, juzhang, lcapitulino, mkenneth, mzhan, qzhang, rbalakri, rpacheco, rvokal, thozza, tpelka, virt-maint

Target Milestone:

Keywords:

FastFix, Patch

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

878966 (view as bug list)

Environment:

Last Closed:

2017-09-06 07:22:36 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

912287

Attachments:

Description	Flags
qemu-ga log	none
libirtd crash log	none
crashed libvirtd log	none
Backported fix	none

Description zhpeng 2012-11-02 03:18:27 UTC

Description of problem:
virsh setmem then dompmsuspend to disk will hang forever

Version-Release number of selected component (if applicable):
libvirt-0.10.2-6.el6.x86_64
qemu-guest-agent-0.12.1.2-2.333

How reproducible:
80%

Steps to Reproduce:
Prepare a 4G memory guest and start it.

[root@zhpeng ~]# virsh dompmsuspend aaa --target disk
Domain aaa successfully suspended               --------------> no problem
[root@zhpeng ~]# virsh list --all
 Id    Name                           State
----------------------------------------------------
 -     aaa                            shut off

[root@zhpeng ~]# virsh start aaa
Domain aaa started

[root@zhpeng ~]# virsh dompmsuspend aaa --target disk
Domain aaa successfully suspended                  -----------> no problem
[root@zhpeng ~]# virsh start aaa
Domain aaa started

[root@zhpeng ~]# virsh setmem --live aaa 2048000

[root@zhpeng ~]# virsh dompmsuspend aaa --target disk      ------------> it hangs forever


Actual results:
As steps

Expected results:
It works

Additional info:

Comment 2 zhpeng 2012-11-02 08:24:57 UTC

Created attachment 636969 [details]
qemu-ga log

Comment 4 zhpeng 2012-11-02 08:49:30 UTC

Created attachment 636989 [details]
libirtd crash log

Comment 5 EricLee 2012-11-09 06:02:31 UTC

I can reproduce the bug as steps in comment #3, even using a normal guest(with windowsX installed) with 1G memory, and my packages are:
# rpm -qa libvirt qemu-kvm-rhev; uname -r
libvirt-0.10.2-7.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.330.el6.x86_64
2.6.32-335.el6.x86_64
and with qemu-guest-agent-0.12.1.2-2.330.el6.x86_64.rpm installed in guest.

It was strange, when I issue the command:
# virsh dompmsuspend test --target disk
the guest was flashed twist one time, and then resume to the status before, and the command just hanged.

Please see the libvirtd crash log in the next comment.

Comment 6 EricLee 2012-11-09 06:03:21 UTC

Created attachment 641268 [details]
crashed libvirtd log

Comment 7 Gunannan Ren 2012-11-15 08:51:44 UTC

The bug can be reproduced by using qemu command line.
After performing operation, the guest failed to save to disk, and the json command { "execute": "guest-suspend-disk"} hangs up forever. But, if we don't do
the balloon memory before suspending guest to disk, the suspending will success.

Steps:
1, start a qemu process instance
/usr/libexec/qemu-kvm -m 4096 -smp 1 -name "rhel6u3pm" \
 -drive file=/virt/rhel63.snap1 \
 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0 \
 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/rhel6u3.agent,server,nowait \
 -device virtserialport,bus=virtio serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0 -monitor stdio

2, Perform the balloon operation from 4G to 2G.
(qemu)balloon 2048

3, connect to socket entry point of qemu-guest-agent
nc -U /var/lib/libvirt/qemu/rhel6u3.agent
 {"execute":"guest-sync", "arguments":{"id":1234}}
 {"return": 1234}
 { "execute": "guest-suspend-disk"}

RPMs:
qemu-kvm-rhev-tools-0.12.1.2-2.334.el6.x86_64
qemu-img-rhev-0.12.1.2-2.334.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.334.el6.x86_64
kernel-firmware-2.6.32-341.el6.noarch
kernel-2.6.32-341.el6.x86_64

So, move component to qemu-kvm for help.

Comment 8 Qunfang Zhang 2012-11-16 02:45:51 UTC

(In reply to comment #7)
> The bug can be reproduced by using qemu command line.
> After performing operation, the guest failed to save to disk, and the json
> command { "execute": "guest-suspend-disk"} hangs up forever. But, if we
> don't do
> the balloon memory before suspending guest to disk, the suspending will
> success.
> 
> Steps:
> 1, start a qemu process instance
> /usr/libexec/qemu-kvm -m 4096 -smp 1 -name "rhel6u3pm" \
>  -drive file=/virt/rhel63.snap1 \
>  -device virtio-serial-pci,id=virtio-serial0,bus=pci.0 \
>  -chardev
> socket,id=charchannel0,path=/var/lib/libvirt/qemu/rhel6u3.agent,server,
> nowait \
>  -device virtserialport,bus=virtio
> serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
> \ -device virtio-balloon-pci,id=balloon0,bus=pci.0 -monitor stdio
> 
> 2, Perform the balloon operation from 4G to 2G.
> (qemu)balloon 2048
> 
> 3, connect to socket entry point of qemu-guest-agent
> nc -U /var/lib/libvirt/qemu/rhel6u3.agent
>  {"execute":"guest-sync", "arguments":{"id":1234}}
>  {"return": 1234}
>  { "execute": "guest-suspend-disk"}
> 
> RPMs:
> qemu-kvm-rhev-tools-0.12.1.2-2.334.el6.x86_64
> qemu-img-rhev-0.12.1.2-2.334.el6.x86_64
> qemu-kvm-rhev-0.12.1.2-2.334.el6.x86_64
> kernel-firmware-2.6.32-341.el6.noarch
> kernel-2.6.32-341.el6.x86_64
> 
> So, move component to qemu-kvm for help.

Hi, Guannan
In rhel6.4 we have two options to enable S3&S4. "-global PIIX4_PM.disable_s3=0  -global PIIX4_PM.disable_s4=0". And the default values are both 1 (disable s3 and s4). And seems there's additional options need to be added to xml files as well if using virsh. From the command line  you provided, there's no these options so the S3/S4 are disabled. 
Could you guys have a try?

Thanks,
Qunfang

Comment 9 Qunfang Zhang 2012-11-16 02:51:05 UTC

Btw, if we disable S4, we could still implement suspend to disk. This is a workaround for linux guest. Guest will simulate s4 by shutting down but storing hibernation image, it will not use the acpi method. But there's a issue when suspend to disk after balloon memory. Please refer to bug 806256.

Comment 10 zhpeng 2012-11-16 03:24:01 UTC

(In reply to comment #8)
> (In reply to comment #7)
> > The bug can be reproduced by using qemu command line.
> > After performing operation, the guest failed to save to disk, and the json
> > command { "execute": "guest-suspend-disk"} hangs up forever. But, if we
> > don't do
> > the balloon memory before suspending guest to disk, the suspending will
> > success.
> > 
> > Steps:
> > 1, start a qemu process instance
> > /usr/libexec/qemu-kvm -m 4096 -smp 1 -name "rhel6u3pm" \
> >  -drive file=/virt/rhel63.snap1 \
> >  -device virtio-serial-pci,id=virtio-serial0,bus=pci.0 \
> >  -chardev
> > socket,id=charchannel0,path=/var/lib/libvirt/qemu/rhel6u3.agent,server,
> > nowait \
> >  -device virtserialport,bus=virtio
> > serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
> > \ -device virtio-balloon-pci,id=balloon0,bus=pci.0 -monitor stdio
> > 
> > 2, Perform the balloon operation from 4G to 2G.
> > (qemu)balloon 2048
> > 
> > 3, connect to socket entry point of qemu-guest-agent
> > nc -U /var/lib/libvirt/qemu/rhel6u3.agent
> >  {"execute":"guest-sync", "arguments":{"id":1234}}
> >  {"return": 1234}
> >  { "execute": "guest-suspend-disk"}
> > 
> > RPMs:
> > qemu-kvm-rhev-tools-0.12.1.2-2.334.el6.x86_64
> > qemu-img-rhev-0.12.1.2-2.334.el6.x86_64
> > qemu-kvm-rhev-0.12.1.2-2.334.el6.x86_64
> > kernel-firmware-2.6.32-341.el6.noarch
> > kernel-2.6.32-341.el6.x86_64
> > 
> > So, move component to qemu-kvm for help.
> 
> Hi, Guannan
> In rhel6.4 we have two options to enable S3&S4. "-global
> PIIX4_PM.disable_s3=0  -global PIIX4_PM.disable_s4=0". And the default
> values are both 1 (disable s3 and s4). And seems there's additional options
> need to be added to xml files as well if using virsh. From the command line 
> you provided, there's no these options so the S3/S4 are disabled. 
> Could you guys have a try?
> 
> Thanks,
> Qunfang

we try this command:
/usr/libexec/qemu-kvm -m 4096 -smp 1 -name "rhelpme" -drive file=/virt/rhel63.snap1 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/rhel6u3.agent,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device virtio-balloon-pci,id=balloon0,bus=pci.0 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -monitor stdio

and still hangs, so, it's a qemu bug i think.

Comment 11 zhpeng 2012-11-16 03:40:22 UTC

I did it again and result is:


4096 ---> 4000    S4 works fine
4096 ---> 3500    S4 works fine
4096 ---> 2048    S4 fail but not exit

So i think libvirt or qemu-kvm should make a judgement for this, not wait forever or not exit

Comment 12 Qunfang Zhang 2012-11-16 03:44:46 UTC

Hi Amit
Based on comment 11, maybe this is a duplicate bug with bug 806256 which is closed as WONTFIX.
Could you help double confirm it? If we still don't plan to fix it in rhel6.4, we could close it. If not, we could keep this open.

Thanks,
Qunfang

Comment 13 Amit Shah 2012-11-16 05:12:32 UTC

Bug 806256 only dealt with the problem of guest not suspending after ballooning.  However, there seems to be some additional problem here: qmp and/or libvirt not recovering from a guest that won't suspend.  Adding Luiz to check.

Comment 14 Gunannan Ren 2012-11-16 05:31:28 UTC

The current case is that qmp command { "execute": "guest-suspend-disk"} will hang up instead of returning success or failure value. so libvirt hangs there too.

Comment 15 Luiz Capitulino 2012-11-21 12:40:52 UTC

From what I've read I can think of two possible causes: either Amit is right on comment #13 (ie. qemu-ga is not recovering from a failed suspend) or the _kernel_ is not reporting the error appropriately to user-space.

I'm investigating this right now, but I have a few comments/questions:

1. zhpeng on comment 3: libvirtd crashing is a libvirt bug for sure. So please, open a new bz for the issue

2. Do you have the pm-utils package installed? If you don't, could you please try to reproduce with it installed? If you do have it installed could you please remove it and try to reproduce the issue? Please, report all results.

3. Could you please try to send the guest-ping command after qemu-ga is supposedly hung?

The guest-suspend-disk command does _not_ return a success response. This is done to avoid possible races with clients, as the guest can suspend before qemu-ga is able to send a success response. However, it obviously should either successfully suspend (in which case qemu will exit) or it should return an error response.

Comment 16 Luiz Capitulino 2012-11-21 12:43:55 UTC

(In reply to comment #15)

> 2. Do you have the pm-utils package installed? If you don't, could you
> please try to reproduce with it installed? If you do have it installed could
> you please remove it and try to reproduce the issue? Please, report all
> results.

This is all done in the guest, btw.

Comment 17 Luiz Capitulino 2012-11-21 16:23:22 UTC

The root cause of this problem is that pm-hibernate in RHEL6.4 does not return a failure exit code when suspending fails. It does in Fedora though, so only RHEL is affected.

Here's a quick reproducer:

1. Start a qemu VM with 2 gigas of RAM and RHEL6.4 as a guest (comment 10 has a command-line example)

2. As soon as the guest has booted, change to qemu's monitor and run:

(qemu) balloon 700

3. Then log into the system and check that hibernate will fail:

# echo disk > /sys/power/state
bash: echo: write error: Cannot allocate memory

4. Then try it with pm-hibernate

# pm-hibernate
# echo $?
0

On F16 pm-hibernate successfully detects the error and returns 128.

Some additional comments:

1. qemu-ga doesn't hang. Actually, it's acting as expected: pm-hiberate reports success, so qemu-ga assumes that suspending succeeded and doesn't emit a success response (see last paragraph of comment 15 for more details)

2. libvirt and/or virsh are also buggy, as they should have a timeout to detect stale responses (will clone this bz for libvirt)

3. As a workaround, you could remove the pm-utils package (however having pm-utils installed is *strongly* recommended on regular usage)

Comment 18 RHEL Program Management 2012-11-21 16:39:50 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unable to address this
request at this time.

Red Hat invites you to ask your support representative to
propose this request, if appropriate, in the next release of
Red Hat Enterprise Linux.

Comment 19 Jaroslav Škarvada 2012-11-22 09:08:32 UTC

Created attachment 649590 [details]
Backported fix

Comment 21 Luiz Capitulino 2012-11-22 13:07:04 UTC

I've tested the attached patch and it fixes the problem, now you'll get:

{ "execute": "guest-suspend-disk"}
{"error": {"class": "UndefinedError", "desc": "An undefined error has ocurred", "data": {}}}

The error message is quite bad, but that's a different story.

Comment 23 Luiz Capitulino 2013-02-18 13:14:48 UTC

Did this get fixed in RHEL-6.4? If yes, then the bz shouldn't be in ASSIGNED state.

Comment 24 Jaroslav Škarvada 2013-02-18 13:21:49 UTC

(In reply to comment #23)
> Did this get fixed in RHEL-6.4? If yes, then the bz shouldn't be in ASSIGNED
> state.

No, it wasn't, proposing for 6.5.0.

Comment 25 zhpeng 2013-02-26 09:01:00 UTC

(In reply to comment #15)
> From what I've read I can think of two possible causes: either Amit is right
> on comment #13 (ie. qemu-ga is not recovering from a failed suspend) or the
> _kernel_ is not reporting the error appropriately to user-space.
> 
> I'm investigating this right now, but I have a few comments/questions:
> 
> 1. zhpeng on comment 3: libvirtd crashing is a libvirt bug for sure. So
> please, open a new bz for the issue
Retest it with libvirt-0.10.2-18.el6.x86_64, issue still exist, so i'll file a new bz for this.

> 
> 2. Do you have the pm-utils package installed? If you don't, could you
> please try to reproduce with it installed? If you do have it installed could
> you please remove it and try to reproduce the issue? Please, report all
> results.
I installed pm-utils and this pkg need by libvirt-client. i removed it and test it again, result is not change.

> 
> 3. Could you please try to send the guest-ping command after qemu-ga is
> supposedly hung?
When virsh dompmsuspend is 'hung', guest is running, and can do operations and network is fine.

> 
> The guest-suspend-disk command does _not_ return a success response. This is
> done to avoid possible races with clients, as the guest can suspend before
> qemu-ga is able to send a success response. However, it obviously should
> either successfully suspend (in which case qemu will exit) or it should
> return an error response.

Comment 28 RHEL Program Management 2013-10-14 00:19:05 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unable to address this
request at this time.

Red Hat invites you to ask your support representative to
propose this request, if appropriate, in the next release of
Red Hat Enterprise Linux.

Comment 29 Luiz Capitulino 2013-10-15 13:51:36 UTC

I'm confused about this BZ's status. A fix for the problem exists since last year but it hasn't made it to a release yet? How so?

Comment 33 Tomáš Hozza 2017-09-06 07:22:36 UTC

Red Hat Enterprise Linux 6 transitioned to the Production 3 Phase on May 10, 2017.  During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:
http://redhat.com/rhel/lifecycle

This issue does not appear to meet the inclusion criteria for the Production Phase 3 and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification.  Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com