Bug 872420
Summary: | pm-hibernate exit code does not indicate failure when s4 fails | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | zhpeng | ||||||||||
Component: | pm-utils | Assignee: | Jaroslav Škarvada <jskarvad> | ||||||||||
Status: | CLOSED WONTFIX | QA Contact: | Desktop QE <desktop-qa-list> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 6.4 | CC: | amit.shah, cwei, dyuan, juzhang, lcapitulino, mkenneth, mzhan, qzhang, rbalakri, rpacheco, rvokal, thozza, tpelka, virt-maint | ||||||||||
Target Milestone: | rc | Keywords: | FastFix, Patch | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | x86_64 | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 878966 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2017-09-06 07:22:36 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 912287 | ||||||||||||
Attachments: |
|
Description
zhpeng
2012-11-02 03:18:27 UTC
Created attachment 636969 [details]
qemu-ga log
Created attachment 636989 [details]
libirtd crash log
I can reproduce the bug as steps in comment #3, even using a normal guest(with windowsX installed) with 1G memory, and my packages are: # rpm -qa libvirt qemu-kvm-rhev; uname -r libvirt-0.10.2-7.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.330.el6.x86_64 2.6.32-335.el6.x86_64 and with qemu-guest-agent-0.12.1.2-2.330.el6.x86_64.rpm installed in guest. It was strange, when I issue the command: # virsh dompmsuspend test --target disk the guest was flashed twist one time, and then resume to the status before, and the command just hanged. Please see the libvirtd crash log in the next comment. Created attachment 641268 [details]
crashed libvirtd log
The bug can be reproduced by using qemu command line. After performing operation, the guest failed to save to disk, and the json command { "execute": "guest-suspend-disk"} hangs up forever. But, if we don't do the balloon memory before suspending guest to disk, the suspending will success. Steps: 1, start a qemu process instance /usr/libexec/qemu-kvm -m 4096 -smp 1 -name "rhel6u3pm" \ -drive file=/virt/rhel63.snap1 \ -device virtio-serial-pci,id=virtio-serial0,bus=pci.0 \ -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/rhel6u3.agent,server,nowait \ -device virtserialport,bus=virtio serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0 -monitor stdio 2, Perform the balloon operation from 4G to 2G. (qemu)balloon 2048 3, connect to socket entry point of qemu-guest-agent nc -U /var/lib/libvirt/qemu/rhel6u3.agent {"execute":"guest-sync", "arguments":{"id":1234}} {"return": 1234} { "execute": "guest-suspend-disk"} RPMs: qemu-kvm-rhev-tools-0.12.1.2-2.334.el6.x86_64 qemu-img-rhev-0.12.1.2-2.334.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.334.el6.x86_64 kernel-firmware-2.6.32-341.el6.noarch kernel-2.6.32-341.el6.x86_64 So, move component to qemu-kvm for help. (In reply to comment #7) > The bug can be reproduced by using qemu command line. > After performing operation, the guest failed to save to disk, and the json > command { "execute": "guest-suspend-disk"} hangs up forever. But, if we > don't do > the balloon memory before suspending guest to disk, the suspending will > success. > > Steps: > 1, start a qemu process instance > /usr/libexec/qemu-kvm -m 4096 -smp 1 -name "rhel6u3pm" \ > -drive file=/virt/rhel63.snap1 \ > -device virtio-serial-pci,id=virtio-serial0,bus=pci.0 \ > -chardev > socket,id=charchannel0,path=/var/lib/libvirt/qemu/rhel6u3.agent,server, > nowait \ > -device virtserialport,bus=virtio > serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 > \ -device virtio-balloon-pci,id=balloon0,bus=pci.0 -monitor stdio > > 2, Perform the balloon operation from 4G to 2G. > (qemu)balloon 2048 > > 3, connect to socket entry point of qemu-guest-agent > nc -U /var/lib/libvirt/qemu/rhel6u3.agent > {"execute":"guest-sync", "arguments":{"id":1234}} > {"return": 1234} > { "execute": "guest-suspend-disk"} > > RPMs: > qemu-kvm-rhev-tools-0.12.1.2-2.334.el6.x86_64 > qemu-img-rhev-0.12.1.2-2.334.el6.x86_64 > qemu-kvm-rhev-0.12.1.2-2.334.el6.x86_64 > kernel-firmware-2.6.32-341.el6.noarch > kernel-2.6.32-341.el6.x86_64 > > So, move component to qemu-kvm for help. Hi, Guannan In rhel6.4 we have two options to enable S3&S4. "-global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0". And the default values are both 1 (disable s3 and s4). And seems there's additional options need to be added to xml files as well if using virsh. From the command line you provided, there's no these options so the S3/S4 are disabled. Could you guys have a try? Thanks, Qunfang Btw, if we disable S4, we could still implement suspend to disk. This is a workaround for linux guest. Guest will simulate s4 by shutting down but storing hibernation image, it will not use the acpi method. But there's a issue when suspend to disk after balloon memory. Please refer to bug 806256. (In reply to comment #8) > (In reply to comment #7) > > The bug can be reproduced by using qemu command line. > > After performing operation, the guest failed to save to disk, and the json > > command { "execute": "guest-suspend-disk"} hangs up forever. But, if we > > don't do > > the balloon memory before suspending guest to disk, the suspending will > > success. > > > > Steps: > > 1, start a qemu process instance > > /usr/libexec/qemu-kvm -m 4096 -smp 1 -name "rhel6u3pm" \ > > -drive file=/virt/rhel63.snap1 \ > > -device virtio-serial-pci,id=virtio-serial0,bus=pci.0 \ > > -chardev > > socket,id=charchannel0,path=/var/lib/libvirt/qemu/rhel6u3.agent,server, > > nowait \ > > -device virtserialport,bus=virtio > > serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 > > \ -device virtio-balloon-pci,id=balloon0,bus=pci.0 -monitor stdio > > > > 2, Perform the balloon operation from 4G to 2G. > > (qemu)balloon 2048 > > > > 3, connect to socket entry point of qemu-guest-agent > > nc -U /var/lib/libvirt/qemu/rhel6u3.agent > > {"execute":"guest-sync", "arguments":{"id":1234}} > > {"return": 1234} > > { "execute": "guest-suspend-disk"} > > > > RPMs: > > qemu-kvm-rhev-tools-0.12.1.2-2.334.el6.x86_64 > > qemu-img-rhev-0.12.1.2-2.334.el6.x86_64 > > qemu-kvm-rhev-0.12.1.2-2.334.el6.x86_64 > > kernel-firmware-2.6.32-341.el6.noarch > > kernel-2.6.32-341.el6.x86_64 > > > > So, move component to qemu-kvm for help. > > Hi, Guannan > In rhel6.4 we have two options to enable S3&S4. "-global > PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0". And the default > values are both 1 (disable s3 and s4). And seems there's additional options > need to be added to xml files as well if using virsh. From the command line > you provided, there's no these options so the S3/S4 are disabled. > Could you guys have a try? > > Thanks, > Qunfang we try this command: /usr/libexec/qemu-kvm -m 4096 -smp 1 -name "rhelpme" -drive file=/virt/rhel63.snap1 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/rhel6u3.agent,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 -device virtio-balloon-pci,id=balloon0,bus=pci.0 -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0 -monitor stdio and still hangs, so, it's a qemu bug i think. I did it again and result is: 4096 ---> 4000 S4 works fine 4096 ---> 3500 S4 works fine 4096 ---> 2048 S4 fail but not exit So i think libvirt or qemu-kvm should make a judgement for this, not wait forever or not exit Hi Amit Based on comment 11, maybe this is a duplicate bug with bug 806256 which is closed as WONTFIX. Could you help double confirm it? If we still don't plan to fix it in rhel6.4, we could close it. If not, we could keep this open. Thanks, Qunfang Bug 806256 only dealt with the problem of guest not suspending after ballooning. However, there seems to be some additional problem here: qmp and/or libvirt not recovering from a guest that won't suspend. Adding Luiz to check. The current case is that qmp command { "execute": "guest-suspend-disk"} will hang up instead of returning success or failure value. so libvirt hangs there too. From what I've read I can think of two possible causes: either Amit is right on comment #13 (ie. qemu-ga is not recovering from a failed suspend) or the _kernel_ is not reporting the error appropriately to user-space. I'm investigating this right now, but I have a few comments/questions: 1. zhpeng on comment 3: libvirtd crashing is a libvirt bug for sure. So please, open a new bz for the issue 2. Do you have the pm-utils package installed? If you don't, could you please try to reproduce with it installed? If you do have it installed could you please remove it and try to reproduce the issue? Please, report all results. 3. Could you please try to send the guest-ping command after qemu-ga is supposedly hung? The guest-suspend-disk command does _not_ return a success response. This is done to avoid possible races with clients, as the guest can suspend before qemu-ga is able to send a success response. However, it obviously should either successfully suspend (in which case qemu will exit) or it should return an error response. (In reply to comment #15) > 2. Do you have the pm-utils package installed? If you don't, could you > please try to reproduce with it installed? If you do have it installed could > you please remove it and try to reproduce the issue? Please, report all > results. This is all done in the guest, btw. The root cause of this problem is that pm-hibernate in RHEL6.4 does not return a failure exit code when suspending fails. It does in Fedora though, so only RHEL is affected. Here's a quick reproducer: 1. Start a qemu VM with 2 gigas of RAM and RHEL6.4 as a guest (comment 10 has a command-line example) 2. As soon as the guest has booted, change to qemu's monitor and run: (qemu) balloon 700 3. Then log into the system and check that hibernate will fail: # echo disk > /sys/power/state bash: echo: write error: Cannot allocate memory 4. Then try it with pm-hibernate # pm-hibernate # echo $? 0 On F16 pm-hibernate successfully detects the error and returns 128. Some additional comments: 1. qemu-ga doesn't hang. Actually, it's acting as expected: pm-hiberate reports success, so qemu-ga assumes that suspending succeeded and doesn't emit a success response (see last paragraph of comment 15 for more details) 2. libvirt and/or virsh are also buggy, as they should have a timeout to detect stale responses (will clone this bz for libvirt) 3. As a workaround, you could remove the pm-utils package (however having pm-utils installed is *strongly* recommended on regular usage) This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate, in the next release of Red Hat Enterprise Linux. Created attachment 649590 [details]
Backported fix
I've tested the attached patch and it fixes the problem, now you'll get: { "execute": "guest-suspend-disk"} {"error": {"class": "UndefinedError", "desc": "An undefined error has ocurred", "data": {}}} The error message is quite bad, but that's a different story. Did this get fixed in RHEL-6.4? If yes, then the bz shouldn't be in ASSIGNED state. (In reply to comment #23) > Did this get fixed in RHEL-6.4? If yes, then the bz shouldn't be in ASSIGNED > state. No, it wasn't, proposing for 6.5.0. (In reply to comment #15) > From what I've read I can think of two possible causes: either Amit is right > on comment #13 (ie. qemu-ga is not recovering from a failed suspend) or the > _kernel_ is not reporting the error appropriately to user-space. > > I'm investigating this right now, but I have a few comments/questions: > > 1. zhpeng on comment 3: libvirtd crashing is a libvirt bug for sure. So > please, open a new bz for the issue Retest it with libvirt-0.10.2-18.el6.x86_64, issue still exist, so i'll file a new bz for this. > > 2. Do you have the pm-utils package installed? If you don't, could you > please try to reproduce with it installed? If you do have it installed could > you please remove it and try to reproduce the issue? Please, report all > results. I installed pm-utils and this pkg need by libvirt-client. i removed it and test it again, result is not change. > > 3. Could you please try to send the guest-ping command after qemu-ga is > supposedly hung? When virsh dompmsuspend is 'hung', guest is running, and can do operations and network is fine. > > The guest-suspend-disk command does _not_ return a success response. This is > done to avoid possible races with clients, as the guest can suspend before > qemu-ga is able to send a success response. However, it obviously should > either successfully suspend (in which case qemu will exit) or it should > return an error response. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate, in the next release of Red Hat Enterprise Linux. I'm confused about this BZ's status. A fix for the problem exists since last year but it hasn't made it to a release yet? How so? Red Hat Enterprise Linux 6 transitioned to the Production 3 Phase on May 10, 2017. During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available. The official life cycle policy can be reviewed here: http://redhat.com/rhel/lifecycle This issue does not appear to meet the inclusion criteria for the Production Phase 3 and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification. Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL: https://access.redhat.com |