Bug 714271
Summary: | libvirt pinned to single CPU after suspend/resume cycle -> all VMs running on the same single core | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Ronald Wahl <rwahl> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 17 | CC: | adel.gadllah, ajia, aquini, belegdol, berrange, cfergeau, clalance, crobinso, dallan, djasa, drago01, fdanapfe, gansalmon, hous3y, itamar, j2, jadams1217, jforbes, jistone, john.newman.0, jonathan, kchamart, kernel-maint, knoel, laine, madhu.chinakonda, marcandre.lureau, mprivozn, mschmidt, pasteur, rs, srivatsa.bhat, unicell, veillard, virt-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | cpuset suspend virt | ||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-08-05 21:24:37 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 813228 | ||
Bug Blocks: |
Description
Ronald Wahl
2011-06-17 18:26:09 UTC
This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. Biting me too, making virt unusable on a laptop. Yeah, for now there's at least a semi-ok workaround. Set libvirt to save the VMs to disk when the service is stopped. Stop the service in your suspend script On wakeup restart the service I don't believe any of this is actually necessary, but I have it commented out in the resume block: #cpus="0-7" #for i in `find /sys/fs/cgroup/cpuset/libvirt/ -iname "cpuset.cpus"` ; do echo "${cpus}" | sudo tee $i ; cat $i ; done #for i in `pgrep libvirt` ; do taskset -c -p $i ; done #for i in `pgrep qemu` ; do taskset -c -p $i ; done The only drawback is suspend/resume takes a lot longer, and if you have a VM running that is not marked to auto start, you have to remember to start it on wakeup; or add the bits to the script to detect and then start it for you. Unfortunately it looks like I'm running into another bug where all of the restored VMs will _rarely become totally unresponsive. I haven't had the time to look into this yet, but I've also seen it more frequently when using the vhost_net module.. Also, I vaguely recall a discussion about adding the power management API to the VM, so before suspending the host, you'd suspend all of the guests to ram, and wake them up. I would think just pausing the emulation would effectively do the same thing, but if I recall, until that is added, suspending the host is not officially supported. Looks like a fix was committed for Linux 3.3. Avi, do you have a pointer to the fix? Also, do you think there is anything that libvirt can do to mitigate this? (In reply to comment #10) > Avi, do you have a pointer to the fix? linux.git 8f2f748b0656257153bc Also, do you think there is anything > that libvirt can do to mitigate this? If there is an API that allows you to receive notifications on resume events, you can use that to reconfigure cpusets. But it's a kernel problem, so we should just backport the fix IMO. Agreed, that seems like a lot of work for a workaround. I've moved the BZ to kernel. That commit was CC'd to stable. It should hopefully get picked up in 3.2.10. (In reply to comment #13) > That commit was CC'd to stable. It should hopefully get picked up in 3.2.10. Actually, looking further it seems that Linus is going to revert that patch from 3.3. http://thread.gmane.org/gmane.linux.kernel/1262802 *** Bug 749191 has been marked as a duplicate of this bug. *** There has been a workaround proposed: https://www.redhat.com/archives/libvir-list/2012-April/msg00777.html Recent findings by Srivatsa S. Bhat: http://thread.gmane.org/gmane.linux.kernel/1262802/focus=1286289 Hi, Recently, I posted a new set of patches to fix this issue in the kernel. http://thread.gmane.org/gmane.linux.documentation/4805 It is still under discussion. Regards, Srivatsa S. Bhat *** Bug 820625 has been marked as a duplicate of this bug. *** *** Bug 787467 has been marked as a duplicate of this bug. *** Since F15 is approaching end of life, moving this to F17 where it it still an issue. F16 is also affected, so any fix should be pushed there as well. (In reply to comment #18) > Hi, > > Recently, I posted a new set of patches to fix this issue in the kernel. > http://thread.gmane.org/gmane.linux.documentation/4805 > > It is still under discussion. Seems like those patches got NAKed and we are still left with the "slow after suspend" issue. How to move forward with that bug? Hi, v6 of the patchset was posted here: http://thread.gmane.org/gmane.linux.kernel/1302893 And Peter Zijlstra (the maintainer) said that he has queued them up to push them to mainline later. http://thread.gmane.org/gmane.linux.kernel/1302893/focus=1303390 It hasn't hit mainline yet though. Regards, Srivatsa S. Bhat (In reply to comment #23) > Hi, > > v6 of the patchset was posted here: > http://thread.gmane.org/gmane.linux.kernel/1302893 > > And Peter Zijlstra (the maintainer) said that he has queued them up > to push them to mainline later. > http://thread.gmane.org/gmane.linux.kernel/1302893/focus=1303390 > > It hasn't hit mainline yet though. Ah nice. Josh can we backport them and ship them in F17/16 or are they considered too invasive? Oh by the way, I forgot to mention this: The real "bug-fix" is just patch 1 only. All the remaining patches (2, 3, and 4) are only cleanups/optimizations. The cover-letter (0/4) explains this structuring. So I guess backporting becomes easy since the fix is contained in just one patch. Regards, Srivatsa S. Bhat (In reply to comment #25) > Oh by the way, I forgot to mention this: > > The real "bug-fix" is just patch 1 only. All the remaining patches (2, 3, > and 4) > are only cleanups/optimizations. The cover-letter (0/4) explains this > structuring. > > So I guess backporting becomes easy since the fix is contained in just one > patch. It's CC'd to stable. It'll get into 3.4.2 or whatever as soon as it's in Linus' tree and we'll pick it up automatically. *** Bug 833655 has been marked as a duplicate of this bug. *** OK, so the tip bot commit which is here: http://git.kernel.org/?p=linux/kernel/git/tip/tip.git;a=commitdiff;h=0c1508129adc051fabaf8debefea79baa2f1a81b doesn't have stable CC'd. Is that an oversight on the upstream tip maintainer's part? Srivatsa, does this bug really just need that single commit or are all 4 patches needed? IIRC stable CC was stripped because this is not a regression fix, so it is not allowed to add this patch to stable. *** Bug 842406 has been marked as a duplicate of this bug. *** (In reply to comment #28) > OK, so the tip bot commit which is here: > > http://git.kernel.org/?p=linux/kernel/git/tip/tip.git;a=commitdiff; > h=0c1508129adc051fabaf8debefea79baa2f1a81b > > doesn't have stable CC'd. Is that an oversight on the upstream tip > maintainer's part? > Well, I had added the CC to stable while submitting the patchset. But Ingo Molnar stripped it saying that it is not a regression fix. This is what he said: http://thread.gmane.org/gmane.linux.kernel/1302893/focus=1316019 > Srivatsa, does this bug really just need that single commit or are all 4 > patches needed? Just that single commit. The remaining 3 patches are cleanups and optimizations. Only the first patch is the bug-fix. Regards, Srivatsa S. bhat I've added a backport of the single commit needed to the following scratch build. I'd appreciate it if someone could test this kernel once it finishes building and let us know if it works as expected. If so, we'll roll this patch into Fedora. http://koji.fedoraproject.org/koji/taskinfo?taskID=4325884 Josh - got a F16 version? (In reply to comment #33) > Josh - got a F16 version? http://koji.fedoraproject.org/koji/taskinfo?taskID=4326270 when it finishes building. Will be a bit yet. (In reply to comment #34) > (In reply to comment #33) > > Josh - got a F16 version? > > http://koji.fedoraproject.org/koji/taskinfo?taskID=4326270 This works for me.. F16 x86_64, starting a vm after suspend/resume will pin/use all cpus. (In reply to comment #35) > (In reply to comment #34) > > (In reply to comment #33) > > > Josh - got a F16 version? > > > > http://koji.fedoraproject.org/koji/taskinfo?taskID=4326270 > > This works for me.. F16 x86_64, starting a vm after suspend/resume will > pin/use all cpus. OK, great. Thanks for testing. (In reply to comment #35) > (In reply to comment #34) > > (In reply to comment #33) > > > Josh - got a F16 version? > > > > http://koji.fedoraproject.org/koji/taskinfo?taskID=4326270 > > This works for me.. F16 x86_64, starting a vm after suspend/resume will > pin/use all cpus. Thanks for testing! Good to know it works for you :-) Regards, Srivatsa S. Bhat (In reply to comment #32) > I've added a backport of the single commit needed to the following scratch > build. I'd appreciate it if someone could test this kernel once it finishes > building and let us know if it works as expected. If so, we'll roll this > patch into Fedora. > > http://koji.fedoraproject.org/koji/taskinfo?taskID=4325884 I just tried F17 x86_64, and it also looks good. The cpusets for libvirt and qemu are now maintained after a suspend/resume. Thanks! OK, fixed in Fedora git. It will be included in the next update. Bodhi will leave a comment in the bug when it is available. F17 x86_64 version seems to work for me too. kernel-3.4.7-1.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/kernel-3.4.7-1.fc16 Package kernel-3.4.7-1.fc16: * should fix your issue, * was pushed to the Fedora 16 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing kernel-3.4.7-1.fc16' as soon as you are able to, then reboot. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2012-11348/kernel-3.4.7-1.fc16 then log in and leave karma (feedback). kernel-3.4.7-1.fc16 has been pushed to the Fedora 16 stable repository. If problems still persist, please make note of it in this bug report. |