Bug 647115
| Summary: | guest cannot resume from S4 | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Chao Yang <chayang> | ||||||||||||||
| Component: | kvm | Assignee: | Zachary Amsden <zamsden> | ||||||||||||||
| Status: | CLOSED WONTFIX | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||||||
| Severity: | medium | Docs Contact: | |||||||||||||||
| Priority: | low | ||||||||||||||||
| Version: | 5.6 | CC: | ehabkost, gcosta, juzhang, michen, mkenneth, shuang, tburke, virt-maint, xfu, xwei | ||||||||||||||
| Target Milestone: | rc | Keywords: | Triaged | ||||||||||||||
| Target Release: | --- | ||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||
| OS: | Linux | ||||||||||||||||
| Whiteboard: | general operation | ||||||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||
| Clone Of: | |||||||||||||||||
| : | 716706 (view as bug list) | Environment: | |||||||||||||||
| Last Closed: | 2011-06-26 15:01:55 UTC | Type: | --- | ||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
| Embargoed: | |||||||||||||||||
| Bug Depends On: | |||||||||||||||||
| Bug Blocks: | 580946, 580954, 716706 | ||||||||||||||||
| Attachments: |
|
||||||||||||||||
|
Description
Chao Yang
2010-10-27 09:26:36 UTC
Retry without spice. Hit the same issue with vnc (In reply to comment #2) > Hit the same issue with vnc There is not attachment to look at. Try using another nic model. Created attachment 456180 [details]
when resume from S4
(In reply to comment #4) > Created attachment 456180 [details] > when resume from S4 Are you doing resume from console or from X? If from X try doing it from console. Also redirect kernel console output to ttyS0 and capture it on the host. Can you ssh into the guest after resume? Created attachment 456193 [details]
kernel console output
1.Are you doing resume from console or from X?
Doing resume from X.Hit again from console.
2.redirect kernel console output to ttyS0 and capture it
Please look at attachment.Rusume from console is the same kernel console output with from x
3.Can you ssh into the guest after resume?
Failed to ssh into the guest after resume
(In reply to comment #6) > Created attachment 456193 [details] > kernel console output > > 1.Are you doing resume from console or from X? > Doing resume from X.Hit again from console. > 2.redirect kernel console output to ttyS0 and capture it > Please look at attachment.Rusume from console is the same kernel console From the attachment I see you are using virtio net. This is not suppose to work. Use something else. Comment on attachment 456193 [details]
kernel console output
I am sorry for that mistake.
I file this bug using e1000 net,then tried with rtl8139 net,still hit this issue.
Created attachment 456202 [details]
kernel console output
Try older guest. S4 resume problems in most cases are guest bugs. In case of Linux guest I don't remember it ever was kvm bug. (In reply to comment #10) > Try older guest. S4 resume problems in most cases are guest bugs. In case of > Linux guest I don't remember it ever was kvm bug. 1. CLI:/usr/libexec/qemu-kvm -M rhel5.6.0 -m 2G -smp 2 -drive file=/root/chayang//testcasefor5u6.raw,if=ide,format=raw,boot=on,cache=none,werror=stop -net nic,vlan=0,macaddr=24:23:12:25:b1:5a,model=e1000 -net tap,vlan=0,script=/etc/qemu-ifup -boot c -monitor stdio -vnc :18 I tried with older guest kernel on rhel5.6 host,resume from S4 successfully. guest kernel:2.6.18-164.el5 #dmesg|grep -i kvm I did not see " time.c: Using 1.193182 MHz WALL KVM GTOD KVM timer." in dmesg but "kvm_get_tsc_khz:cpu 0,msr 0:2401001" 2. CLI:/usr/libexec/qemu-kvm -M rhel6.0.0 -m 2G -smp 2 -drive file=/root/testcasefor5u6.raw,if=ide,format=raw,boot=on,cache=none,werror=stop -net nic,vlan=0,macaddr=24:23:12:25:b1:5a,model=e1000 -net tap,vlan=0,script=/etc/qemu-ifup -boot c -monitor stdio -vnc :19 Also tried guest kernel 2.6.18-194.el5 on rhel6 host,can resume from S4. run #dmesg|grep -i kvm,it prints " time.c: Using 1.193182 MHz WALL KVM GTOD KVM timer." in dmesg. boot guest kernel 2.6.18-164.el5 on rhel6 host,can resume from S4,too. run #dmesg|grep -i kvm,it prints "kvm_get_tsc_khz:cpu 0,msr 0:238c001" NOTE:I did these two steps without adding no-kvmclock to guest kernel parameters Glauber, can you look at comment above please? Any ideas? This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release. Created attachment 475595 [details]
messages when resume from s4
Please try booting your guest with clock=pmtmr. We need to rule out a clock issue here. (In reply to comment #20) > Please try booting your guest with clock=pmtmr. We need to rule out a clock > issue here. Glauber, I have tested for 14 times with clock=pmtmr, 7 for guest with deskop, 7 for guest without deskop, this issue disappears. Host kernel: 2.6.18-238.el5 Guest kernel: 2.6.18-238.el5 kvm version: # rpm -qa|grep kvm kvm-tools-83-227.el5 kvm-83-227.el5 kmod-kvm-83-224.el5 kvm-debuginfo-83-227.el5 etherboot-zroms-kvm-5.4.4-13.el5 kvm-qemu-img-83-227.el5 Works with pmtmr. Looks like kvmclock problem. (In reply to comment #11) > (In reply to comment #10) > > Try older guest. S4 resume problems in most cases are guest bugs. In case of > > Linux guest I don't remember it ever was kvm bug. > > 1. > CLI:/usr/libexec/qemu-kvm -M rhel5.6.0 -m 2G -smp 2 -drive > file=/root/chayang//testcasefor5u6.raw,if=ide,format=raw,boot=on,cache=none,werror=stop > -net nic,vlan=0,macaddr=24:23:12:25:b1:5a,model=e1000 -net > tap,vlan=0,script=/etc/qemu-ifup -boot c -monitor stdio -vnc :18 > > I tried with older guest kernel on rhel5.6 host,resume from S4 successfully. > guest kernel:2.6.18-164.el5 > #dmesg|grep -i kvm > I did not see " time.c: Using 1.193182 MHz WALL KVM GTOD KVM timer." in dmesg > but "kvm_get_tsc_khz:cpu 0,msr 0:2401001" So it looks like we have a bug in RHEL 5.6 KVM clock, which may or may not be fixable, which has been corrected in RHEL 6. The bug happens with kernels which use KVM clock. We are either missing something from RHEL 5.6 host or 5.6 guest which is fixed in later kernels. There is a possibility that we can't fix the RHEL 5.6 host kernel at all to work around whatever is causing this bug; the code change in RHEL 6 with improved timer infrastructure may not be possible to backport, and at this point, is probably too complex to manage in time for a RHEL 5 kernel release. If we can't find out what is causing this soon, my recommendation is going to be disabling KVM clock in RHEL 5.6 kernels. One important point of data - is the guest kernel 32-bit or 64-bit? Second important point - does the bug reproduce with -smp 1 ? (In reply to comment #23) > > One important point of data - is the guest kernel 32-bit or 64-bit? So far, I haven't reproduced this bug on 32-bit(tried on RHEL5.6-32 and RHEL5.7-32 guest), seems only happens on 64-bit. RHEL-Server-5.7-32.qcow2 RHEL-Server-5.6-32.qcow2 > Second important point - does the bug reproduce with -smp 1 ? Yes, can reproduce with -smp 1 on x86_64 guest, will attach the log. CLI: /usr/libexec/qemu-kvm -M rhel5.6.0 -no-hpet -rtc-td-hack -startdate now -name rhel5.7 -smp 1 -m 2048 -cpu qemu64,+sse2 -uuid `uuidgen` -boot c -net nic,vlan=1,macaddr=F0:4D:A2:24:ad:89,model=e1000 -net tap,vlan=1,script=/etc/qemu-ifup -drive file=/root/virtual-NIC/rhel5.7-64.qcow2,media=disk,if=ide,cache=none,boot=on,format=qcow2 -vnc :1 -notify all -balloon none -monitor stdio -serial unix:/tmp/test.sock,server,nowait Created attachment 498765 [details]
s4 fails with -smp 1
Created attachment 498766 [details]
launch guest with one cpu,
(In reply to comment #25) > Created attachment 498765 [details] > s4 fails with -smp 1 Ignore Comment #25, attachment 498765 [details] generated by two CPUs guest. (In reply to comment #24) > (In reply to comment #23) > > > > > One important point of data - is the guest kernel 32-bit or 64-bit? > So far, I haven't reproduced this bug on 32-bit(tried on RHEL5.6-32 and > RHEL5.7-32 guest), seems only happens on 64-bit. > RHEL-Server-5.7-32.qcow2 RHEL-Server-5.6-32.qcow2 Sounds to me like we are missing a patch for 64-bit RHEL 5, which has already been applied on 32-bit. Quite easy to do as the 32-bit and 64-bit kernels here are separate and I believe there were a bunch of kvmclock patches backported from upstream. *** Bug 701606 has been marked as a duplicate of this bug. *** After investigating further, and discovering I was looking at the wrong tree, I found that this bug should already be fixed.
>-Guest
>#uname -r
>2.6.18-194.el5
This guest is too old to use kvmclock effectively on an SMP guest; it is missing the atomic backwards protection which was added later.
Can you retry with an updated el5 kernel? I looked and the proper fixes are in the following:
2.6.18-238.9.1.el5
I'm still a bit confused there the 5.6 kernels on the install media come from, but if they are recent enough to be updated to -194, they should be recent enough to be updatable to -238 as well.
It really looks like this should be fixed, please verify with an updated guest kernel. (In reply to comment #31) > It really looks like this should be fixed, please verify with an updated guest > kernel. I tested twice , first s4 works fine, but the second time, it stuck at : Trying to resume from /dev/VolGroup00/LogVol01 Resuming from /dev/VolGroup00/LogVol01. Attempting manual resume Disabling non-boot CPUs ... CPU 1 is now offline SMP alternatives: switching to UP code CPU1 is down Stopping tasks: ======| Shrinking memory... done (0 pages freed) Loading image data pages (68039 pages) ... done Read 272156 kbytes in 5.24 seconds (51.93 MB/s) And cpu usage is : PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 11565 root 15 0 2262m 351m 3436 S 99.9 4.4 4:59.02 qemu-kvm host: # uname -a Linux localhost.localdomain 2.6.18-238.12.1.el5 #1 SMP Sat May 7 20:18:50 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux guest: # uname -a uname -a Linux localhost.localdomain 2.6.18-238.12.1.el5 #1 SMP Sat May 7 20:18:50 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux # dmesg|grep -i time.c dmesg|grep -i time.c time.c: Using 1.193182 MHz WALL KVM GTOD KVM timer. time.c: Detected 2666.754 MHz processor. Real Time Clock Driver v1.12ac CLI: /usr/libexec/qemu-kvm -M rhel5.6.0 -no-hpet -rtc-td-hack -startdate now -name rhel5.6 -smp 2 -m 2048 -cpu qemu64,+sse2 -uuid `uuidgen` -boot c -net nic,vlan=1,macaddr=64:31:50:43:49:45,model=e1000 -net tap,vlan=1,script=/etc/qemu-ifup -drive file=RHEL-Server-5.6-64.qcow2,media=disk,if=ide,cache=none,boot=on,format=qcow2 -vnc :1 -notify all -balloon none -monitor stdio -serial unix:/tmp/chayang.unix,server,nowait And the networking is not reachable: # ping 10.66.9.185 PING 10.66.9.185 (10.66.9.185) 56(84) bytes of data. From 10.66.11.212 icmp_seq=2 Destination Host Unreachable From 10.66.11.212 icmp_seq=3 Destination Host Unreachable From 10.66.11.212 icmp_seq=4 Destination Host Unreachable --- 10.66.9.185 ping statistics --- 5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3999ms , pipe 3 It's not clear what the failure rate was before, but it appeared to be 100%, now it is not, and a hang during S4 resume could be caused by any number of kernel changes. Can we implicate or rule out the clocksource again with the new kernel by testing with clock=pmtmr (In reply to comment #34) > It's not clear what the failure rate was before, but it appeared to be 100%, > now it is not, and a hang during S4 resume could be caused by any number of > kernel changes. Can we implicate or rule out the clocksource again with the > new kernel by testing with clock=pmtmr I tested for 30 times with kernel 2.6.18-238.12.1.el5(x86_64). Clock Times Result pmtmr 15 ALL PASS kvmclock 15 6 FAIL, 9 PASS It's not clear how to proceed on this bug at this point in time. Yes, it's a real bug, and yes, it is fixed by moving to pmtmr. It should be fixed when running on a RHEL 6.1 hypervisor, but earlier RHEV releases may cause problems. There are two problems that conspire to cause the bug, the first was the lack of backwards protection in the guest, which has already been fixed by updating the kernel. The second problem is that the hypervisor is missing S4 suspend kvm clock compensation, which was added in RHEL 6.1, and not present before. It's not going to be possible to easily backport that code into RHEL5, certainly not in time for this release. The S4 suspend compensation was an incremental improvement built on major infrastructure work on both kvm clock and guest timekeeping in general, and also depends on other pieces of infrastructure (high-res clocksource changes) which are extremely risky and complex to backport. The way I see it, we have essentially 4 choices: 1) document the bug and known workaround - as most people running virtual machines are not going to be putting their systems into S4 sleep anyway, this may be an acceptable solution. 2) add patches to either disable or default kvmclock to off when running under a RHEL5 hypervisor. Not sure if we publish a recognizable version field, so this may have the undesirable side effect of turning off KVM clock even when it is fully usable. 3) a complex and tedious backport of the the whole kvmclock and clocksource improvements. This is by far the riskiest option. 4) a selective backport of just the S4 clock compensation into RHEL5; it may be possible, but the code involved is subtle and hasn't been tested in that order of application. Seeing as the S4 compensation isn't even upstream yet, this is also a risky choice for a RHEL release, especially a RHEL5 update. Requesting additional opinions about which path to proceed down, but my vote is #1. I think #1 is better as well. *** Bug 707839 has been marked as a duplicate of this bug. *** Actually, delving further into this... I was under the mistaken impression that the guest failed when the host was put into S4 suspend. That still won't be possible on a RHEL5 hypervisor, but it is possible on a RHEL6 hypervisor - however, this is a completely separate bug, and one that is probably a WONTFIX for RHEL5 and a feature improvement for RHEL6. The solutions I proposed in Comment 36 were based on this misunderstanding. However, what's going on in this bug is actually a GUEST S4 suspend / resume. The fact that a kvmclock guest can't come back from this is certainly a guest bug, not a hypervisor issue, so it would required a RHEL5 guest kernel patch. I don't believe S4 suspend was an original design parameter for KVM clock, and there are a number of things that could go wrong along the resume path. It's not clear whether the bug is low probability, or possibly fixed on 32-bit and not 64-bit, but there are some other factors at work here causing it to work in some configurations and not in others. Our recommendation is almost certainly going to be - don't do S4 suspend if you use KVM clock. It's entirely unnecessary, as you can do loadvm / savevm, which provides nearly the same facility. So our actual choices then are going to be: 1) disable S4 suspend when KVM clock is in use 2) disable kvmclock 3) diagnose and fix the problem (which is still an issue in 6.1 - see BZ 694801, comment 4 - and thus likely also upstream as well), get the fix upstream, then backport the fix to all of the 5.6 and later releases). 4) document the problem and recommend not using kvmclock and S4 suspend in combination when running in a VM in the 5.6 release notes Well obviously #3 is the best choice, as far as 5.6, that ship has sailed, and there isn't sufficient time to do anything at all about it. For now, if we do anything at all to change the guest kernel (#1, #2, or #3), it is going to take a while to get into the next update. Given that state of affairs, the relative obscurity of the issue, I would propose not rushing this, taking approach #3 and #4 in parallel, fixing it properly updstream, documenting it as a known issue, and backporting changes only if they can be shown to be either low risk or highly demanded by users of 5.x kernels. Glauber, do you remember any bugs with S4 resume of a kvmclock guest, or any patches that might have missed one of the 64-bit kernel paths and been fixed on 32-bit? I'll go back and look over the code again now with a proper understanding of the bug. I think the root of the problem here is that kvmclock doesn't support a clocksource resume method... I have cloned this against rhel6 and I will close it for rhel5 |