Red Hat Bugzilla – Bug 584310
non-smp guests become unresponsive and use 100% cpu with clock source kvm-clock
Last modified: 2010-11-09 08:14:55 EST
This is actually occuring under centos 5.4 x86_64 (fully updated) with centos 5.4 32-bit guests. Reporting here since you are upstream.
I have five guests running on this machine. At least once a day one or more (normally more) guests will hit 100% cpu and become unresponsive.
Setting kernel.panic=10 in the guest does not reboot the guest. The only solution is to destroy the guest and start it again.
# rpm -qa|grep kvm
The guests were using the virtio_blk disk device vda but I have switched them back to hda. The network cards use virtio. The disks are qcow2. All other settings are standard.
Created attachment 408066 [details]
strace -s 250 -f -p MADGUESTPID
Maybe this strace will help.
Created attachment 408069 [details]
kvmtrace -o blah -w 5 (blah.kvmtrace.0)
Created attachment 408070 [details]
kvmtrace -o blah -w 5 (blah.kvmtrace.1)
Created attachment 408071 [details]
kvmtrace -o blah -w 5 (blah.kvmtrace.2)
Created attachment 408072 [details]
kvmtrace -o blah -w 5 (blah.kvmtrace.3)
Trace files from kvmtrace and strace attached.
Please could you mark these files as sensitive/confidential.
Setting severity to high since this is a hard crash.
Using hangcheck has no effect at all. It never fires.
Using kernel.panic=10 doesn't help either.
Nothing in the hosts logs. No console messages in the guest.
To get this into a "supported" state, so that you can worry about it (!), I converted the three least busy machines to
i) not use virtio_blk
ii) not use the virtio network device
ii) not use the qcow2 disk format
One of these newly supported machines just did it again: 100% cpu and not responding.
What can I try please?
(In reply to comment #1)
> Created an attachment (id=408066) [details]
> strace -s 250 -f -p MADGUESTPID
Could be a time drift issue. Do you see clock skews in the guests? Are you using kvmclock in the guest?
I have noticed a few seconds difference. I am using kvm-clock:
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
The guests with problems are almost idle.
Time difference is about 1 second: the guests are behind by a second.
Do you recommend installing ntpd in the guests?
I installed ntpd in the guests along with step tickers. I just had the guest crash again. Same strace. ARGH! :(
Three crashes so far today. If I can put kvm into debug mode and you want me to post that, please let me know.
I switched from clocksource=kvm-clock to clocksource=acpi_pm, keeping ntpd and it seems a lot more stable. No crashes since the change.
I will watch it for another 24 hours with ntpd now off to see if it crashes again and if the time drifts.
Perhaps important is that the machines which are loaded never or almost never crash.
24 hours is up: not one single extra crash.
I've changed all guests away from using kvm-clock. Is this a know problem with 5.4?
Still no more crashes!
Interesting is how the cpu flags on the host and guest compare:
HOST: tsc constant_tsc nonstop_tsc
i.e. no constant_tsc, which means it might be linked to bug 475598 - but it's not clear to me in that bug if there is a missing constant_tsc in the guests or on the host.
I meet this problem on rhel5u5 (32-bit server) guest, I use rhevm(sm70) to crate these guests (with virtio disk driver and virtio network driver) on rhev-hypervisor-5.5-2.2.1
These rhel5u5 (32-bit server) guests will be hang after some running time.
Ok, In fact there is a know kvmclock problem with rhel5.5
The know bug, so far, is known to bite SMP. Are you doing SMP in the guest?
For the record, this is the bug:
RPMs for it are likely to arrive soon.
(In reply to comment #18)
> Ok, In fact there is a know kvmclock problem with rhel5.5
> The know bug, so far, is known to bite SMP. Are you doing SMP in the guest?
> For the record, this is the bug:
> RPMs for it are likely to arrive soon.
I'm having the same issue (5.5 32 bit guests) and yes they all have multiple CPU assigned.
Please do test single-cpu guest to make sure it does go away.
I was seeing this on single cpu guests.
Note added to all kvmclock bugs:
Please retest with kernel-2.6.18-202.el5 (RHEL5) or kernel-2.6.32-33.el6 (RHEL6) in your guest kernel. In case it works, please close as a DUP of bugs 570824 (RHEL5) or 569603 (RHEL6)
Did any of you re-tested this ?
kernel 202 is not available yet. I don't find a place to get the kernel-2.6.18-202.el5 you are refering too. The closest thing I found looking at referencing bugs id was version 203 found at http://people.redhat.com/jwilson/el5 ( url comes from https://bugzilla.redhat.com/show_bug.cgi?id=570824 )
Where can I get it ?
203 will do.
This sounds exactly like the bug I was hitting running RHEL6 kvmclock guests.
The problem only happens for SMP guests, as described here, and resulted in hangs. Switching away from kvmclock or switching to UP guest fixed the problem.
Glauber's patches to the guest kernel to make the kvmclock not go backwards fixed the problem. I'm highly suspicious this is a dup.
> as described here
Please could everyone stop stealing my bug.
This bug is for single cpu guests.
Ok, let's start from the supposition this is not the same bug.
Would be good to give the new kernel a test anyway, so we can start from a
fresher base. Many kvmclock patches went in, so maybe your problem is fixed by it.
In case it is not, please report, together with the info present in your /proc/cpuinfo (host)
Created attachment 426358 [details]
/proc/cpuinfo on host
kernel-2.6.18-203.el5.i686.rpm installed in guest and rebooted.
So far, no 100% craziness.
Still working well. It's never lasted anywhere near this long before. Timekeeping is also fine.
Do you want me to do anything else?
if it is working for you, I'll close it as a dup.
*** This bug has been marked as a duplicate of bug 570824 ***