584310 – non-smp guests become unresponsive and use 100% cpu with clock source kvm-clock

Bug 584310 - non-smp guests become unresponsive and use 100% cpu with clock source kvm-clock

Summary: non-smp guests become unresponsive and use 100% cpu with clock source kvm-clock

Keywords:
Status:	CLOSED DUPLICATE of bug 570824
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kvm
Sub Component:
Version:	5.4
Hardware:	All
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Glauber Costa
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	Rhel5KvmTier1
TreeView+	depends on / blocked

Reported:	2010-04-21 10:52 UTC by Need Real Name
Modified:	2010-11-09 13:14 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-06-28 13:07:43 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
strace -s 250 -f -p MADGUESTPID (19.40 KB, text/plain) 2010-04-21 13:34 UTC, Need Real Name	no flags	Details
kvmtrace -o blah -w 5 (blah.kvmtrace.0) (3.33 MB, application/octet-stream) 2010-04-21 13:40 UTC, Need Real Name	no flags	Details
kvmtrace -o blah -w 5 (blah.kvmtrace.1) (337.30 KB, application/octet-stream) 2010-04-21 13:40 UTC, Need Real Name	no flags	Details
kvmtrace -o blah -w 5 (blah.kvmtrace.2) (2.68 MB, application/octet-stream) 2010-04-21 13:42 UTC, Need Real Name	no flags	Details
kvmtrace -o blah -w 5 (blah.kvmtrace.3) (794.70 KB, application/octet-stream) 2010-04-21 13:43 UTC, Need Real Name	no flags	Details
/proc/cpuinfo on host (2.73 KB, text/plain) 2010-06-23 18:54 UTC, Need Real Name	no flags	Details
View All

Description Need Real Name 2010-04-21 10:52:31 UTC

This is actually occuring under centos 5.4 x86_64 (fully updated) with centos 5.4 32-bit guests. Reporting here since you are upstream.

I have five guests running on this machine. At least once a day one or more (normally more) guests will hit 100% cpu and become unresponsive.

Setting kernel.panic=10 in the guest does not reboot the guest. The only solution is to destroy the guest and start it again.

# rpm -qa|grep kvm
kmod-kvm-83-105.el5_4.28
etherboot-zroms-kvm-5.4.4-10.el5.centos
kvm-tools-83-105.el5_4.28
kvm-qemu-img-83-105.el5_4.28
kvm-83-105.el5_4.28

kernel 2.6.18-164.15.1.el5

The guests were using the virtio_blk disk device vda but I have switched them back to hda. The network cards use virtio. The disks are qcow2. All other settings are standard.

Comment 1 Need Real Name 2010-04-21 13:34:20 UTC

Created attachment 408066 [details]
strace -s 250 -f -p MADGUESTPID

Maybe this strace will help.

Comment 2 Need Real Name 2010-04-21 13:40:40 UTC

Created attachment 408069 [details]
kvmtrace -o blah -w 5 (blah.kvmtrace.0)

Comment 3 Need Real Name 2010-04-21 13:40:52 UTC

Created attachment 408070 [details]
kvmtrace -o blah -w 5 (blah.kvmtrace.1)

Comment 4 Need Real Name 2010-04-21 13:42:20 UTC

Created attachment 408071 [details]
kvmtrace -o blah -w 5 (blah.kvmtrace.2)

Comment 5 Need Real Name 2010-04-21 13:43:27 UTC

Created attachment 408072 [details]
kvmtrace -o blah -w 5 (blah.kvmtrace.3)

Comment 6 Need Real Name 2010-04-21 13:44:35 UTC

Trace files from kvmtrace and strace attached.

Please could you mark these files as sensitive/confidential.

Comment 7 Need Real Name 2010-04-21 14:16:27 UTC

Setting severity to high since this is a hard crash.

Comment 8 Need Real Name 2010-04-21 14:34:16 UTC

Using hangcheck has no effect at all. It never fires.
Using kernel.panic=10 doesn't help either.
Nothing in the hosts logs. No console messages in the guest.

Comment 9 Need Real Name 2010-04-21 18:44:43 UTC

To get this into a "supported" state, so that you can worry about it (!), I converted the three least busy machines to
i) not use virtio_blk
ii) not use the virtio network device
ii) not use the qcow2 disk format

One of these newly supported machines just did it again: 100% cpu and not responding.

What can I try please?

Comment 10 Amit Shah 2010-04-21 19:07:03 UTC

(In reply to comment #1)
> Created an attachment (id=408066) [details]
> strace -s 250 -f -p MADGUESTPID

Could be a time drift issue. Do you see clock skews in the guests? Are you using kvmclock in the guest?

Comment 11 Need Real Name 2010-04-21 19:23:20 UTC

I have noticed a few seconds difference. I am using kvm-clock:

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
kvm-clock 

The guests with problems are almost idle.

Time difference is about 1 second: the guests are behind by a second.

Do you recommend installing ntpd in the guests?

Comment 12 Need Real Name 2010-04-21 20:23:50 UTC

I installed ntpd in the guests along with step tickers. I just had the guest crash again. Same strace. ARGH! :(

Comment 13 Need Real Name 2010-04-22 09:15:05 UTC

Three crashes so far today. If I can put kvm into debug mode and you want me to post that, please let me know.

Comment 14 Need Real Name 2010-04-22 19:38:21 UTC

I switched from clocksource=kvm-clock to clocksource=acpi_pm, keeping ntpd and it seems a lot more stable. No crashes since the change.

I will watch it for another 24 hours with ntpd now off to see if it crashes again and if the time drifts.

Perhaps important is that the machines which are loaded never or almost never crash.

Comment 15 Need Real Name 2010-04-23 14:35:19 UTC

24 hours is up: not one single extra crash.

I've changed all guests away from using kvm-clock. Is this a know problem with 5.4?

Comment 16 Need Real Name 2010-04-25 14:20:59 UTC

Still no more crashes!

Interesting is how the cpu flags on the host and guest compare:

HOST: tsc constant_tsc nonstop_tsc
GUEST: tsc

i.e. no constant_tsc, which means it might be linked to bug 475598 - but it's not clear to me in that bug if there is a missing constant_tsc in the guests or on the host.

Comment 17 XinSun 2010-06-01 03:49:37 UTC

I meet this problem on rhel5u5 (32-bit server) guest, I use rhevm(sm70) to crate these guests (with virtio disk driver and virtio network driver) on rhev-hypervisor-5.5-2.2.1

These rhel5u5 (32-bit server) guests will be hang after some running time.

Comment 18 Glauber Costa 2010-06-01 12:07:13 UTC

Ok, In fact there is a know kvmclock problem with rhel5.5

The know bug, so far, is known to bite SMP. Are you doing SMP in the guest? 

For the record, this is the bug:
https://bugzilla.redhat.com/show_bug.cgi?id=570824

RPMs for it are likely to arrive soon.

Comment 19 philippe.plouffe 2010-06-01 13:53:41 UTC

(In reply to comment #18)
> Ok, In fact there is a know kvmclock problem with rhel5.5
> 
> The know bug, so far, is known to bite SMP. Are you doing SMP in the guest? 
> 
> For the record, this is the bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=570824
> 
> RPMs for it are likely to arrive soon.    

I'm having the same issue (5.5 32 bit guests) and yes they all have multiple CPU assigned.

Comment 20 Glauber Costa 2010-06-01 15:43:19 UTC

Please do test single-cpu guest to make sure it does go away.

Comment 21 Need Real Name 2010-06-01 16:20:25 UTC

I was seeing this on single cpu guests.

Comment 22 Glauber Costa 2010-06-07 14:26:35 UTC

Note added to all kvmclock bugs:

Please retest with kernel-2.6.18-202.el5 (RHEL5) or kernel-2.6.32-33.el6 (RHEL6) in your guest kernel. In case it works, please close as a DUP of bugs 570824 (RHEL5) or 569603 (RHEL6)

Comment 23 Glauber Costa 2010-06-22 13:25:05 UTC

Did any of you re-tested this ?

Comment 24 philippe.plouffe 2010-06-22 14:01:01 UTC

kernel 202 is not available yet. I don't find a place to get the kernel-2.6.18-202.el5 you are refering too. The closest thing I found looking at referencing bugs id was version 203 found at http://people.redhat.com/jwilson/el5 ( url comes from https://bugzilla.redhat.com/show_bug.cgi?id=570824 )

Where can I get it ?

Comment 25 Glauber Costa 2010-06-22 19:24:45 UTC

203 will do.

Comment 26 Zachary Amsden 2010-06-22 20:59:30 UTC

This sounds exactly like the bug I was hitting running RHEL6 kvmclock guests.

The problem only happens for SMP guests, as described here, and resulted in hangs.  Switching away from kvmclock or switching to UP guest fixed the problem.

Glauber's patches to the guest kernel to make the kvmclock not go backwards fixed the problem.  I'm highly suspicious this is a dup.

Comment 27 Need Real Name 2010-06-22 21:12:53 UTC

> as described here

Please could everyone stop stealing my bug.

This bug is for single cpu guests.

Comment 28 Glauber Costa 2010-06-23 13:23:33 UTC

Ok, let's start from the supposition this is not the same bug.

Would be good to give the new kernel a test anyway, so we can start from a
fresher base. Many kvmclock patches went in, so maybe your problem is fixed by it.

In case it is not, please report, together with the info present in your /proc/cpuinfo (host)

Comment 29 Need Real Name 2010-06-23 18:54:33 UTC

Created attachment 426358 [details]
/proc/cpuinfo on host

Comment 30 Need Real Name 2010-06-23 18:56:45 UTC

kernel-2.6.18-203.el5.i686.rpm installed in guest and rebooted.

Comment 31 Need Real Name 2010-06-25 06:40:51 UTC

So far, no 100% craziness.

Comment 32 Need Real Name 2010-06-26 15:10:36 UTC

Still working well. It's never lasted anywhere near this long before. Timekeeping is also fine.

Do you want me to do anything else?

Comment 33 Glauber Costa 2010-06-28 13:06:57 UTC

if it is working for you, I'll close it as a dup.

Comment 34 Glauber Costa 2010-06-28 13:07:43 UTC


*** This bug has been marked as a duplicate of bug 570824 ***

Note You need to log in before you can comment on or make changes to this bug.