181856 – Xen guest reporting, "BUG: soft lockup detected on CPU#0!"

Bug 181856 - Xen guest reporting, "BUG: soft lockup detected on CPU#0!"

Summary: Xen guest reporting, "BUG: soft lockup detected on CPU#0!"

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel-xen
Sub Component:
Version:	5
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	James Morris
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	185081 186049 (view as bug list)
Depends On:
Blocks:	179599
TreeView+	depends on / blocked

Reported:	2006-02-17 07:08 UTC by Jason
Modified:	2007-11-30 22:11 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-09-19 00:49:17 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Xen guest console showing error messages (2.33 KB, text/plain) 2006-02-17 07:08 UTC, Jason	no flags	Details
Errors still occuring with 2054 kernel under high CPU load (5.38 KB, text/plain) 2006-03-17 10:32 UTC, Jason	no flags	Details
Console output with Xen0/XenU 2069 installed (1.92 KB, text/plain) 2006-03-28 10:45 UTC, Jason	no flags	Details
Updated patch (849 bytes, text/x-patch) 2006-04-02 05:44 UTC, James Morris	no flags	Details
View All

Description Jason 2006-02-17 07:08:47 UTC

Receiving error message while running a xen guest that states, "BUG: soft 
lockup detected on CPU#0!"

This seems to occur a lot when running 'yum -y update' on both the xen host and 
guest, both of which running Fedora Core 5 Test 2.  The error occurs at other 
times as well, but seems virtually 100% reproducable for me if I run yum on 
both the host and guest at the same time.  The console output from one attempt 
to run yum is attached to this bug report.

The guest console does not respond for several seconds upon receiving this 
message, but does return and continue exactly where it left off without any 
problems.

Nothing is recorded in /var/log/dmesg when these errors occur.

Comment 1 Jason 2006-02-17 07:08:47 UTC

Created attachment 124800 [details]
Xen guest console showing error messages

Comment 2 Jason 2006-02-17 10:19:40 UTC

Additional Info: 
This problem existed in every version of the hypervisor/guest kernels I have 
used up to and including kernel-xen-hypervisor-2.6.15-1.1955_FC5/kernel-xen-
guest-2.6.15-1.1955_FC5

I am also not sure if the systems low specs could be contributing to the 
frequency of the error messages.  The system is a Dell Inspiron 8000 Laptop 
with a PIII 700 MHz processor, and 512 MB of RAM.

Comment 3 Jason 2006-02-20 00:44:23 UTC

This still occurs on FC5T3 host/guests.  I have been running through some 
different scenarios to try to determine what is causing this.

It seems that the guest locks up when the host starts the line that reads:
developmen: ################################################## 4303/4303

It looks as though, any time I am connected to the xen console (using 'xm 
console fc5t3xen') and try to run yum on both simultaneously this happens.  It 
does not matter if it is two tabs in a Gnome Terminal, two separate Gnome 
Terminals, two ssh sessions to the host in which the guest console is 
connected, with a Gnome session open on the desktop and an ssh session to the 
host with the guest console open, etc.

It basically occurs at any time if I try to run yum from both trying to using 
the xen console to initiate the process for the guest.  

However, if I open an SSH session to each individually (or just an ssh session 
to the guest at least) and run yum from both without attempting to use the xen 
guest console, then everything appears to work perfectly normal.  Because of 
that I tend to believe that the guest is having a problem updating it's console 
while the host is updating one of its terminal sessions.

Comment 4 Rahul Sundaram 2006-02-20 11:14:55 UTC


These bugs are being closed since a large number of updates have been released
after the FC5 test1 and test2 releases. Kindly update your system by running yum
update as root user or try out the third and final test version of FC5 being
released in a short while and verify if the bugs are still present on the system
.Reopen or file new bug reports as appropriate after confirming the presence of
this issue. Thanks

Comment 5 Jason 2006-02-20 11:25:17 UTC

Read previous comment.  This still occurs with FC5T3 guest and host on a clean 
install.

Comment 6 Stephen Tweedie 2006-02-24 20:59:20 UTC

This is almost certainly because of the raised priority dom0 has over domU.  It
may well be worthwhile modifying HV scheduler defaults to deal with this case.

Comment 7 Rik van Riel 2006-02-24 21:07:31 UTC

From the xen-devel mailing list.  It may be an idea to have defaults like this
in our hypervisor...

Date: Tue, 21 Feb 2006 11:01:06 +1100
From: James Harper
Subject: RE: [Xen-devel] dom0 starves guests off CPU

I found this too when doing a compile in dom0. Search the archives for a
thread titled 'Performance problems' from January this year.

Something like:
xm sched-sedf <domID> 0 0 0 1 1

was suggested there and it works for me!

Comment 8 Stephen Tweedie 2006-02-24 21:27:12 UTC

http://lists.xensource.com/archives/html/xen-devel/2006-02/msg00720.html

for the recent xen-devel thread.

Comment 9 Jason 2006-02-25 03:57:32 UTC

xm sched-sedf 0 0 0 0 1 1 seems to have made the two play much more nicely 
together.  While compiling and running yum on the host I was able to run yum 
from the guest console and not once suffer one of these error messages.

Comment 10 James Morris 2006-03-10 15:45:02 UTC

*** Bug 185081 has been marked as a duplicate of this bug. ***

Comment 11 James Morris 2006-03-15 16:21:03 UTC

This should be fixed now in kernel-xen0-2.6.15-1.2054_FC5.

Plase verify.

Comment 12 Jason 2006-03-16 13:54:55 UTC

Appears to be working well.  Can't force a soft lockup message out my 
system.  :)

Comment 13 Jason 2006-03-17 10:32:43 UTC

Created attachment 126269 [details]
Errors still occuring with 2054 kernel under high CPU load

Comment 14 Jason 2006-03-17 10:33:41 UTC

I hate to do this, after saying that everything was good.  I downloaded the 
BOINC client (http://boinc.berkeley.edu) last night to play with last night, 
because I was bored and left it running all night.

When I woke up in the morning the console connected to my xen guest had all the 
errors in the attachment sitting on it.

The problem seems to have improved in that, when I catch the error message 
occuring, the guest is immediately responsive again, not like in the passed 
where it would hang for several seconds, possibly even minutes.  Whatever is 
catching the condition and spitting out the error message seems almost too 
sensitive.  Previously, even if I did not see the message I would know that 
something was wrong by the guest locking up and being unresponsive.

Now, I would have no idea there was a problem, save for the error message...

Comment 15 James Morris 2006-03-17 14:19:11 UTC

What if you re-run the above load all night with the manual dom0 workaround?

xm sched-sedf <domID> 0 0 0 1 1

Comment 16 Jason 2006-03-19 05:11:55 UTC

running the manual workaround seems to prevent the errors from occurring all 
together.  Without it, it still seems more difficult than with previous kernels 
to produce the errors, but still possible.

Comment 17 Stephen Tweedie 2006-03-22 18:27:42 UTC

*** Bug 186049 has been marked as a duplicate of this bug. ***

Comment 18 Stephen Tweedie 2006-03-22 18:30:31 UTC

There was a spec file problem which prevented the updated scheduler defaults
from taking effect in 1.2054; we are preparing an update kernel to fix this.

Comment 19 Robert Story 2006-03-24 21:33:10 UTC

i'm also seeing this, on a lowly pIII 450, 768MB ram, with no real load on
either dom0 or domU. FC5 final, yum updated. Not too much to add, but didn't see
a way to get me on the cc list w/out adding a comment. :-/

Comment 20 Stephen Tweedie 2006-03-27 16:51:37 UTC

This should be fixed with the latest kernels on fedora-updates-testing
(currently at 2.6.16-1.2069_FC5), can you confirm?

Thanks.

Comment 21 Jason 2006-03-28 10:42:32 UTC

Still happening with 2069 installed on the host and guest.  Attached a copy of
my console output from the guest.

Comment 22 Jason 2006-03-28 10:45:19 UTC

Created attachment 126892 [details]
Console output with Xen0/XenU 2069 installed

This is the output from a console after installing the 2069 kernel from Fedora
Test Updates.

Comment 23 James Morris 2006-03-28 13:49:33 UTC

What if you try the manual setting again:

xm sched-sedf <domID> 0 0 0 1 1

Comment 24 Jason 2006-03-28 14:05:15 UTC

I entered in the manual fix after getting the errors (I received about 5 error 
messages in about an hour of testing before entering the manual setting) and 
once again get no error messages with the manual setting.

Comment 25 James Morris 2006-04-02 05:44:08 UTC

Created attachment 127197 [details]
Updated patch

I've reworked the hypervisor patch to correctly simulate the effect of the
manual workaround.

I haven't been able to reproduce the problem with the existing patch, but heavy
testing with this new patch has not revealed any new problems.

Not sure when Xen will be enabled in the rawhide kernel rpm, so you may want to
try this patch with an srpm build -- just drop this patch on top of the file
xen-sched-sedf.patch in SOURCES.

Comment 26 Jason 2006-04-04 05:42:04 UTC

I applied the patch and recompiled and I have been running for about 3 days now 
with the host at as close to 100% cpu I could keep it the entire time.  I 
performed various tasks with the guest at various different times and was 
unable to create the errors.  It looks like this patch is working good.

Comment 27 James Morris 2006-04-04 06:25:22 UTC

Thanks for the testing.

Note You need to log in before you can comment on or make changes to this bug.