192083 – BUG: soft lockup detected on CPU#1!

Bug 192083 - BUG: soft lockup detected on CPU#1!

Summary: BUG: soft lockup detected on CPU#1!

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:	MassClosed
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-05-17 14:25 UTC by Christopher Johnson
Modified:	2008-01-20 04:38 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-01-20 04:38:33 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
A few iptables kernel log line followed by the BUG entry in /var/log/messages (5.54 KB, text/plain) 2006-05-17 14:25 UTC, Christopher Johnson	no flags	Details
dmidecode output from this machine (15.55 KB, text/plain) 2006-05-17 14:34 UTC, Christopher Johnson	no flags	Details
View All

Description Christopher Johnson 2006-05-17 14:25:45 UTC

Description of problem:
BUG: soft lockup detected on CPU#1! occurs with partial display of registers and
traceback displayed on console.  In one instance I did catch the entirety in the
log, not lost by filesystem recovery upon boot. Then the server becomes totally
unresponsive to network and console, forcing a hard power off and reboot.

Version-Release number of selected component (if applicable):
kernel-2.6.16-1.2111_FC5

How reproducible:
This machine is a new iptables based firewall between a 5Mbps capable Internet
service and a private network.  The lockup happened twice within relatively
short time by running speed tests of the Internet connection through this
firewall.  Tests were using speakeasy.net/speedtest/ conducted by a client on
the private network side of the firewall.

Steps to Reproduce:
1.Define iptables firewall rules and tcpv4 forwarding.
2.Bring up firewall with Internet and private network interfaces up.
3.Run the speed test.

  
Actual results:
System locks up.

Expected results:
Stable continued execution.

Additional info:
This is an Opteron 280 based HP DL385 machine running Fedora updates released
kernel.  No customization of kernel, nor third party modules.
Log excerpt attached.

Comment 1 Christopher Johnson 2006-05-17 14:25:45 UTC

Created attachment 129320 [details]
A few iptables kernel log line followed by the BUG entry in /var/log/messages

Comment 2 Christopher Johnson 2006-05-17 14:34:19 UTC

Created attachment 129321 [details]
dmidecode output from this machine

Comment 3 Christopher Johnson 2006-05-17 15:46:20 UTC

I removed ip_conntrack_netbios_ns from the iptables-config loaded modules and
restarted iptables this morning.
The server has not yet crashed since then in spite of repeated Internet speed tests.

That might be coincidence, but since that is a relatively new module and the
traceback indicated iptables connection tracking involvement, I thought it worth
testing.

I will post again tomorrow if the system remains stable, or sooner if it locks up.

Comment 4 Christopher Johnson 2006-05-19 01:13:20 UTC

Uptime 38 hours so far.  If it is still up tomorrow morning I will re-include
ip_conntrack_netbios_ns in loaded modules, restart iptables, and see if it soon
breaks.

Comment 5 Christopher Johnson 2006-05-26 23:23:14 UTC

On 2006-05-19 I restarted iptables with ip_conntrack_netbios_ns included in the
module.  It ran without lockup until I rebooted on the 2.6.16-1.2122_FC5 kernel
2006-05-26 17:31 (without ip_conntrack_netbios_ns).  So what I thought would
reproduce the lockup did not. 

But with the 2122 kernel with production network traffic it promptly locked up 3
times in less than 2 hours. Sometimes with CPU#0, sometimes CPU#1, and once
during boot, but with little to no output on the console and nothing in the log.

Last time cycling power and booting I switched back to the 2111 kernel to see if
it is more stable.

Comment 6 Christopher Johnson 2006-05-27 03:22:15 UTC

2111 is going into soft lockup in less than an hour under consistent network
load.  3 or 4 times in 4 hours including the time it sat in lockup while I
grabbed a late dinner.

Comment 7 Christopher Johnson 2006-05-28 14:24:36 UTC

Back on 2122 kernel.  And I believe I have found the condition that exposes this
lockup bug on my firewall.  The vga=792 boot option.

Here's the collection of observations that lead me to investigating that:
1) The first few lines of bug message output that often does show on the console
several times referenced something about a console semaphore
2) The console typeout is strangely slow on this hardware with my usual vga=792
boot option (I like to see more on the screen).  As a firewall this logs (with
rate limiting) various iptables rejects. So the slow console output could
contribute to a race condition problem in relation to SMP and iptables logging.
3) I found some online info about someone using an NV driver getting a soft
lockup detected on CPU#n that was fixed with newer NV code
4) It is my understanding that the vga=792 boot option causes use of a
framebuffer driver

Booting without vga=792 the firewall has not locked up.  It had about 1.5 hours
of the same network load before I lost my test window, and with no traffic it
has remained up ever since.

I then went back through the previous week's logs and verified that the one time
I booted 2111 when it stayed up for 9 days I had on that occasion edited the
boot line and booted without vga=792.  The Bootdata log entry verified that. 
Examining all the other Bootdata entries for all the times the system did crash
(usually in less than an hour) they all did have vga=792.

I will do some more testing of the DL385 under production load later in the week
for higher confidence, but this is the only factor I have found that is
consistent with when the lockup bug presents, and when it does not.

I'm not a kernel developer, but it seems likely that this merely influences the
likelihood of recreating the bug triggering conditions, and that the root cause
has nothing to do with the video driver.

Believing the bug to be uniquely an SMP kernel issue I have temporarily replaced
the DL385 exhibiting the bug with a DL360 running the identical software
versions and configuration (without vga=792) and it does not have the lockup
problem.  This is a uniprocessor PIII Coppermine 1GHz cpu, which of course is
running the i686 2.6.16-1.2122_FC5 kernel.  The uniprocessor system has been
running with production network load for 33 hours now.

Comment 8 Dave Jones 2006-10-16 18:00:06 UTC

A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 9 Jon Stanley 2008-01-20 04:38:33 UTC

(this is a mass-close to kernel bugs in NEEDINFO state)

As indicated previously there has been no update on the progress of this bug
therefore I am closing it as INSUFFICIENT_DATA. Please re-open if the issue
still occurs for you and I will try to assist in its resolution. Thank you for
taking the time to report the initial bug.

If you believe that this bug was closed in error, please feel free to reopen
this bug.

Note You need to log in before you can comment on or make changes to this bug.