236764 – Heavy disk activity like tar/cp causes "spinlock lockup on CPU#2"

Bug 236764 - Heavy disk activity like tar/cp causes "spinlock lockup on CPU#2"

Summary: Heavy disk activity like tar/cp causes "spinlock lockup on CPU#2"

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:	bzcl34nup
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-04-17 15:56 UTC by bobsed
Modified:	2008-05-06 19:30 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2008-05-06 19:30:37 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmesg for 2.6.20-1.2307.fc5 (32.27 KB, text/plain) 2007-04-17 15:56 UTC, bobsed	no flags	Details
View All

Description bobsed 2007-04-17 15:56:44 UTC

Description of problem:

My server just toasted for the third of fourth time in a few days with a total
lockup. The only error message I was able to get from the console (via the
serial port) was:

BUG: spinlock lockup on CPU#2, httpd/26200, ffff8102fe8a5600 (Not tainted)
BUG: spinlock lockup on CPU#3, classifieds.cgi/27066, ffff8102fe8a5600 (Not
tainted)


These lockups have all occurred when there has been heavy disk activity such as
tar'ing or cp'ing lots of files. Doing so seems to cause the load to spike
considerably and then the server hangs. In one previous case I saw a kernel
trace on the console which included some references to ext3 and jdb, but I've
not been able to catch a copy of it. The most recent crash resulted only in the
above error messages being displayed on the serial console.


Version-Release number of selected component (if applicable):

 kernel.2.6.20-1.2312.fc5

also had the same crashes with ext3 kernel messages for:

 kernel.2.6.19-1.2288.2.4.fc5

I've had a similar crash for 2.6.20-1.2307.fc5, but was unable to retrieve
any messages from the console for this one. That's what I'm running now.


How reproducible:

tar or cp lots of files, look at the console.

Steps to Reproduce:
1. Boot up
2. tar lots of files
3. watch it crash
  
Actual results:

The server locks up and needs to be rebooted

Expected results:

The server doesn't fall over, and has the hundreds of days of rock-solid uptime
that the zealots claim makes Linux so much better than windows.

Additional info:

I realize that this sucks as far as error messages, but I'll update if I can get
a backtrace. This is a 2x Dual-Core AMD Opteron(tm) Processor 2220 SE box from
Penguin with an Adaptec raid controller using aacraid driver (1.1-5[2423]-mh3).

If anyone has any tips on boot cmdline params to use to make this thing more
stable, I'd be appreciative.

dmesg for 2.6.20-1.2307.fc5 attached

Colin

Comment 1 bobsed 2007-04-17 15:56:44 UTC

Created attachment 152826 [details]
dmesg for 2.6.20-1.2307.fc5

Comment 2 Chuck Ebbert 2007-04-17 21:28:48 UTC

Which Adaptec card is the server using?

Does it have the latest BIOS/firmware update installed?
(Version 8832 is over a year old.)

Comment 3 bobsed 2007-04-18 13:28:30 UTC

Its an Adaptec 2130S with BIOS v8832. I had looked into upgrading half a year
ago when I first got this server as I'd been having problems with XFS crashes,
but at the newer BIOS's came with big warnings about being unstable, and the
most common response was that it was XFS. Now it is running ext3 and had been
stable for a month.

It just crashed a few minutes ago with this stack trace, but looking at the
trace, it appears to be a network card related issue this time. Our monitoring
shows that the load was twice the usual, but still only averaging about 1.2 for
an hour.

BUG: spinlock bad magic on CPU#0, swapper/0 (Not tainted)
 lock: ffff81019cee3470, .magic: ffffffff, .owner: /0, .owner_cpu: -1662110600

Call Trace:
 <IRQ>  [<ffffffff802076e5>] _raw_spin_lock+0x1e/0xe9
 [<ffffffff80260adb>] _spin_lock_irqsave+0x9/0xe
 [<ffffffff8022d9cc>] __wake_up+0x22/0x4f
 [<ffffffff80251927>] sk_stream_write_space+0x5c/0x82
 [<ffffffff8021b86b>] tcp_rcv_established+0x851/0x8fe
 [<ffffffff8023a0d8>] tcp_v4_do_rcv+0x1b5/0x4cf
 [<ffffffff80227188>] tcp_v4_rcv+0x95d/0x9f1
 [<ffffffff80428a04>] ip_local_deliver_finish+0x0/0x1fd
 [<ffffffff802335a5>] ip_local_deliver+0x1b1/0x275
 [<ffffffff80234679>] ip_rcv+0x497/0x4de
 [<ffffffff802201cf>] netif_receive_skb+0x34f/0x3d9
 [<ffffffff880f707d>] :forcedeth:nv_napi_poll+0x438/0x54a
 [<ffffffff8020c4d3>] net_rx_action+0xa8/0x1ad
 [<ffffffff880f56d9>] :forcedeth:nv_nic_irq+0x1a7/0x23e
 [<ffffffff80211fa5>] __do_softirq+0x55/0xc3
 [<ffffffff8025b23c>] call_softirq+0x1c/0x28
 [<ffffffff802685b7>] do_softirq+0x2c/0x85
 [<ffffffff8026875c>] do_IRQ+0x14c/0x16d
 [<ffffffff80266fb1>] default_idle+0x0/0x3d
 [<ffffffff8025a631>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80266fda>] default_idle+0x29/0x3d
 [<ffffffff80246a20>] cpu_idle+0x8c/0xaf
 [<ffffffff805e4792>] start_kernel+0x236/0x23b
 [<ffffffff805e415c>] _sinittext+0x15c/0x160

Comment 4 bobsed 2007-04-18 14:03:20 UTC

Incidentally, I'm not trying to be an idiot here, I just didn't get coffee yet. 

That should read "but at the time, the newer RAID card BIOS's came with dire
warnings about being unstable".

I also realize that this stack trace shows nothing to do with a file system
issue. But it has been crashed each time a heavy load was placed on it via
tar/cp, and this has happened with a number of the recent RH kernel RPM's, hence
my original post. I'm not sure if this stack trace simply hinders this
situation, but since I had it, I thought I would include it anyway.
Unfortunately this is currently a production server so much as I would like to
try breaking it, I can't at the moment as its replacement is still being
shipped. Once it arrives I'll probably end up shipping customers on to that box
and then I'll be able to test more readily.

Colin.

Comment 5 Bug Zapper 2008-04-04 06:52:45 UTC

Fedora apologizes that these issues have not been resolved yet. We're
sorry it's taken so long for your bug to be properly triaged and acted
on. We appreciate the time you took to report this issue and want to
make sure no important bugs slip through the cracks.

If you're currently running a version of Fedora Core between 1 and 6,
please note that Fedora no longer maintains these releases. We strongly
encourage you to upgrade to a current Fedora release. In order to
refocus our efforts as a project we are flagging all of the open bugs
for releases which are no longer maintained and closing them.
http://fedoraproject.org/wiki/LifeCycle/EOL

If this bug is still open against Fedora Core 1 through 6, thirty days
from now, it will be closed 'WONTFIX'. If you can reporduce this bug in
the latest Fedora version, please change to the respective version. If
you are unable to do this, please add a comment to this bug requesting
the change.

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we are following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

And if you'd like to join the bug triage team to help make things
better, check out http://fedoraproject.org/wiki/BugZappers

Comment 6 Bug Zapper 2008-05-06 19:30:35 UTC

This bug is open for a Fedora version that is no longer maintained and
will not be fixed by Fedora. Therefore we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen thus bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.