229516 – BUG: spinlock lockup on CPU#3, events/3/17, ffff81007d611ee8 (Not tainted)

Bug 229516 - BUG: spinlock lockup on CPU#3, events/3/17, ffff81007d611ee8 (Not tainted)

Summary: BUG: spinlock lockup on CPU#3, events/3/17, ffff81007d611ee8 (Not tainted)

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	6
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-02-21 18:05 UTC by Orion Poplawski
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-17 19:55:33 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
var log messages from March 25th 2007 where Kernel error occurs. (3.55 KB, text/plain) 2007-03-26 02:50 UTC, James Collins	no flags	Details
View All

Description Orion Poplawski 2007-02-21 18:05:07 UTC

Description of problem:

Machine locked up.  On console:

BUG: spinlock lockup on CPU#3, events/3/17, ffff81007d611ee8 (Not tainted)
Call Trace:
  show_trace+0x34/0x47
  dump_stack+0x12/0x17
  _raw_spin_lock+0xc4/0xe5
  __drain_alien_cache+0x2d/0x71
  cache_reap+0x244/0x251
  run_workqueue+0xb2/0x10c
  worker_thread+0xec/0x11e
  kthread+0xd0/0x100
  child_rip+0xa/0x12

This is with 2.6.19-1.2288.fc5debug

Above this was the tail end of another trace of some sort, perhaps a recursive
lockdep message.  Didn't get flushed the /var/log/messages either.

  :xfs:xfs_iflush+0x168/0x51e
  :xfs:xfs_inode_flush+0x13c/0x165
  :xfs:xfs_fs_write_inode+0x28/0x50
  __writeback_single_inode+0x213/0x352
  sync_sb_inodes+0x1c1/0x282
  write_back_inodes+0x93/0xee
  background_writeout+0x82/0xb5
  pdflush+0x134/0x1df
  kthread+0xd0/0x100
  child_rip+0xa/0x12

Dual Opteron 275, Tyan S2882-D MB

Machine has been locking up and spontaneously rebooting.  Had gotten some EDAC
chipkill errors before, but that seems to be gone after removing one on the DIMMs.

Comment 1 Chuck Ebbert 2007-02-21 19:24:40 UTC

Are you using XFS on top of LVM and/or RAID?

Comment 2 Orion Poplawski 2007-02-21 20:28:22 UTC

(In reply to comment #1)
> Are you using XFS on top of LVM and/or RAID?
> 

Nope, straight partition.

# mount
/dev/hdc6 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/hdc1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
/dev/hdc7 on /scratch type xfs (rw)  *********
/dev/hdc3 on /usr type ext3 (rw)
/dev/hdc5 on /var type ext3 (rw)
tmpfs on /tmp type tmpfs (rw)


Do you think xfs could be causing the lockups?  Could easily make it an ext3 fs
instead...

Comment 3 Orion Poplawski 2007-02-22 18:24:19 UTC

System ran fine overnight with a couple CPU intensive loads.  Started up a
memory intensive load that caused swapping and it crashed fairly quickly (few
minutes).

I'm going to reinstall with FC6 and no xfs /scratch partition and see what
happens there.

Screen from todays crash:

  showtrace+0x34/0x47
  dump_stack+0x12/0x17
  _raw_spin_lock+0xc4/0xe5
  cache_alloc_refill+0x8b/0x219
  kmem_cache_alloc+0xb1/0x107
  copy_process+0xa8/0x1748
  do_fork+0xb41/0x1af
  ptregscall_common+0x67/0xb0
  system_call+0x7e/0x83

BUG: spinlock lockup on CPU#2, events/2/16 .... (Not tainted)

Call Trace:
  showtrace+0x34/0x47
  dump_stack+0x12/0x17
  _raw_spin_lock+0xc4/0xe5
  drain_freelist+0x29/0xa9
  cache_reap+0x12d/0x251
  run_workqueue+0xb2/0x10c
  worker_thread+0xec/0x11e
  kthread+0xd0/0x100
  child_rip+0xa/0x12

Tried to collect vmcore with kdump but ran out of space.  Will make sure there
is enough next time...

Comment 4 Orion Poplawski 2007-02-22 20:44:41 UTC

With FC6:

Feb 22 12:51:37 apapane kernel: BUGging on
(!spin_is_locked(&cachep->nodelists[({ typeof(
_proxy_pda.nodenumber) ret__; switch (sizeof(_proxy_pda.nodenumber)) { case 2:
asm("mov"
"w %%gs:%c1,%0" : "=r" (ret__) : "i" (__builtin_offsetof(struct
x8664_pda,nodenumber)), "
m" (_proxy_pda.nodenumber)); break; case 4: asm("mov" "l %%gs:%c1,%0": "=r"
(ret__): "i"
(__builtin_offsetof(struct x8664_pda,nodenumber)), "m" (_proxy_pda.nodenumber));
break; c
ase 8: asm("mov" "q %%gs:%c1,%0": "=r" (ret__) : "i" (__builtin_offsetof(struct
x8664_pda
,nodenumber)), "m" (_proxy_pda.nodenumber)); break; default: __bad_pda_field();
} ret__;
})]->list_lock))
Feb 22 12:51:37 apapane kernel: ----------- [cut here ] --------- [please bite
here ] ---
------
Feb 22 12:51:37 apapane kernel: Kernel BUG at mm/slab.c:2350
Feb 22 12:51:37 apapane kernel: invalid opcode: 0000 [1] SMP
Feb 22 12:51:37 apapane kernel: last sysfs file:
/devices/pci0000:00/0000:00:06.0/irq
Feb 22 12:51:37 apapane kernel: CPU 3
Feb 22 12:51:37 apapane kernel: Modules linked in: nfs lockd nfs_acl autofs4
w83627hf eep
rom lm85 hwmon_vid i2c_isa sunrpc ipv6 cpufreq_ondemand dm_mirror dm_multipath
dm_mod vid
eo sbs i2c_ec button battery asus_acpi ac parport_pc lp parport amd_rng
i2c_amd8111 i2c_a
md756 k8_edac pcspkr shpchp i2c_core edac_mc k8temp hwmon e100 mii serio_raw tg3
ext3 jbd
 ehci_hcd ohci_hcd uhci_hcd
Feb 22 12:51:37 apapane kernel: Pid: 2744, comm: cupsd Not tainted
2.6.19-1.2911.fc6debug
 #1
Feb 22 12:51:37 apapane kernel: RIP: 0010:[<ffffffff8025d136>] 
[<ffffffff8025d136>] cach
e_alloc_refill+0x112/0x219
Feb 22 12:51:37 apapane kernel: RSP: 0018:ffff8100bfa4ba08  EFLAGS: 00010086
Feb 22 12:51:37 apapane kernel: RAX: 0000000000000263 RBX: 0000000000000000 RCX:
ffffffff
80586390
Feb 22 12:51:37 apapane kernel: RDX: ffff8100bfec7100 RSI: 0000000000000000 RDI:
ffffffff
80586360
Feb 22 12:51:37 apapane kernel: RBP: ffff8100bfc81c78 R08: ffffffff80586390 R09:
ffffffff
80497c39
Feb 22 12:51:37 apapane kernel: R10: ffff81005a3ab590 R11: ffff810037e90300 R12:
ffff8100
75f76da0
Feb 22 12:51:37 apapane kernel: R13: ffff810037f5e5c0 R14: ffff810037ec4700 R15:
00000000
0000000c
Feb 22 12:51:37 apapane kernel: FS:  00002aaaaefce560(0000)
GS:ffff8100bff54470(0000) knl
GS:0000000000000000
Feb 22 12:51:37 apapane kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Feb 22 12:51:37 apapane kernel: CR2: 0000000012331000 CR3: 00000000bece7000 CR4:
00000000
000006e0
Feb 22 12:51:37 apapane kernel: Process cupsd (pid: 2744, threadinfo
ffff8100bfa4a000, ta
sk ffff8100bfec7100)
Feb 22 12:51:38 apapane kernel: Stack:  0000000000000850 0000000000000000
ffff810037f5e5c
0 0000000000000246
Feb 22 12:51:38 apapane kernel:  0000000000000850 ffffffff88022c3d
0000000000000001 fffff
fff8020abf8
Feb 22 12:51:38 apapane kernel:  0000000000000000 ffff81007f7e93b0
ffff810069bad6d0 00000
00000000000
Feb 22 12:51:39 apapane kernel: Call Trace:

seems to point to hardware corruption?  How else would we generate an invalid
opcode?  memtest86 does run fine with no errors.

Comment 5 Orion Poplawski 2007-02-22 23:54:46 UTC

Downgraded to 2.6.18-1.2798.fc6 and the system is much more stable now.  Need to
let it run for a few days to be positive, but before it was crashing in minutes.
Up for 3 hours now.

Comment 6 Chuck Ebbert 2007-02-23 14:27:02 UTC

(In reply to comment #4)
> With FC6:
> 
> <SNIP>
> Feb 22 12:51:39 apapane kernel: Call Trace:

Where is the call trace from the syslog? Is it missing, or did you
just not post the entire oops message?

> seems to point to hardware corruption?  How else would we generate an invalid
> opcode?  memtest86 does run fine with no errors.

We deliberately execute an invalid instruction to force the oops message
when we hit BUG().

Comment 7 Orion Poplawski 2007-02-23 15:43:51 UTC

(In reply to comment #6)
> 
> Where is the call trace from the syslog? Is it missing, or did you
> just not post the entire oops message?

It was missing from syslog.  alt-sysrq-c didn't trigger the kdump kernel either.
 Machine is still happily running this morning with the older kernel.

Comment 8 Orion Poplawski 2007-02-26 17:08:58 UTC

System has been running fine for almost 4 days now on the old kernel.  Looks
like it may be a kernel issue.  Let me know if there is anything else I can send
that would be useful.  Unfortunately, it's in heavy use at the moment, but we
may be able to do some testing later.

Comment 9 Orion Poplawski 2007-03-09 16:30:38 UTC

System will crash with old kernel (2.6.18-1.2798.fc6), though not as frequently.
 Console will lock up and it will start spewing *something* out of the gigE
ports that shuts down those networks.  Hooked up a serial console and got *tons*
of the following:

BUG: soft lockup detected on CPU#2!

Call Trace:
 [<ffffffff8026929b>] show_trace+0x34/0x47
 [<ffffffff802692c0>] dump_stack+0x12/0x17
 [<ffffffff802b4900>] softlockup_tick+0xdb/0xf6
 [<ffffffff80293837>] update_process_times+0x42/0x68
 [<ffffffff80273e48>] smp_local_timer_interrupt+0x23/0x47
 [<ffffffff80274504>] smp_apic_timer_interrupt+0x41/0x47
 [<ffffffff8025cb82>] apic_timer_interrupt+0x66/0x6c
DWARF2 unwinder stuck at apic_timer_interrupt+0x66/0x6c
Leftover inexact backtrace:
 <IRQ>  <EOI>  [<ffffffff80400d55>] pci_conf1_read+0x0/0xc0
 [<ffffffff80272a5c>] __smp_call_function+0x64/0x8b
 [<ffffffff8026f4fe>] mcheck_check_cpu+0x0/0x2f
 [<ffffffff80272b0f>] smp_call_function+0x32/0x49
 [<ffffffff8026f4fe>] mcheck_check_cpu+0x0/0x2f
 [<ffffffff8028e4b2>] on_each_cpu+0x10/0x22
 [<ffffffff8026e86f>] mcheck_timer+0x1c/0x6c
 [<ffffffff8024bf5f>] run_workqueue+0x9a/0xed
 [<ffffffff8024884a>] worker_thread+0x0/0x122
 [<ffffffff8024884a>] worker_thread+0x0/0x122
 [<ffffffff8024893a>] worker_thread+0xf0/0x122
 [<ffffffff80287150>] default_wake_function+0x0/0xe
 [<ffffffff80232843>] kthread+0xf6/0x12a
 [<ffffffff8025cea5>] child_rip+0xa/0x11
 [<ffffffff8023274d>] kthread+0x0/0x12a
 [<ffffffff8025ce9b>] child_rip+0x0/0x11

Comment 10 James Collins 2007-03-26 02:50:43 UTC

Created attachment 150867 [details]
var log messages from March 25th 2007 where Kernel error occurs.

Comment 11 James Collins 2007-03-26 02:54:41 UTC

Upgrade to kernel-xen-2.6.20-1.2933.fc6 caused very flakey operation on a 
Postgresql server. Errors started at 21:02:07 with  kernel: no vm86_info: BAD. 
This caused postmaster to roll back, and during that process, it spontaneously 
rebooted so I didn't get any more error info. Something else happened the next 
morning at 1:20AM, but there was no error information in the log, it just 
rebooted. Postgres died again at 2:04AM, Kernel errors streamed by on the 
screen, eventually locking solid with BUG: spinlock lockup on CPU#0. Reverted 
to previous Kernel, kernel-xen-2.6.19-1.2911.6.5.fc6, and since then, 
everything has been smooth sailing.

Comment 12 Orion Poplawski 2007-10-17 19:55:33 UTC

System has been running RHEL now for 73 days without issue.  Closing.

Note You need to log in before you can comment on or make changes to this bug.