Description of problem: Machine locked up. On console: BUG: spinlock lockup on CPU#3, events/3/17, ffff81007d611ee8 (Not tainted) Call Trace: show_trace+0x34/0x47 dump_stack+0x12/0x17 _raw_spin_lock+0xc4/0xe5 __drain_alien_cache+0x2d/0x71 cache_reap+0x244/0x251 run_workqueue+0xb2/0x10c worker_thread+0xec/0x11e kthread+0xd0/0x100 child_rip+0xa/0x12 This is with 2.6.19-1.2288.fc5debug Above this was the tail end of another trace of some sort, perhaps a recursive lockdep message. Didn't get flushed the /var/log/messages either. :xfs:xfs_iflush+0x168/0x51e :xfs:xfs_inode_flush+0x13c/0x165 :xfs:xfs_fs_write_inode+0x28/0x50 __writeback_single_inode+0x213/0x352 sync_sb_inodes+0x1c1/0x282 write_back_inodes+0x93/0xee background_writeout+0x82/0xb5 pdflush+0x134/0x1df kthread+0xd0/0x100 child_rip+0xa/0x12 Dual Opteron 275, Tyan S2882-D MB Machine has been locking up and spontaneously rebooting. Had gotten some EDAC chipkill errors before, but that seems to be gone after removing one on the DIMMs.
Are you using XFS on top of LVM and/or RAID?
(In reply to comment #1) > Are you using XFS on top of LVM and/or RAID? > Nope, straight partition. # mount /dev/hdc6 on / type ext3 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) /dev/hdc1 on /boot type ext3 (rw) tmpfs on /dev/shm type tmpfs (rw) /dev/hdc7 on /scratch type xfs (rw) ********* /dev/hdc3 on /usr type ext3 (rw) /dev/hdc5 on /var type ext3 (rw) tmpfs on /tmp type tmpfs (rw) Do you think xfs could be causing the lockups? Could easily make it an ext3 fs instead...
System ran fine overnight with a couple CPU intensive loads. Started up a memory intensive load that caused swapping and it crashed fairly quickly (few minutes). I'm going to reinstall with FC6 and no xfs /scratch partition and see what happens there. Screen from todays crash: showtrace+0x34/0x47 dump_stack+0x12/0x17 _raw_spin_lock+0xc4/0xe5 cache_alloc_refill+0x8b/0x219 kmem_cache_alloc+0xb1/0x107 copy_process+0xa8/0x1748 do_fork+0xb41/0x1af ptregscall_common+0x67/0xb0 system_call+0x7e/0x83 BUG: spinlock lockup on CPU#2, events/2/16 .... (Not tainted) Call Trace: showtrace+0x34/0x47 dump_stack+0x12/0x17 _raw_spin_lock+0xc4/0xe5 drain_freelist+0x29/0xa9 cache_reap+0x12d/0x251 run_workqueue+0xb2/0x10c worker_thread+0xec/0x11e kthread+0xd0/0x100 child_rip+0xa/0x12 Tried to collect vmcore with kdump but ran out of space. Will make sure there is enough next time...
With FC6: Feb 22 12:51:37 apapane kernel: BUGging on (!spin_is_locked(&cachep->nodelists[({ typeof( _proxy_pda.nodenumber) ret__; switch (sizeof(_proxy_pda.nodenumber)) { case 2: asm("mov" "w %%gs:%c1,%0" : "=r" (ret__) : "i" (__builtin_offsetof(struct x8664_pda,nodenumber)), " m" (_proxy_pda.nodenumber)); break; case 4: asm("mov" "l %%gs:%c1,%0": "=r" (ret__): "i" (__builtin_offsetof(struct x8664_pda,nodenumber)), "m" (_proxy_pda.nodenumber)); break; c ase 8: asm("mov" "q %%gs:%c1,%0": "=r" (ret__) : "i" (__builtin_offsetof(struct x8664_pda ,nodenumber)), "m" (_proxy_pda.nodenumber)); break; default: __bad_pda_field(); } ret__; })]->list_lock)) Feb 22 12:51:37 apapane kernel: ----------- [cut here ] --------- [please bite here ] --- ------ Feb 22 12:51:37 apapane kernel: Kernel BUG at mm/slab.c:2350 Feb 22 12:51:37 apapane kernel: invalid opcode: 0000 [1] SMP Feb 22 12:51:37 apapane kernel: last sysfs file: /devices/pci0000:00/0000:00:06.0/irq Feb 22 12:51:37 apapane kernel: CPU 3 Feb 22 12:51:37 apapane kernel: Modules linked in: nfs lockd nfs_acl autofs4 w83627hf eep rom lm85 hwmon_vid i2c_isa sunrpc ipv6 cpufreq_ondemand dm_mirror dm_multipath dm_mod vid eo sbs i2c_ec button battery asus_acpi ac parport_pc lp parport amd_rng i2c_amd8111 i2c_a md756 k8_edac pcspkr shpchp i2c_core edac_mc k8temp hwmon e100 mii serio_raw tg3 ext3 jbd ehci_hcd ohci_hcd uhci_hcd Feb 22 12:51:37 apapane kernel: Pid: 2744, comm: cupsd Not tainted 2.6.19-1.2911.fc6debug #1 Feb 22 12:51:37 apapane kernel: RIP: 0010:[<ffffffff8025d136>] [<ffffffff8025d136>] cach e_alloc_refill+0x112/0x219 Feb 22 12:51:37 apapane kernel: RSP: 0018:ffff8100bfa4ba08 EFLAGS: 00010086 Feb 22 12:51:37 apapane kernel: RAX: 0000000000000263 RBX: 0000000000000000 RCX: ffffffff 80586390 Feb 22 12:51:37 apapane kernel: RDX: ffff8100bfec7100 RSI: 0000000000000000 RDI: ffffffff 80586360 Feb 22 12:51:37 apapane kernel: RBP: ffff8100bfc81c78 R08: ffffffff80586390 R09: ffffffff 80497c39 Feb 22 12:51:37 apapane kernel: R10: ffff81005a3ab590 R11: ffff810037e90300 R12: ffff8100 75f76da0 Feb 22 12:51:37 apapane kernel: R13: ffff810037f5e5c0 R14: ffff810037ec4700 R15: 00000000 0000000c Feb 22 12:51:37 apapane kernel: FS: 00002aaaaefce560(0000) GS:ffff8100bff54470(0000) knl GS:0000000000000000 Feb 22 12:51:37 apapane kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Feb 22 12:51:37 apapane kernel: CR2: 0000000012331000 CR3: 00000000bece7000 CR4: 00000000 000006e0 Feb 22 12:51:37 apapane kernel: Process cupsd (pid: 2744, threadinfo ffff8100bfa4a000, ta sk ffff8100bfec7100) Feb 22 12:51:38 apapane kernel: Stack: 0000000000000850 0000000000000000 ffff810037f5e5c 0 0000000000000246 Feb 22 12:51:38 apapane kernel: 0000000000000850 ffffffff88022c3d 0000000000000001 fffff fff8020abf8 Feb 22 12:51:38 apapane kernel: 0000000000000000 ffff81007f7e93b0 ffff810069bad6d0 00000 00000000000 Feb 22 12:51:39 apapane kernel: Call Trace: seems to point to hardware corruption? How else would we generate an invalid opcode? memtest86 does run fine with no errors.
Downgraded to 2.6.18-1.2798.fc6 and the system is much more stable now. Need to let it run for a few days to be positive, but before it was crashing in minutes. Up for 3 hours now.
(In reply to comment #4) > With FC6: > > <SNIP> > Feb 22 12:51:39 apapane kernel: Call Trace: Where is the call trace from the syslog? Is it missing, or did you just not post the entire oops message? > seems to point to hardware corruption? How else would we generate an invalid > opcode? memtest86 does run fine with no errors. We deliberately execute an invalid instruction to force the oops message when we hit BUG().
(In reply to comment #6) > > Where is the call trace from the syslog? Is it missing, or did you > just not post the entire oops message? It was missing from syslog. alt-sysrq-c didn't trigger the kdump kernel either. Machine is still happily running this morning with the older kernel.
System has been running fine for almost 4 days now on the old kernel. Looks like it may be a kernel issue. Let me know if there is anything else I can send that would be useful. Unfortunately, it's in heavy use at the moment, but we may be able to do some testing later.
System will crash with old kernel (2.6.18-1.2798.fc6), though not as frequently. Console will lock up and it will start spewing *something* out of the gigE ports that shuts down those networks. Hooked up a serial console and got *tons* of the following: BUG: soft lockup detected on CPU#2! Call Trace: [<ffffffff8026929b>] show_trace+0x34/0x47 [<ffffffff802692c0>] dump_stack+0x12/0x17 [<ffffffff802b4900>] softlockup_tick+0xdb/0xf6 [<ffffffff80293837>] update_process_times+0x42/0x68 [<ffffffff80273e48>] smp_local_timer_interrupt+0x23/0x47 [<ffffffff80274504>] smp_apic_timer_interrupt+0x41/0x47 [<ffffffff8025cb82>] apic_timer_interrupt+0x66/0x6c DWARF2 unwinder stuck at apic_timer_interrupt+0x66/0x6c Leftover inexact backtrace: <IRQ> <EOI> [<ffffffff80400d55>] pci_conf1_read+0x0/0xc0 [<ffffffff80272a5c>] __smp_call_function+0x64/0x8b [<ffffffff8026f4fe>] mcheck_check_cpu+0x0/0x2f [<ffffffff80272b0f>] smp_call_function+0x32/0x49 [<ffffffff8026f4fe>] mcheck_check_cpu+0x0/0x2f [<ffffffff8028e4b2>] on_each_cpu+0x10/0x22 [<ffffffff8026e86f>] mcheck_timer+0x1c/0x6c [<ffffffff8024bf5f>] run_workqueue+0x9a/0xed [<ffffffff8024884a>] worker_thread+0x0/0x122 [<ffffffff8024884a>] worker_thread+0x0/0x122 [<ffffffff8024893a>] worker_thread+0xf0/0x122 [<ffffffff80287150>] default_wake_function+0x0/0xe [<ffffffff80232843>] kthread+0xf6/0x12a [<ffffffff8025cea5>] child_rip+0xa/0x11 [<ffffffff8023274d>] kthread+0x0/0x12a [<ffffffff8025ce9b>] child_rip+0x0/0x11
Created attachment 150867 [details] var log messages from March 25th 2007 where Kernel error occurs.
Upgrade to kernel-xen-2.6.20-1.2933.fc6 caused very flakey operation on a Postgresql server. Errors started at 21:02:07 with kernel: no vm86_info: BAD. This caused postmaster to roll back, and during that process, it spontaneously rebooted so I didn't get any more error info. Something else happened the next morning at 1:20AM, but there was no error information in the log, it just rebooted. Postgres died again at 2:04AM, Kernel errors streamed by on the screen, eventually locking solid with BUG: spinlock lockup on CPU#0. Reverted to previous Kernel, kernel-xen-2.6.19-1.2911.6.5.fc6, and since then, everything has been smooth sailing.
System has been running RHEL now for 73 days without issue. Closing.