221787 – Kernel crash -- BUG: soft lockup detected on CPU#0!

Bug 221787 - Kernel crash -- BUG: soft lockup detected on CPU#0!

Summary: Kernel crash -- BUG: soft lockup detected on CPU#0!

Keywords:
Status:	CLOSED DUPLICATE of bug 211672
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	xen
Sub Component:
Version:	6
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Xen Maintainance List
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-01-08 00:36 UTC by Dave Bradley
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-02-20 23:05:10 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Dave Bradley 2007-01-08 00:36:12 UTC

Description of problem:
After a period of time (1 to 8 days), kernel crashes with 

Version-Release number of selected component (if applicable):
Linux zeus.bradleyland.com 2.6.18-1.2869.fc6xen #1 SMP Wed Dec 20 15:28:06 EST
2006 i686 athlon i386 GNU/Linux

How reproducible:
Always so far. 

Steps to Reproduce:
Unknown -- machine has crashed 6 times since upgrading to the 2968 and 2969 kernels.

  
Actual results:
Crash

Expected results:
Stability

Additional info:

BUG: soft lockup detected on CPU#0!
 [<c0405707>] dump_trace+0x69/0x1af
 [<c0405865>] show_trace_log_lvl+0x18/0x2c
 [<c0405e05>] show_trace+0xf/0x11
 [<c0405e34>] dump_stack+0x15/0x17
 [<c0441aa5>] softlockup_tick+0xad/0xc4
 [<c0409280>] timer_interrupt+0x54b/0x593
 [<c0441d2a>] handle_IRQ_event+0x27/0x51
 [<c0441dea>] __do_IRQ+0x96/0xf2
 [<c0406ca1>] do_IRQ+0xab/0xbc
 [<c0546849>] evtchn_do_upcall+0x64/0x9b
 [<c040506d>] hypervisor_callback+0x3d/0x48
DWARF2 unwinder stuck at hypervisor_callback+0x3d/0x48

Leftover inexact backtrace:

 [<ee13cc74>] sis900_interrupt+0x42/0x6ab [sis900]
 [<c056054e>] ide_intr+0xfe/0x1ab
 [<c0441d2a>] handle_IRQ_event+0x27/0x51
 [<c0441dea>] __do_IRQ+0x96/0xf2
 [<c0406c94>] do_IRQ+0x9e/0xbc
 [<ee31b8ac>] br_handle_frame_finish+0xb2/0xcf [bridge]
 [<c0546849>] evtchn_do_upcall+0x64/0x9b
 [<c040506d>] hypervisor_callback+0x3d/0x48
 [<ee115502>] rtl8169_poll+0x77/0x1d4 [r8169]
 [<c05afc54>] net_rx_action+0x96/0x18b
 [<c0421140>] __do_softirq+0x5e/0xc3
 [<c0406d0b>] do_softirq+0x59/0xc6
 [<c0430aaa>] ktime_get+0x12/0x34
 [<c0406ca6>] do_IRQ+0xb0/0xbc
 [<c0546849>] evtchn_do_upcall+0x64/0x9b
 [<c040506d>] hypervisor_callback+0x3d/0x48
 [<c0408673>] raw_safe_halt+0x8c/0xaf
 [<c0402cb4>] xen_idle+0x22/0x2e
 [<c0402de5>] cpu_idle+0x91/0xab
 [<c073a7f8>] start_kernel+0x3aa/0x3b2
 [<c073a24a>] unknown_bootoption+0x0/0x204
 =======================

I've also seen this trace:

"zeus.log" [converted] 6813L, 370176C
 [<c0546849>] evtchn_do_upcall+0x64/0x9b
 [<c040506d>] hypervisor_callback+0x3d/0x48
 [<c0546168>] force_evtchn_callback+0xa/0xc
 [<c0441d1c>] handle_IRQ_event+0x19/0x51
 [<c0441dea>] __do_IRQ+0x96/0xf2
 [<c0406c94>] do_IRQ+0x9e/0xbc
 [<ee3198ac>] br_handle_frame_finish+0xb2/0xcf [bridge]
 [<c0546849>] evtchn_do_upcall+0x64/0x9b
 [<c040506d>] hypervisor_callback+0x3d/0x48
 [<ee104502>] rtl8169_poll+0x77/0x1d4 [r8169]
 [<c05afc54>] net_rx_action+0x96/0x18b
 [<c0421140>] __do_softirq+0x5e/0xc3
 [<c0406d0b>] do_softirq+0x59/0xc6
 [<c0430aaa>] ktime_get+0x12/0x34
 [<c0406ca6>] do_IRQ+0xb0/0xbc
 [<c0546849>] evtchn_do_upcall+0x64/0x9b
 [<c040506d>] hypervisor_callback+0x3d/0x48
 [<c0408673>] raw_safe_halt+0x8c/0xaf
 [<c0402cb4>] xen_idle+0x22/0x2e
 [<c0402de5>] cpu_idle+0x91/0xab
 [<c073a7f8>] start_kernel+0x3aa/0x3b2
 [<c073a24a>] unknown_bootoption+0x0/0x204
 =======================

Comment 1 Dave Bradley 2007-01-23 17:23:14 UTC

I've isolated my FC6/xen crashing problems to cron backup jobs that were running
in my dom0. I have moved the jobs into a domU and now observe the crashing
behavior in the domU. At least the entire environment doesn't come down when
it's in a domU.

In my crontab, I have a series of rsnapshot backup jobs to backup a handful of
windows and linux servers. For windows machines, the script mounts a share on
the windows machine using CIFS (samba). It seems only the Windows backup jobs
crash the machine and then only crash when two are scheduled to start at exactly
the same time.

I can replicate the problem by running the crontab command from the command
line. If I run the commands one at a time, no crash. If I start them both back
to back, the crash occurs within 30 seconds or so.

Under FC4, these scripts/backup jobs worked fine for almost a year without
intervention. I've read there have been a host of problems with CIFS in FC6, but
I thought they had been resolved. As a workaround, I can change the job schedule
for now, but something is still broken in the kernel, samba or both.

Here's a trace:

list_del corruption. prev->next should be c2f5c640, but was c2f50080
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:65!
invalid opcode: 0000 [#1]
SMP 
last sysfs file: /block/ram0/range
Modules linked in: nls_utf8 cifs ipv6 autofs4 hidp l2cap bluetooth iptable_raw
xt_policy xt_multiport ipt_ULOG ipt_TTL ipt_ttl ipt_TOS ipt_tos ipt_TCPMSS
ipt_SAME ipt_REJECT ipt_REDIRECT ipt_recent ipt_owner ipt_NETMAP ipt_MASQUERADE
ipt_LOG ipt_iprange ipt_hashlimit ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah
ipt_addrtype ip_nat_tftp ip_nat_snmp_basic ip_nat_pptp ip_nat_irc ip_nat_ftp
ip_nat_amanda ip_conntrack_tftp ip_conntrack_pptp ip_conntrack_netbios_ns
ip_conntrack_irc ip_conntrack_ftp ts_kmp ip_conntrack_amanda xt_tcpmss
xt_pkttype xt_physdev bridge xt_NFQUEUE xt_MARK xt_mark xt_mac xt_limit
xt_length xt_helper xt_dccp xt_conntrack xt_CONNMARK xt_connmark xt_CLASSIFY
xt_tcpudp xt_state iptable_nat ip_nat ip_conntrack iptable_mangle nfnetlink
iptable_filter ip_tables x_tables tun sunrpc xennet parport_pc lp parport pcspkr
dm_snapshot dm_zero dm_mirror dm_mod raid456 xor ext3 jbd xenblk
CPU:    0
EIP:    0061:[<c04e9d0b>]    Not tainted VLI
EFLAGS: 00010082   (2.6.19-1.2895.fc6xen #1)
EIP is at list_del+0x23/0x6c
eax: 00000048   ebx: c2f5c640   ecx: c0683b30   edx: f5416000
esi: c117a7c0   edi: c32af000   ebp: c117eda0   esp: c0d2def0
ds: 007b   es: 007b   ss: 0069
Process events/0 (pid: 5, ti=c0d2d000 task=c006e030 task.ti=c0d2d000)
Stack: c0646145 c2f5c640 c2f50080 c2f5c640 c0467706 c078afc0 c028c980 c0619b9d 
       00000014 00000002 c1176228 c1176220 00000014 c1176200 00000000 c0467809 
       00000000 00000000 c117eda0 c117a7e4 c117a7c0 c117eda0 c0d404a0 00000000 
Call Trace:
 [<c0467706>] free_block+0x77/0xf0
 [<c0467809>] drain_array+0x8a/0xb5
 [<c0468e22>] cache_reap+0x85/0x117
 [<c042d603>] run_workqueue+0x97/0xdd
 [<c042dfc0>] worker_thread+0xd9/0x10d
 [<c043058c>] kthread+0xc0/0xec
 [<c0405253>] kernel_thread_helper+0x7/0x10
 =======================
Code: 00 00 89 c3 eb e8 90 90 53 89 c3 83 ec 0c 8b 40 04 8b 00 39 d8 74 1c 89 5c
24 04 89 44 24 08 c7 04 24 45 61 64 c0 e8 9a 4b f3 ff <0f> 0b 41 00 82 61 64 c0
8b 03 8b 40 04 39 d8 74 1c 89 5c 24 04 
EIP: [<c04e9d0b>] list_del+0x23/0x6c SS:ESP 0069:c0d2def0
 <3>BUG: sleeping function called from invalid context at kernel/rwsem.c:20
in_atomic():0, irqs_disabled():1
 [<c04056ff>] dump_trace+0x69/0x1b6
 [<c0405864>] show_trace_log_lvl+0x18/0x2c
 [<c0405e4b>] show_trace+0xf/0x11
 [<c0405e7a>] dump_stack+0x15/0x17
 [<c0433252>] down_read+0x12/0x28
 [<c042aca2>] blocking_notifier_call_chain+0xe/0x29
 [<c0420d75>] do_exit+0x1b/0x787
 [<c0405dec>] die+0x2af/0x2d4
 [<c0406262>] do_invalid_op+0xa2/0xab
 [<c0619deb>] error_code+0x2b/0x30
 [<c04e9d0b>] list_del+0x23/0x6c
 [<c0467706>] free_block+0x77/0xf0
 [<c0467809>] drain_array+0x8a/0xb5
 [<c0468e22>] cache_reap+0x85/0x117
 [<c042d603>] run_workqueue+0x97/0xdd
 [<c042dfc0>] worker_thread+0xd9/0x10d
 [<c043058c>] kthread+0xc0/0xec
 [<c0405253>] kernel_thread_helper+0x7/0x10
 =======================
BUG: spinlock lockup on CPU#0, rsync/11148, c117a7e4 (Not tainted)
 [<c04056ff>] dump_trace+0x69/0x1b6
 [<c0405864>] show_trace_log_lvl+0x18/0x2c
 [<c0405e4b>] show_trace+0xf/0x11
 [<c0405e7a>] dump_stack+0x15/0x17
 [<c04e9b6f>] _raw_spin_lock+0xbf/0xdc
 [<c0467a45>] cache_alloc_refill+0x74/0x4dc
 [<c04679b8>] kmem_cache_alloc+0x54/0x6d
 [<c0413ec1>] pgd_alloc+0x54/0x230
 [<c041c020>] mm_init+0x94/0xb9
 [<c047056d>] do_execve+0x6f/0x1f5
 [<c0402e08>] sys_execve+0x2f/0x4f
 [<c0404efb>] syscall_call+0x7/0xb
 [<00b98402>] 0xb98402
 =======================

Comment 2 Dave Bradley 2007-01-27 00:42:36 UTC

Further investigation revealed this problem is caused by a bug in the cifs
driver,  specifically, the sess.c file. The following patch (I manually applied
it and recompiled) appears to have solved this problem. No more crashes on mount:

diff -u sess.c sess.c.mod
--- sess.c      2006-08-02 16:15:17.000000000 -0500
+++ sess.c.mod  2006-12-21 09:43:19.000000000 -0600
@@ -179,10 +179,9 @@
        cFYI(1,("bleft %d",bleft));


-       /* word align, if bytes remaining is not even */
-       if(bleft % 2) {
+       /* word align, if bytes remaining is even */
+       if(!(bleft % 2)) {
                bleft--;
-               data++;
        }
        words_left = bleft / 2;

@@ -506,6 +505,7 @@
        /* and lanman response is 3 */
        bytes_remaining = BCC(smb_buf);
        bcc_ptr = pByteArea(smb_buf);
+       bcc_ptr++;

        if(smb_buf->WordCount == 4) {
                __u16 blob_len;

Comment 3 Dave Bradley 2007-02-20 23:05:10 UTC


*** This bug has been marked as a duplicate of 211672 ***

Note You need to log in before you can comment on or make changes to this bug.