Description of problem: This bus is possibly related to bug 199944. It appears that xennet is occasionally crashing my domU. Version-Release number of selected component (if applicable): Linux epic 2.6.17-1.2157_FC5xenU #1 SMP Tue Jul 11 23:47:25 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux The system was fully updated according to yum pointed at default repositories as of August 4th. How reproducible: I haven't figured out how to reproduce yet. It seems easier to reproduce when I'm copying a lot of files up to the domU via an nfs mount, and I've seen the exact same kernel panic twiUnable to handle kernel NULL pointer dereference at 00000000000000d4 RIP: <ffffffff880571f1>{:xennet:network_tx_buf_gc+227} PGD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: nfsd exportfs lockd nfs_acl deflate zlib_deflate twofish serpent aes blowfish des sha256 crypto_null af_key ipv6 sunrpc xennet ip_conntrack_netbios_ns ipt_LOG xt_limit xt_state ip_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables x_tables dm_snapshot dm_zero dm_mirror dm_mod Pid: 0, comm: swapper Not tainted 2.6.17-1.2157_FC5xenU #1 RIP: e030:[<ffffffff880571f1>] <ffffffff880571f1>{:xennet:network_tx_buf_gc+227} RSP: e02b:ffffffff804b5e68 EFLAGS: 00010046 RAX: 0000000000000013 RBX: 0000000000000028 RCX: 0000000000000000 RDX: 0000000000000028 RSI: 0000000000000000 RDI: ffff88001e651738 RBP: ffff88001e650580 R08: ffffc20000000000 R09: 0000000000000000 R10: ffff88000089c330 R11: 0000000000000246 R12: 0000000000000010 R13: ffff88001e650000 R14: 000000000000be17 R15: 000000000000be18 FS: 00002aaaaae0f6f0(0000) GS:ffffffff8051b000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process swapper (pid: 0, threadinfo ffffffff80530000, task ffffffff80447dc0) Stack: 0000be0800000028 0000000000000000 ffff88001e650000 ffff88001e650680 ffff88001e650580 ffffffff80531ea8 ffffffff80531ea8 ffffffff8805845f 0000000000000000 ffff88001ef07280 Call Trace: <IRQ> <ffffffff8805845f>{:xennet:netif_int+46} <ffffffff80212b5b>{handle_IRQ_event+78} <ffffffff802a6244>{__do_IRQ+154} <ffffffff80271ba4>{do_IRQ+60} <ffffffff8036868e>{evtchn_do_upcall+134} <ffffffff80267052>{do_hypervisor_callback+30} <EOI> <ffffffff802063aa>{hypercall_page+938} <ffffffff802063aa>{hypercall_page+938} <ffffffff8029550c>{rcu_pending+38} <ffffffff80272add>{safe_halt+132} <ffffffff8026fcd9>{xen_idle+112} <ffffffff8024ff79>{cpu_idle+174} <ffffffff805337e0>{start_kernel+524} <ffffffff805331e5>{_sinittext+485} Code: f0 41 ff 8c 24 c4 00 00 00 0f 94 c0 84 c0 74 74 48 8b 15 a0 RIP <ffffffff880571f1>{:xennet:network_tx_buf_gc+227} RSP <ffffffff804b5e68> CR2: 00000000000000d4 <3>BUG: sleeping function called from invalid context at include/linux/rwsem.h:43 in_atomic():1, irqs_disabled():1 Call Trace: <IRQ> <ffffffff80291cb2>{blocking_notifier_call_chain+31} <ffffffff80218000>{do_exit+32} <ffffffff8020c257>{do_page_fault+4470} <ffffffff80282c01>{__wake_up_common+62} <ffffffff80232b19>{__wake_up+56} <ffffffff80266fa7>{error_exit+0} <ffffffff880571f1>{:xennet:network_tx_buf_gc+227} <ffffffff880571cc>{:xennet:network_tx_buf_gc+190} <ffffffff8805845f>{:xennet:netif_int+46} <ffffffff80212b5b>{handle_IRQ_event+78} <ffffffff802a6244>{__do_IRQ+154} <ffffffff80271ba4>{do_IRQ+60} <ffffffff8036868e>{evtchn_do_upcall+134} <ffffffff80267052>{do_hypervisor_callback+30} <EOI> <ffffffff802063aa>{hypercall_page+938} <ffffffff802063aa>{hypercall_page+938} <ffffffff8029550c>{rcu_pending+38} <ffffffff80272add>{safe_halt+132} <ffffffff8026fcd9>{xen_idle+112} <ffffffff8024ff79>{cpu_idle+174} <ffffffff805337e0>{start_kernel+524} <ffffffff805331e5>{_sinittext+485} Kernel panic - not syncing: Aiee, killing interrupt handler! ce:
These code paths are now completely different with the addition of SG support. So chances are whatever bug that's causing this is no longer there. The new code should filter through to Fedora soon.
I'm merging this with #199944 since the backtraces point to one issue. *** This bug has been marked as a duplicate of 199944 ***
Ben, Are you running with an Athlon processor? Is Herbert? I'm wondering if it is a weird bug where it matters exactly which processor is used.
I'm using a single dual-core opteron, 4GB of ram, and a tyan k8we motherboard. I can go for weeks without hitting this bug, and then there will be days when I crash every few hours. The frequency of crashing is at least roughly correlated with how much I'm using network traffic. FYI, since I've updated to the new kernel (2.6.17-1.2174_FC5xenU) I haven't crashed..... yet.
I ask because I have a single core P4 that is running two XenU's (2.6.17-1.2145_FC5xenU) without a hitch so far. One is my primary mail and DNS server and gets a fair bit of activity, while the other is my LDAP server. The Dual-core Athlon that runs 2 different XenU's hits that bug at least once a week. I didn't notice the .2174 update and will check out if it will help, although I don't know if this is the one with the SG code that Herbert was writing about. I also hope the SG code will help me with a different problem of running a PPPoE session with my ISP in a Xen0 and being able to have XenU's talk over that link.
Yes I am on 32-bit so this could well be the reason. I'll go look for possible 64-bit issues in the code.
Nevermind, I just noticed that your original report is 32-bit :)
Good news, I've just reproduced the problem here. It should be straightforward now.