Bug 201504

Summary:	kernel crashing (xennet?)
Product:	[Fedora] Fedora	Reporter:	Ben <bench>
Component:	xen	Assignee:	Herbert Xu <herbert.xu>
Status:	CLOSED DUPLICATE	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	5	CC:	bstein, katzj, russell
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-08-15 10:23:50 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ben 2006-08-06 16:17:51 UTC

Description of problem:

This bus is possibly related to bug 199944. It appears that xennet is
occasionally crashing my domU.

Version-Release number of selected component (if applicable):

Linux epic 2.6.17-1.2157_FC5xenU #1 SMP Tue Jul 11 23:47:25 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux

The system was fully updated according to yum pointed at default repositories as
of August 4th.

How reproducible:

I haven't figured out how to reproduce yet. It seems easier to reproduce when
I'm copying a lot of files up to the domU via an nfs mount, and I've seen the
exact same kernel panic twiUnable to handle kernel NULL pointer dereference at
00000000000000d4 RIP: 
<ffffffff880571f1>{:xennet:network_tx_buf_gc+227}
PGD 0 
Oops: 0002 [1] SMP 
CPU 0 
Modules linked in: nfsd exportfs lockd nfs_acl deflate zlib_deflate twofish
serpent aes blowfish des sha256 crypto_null af_key ipv6 sunrpc xennet
ip_conntrack_netbios_ns ipt_LOG xt_limit xt_state ip_conntrack nfnetlink
xt_tcpudp iptable_filter ip_tables x_tables dm_snapshot dm_zero dm_mirror dm_mod
Pid: 0, comm: swapper Not tainted 2.6.17-1.2157_FC5xenU #1
RIP: e030:[<ffffffff880571f1>] <ffffffff880571f1>{:xennet:network_tx_buf_gc+227}
RSP: e02b:ffffffff804b5e68  EFLAGS: 00010046
RAX: 0000000000000013 RBX: 0000000000000028 RCX: 0000000000000000
RDX: 0000000000000028 RSI: 0000000000000000 RDI: ffff88001e651738
RBP: ffff88001e650580 R08: ffffc20000000000 R09: 0000000000000000
R10: ffff88000089c330 R11: 0000000000000246 R12: 0000000000000010
R13: ffff88001e650000 R14: 000000000000be17 R15: 000000000000be18
FS:  00002aaaaae0f6f0(0000) GS:ffffffff8051b000(0000) knlGS:0000000000000000
CS:  e033 DS: 0000 ES: 0000
Process swapper (pid: 0, threadinfo ffffffff80530000, task ffffffff80447dc0)
Stack: 0000be0800000028 0000000000000000 ffff88001e650000 ffff88001e650680 
       ffff88001e650580 ffffffff80531ea8 ffffffff80531ea8 ffffffff8805845f 
       0000000000000000 ffff88001ef07280 
Call Trace: <IRQ> <ffffffff8805845f>{:xennet:netif_int+46}
       <ffffffff80212b5b>{handle_IRQ_event+78} <ffffffff802a6244>{__do_IRQ+154}
       <ffffffff80271ba4>{do_IRQ+60} <ffffffff8036868e>{evtchn_do_upcall+134}
       <ffffffff80267052>{do_hypervisor_callback+30} <EOI>
       <ffffffff802063aa>{hypercall_page+938} <ffffffff802063aa>{hypercall_page+938}
       <ffffffff8029550c>{rcu_pending+38} <ffffffff80272add>{safe_halt+132}
       <ffffffff8026fcd9>{xen_idle+112} <ffffffff8024ff79>{cpu_idle+174}
       <ffffffff805337e0>{start_kernel+524} <ffffffff805331e5>{_sinittext+485}

Code: f0 41 ff 8c 24 c4 00 00 00 0f 94 c0 84 c0 74 74 48 8b 15 a0 
RIP <ffffffff880571f1>{:xennet:network_tx_buf_gc+227} RSP <ffffffff804b5e68>
CR2: 00000000000000d4
 <3>BUG: sleeping function called from invalid context at include/linux/rwsem.h:43
in_atomic():1, irqs_disabled():1

Call Trace: <IRQ> <ffffffff80291cb2>{blocking_notifier_call_chain+31}
       <ffffffff80218000>{do_exit+32} <ffffffff8020c257>{do_page_fault+4470}
       <ffffffff80282c01>{__wake_up_common+62} <ffffffff80232b19>{__wake_up+56}
       <ffffffff80266fa7>{error_exit+0}
<ffffffff880571f1>{:xennet:network_tx_buf_gc+227}
       <ffffffff880571cc>{:xennet:network_tx_buf_gc+190}
<ffffffff8805845f>{:xennet:netif_int+46}
       <ffffffff80212b5b>{handle_IRQ_event+78} <ffffffff802a6244>{__do_IRQ+154}
       <ffffffff80271ba4>{do_IRQ+60} <ffffffff8036868e>{evtchn_do_upcall+134}
       <ffffffff80267052>{do_hypervisor_callback+30} <EOI>
       <ffffffff802063aa>{hypercall_page+938} <ffffffff802063aa>{hypercall_page+938}
       <ffffffff8029550c>{rcu_pending+38} <ffffffff80272add>{safe_halt+132}
       <ffffffff8026fcd9>{xen_idle+112} <ffffffff8024ff79>{cpu_idle+174}
       <ffffffff805337e0>{start_kernel+524} <ffffffff805331e5>{_sinittext+485}
Kernel panic - not syncing: Aiee, killing interrupt handler!
ce:

Comment 1 Herbert Xu 2006-08-07 13:13:06 UTC

These code paths are now completely different with the addition of SG support. 
So chances are whatever bug that's causing this is no longer there.

The new code should filter through to Fedora soon.

Comment 2 Herbert Xu 2006-08-15 10:23:50 UTC

I'm merging this with #199944 since the backtraces point to one issue.

*** This bug has been marked as a duplicate of 199944 ***

Comment 3 Russell McOrmond 2006-08-15 14:08:49 UTC

Ben,

Are you running with an Athlon processor?

Is Herbert?

I'm wondering if it is a weird bug where it matters exactly which processor is used.

Comment 4 Russell McOrmond 2006-08-15 14:09:20 UTC

Ben,

Are you running with an Athlon processor?

Is Herbert?

I'm wondering if it is a weird bug where it matters exactly which processor is used.

Comment 5 Ben 2006-08-15 15:04:06 UTC

I'm using a single dual-core opteron, 4GB of ram, and a tyan k8we motherboard. I
can go for weeks without hitting this bug, and then there will be days when I
crash every few hours. The frequency of crashing is at least roughly correlated
with how much I'm using network traffic.

FYI, since I've updated to the new kernel (2.6.17-1.2174_FC5xenU) I haven't
crashed..... yet.

Comment 6 Russell McOrmond 2006-08-15 20:15:13 UTC

I ask because I have a single core P4 that is running two XenU's
(2.6.17-1.2145_FC5xenU) without a hitch so far.  One is my primary mail and DNS
server and gets a fair bit of activity, while the other is my LDAP server.

The Dual-core Athlon that runs 2 different XenU's hits that bug at least once a
week.


I didn't notice the .2174 update and will check out if it will help, although I
don't know if this is the one with the SG code that Herbert was writing about.

I also hope the SG code will help me with a different problem of running a PPPoE
session with my ISP in a Xen0 and being able to have XenU's talk over that link.

Comment 7 Herbert Xu 2006-08-16 01:44:28 UTC

Yes I am on 32-bit so this could well be the reason.  I'll go look for possible
64-bit issues in the code.

Comment 8 Herbert Xu 2006-08-16 01:46:48 UTC

Nevermind, I just noticed that your original report is 32-bit :)

Comment 9 Herbert Xu 2006-08-16 03:13:43 UTC

Good news, I've just reproduced the problem here.  It should be straightforward now.