135432 – Kernel panic in ipv6_rcv

Bug 135432 - Kernel panic in ipv6_rcv

Summary: Kernel panic in ipv6_rcv

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	5
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	David Miller
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	FC3Blocker 131589
TreeView+	depends on / blocked

Reported:	2004-10-12 17:45 UTC by Seth Nickell
Modified:	2007-11-30 22:10 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-03-17 20:27:21 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Screen shot showing the kernel panic (122.29 KB, image/jpeg) 2004-10-12 17:46 UTC, Seth Nickell	no flags	Details
kernel panic on 640 (776.14 KB, image/jpeg) 2004-10-21 20:13 UTC, Bryan W Clark	no flags	Details
View All

Description Seth Nickell 2004-10-12 17:45:11 UTC

Attached screen shot shows the panic. Its triggered by enabling
NetworkManager on at least two laptops (IBM T41 and a Latitude LS)
with a variety of wireless cards (Cisco Aironet mini-PCI & PCMCIA, and
Prism 54 cardbus). We have no idea if its related to wireless or not,
though that's going to be where NetworkManager exercises the kernel
most heavily.

Its present in at least kernel builds 541 & 585, though occurs *far*
more frequently with 541 (usually within several minutes of starting
NetworkManager).

This seriously hampers our ability to deploy NetworkManager for RHEL4
(and, for example, has made us unable to safely demo NetworkManager
for fear the kernel will die and make our product look terribly unstable).

Comment 1 Seth Nickell 2004-10-12 17:46:40 UTC

Created attachment 105084 [details]
Screen shot showing the kernel panic

Comment 2 Dan Williams 2004-10-12 19:23:28 UTC

Was also triggered just by using iwconfig on the Latitidue LS with the
prism54 card.  So its not just a NetworkManager problem, though
NetworkManager might expose the issue more due to the fact that it can
make the ioctl() and iwlib calls faster than a human can on the
command line.

Comment 3 Seth Nickell 2004-10-15 16:42:43 UTC

Any other information we can provide to help track this down?

Comment 4 Bill Nottingham 2004-10-20 05:17:25 UTC

Presuming it persists on .639?

Comment 5 Dan Williams 2004-10-21 19:33:34 UTC

Persists in .640

Comment 6 David Miller 2004-10-21 19:45:17 UTC

After a cursory scan of net/core/wireless.c I'd say that
get_spydata() looks very suspicious.  If the driver has
a NULL dev->wireless_data and dev->wireless_handlers->spy_offset
is some random value, this will spam the driver private data
in indeterminate ways during iwconfig calls.

Which exact iwconfig commands are you invoking?  This may help
us track this down.

In general, the wireless kernel APIs are a complete mess and
much more complicated than they should be IMHO.

Comment 7 Bryan W Clark 2004-10-21 20:13:56 UTC

Created attachment 105607 [details]
kernel panic on 640

Comment 8 Dan Williams 2004-10-21 20:17:32 UTC

Any and all of the iwconfig commands.  NetworkManager (I've also had
the crash using iwconfig) uses wireless ioctl() commands quite a bit.
 Its no particular command, just usually freezes at various points
after you've been exercising the wireless driver (ie, taking it down,
setting essid and key, bringing it up, etc).  You're unlinkely to
reproduce this with iwconfig since you can't physically type the
commands fast enough.

Comment 9 Bryan W Clark 2004-10-28 18:56:16 UTC

still happening on 643 kernels

Comment 10 David Miller 2004-11-02 22:57:45 UTC

Bryan, can you check against 648 please?  There are a bunch
of ipv6 device and route refcounting fixes in there which
might cure this.

The bug seems to have to do with downing and upping interfaces
which ipv6 runs over, do the NetworkManager scripts up and
down interfaces quite a lot?  If so, then the whole deal with
the wireless ioctl() was a total red herring and has no bearing
upon this bug report.

Comment 11 Dan Williams 2004-11-03 14:38:16 UTC

Yes, NM used to up/down the interface quite a bit, for example
whenever it would connect to another access point, bring the interface
down, set WEP key and essid and mode, then bring it up.  While I still
think this is the technically correct approach, it seems to interfere
with drivers that load firmware onto the card, so taht every time the
interface goes up again a firmware hotplug is triggered, at least on
Prism54 cards.  Upstream CVS no longer up/downs the interface nearly
as much, and doesn't do it when switching wireless networks which
seemed to be the trigger for the panic.

Comment 12 Tim Burke 2004-11-03 22:42:58 UTC

So is this a bug to pursue from the kernel space, or something to work-around
the behavior at user level?

Comment 13 David Miller 2004-11-03 23:40:32 UTC

It's a kernel side bug, we just need Bryan to,
as I stated in comment #10, retest with the most up to
date kernel images to see if it's been fixed or not.

I believe that upstream fixes that occurred going into
648 may have fixed this.  There has been a lot of churn
in ipv6 lately.

Comment 14 Dan Williams 2004-11-04 12:47:24 UTC

Well the bug here appears to be in kernel space since the thing
panics.  However, while we _don't_ know yet, I think NetworkManager
should trigger this bug less.  We'll have to see though.

Comment 15 Bryan W Clark 2004-11-04 19:59:01 UTC

I'm having a hard time reproducing this one, I'm using the 667 kernel
now.  I played with it for a while this morning and was able to make
it hang once.  However I haven't been able to do it again and I didn't
get the kernel panic output from it.

Comment 17 Jay Turner 2004-12-06 12:42:05 UTC

It's been over a month, so closing this out.  Please holler quickly and loudly
if this becomes an issue again.

Comment 18 Jon Orris 2005-01-03 21:37:11 UTC

I can consistently reproduce this on my laptop with a prism54 card
using the 681 kernel. If NetworkManager is started as a service, the
laptop will lock up completely within minutes of booting.

Laptop is a Thinkpad T30 running FC3 updated as of 1/3/05.
Wireless card is a Netgear WG511. 

The following is what shows up in /var/log/messages before the system
freezes:

Jan  3 14:56:05 localhost NetworkManager:
nm_create_device_and_add_to_list(): adding device 'eth1' (wireless)
Jan  3 14:56:05 localhost NetworkManager:
nm_create_device_and_add_to_list(): adding device 'eth0' (wired)
Jan  3 14:56:05 localhost NetworkManager: AUTO: Best wired device = (null)
Jan  3 14:56:05 localhost NetworkManager: AUTO: Best wireless device =
eth1  ()
Jan  3 14:56:05 localhost NetworkManager:     SWITCH: best device changed
Jan  3 14:56:05 localhost NetworkManager:
nm_state_modification_monitor(): beginning activation for device 'eth1'
Jan  3 14:56:07 localhost NetworkManager: HAVELINK: act=0 &&
(dev_crypt=0 <= prev_crypt=0) 
Jan  3 14:56:07 localhost NetworkManager: HAVELINK: act=0 &&
(dev_crypt=0 <= prev_crypt=0) 
Jan  3 14:56:07 localhost NetworkManager: LINK: !HAVE=1, (best_ap=0x0
&& (is_enc=0 && (!source=1 || !len_source=0))) 
Jan  3 14:56:09 localhost NetworkManager: HAVELINK: act=0 &&
(dev_crypt=0 <= prev_crypt=0) 
Jan  3 14:56:09 localhost NetworkManager: HAVELINK: act=0 &&
(dev_crypt=0 <= prev_crypt=0) 
Jan  3 14:56:09 localhost NetworkManager: LINK: !HAVE=1, (best_ap=0x0
&& (is_enc=0 && (!source=1 || !len_source=0))) 
Jan  3 14:56:11 localhost NetworkManager: HAVELINK: act=0 &&
(dev_crypt=0 <= prev_crypt=0) 
Jan  3 14:56:11 localhost NetworkManager: HAVELINK: act=0 &&
(dev_crypt=0 <= prev_crypt=0) 
Jan  3 14:56:11 localhost NetworkManager: LINK: !HAVE=1, (best_ap=0x0
&& (is_enc=0 && (!source=1 || !len_source=0))) 
Jan  3 14:56:13 localhost kernel: eth1: timeout waiting for mgmt response
Jan  3 14:56:13 localhost NetworkManager: HAVELINK: act=0 &&
(dev_crypt=0 <= prev_crypt=0) 
Jan  3 14:56:13 localhost NetworkManager: HAVELINK: act=0 &&
(dev_crypt=0 <= prev_crypt=0) 
Jan  3 14:56:13 localhost NetworkManager: LINK: !HAVE=1, (best_ap=0x0
&& (is_enc=0 && (!source=1 || !len_source=0))) 
Jan  3 14:56:15 localhost NetworkManager: HAVELINK: act=0 &&
(dev_crypt=0 <= prev_crypt=0) 
Jan  3 14:56:15 localhost NetworkManager: HAVELINK: act=0 &&
(dev_crypt=0 <= prev_crypt=0) 
Jan  3 14:56:15 localhost NetworkManager: LINK: !HAVE=1, (best_ap=0x0
&& (is_enc=0 && (!source=1 || !len_source=0))) 
Jan  3 14:56:16 localhost kernel: ACPI: PCI interrupt 0000:01:00.0[A]
-> GSI 11 (level, low) -> IRQ 11

The only way I can get NetworkManager to work is:
 Disable NM service startup. 
 Start eth1 on boot.
 After logging in, start NetworkManager and  NetworkManagerInfo.
 Immediately select a network that NM finds. Delays in selecting the
network will cause the laptop to lock up.

Comment 20 John W. Linville 2005-01-04 15:18:37 UTC

From comment 18, "Delays in selecting the network will cause the
laptop to lock up."  Interesting...?

Anyone (dcbw?) familiar enough w/ the workings of NetworkManager to
theorize on how its behaviour changes between before and after the
user selects a network?  Particular w.r.t. how it interacts w/ the kernel?

Comment 21 Jon Orris 2005-01-04 16:49:59 UTC

Sometime yesterday after posting my comments, the 724 kernel became
available. I installed it on my laptop and NetworkManager no logner
appears to lock the machine. It has run for about 15 minutes without a
hang. I'm letting it run for a longer span to make sure.

However, wireless with NM no longer works at all. The log is full of
'eth1: timeout waiting for mgmt response' messages interspersed with
NM messages from comment 18. If I shut down the NM service, reinsert
the card & start networking normally, wireless works fine.

Comment 22 Jon Orris 2005-01-04 16:56:24 UTC

Heh. I spoke too soon. The laptop still locks up running NM; this time
it did it in about 10 minutes.

Comment 23 John Devereaux 2005-01-10 15:54:34 UTC

Started NetworkManagerInfo and I experienced FC3 locking up after
trying to connect to a WEP-enabled Access Point. The application was
attempting to connect then it locked up the system. I could not change
different consoles to reboot or terminate the PID.

Comment 25 Dan Williams 2005-02-10 13:09:37 UTC

davem:  I've seen the airo patch that sets all the data to NULL after releasing
it, but I'm unsure what version of the kernel that patch is in.  Is it likely
that other drivers may need to be patched in the same way?

Comment 26 Dave Jones 2005-10-06 03:07:50 UTC

Is this still occuring in current Fedora kernels ?
If its also occuring in RHEL4, this bug should be cloned, and appropriately
reclassified.

Comment 27 Dan Williams 2006-03-17 20:27:21 UTC

seems to have gone away

Note You need to log in before you can comment on or make changes to this bug.