Attached screen shot shows the panic. Its triggered by enabling NetworkManager on at least two laptops (IBM T41 and a Latitude LS) with a variety of wireless cards (Cisco Aironet mini-PCI & PCMCIA, and Prism 54 cardbus). We have no idea if its related to wireless or not, though that's going to be where NetworkManager exercises the kernel most heavily. Its present in at least kernel builds 541 & 585, though occurs *far* more frequently with 541 (usually within several minutes of starting NetworkManager). This seriously hampers our ability to deploy NetworkManager for RHEL4 (and, for example, has made us unable to safely demo NetworkManager for fear the kernel will die and make our product look terribly unstable).
Created attachment 105084 [details] Screen shot showing the kernel panic
Was also triggered just by using iwconfig on the Latitidue LS with the prism54 card. So its not just a NetworkManager problem, though NetworkManager might expose the issue more due to the fact that it can make the ioctl() and iwlib calls faster than a human can on the command line.
Any other information we can provide to help track this down?
Presuming it persists on .639?
Persists in .640
After a cursory scan of net/core/wireless.c I'd say that get_spydata() looks very suspicious. If the driver has a NULL dev->wireless_data and dev->wireless_handlers->spy_offset is some random value, this will spam the driver private data in indeterminate ways during iwconfig calls. Which exact iwconfig commands are you invoking? This may help us track this down. In general, the wireless kernel APIs are a complete mess and much more complicated than they should be IMHO.
Created attachment 105607 [details] kernel panic on 640
Any and all of the iwconfig commands. NetworkManager (I've also had the crash using iwconfig) uses wireless ioctl() commands quite a bit. Its no particular command, just usually freezes at various points after you've been exercising the wireless driver (ie, taking it down, setting essid and key, bringing it up, etc). You're unlinkely to reproduce this with iwconfig since you can't physically type the commands fast enough.
still happening on 643 kernels
Bryan, can you check against 648 please? There are a bunch of ipv6 device and route refcounting fixes in there which might cure this. The bug seems to have to do with downing and upping interfaces which ipv6 runs over, do the NetworkManager scripts up and down interfaces quite a lot? If so, then the whole deal with the wireless ioctl() was a total red herring and has no bearing upon this bug report.
Yes, NM used to up/down the interface quite a bit, for example whenever it would connect to another access point, bring the interface down, set WEP key and essid and mode, then bring it up. While I still think this is the technically correct approach, it seems to interfere with drivers that load firmware onto the card, so taht every time the interface goes up again a firmware hotplug is triggered, at least on Prism54 cards. Upstream CVS no longer up/downs the interface nearly as much, and doesn't do it when switching wireless networks which seemed to be the trigger for the panic.
So is this a bug to pursue from the kernel space, or something to work-around the behavior at user level?
It's a kernel side bug, we just need Bryan to, as I stated in comment #10, retest with the most up to date kernel images to see if it's been fixed or not. I believe that upstream fixes that occurred going into 648 may have fixed this. There has been a lot of churn in ipv6 lately.
Well the bug here appears to be in kernel space since the thing panics. However, while we _don't_ know yet, I think NetworkManager should trigger this bug less. We'll have to see though.
I'm having a hard time reproducing this one, I'm using the 667 kernel now. I played with it for a while this morning and was able to make it hang once. However I haven't been able to do it again and I didn't get the kernel panic output from it.
It's been over a month, so closing this out. Please holler quickly and loudly if this becomes an issue again.
I can consistently reproduce this on my laptop with a prism54 card using the 681 kernel. If NetworkManager is started as a service, the laptop will lock up completely within minutes of booting. Laptop is a Thinkpad T30 running FC3 updated as of 1/3/05. Wireless card is a Netgear WG511. The following is what shows up in /var/log/messages before the system freezes: Jan 3 14:56:05 localhost NetworkManager: nm_create_device_and_add_to_list(): adding device 'eth1' (wireless) Jan 3 14:56:05 localhost NetworkManager: nm_create_device_and_add_to_list(): adding device 'eth0' (wired) Jan 3 14:56:05 localhost NetworkManager: AUTO: Best wired device = (null) Jan 3 14:56:05 localhost NetworkManager: AUTO: Best wireless device = eth1 () Jan 3 14:56:05 localhost NetworkManager: SWITCH: best device changed Jan 3 14:56:05 localhost NetworkManager: nm_state_modification_monitor(): beginning activation for device 'eth1' Jan 3 14:56:07 localhost NetworkManager: HAVELINK: act=0 && (dev_crypt=0 <= prev_crypt=0) Jan 3 14:56:07 localhost NetworkManager: HAVELINK: act=0 && (dev_crypt=0 <= prev_crypt=0) Jan 3 14:56:07 localhost NetworkManager: LINK: !HAVE=1, (best_ap=0x0 && (is_enc=0 && (!source=1 || !len_source=0))) Jan 3 14:56:09 localhost NetworkManager: HAVELINK: act=0 && (dev_crypt=0 <= prev_crypt=0) Jan 3 14:56:09 localhost NetworkManager: HAVELINK: act=0 && (dev_crypt=0 <= prev_crypt=0) Jan 3 14:56:09 localhost NetworkManager: LINK: !HAVE=1, (best_ap=0x0 && (is_enc=0 && (!source=1 || !len_source=0))) Jan 3 14:56:11 localhost NetworkManager: HAVELINK: act=0 && (dev_crypt=0 <= prev_crypt=0) Jan 3 14:56:11 localhost NetworkManager: HAVELINK: act=0 && (dev_crypt=0 <= prev_crypt=0) Jan 3 14:56:11 localhost NetworkManager: LINK: !HAVE=1, (best_ap=0x0 && (is_enc=0 && (!source=1 || !len_source=0))) Jan 3 14:56:13 localhost kernel: eth1: timeout waiting for mgmt response Jan 3 14:56:13 localhost NetworkManager: HAVELINK: act=0 && (dev_crypt=0 <= prev_crypt=0) Jan 3 14:56:13 localhost NetworkManager: HAVELINK: act=0 && (dev_crypt=0 <= prev_crypt=0) Jan 3 14:56:13 localhost NetworkManager: LINK: !HAVE=1, (best_ap=0x0 && (is_enc=0 && (!source=1 || !len_source=0))) Jan 3 14:56:15 localhost NetworkManager: HAVELINK: act=0 && (dev_crypt=0 <= prev_crypt=0) Jan 3 14:56:15 localhost NetworkManager: HAVELINK: act=0 && (dev_crypt=0 <= prev_crypt=0) Jan 3 14:56:15 localhost NetworkManager: LINK: !HAVE=1, (best_ap=0x0 && (is_enc=0 && (!source=1 || !len_source=0))) Jan 3 14:56:16 localhost kernel: ACPI: PCI interrupt 0000:01:00.0[A] -> GSI 11 (level, low) -> IRQ 11 The only way I can get NetworkManager to work is: Disable NM service startup. Start eth1 on boot. After logging in, start NetworkManager and NetworkManagerInfo. Immediately select a network that NM finds. Delays in selecting the network will cause the laptop to lock up.
From comment 18, "Delays in selecting the network will cause the laptop to lock up." Interesting...? Anyone (dcbw?) familiar enough w/ the workings of NetworkManager to theorize on how its behaviour changes between before and after the user selects a network? Particular w.r.t. how it interacts w/ the kernel?
Sometime yesterday after posting my comments, the 724 kernel became available. I installed it on my laptop and NetworkManager no logner appears to lock the machine. It has run for about 15 minutes without a hang. I'm letting it run for a longer span to make sure. However, wireless with NM no longer works at all. The log is full of 'eth1: timeout waiting for mgmt response' messages interspersed with NM messages from comment 18. If I shut down the NM service, reinsert the card & start networking normally, wireless works fine.
Heh. I spoke too soon. The laptop still locks up running NM; this time it did it in about 10 minutes.
Started NetworkManagerInfo and I experienced FC3 locking up after trying to connect to a WEP-enabled Access Point. The application was attempting to connect then it locked up the system. I could not change different consoles to reboot or terminate the PID.
davem: I've seen the airo patch that sets all the data to NULL after releasing it, but I'm unsure what version of the kernel that patch is in. Is it likely that other drivers may need to be patched in the same way?
Is this still occuring in current Fedora kernels ? If its also occuring in RHEL4, this bug should be cloned, and appropriately reclassified.
seems to have gone away