Bug 575470

Summary: Hard lock-up with b43 / Broadcom Corporation BCM4312 802.11a/b/g (rev 02) on kernels after 2.6.31.6-166.fc12.i686
Product: [Fedora] Fedora Reporter: Luke Ross <luke>
Component: kernelAssignee: John W. Linville <linville>
Status: CLOSED CANTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 12CC: anton, dougsland, gansalmon, itamar, jonathan, kernel-maint, larry.finger, linville, sgruszka
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-21 14:29:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg from boot, through to login and attempt to associate with a network. Terminates when the machine locks. Captured using firescope. none

Description Luke Ross 2010-03-20 22:16:23 UTC
Created attachment 401498 [details]
dmesg from boot, through to login and attempt to associate with a network. Terminates when the machine locks. Captured using firescope.

Description of problem:
On kernels since 2.6.31.9-174.fc12.i686 my machine locks up hard regularly when wireless is enabled, requiring a reboot. About 50% of time it'll lock up between half way through the charge animation and completing log-in. On the approx. 20% restarts I can reach the desktop attempting to associate with any wireless network (encrypted/unencrypted) will guarantee a lock. If the wireless is disabled using the hardware wireless disable button the machine does not lock.

Version-Release number of selected component (if applicable):
First happened with kernel 2.6.31.9-174.fc12.i686, and has affected all F12 update kernels since up to and including 2.6.32.9-70.fc12.i686. 2.6.31.6-166.fc12.i686 and prior are unaffected - wireless works normally with these kernels.

How reproducible:
Point of lock-up varies and it seems it can happen at any time from about mid-way through boot. Does not occur at all if the radio is disabled. Can always force a lock-up by associating with a wireless network if no lock during boot.
  
Additional info:
Machine is an HP 2133 laptop. lspci details the card as follows:
02:00.0 Network controller: Broadcom Corporation BCM4312 802.11a/b/g (rev 02)
	Subsystem: Hewlett-Packard Company Broadcom 802.11a/b/g WLAN
	Flags: bus master, fast devsel, latency 0, IRQ 24
	Memory at fdffc000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: <access denied>
	Kernel driver in use: b43-pci-bridge
	Kernel modules: ssb

I'm using the Broadcom firmware and not the openfwwf. Non-graphical booting reveals nothing logged to the screen at point of lock,

Comment 1 John W. Linville 2010-04-07 13:13:06 UTC

*** This bug has been marked as a duplicate of bug 533746 ***

Comment 2 Luke Ross 2010-04-07 22:43:18 UTC
Are you sure that this is a duplicate of bug 533746?

As far as I can see that bug is about people who went from having a working machine with no wireless to getting a consistent hard boot at udev. I've gone from fully working wireless to getting random lockups that occur during or after boot, although not prior to starting udev - on a good day the machine may run for ten minutes or so prior to locking.

kernel-2.6.31.6-166.fc12.i686 was rock-solid with fully working wireless; everything since has failed on this machine. I tried kernel-2.6.32.10-90.fc12.i686 and kernel-2.6.32.10-94.fc12.i686 (the recommended resolutions in #533746) but these both hard-lock too.

Comment 3 John W. Linville 2010-04-08 14:21:17 UTC
Sorry, timing seemed to coincide with a number of other duplicate reports.

Unfortunately, I'm not sure what to suggest.  Are you in a position to build and test kernels extracted from git?

Comment 4 Luke Ross 2010-04-09 09:58:31 UTC
I will certainly give it a try, if you point me towards which ones I need to build.

Comment 5 John W. Linville 2010-04-09 14:01:37 UTC
OK, lets try to establish some baselines...

git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.31.y.git
cd linux-2.6.31.y
git checkout -b testing v2.6.31
# use oldest 2.6.31 config you still have...
cp /boot/config-2.6.31.6-166.fc12.i686 .config
make oldconfig
make
make modules_install install

Reboot and verify that the virgin 2.6.31 kernel works for you.  Assuming it does, try the following:

cd linux-2.6.31.y
git reset --hard v2.6.31.9
make oldconfig # select defaults for any config options that need attention
make
make modules_install install

Reboot and verify that the hang exists.  If not, repeat with 2.6.31.10.  In the end, you should have identified a working kernel version and a broken kernel version.  You can now proceed with a bisection.  Below I will presume that v2.6.31 works and v2.6.31.9 does not -- adjust if you find different versions for the endpoints.

cd linux-2.6.31.y
git bisect start
git bisect good v2.6.31
git bisect bad v2.6.31.9
# wait for it to finish churning...
make oldconfig # select defaults for any config options that need attention
make
make modules_install install

Reboot and evaluate the results.  If it works do this:

cd linux-2.6.31.y
git bisect good
# wait for it to finish churning...

If it fails, after rebooting into a working kernel do this:

cd linux-2.6.31.y
git bisect bad
# wait for it to finish churning...

In either case, proceed like this:

make oldconfig # select defaults for any config options that need attention
make
make modules_install install

Lather, rinse, repeat...hopefully you shouldn't need but a handful of iterations until it identifies the first bad commit.  Please copy that information and post it here.  Thanks! :-)

Comment 6 Luke Ross 2010-04-13 13:30:57 UTC
I'm trying to work my way through this, but not making much progress because 2.6.31 hangs during boot, firstly throwing a trace because the kernel is sleeping whilst holding a lock - this is thrown several times in succession so the original scrolls off the screen before I can read it, and then saying "note: switch_root[344] exited with preempt_count 1". After that the boot stops. I tried building 2.6.31.6 as it was the closest base version to the last working RPM I had, but it dies with similar traces.

Comment 7 John W. Linville 2010-04-13 14:11:11 UTC
Hmmm...I'm sorry to hear that.  Doing a bisect can be problematic... :-(  Do make sure that you are sticking close to the Fedora config, just in case there is some "magic" setting required to be compatible with the Fedora userland.  You might also try taking a closer look at anything that needs attention during the "make oldconfig", just in case the default isn't actually correct -- sorry, it is hard to offer generic advise in that regard.

The switch_root error sounds like it is coming from the initrd.  I'm not sure what that could be either, but you should look to see if there are any errors coming from installkernel (invoked during the "make modules_install install" phase above).

I'm sorry this is so painful.  Bisection can be very useful in pinpointing the origin of a problem when you don't know where else to start.  Unfortunately, that is the situation we are in now. :-(

Comment 8 Luke Ross 2010-04-22 21:00:49 UTC
The problem was that vanilla kernels don't seem to boot a Fedora encrypted install. I had to install F12 to a USB key and have the vanilla kernel boot that.

Using this, I tried vanilla 2.6.31.6, 2.6.31.10 and 2.6.31.13. I was unable to replicate the problem using any of these vanilla kernels, but on the same USB install showed the problem without trouble using the non-vanilla 2.6.31.9-174.fc12.i686.

This suggests to me that one of the Fedora patches is triggering the problem.

Comment 9 Chuck Ebbert 2010-04-27 12:01:18 UTC
2.6.32.11-104 and later kernels have some b43 DMA fixes, and add a way of forcing PIO at module load.

Comment 10 Luke Ross 2010-04-28 19:13:14 UTC
I gave kernel-2.6.32.11-105.fc12.i686 a whirl - same thing. I then tried using the pio=1 option but still the same lock-up.

Comment 11 Luke Ross 2010-05-03 09:28:50 UTC
Going back to the vanilla kernels, 2.6.31.x (I tried .6, ..10, .11 and .13) all appear to work but vanilla 2.6.32 locks on connecting over wireless, so perhaps it was something Fedora pulled in early but got mainstreamed in 2.6.32?

Comment 12 John W. Linville 2010-05-03 14:45:10 UTC
Well, I checked kernel-2.6.31.9-174.fc12 -- there are no b43 or ssb patches in it.  So while what you suggest in comment 11 is possible, it doesn't seem to be the case for any patches obviously related to your hardware.

I'll Cc: one of the b43 team in case he has some suggestions.  If you are bored or desperate, you might try a git bisect from 2.6.31 to 2.6.32... :-)

Comment 13 Larry Finger 2010-05-03 16:27:12 UTC
I am not able to duplicate your result as my BCM4312 802.11a/b/g card is on loan to the openfwwf group in Italy; however, you should be able to speed up the bisection by using

git bisect start drivers/ssb/ drivers/net/wireless/b43/
git bisect bad v2.6.32
git bisect good v2.6.31

On the mainline tree, this results in 53 revisions. About half of them cover LP PHY addition, and are not relevant to your situation.

Comment 14 John W. Linville 2010-06-09 17:59:32 UTC
Luke, any progress on the git bisect for this issue?

Comment 15 Luke Ross 2010-06-09 20:01:21 UTC
I managed to screw up the bisection and arrived at an unhelpful answer, so had to start over. As each kernel build takes 2-3 evenings to build on this machine it is proving slow going, but I'm hopefully getting there.

Comment 16 Luke Ross 2010-06-20 21:26:20 UTC
Sorry, but after 3 complete bisections I've drawn a blank. Each bisection results in a different answer, and that (different) commit it points at doesn't line up with the b43/ssb drivers, or indeed any meaningful change (two of the three were merge points) - I can only assume it happens under specific circumstances which is confusing the bisect. I suggest closing as CANTFIX for now, and hope someone else with the same problem stumbles across it.

Comment 17 John W. Linville 2010-06-21 14:29:06 UTC
OK, sorry we can't be more helpful at the moment...