Bug 769920

Summary: Occasional oops in firmware_loading_store while booting installer
Product: [Fedora] Fedora Reporter: Jason Tibbitts <tibbs>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 16CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, mlb, nhorman
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-06 16:59:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Picture of warning and oops
none
patch to fix reported oops none

Description Jason Tibbitts 2011-12-22 18:29:39 UTC
Created attachment 549229 [details]
Picture of warning and oops

Very occasionally when booting the Fedora 16 installer, I'll get an oops in firmware_loading_store.  I was able to get a picture of the oops and a warning immediately preceding it.  The system was still alive, but it appears the installer was hung trying to configure the network, and my attempt to run the 'ip' command to see if the network was up also hung.

I'll attach the picture; the oops was at firmware_loading_store+0xf9/0x17b and the backtrace went:

dev_attr_store+0x20/0x22
sysfs_write_file+0x101/0x134
vfs_write+0xac/0xf3
sys_write+0x4a/0x6e
system_call_fastpath+0x16/0x1b

Comment 1 Neil Horman 2011-12-22 19:41:47 UTC
I see the problem

Comment 2 Neil Horman 2011-12-22 19:45:41 UTC
Sorry, hit save too early.  Theres a race on the use of fw_priv->fw.  In case 1 in firmware_loading_store we dereference fw_priv->fw->data without the protection of the fw_lock mutex.  It can be set to NULL in _request_firmware prior to the sysfs file being unlinked, allowing for a user space app to write into it and see a NULL pointer there. I'll get it fixed in a bit.

Comment 3 Neil Horman 2011-12-22 20:04:20 UTC
Created attachment 549234 [details]
patch to fix reported oops

Its untested, but this should prevent your oops.

Comment 4 Neil Horman 2011-12-22 20:19:27 UTC
http://koji.fedoraproject.org/koji/taskinfo?taskID=3601889

theres a test build for you.  Please install it on your system and confirm that it boots, and that you no longer see the reported oops.  Thanks!

Comment 5 Jason Tibbitts 2011-12-22 20:33:01 UTC
It might be a bit difficult to test this.  I've only ever seen this when booting the F16 installer, and then only three times out of the perhaps a hundred times that I've booted it recently.

Fortunately I use the pxe images so I _think_ I can just replace the vmlinuz there with one pulled from the koji build when it finishes.  But saying that it conclusively solves the problem is going to be pretty difficult, though I do have another hundred or so machines to install.

Comment 6 Neil Horman 2011-12-22 20:45:35 UTC
Thats ok, The problem is pretty clear.  Mostly I want to be sure that I havent caused any deadlock regressions by moving the mutex around like that.  installing a few systems should confirm that.  If you don't see any problems with this during your next set of installs with this kernel, post a comment to that affect here, and I'll make sure this change gets in.

Comment 7 Jason Tibbitts 2011-12-22 21:32:33 UTC
It finally occurs to me that of course I can't just swap in an updated vmlinuz as the modules in the initrd won't match up and pretty much nothing will work.  If you have any hints about how I might regenerate the pxe images I'm happy to take a look, but otherwise I've no easy way to test this.  Given that we're about to close for the holidays, all I can do is take this back up in January.  Maybe I can do some boot testing in a VM if I can figure out how to generate the images.

Comment 8 Neil Horman 2011-12-23 01:01:53 UTC
Well, the normal method for using pxe images is to generate an initrd to go with them. Presumably you have initrds that you download with the existing pxe images that run your installer.  you should just be able to loop mount the initrd image (after decompression), or you can just expand it, then go in and replace the modules in the installer with the one from the new kernel rpm, as well as the firmware.  Then repack the initramfs, and specify it to use with the new kernel image.

Alternatively, we can setup a new reproducer to test this out.  After installing a system, update the kernel with this package. Then determine which module load is triggering the firmware request (you can monitor udev for this).  When you know that, you can start two processs, one to constantly load and unload the module, and a second to blindly echo 0 > /sys/bus/pci/<device>/.../loading

The latter will fail most of the time (as the file doesn't exist), but during the module load it will exist for a brief period of time, during firmware load time.  On the old kernel it should randomly fail, on the new kernel it should not (although note that the device will consistently not work because doing the above echo will cause firmware load failures).  That shouldn't matter for the purposes of this bug though.

Comment 9 Neil Horman 2011-12-31 13:55:43 UTC
ping, any update here?

Comment 10 Jason Tibbitts 2012-01-01 20:40:10 UTC
Given that my employer is still closed and I've been on vacation, I haven't had access to any machines to use to work on this.  Probably won't have much time to look at it during the first couple of days back, either.

Comment 11 Neil Horman 2012-01-02 20:29:13 UTC
Ok, Well, I've managed to get hold of a system and confirmed that it doesn't break anything, which is all I was really after, given the root cause is fairly obvious.  I'm going to go ahead and post upstream.  I'll backport when its accepted.

Comment 12 Neil Horman 2012-01-02 20:34:27 UTC
http://marc.info/?l=linux-kernel&m=132553646826909&w=2

Upstream thread

Comment 13 Neil Horman 2012-01-03 11:47:34 UTC
Looks like greg likes the patch.  I'll backport it asap.

Comment 14 Neil Horman 2012-01-03 11:58:20 UTC
comitted to the f16 git branch as commit f05bbd88785e2ea554a6f9e1ad433da7f7d1e296