Created attachment 549229 [details]
Picture of warning and oops
Very occasionally when booting the Fedora 16 installer, I'll get an oops in firmware_loading_store. I was able to get a picture of the oops and a warning immediately preceding it. The system was still alive, but it appears the installer was hung trying to configure the network, and my attempt to run the 'ip' command to see if the network was up also hung.
I'll attach the picture; the oops was at firmware_loading_store+0xf9/0x17b and the backtrace went:
I see the problem
Sorry, hit save too early. Theres a race on the use of fw_priv->fw. In case 1 in firmware_loading_store we dereference fw_priv->fw->data without the protection of the fw_lock mutex. It can be set to NULL in _request_firmware prior to the sysfs file being unlinked, allowing for a user space app to write into it and see a NULL pointer there. I'll get it fixed in a bit.
Created attachment 549234 [details]
patch to fix reported oops
Its untested, but this should prevent your oops.
theres a test build for you. Please install it on your system and confirm that it boots, and that you no longer see the reported oops. Thanks!
It might be a bit difficult to test this. I've only ever seen this when booting the F16 installer, and then only three times out of the perhaps a hundred times that I've booted it recently.
Fortunately I use the pxe images so I _think_ I can just replace the vmlinuz there with one pulled from the koji build when it finishes. But saying that it conclusively solves the problem is going to be pretty difficult, though I do have another hundred or so machines to install.
Thats ok, The problem is pretty clear. Mostly I want to be sure that I havent caused any deadlock regressions by moving the mutex around like that. installing a few systems should confirm that. If you don't see any problems with this during your next set of installs with this kernel, post a comment to that affect here, and I'll make sure this change gets in.
It finally occurs to me that of course I can't just swap in an updated vmlinuz as the modules in the initrd won't match up and pretty much nothing will work. If you have any hints about how I might regenerate the pxe images I'm happy to take a look, but otherwise I've no easy way to test this. Given that we're about to close for the holidays, all I can do is take this back up in January. Maybe I can do some boot testing in a VM if I can figure out how to generate the images.
Well, the normal method for using pxe images is to generate an initrd to go with them. Presumably you have initrds that you download with the existing pxe images that run your installer. you should just be able to loop mount the initrd image (after decompression), or you can just expand it, then go in and replace the modules in the installer with the one from the new kernel rpm, as well as the firmware. Then repack the initramfs, and specify it to use with the new kernel image.
Alternatively, we can setup a new reproducer to test this out. After installing a system, update the kernel with this package. Then determine which module load is triggering the firmware request (you can monitor udev for this). When you know that, you can start two processs, one to constantly load and unload the module, and a second to blindly echo 0 > /sys/bus/pci/<device>/.../loading
The latter will fail most of the time (as the file doesn't exist), but during the module load it will exist for a brief period of time, during firmware load time. On the old kernel it should randomly fail, on the new kernel it should not (although note that the device will consistently not work because doing the above echo will cause firmware load failures). That shouldn't matter for the purposes of this bug though.
ping, any update here?
Given that my employer is still closed and I've been on vacation, I haven't had access to any machines to use to work on this. Probably won't have much time to look at it during the first couple of days back, either.
Ok, Well, I've managed to get hold of a system and confirmed that it doesn't break anything, which is all I was really after, given the root cause is fairly obvious. I'm going to go ahead and post upstream. I'll backport when its accepted.
Looks like greg likes the patch. I'll backport it asap.
comitted to the f16 git branch as commit f05bbd88785e2ea554a6f9e1ad433da7f7d1e296