Created attachment 786668 [details] syslog Description of problem: Scratch built Anaconda boot.iso from rawhide gets stuck after storage spoke. Nothing obvious in the Anaconda logs, but after letting it sit this message shows up in syslog: 18:39:05,825 ERR kernel:[ 243.684161] INFO: task systemd-udevd:757 blocked for more than 120 seconds. 18:39:05,825 ERR kernel:[ 243.684167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 18:39:05,825 INFO kernel:[ 243.684169] systemd-udevd D ffff88010cefdc00 4568 757 742 0x00000000 18:39:05,825 WARNING kernel:[ 243.684178] ffff88010d219c08 0000000000000046 00000000001d5300 ffff88010d219fd8 18:39:05,825 WARNING kernel:[ 243.684183] ffff88010d219fd8 00000000001d5300 ffff88010c0e5c00 ffff880118cd8b50 18:39:05,825 WARNING kernel:[ 243.684188] 0000000000000246 ffff88010c0e5c00 ffff880118cd8b48 ffffffffa0254580 18:39:05,825 WARNING kernel:[ 243.684193] Call Trace: 18:39:05,825 WARNING kernel:[ 243.684207] [<ffffffff8172b9c9>] schedule_preempt_disabled+0x39/0x90 18:39:05,825 WARNING kernel:[ 243.684211] [<ffffffff81727ced>] mutex_lock_nested+0x1ad/0x3e0 18:39:05,825 WARNING kernel:[ 243.684227] [<ffffffff81486129>] ? subsys_interface_register+0x59/0xf0 18:39:05,825 WARNING kernel:[ 243.684232] [<ffffffff81486129>] ? subsys_interface_register+0x59/0xf0 18:39:05,825 WARNING kernel:[ 243.684237] [<ffffffff8172d3b6>] ? _raw_write_unlock_irqrestore+0x36/0x70 18:39:05,825 WARNING kernel:[ 243.684243] [<ffffffff81486129>] subsys_interface_register+0x59/0xf0 18:39:05,825 WARNING kernel:[ 243.684258] [<ffffffff810e6c8d>] ? trace_hardirqs_on+0xd/0x10 18:39:05,825 WARNING kernel:[ 243.684270] [<ffffffff8159589f>] cpufreq_register_driver+0x9f/0x200 18:39:05,825 WARNING kernel:[ 243.684278] [<ffffffffa025b16c>] acpi_cpufreq_init+0x16c/0x1000 [acpi_cpufreq] 18:39:05,825 WARNING kernel:[ 243.684291] [<ffffffffa025b000>] ? 0xffffffffa025afff 18:39:05,825 WARNING kernel:[ 243.684300] [<ffffffff810020fa>] do_one_initcall+0xfa/0x1b0 18:39:05,825 WARNING kernel:[ 243.684312] [<ffffffff8105cb93>] ? set_memory_nx+0x43/0x50 18:39:05,825 WARNING kernel:[ 243.684324] [<ffffffff810f74ff>] load_module+0x1c6f/0x27f0 18:39:05,825 WARNING kernel:[ 243.684328] [<ffffffff810f2ce0>] ? store_uevent+0x40/0x40 18:39:05,825 WARNING kernel:[ 243.684337] [<ffffffff810f8216>] SyS_finit_module+0x86/0xb0 18:39:05,825 WARNING kernel:[ 243.684344] [<ffffffff81736e99>] system_call_fastpath+0x16/0x1b 18:39:05,825 WARNING kernel:[ 243.684347] 1 lock held by systemd-udevd/757: 18:39:05,825 WARNING kernel:[ 243.684349] #0: (subsys mutex#4){+.+.+.}, at: [<ffffffff81486129>] subsys_interface_register+0x59/0xf0 18:39:16,833 INFO kernel:[ 254.692354] microcode: Microcode Update Driver: v2.00 <tigran.co.uk>, Peter Oruba 18:39:16,840 WARNING kernel:[ 254.699383] systemd-udevd (768) used greatest stack depth: 3112 bytes left Version-Release number of selected component (if applicable): systemd-206-2.fc20
kernel bug with load_module() Might be related that the kernel config does not set: CONFIG_FW_LOADER_USER_HELPER=n otherwise firmware loading can take a looooong time
(In reply to Harald Hoyer from comment #1) > kernel bug with load_module() > > Might be related that the kernel config does not set: > CONFIG_FW_LOADER_USER_HELPER=n > > otherwise firmware loading can take a looooong time Yeah, that whole mess is... well it's a mess. It'd be helpful if the udev maintainers gave us a heads up before they went and made the upstream DEFAULT kernel setting not work anymore.
(In reply to Josh Boyer from comment #2) > (In reply to Harald Hoyer from comment #1) > > kernel bug with load_module() > > > > Might be related that the kernel config does not set: > > CONFIG_FW_LOADER_USER_HELPER=n > > > > otherwise firmware loading can take a looooong time > > Yeah, that whole mess is... well it's a mess. It'd be helpful if the udev > maintainers gave us a heads up before they went and made the upstream > DEFAULT kernel setting not work anymore. So we can't turn it off without disabling the Dell RBU driver apparently. That driver selects it the user helper because it needs it for some reason. Like I said, this is a mess.
Also see https://lkml.org/lkml/2013/7/6/112
(In reply to Josh Boyer from comment #2) > Yeah, that whole mess is... well it's a mess. It'd be helpful if the udev > maintainers gave us a heads up before they went and made the upstream > DEFAULT kernel setting not work anymore. Oh, we we tried and gave a head-up that we disabled the udev firmware handling: Mar 20 01:44:51 <kay> jwb: FYI, we wil disabled the userspace firmware loading code in udev now too; so all pieces are gone, it's history ... http://cgit.freedesktop.org/systemd/systemd/commit/?id=a3bd8447be4ea2ce230eb8ae0e815c04d85fa15a But nobody knew that this wasn't enough, or that the broken DELL stuff would pull-in the firmware loader, or that other stuff was now relying on userspace to just blindly *cancel* all firmware requests. It all makes not much sense, like the whole userspace firmware loading never did, and right, it is really just a huge mess. Today, the kernel should not rely on userspace to do anything, not wait for 60 seconds for something that will never happen, especially when it itself has already found out that there is nothing to find in for userspace. Seems we have these 3 options: 1. apply Takashi's patch, which is the cleanest and simplest solution, and the kernel can provide backwards compat as long as it needs, and userspace is not involved at all in firmware loading, like it should be: https://lkml.org/lkml/2013/7/8/81 2. disable the DELL rbu firmware mis-user and let DELL fix it. 3. add an awkward udev rule to immediately cancel all kernel firmware requests: SUBSYSTEM=="firmware", ACTION=="add", ATTR{loading}="-1"
(In reply to Kay Sievers from comment #5) > (In reply to Josh Boyer from comment #2) > > Yeah, that whole mess is... well it's a mess. It'd be helpful if the udev > > maintainers gave us a heads up before they went and made the upstream > > DEFAULT kernel setting not work anymore. > > Oh, we we tried and gave a head-up that we disabled the udev firmware > handling: > Mar 20 01:44:51 <kay> jwb: FYI, we wil disabled the userspace firmware > loading code in udev now too; so all pieces are gone, it's history ... My mistake. I clearly didn't remember that. Though looking in the git history for the Fedora config fragments, we did turn this option off back in Februrary with the 3.8 merge window kernels. I guess we missed the Dell RBU thing bringing it back in. > http://cgit.freedesktop.org/systemd/systemd/commit/ > ?id=a3bd8447be4ea2ce230eb8ae0e815c04d85fa15a > > But nobody knew that this wasn't enough, or that the broken DELL stuff > would pull-in the firmware loader, or that other stuff was now relying > on userspace to just blindly *cancel* all firmware requests. Yeah :\. > It all makes not much sense, like the whole userspace firmware loading > never did, and right, it is really just a huge mess. > > Today, the kernel should not rely on userspace to do anything, not wait > for 60 seconds for something that will never happen, especially when it > itself has already found out that there is nothing to find in for userspace. > > Seems we have these 3 options: > > 1. apply Takashi's patch, which is the cleanest and simplest solution, > and the kernel can provide backwards compat as long as it needs, and > userspace is not involved at all in firmware loading, like it should be: > https://lkml.org/lkml/2013/7/8/81 > > 2. disable the DELL rbu firmware mis-user and let DELL fix it. I'll look at one of these two. I don't like the third option at all.
*** Bug 983319 has been marked as a duplicate of this bug. ***
I disabled the Dell rbu driver. Takashi's patch isn't any better and likely isn't going upstream.
In case it wasn't clear - sorry! - this is a showstopper for Alpha at present. Can we get a kernel build with the fix pretty soon, so this isn't a problem for TC builds? Thanks!
So, a new kernel build was done: http://koji.fedoraproject.org/koji/buildinfo?buildID=457729 I built a live image with that kernel and latest anaconda, and tried it. After patching #997690, anaconda still gets stuck on returning from the partitioning spoke. I don't know if a) the kernel update didn't really fix the kernel problem, b) the kernel update fixed the kernel problem but that was not really what caused the anaconda problem, or c) something else. But the anaconda issue definitely still seems to be there, for me.
(In reply to Adam Williamson from comment #10) > So, a new kernel build was done: > > http://koji.fedoraproject.org/koji/buildinfo?buildID=457729 > > I built a live image with that kernel and latest anaconda, and tried it. > After patching #997690, anaconda still gets stuck on returning from the > partitioning spoke. > > I don't know if a) the kernel update didn't really fix the kernel problem, > b) the kernel update fixed the kernel problem but that was not really what > caused the anaconda problem, or c) something else. But the anaconda issue > definitely still seems to be there, for me. Likely either b) or c). If this only happens when you go into the storage spoke, then it's possible udev might be waiting but I find it pretty unlikely. It was already an educated guess on Harald and Brian's part that udev hanging was the issue, as that was the only thing that showed up in the logs.
Previously posted on: https://bugzilla.redhat.com/show_bug.cgi?id=996295 Here is what I was seeing (prior to kernel-3.11.0-0.rc6.git1.1.fc20.x86_64) [root@tlondon systemd]# journalctl -b | grep systemd-udevd Aug 20 06:01:58 tlondon.localhost.org systemd-udevd[142]: starting version 206 Aug 20 06:02:04 tlondon.localhost.org systemd-udevd[322]: starting version 206 Aug 20 06:02:05 tlondon.localhost.org systemd-udevd[338]: error opening ATTR{/sys/devices/pci0000:00/0000:00:1a.1/usb4/4-1/4-1:1.0/power/control} for writing: No such file or directory Aug 20 06:02:07 tlondon.localhost.org systemd-udevd[329]: renamed network interface wlan0 to wlp3s0 Aug 20 06:02:46 tlondon.localhost.org systemd-udevd[322]: worker [330] /devices/system/cpu/cpu0 timeout; kill it Aug 20 06:02:46 tlondon.localhost.org systemd-udevd[322]: seq 1890 '/devices/system/cpu/cpu0' killed Aug 20 06:02:46 tlondon.localhost.org systemd-udevd[322]: worker [336] /devices/system/cpu/cpu1 timeout; kill it Aug 20 06:02:46 tlondon.localhost.org systemd-udevd[322]: seq 1891 '/devices/system/cpu/cpu1' killed Aug 20 06:02:46 tlondon.localhost.org systemd-udevd[322]: worker [336] terminated by signal 9 (Killed) Aug 20 06:04:06 tlondon.localhost.org systemd-udevd[322]: worker [330] terminated by signal 9 (Killed) [root@tlondon systemd]# Updated/installed kernel-3.11.0-0.rc6.git1.1.fc20.x86_64 and rebooted my Lenovo Thinkpad X200. Here is what I now see: [root@tlondon ~]# journalctl -b | grep systemd-udevd Aug 21 06:22:03 tlondon.localhost.org systemd-udevd[143]: starting version 206 Aug 21 06:22:05 tlondon.localhost.org kernel: systemd-udevd (145) used greatest stack depth: 2568 bytes left Aug 21 06:22:12 tlondon.localhost.org systemd-udevd[309]: starting version 206 Aug 21 06:22:14 tlondon.localhost.org systemd-udevd[329]: error opening ATTR{/sys/devices/pci0000:00/0000:00:1a.1/usb4/4-1/4-1:1.0/power/control} for writing: No such file or directory Aug 21 06:22:18 tlondon.localhost.org systemd-udevd[327]: renamed network interface wlan0 to wlp3s0 [root@tlondon ~]# So, at least for one boot, the very pronounced pause during boot and the above spew is gone. Will try rebooting several times to verify.
Rebooted. No systemd-udevd spew. Probably unrelated, but I see this: [ 19.073545] ieee80211 phy0: Selected rate control algorithm 'iwl-agn-rs' [ 19.447105] input: ThinkPad Extra Buttons as /devices/platform/thinkpad_acpi/input/input13 [ 19.786006] ====================================================== [ 19.786006] [ INFO: possible circular locking dependency detected ] [ 19.786006] 3.11.0-0.rc6.git1.1.fc20.x86_64 #1 Not tainted [ 19.791127] ------------------------------------------------------- [ 19.791127] crda/385 is trying to acquire lock: [ 19.791127] (genl_mutex){+.+.+.}, at: [<ffffffff81621512>] ctrl_dumpfamily+0x172/0x180 [ 19.791127] but task is already holding lock: [ 19.791127] (nlk->cb_mutex){+.+.+.}, at: [<ffffffff8161da41>] netlink_dump+0x21/0x230 [ 19.791127] which lock already depends on the new lock. [ 19.791127] the existing dependency chain (in reverse order) is: [ 19.791127] -> #1 (nlk->cb_mutex){+.+.+.}: [ 19.791127] [<ffffffff810e9412>] lock_acquire+0xa2/0x1f0 [ 19.791127] [<ffffffff81727357>] mutex_lock_nested+0x87/0x3e0 [ 19.791127] [<ffffffff8161e605>] __netlink_dump_start+0xb5/0x1c0 [ 19.791127] [<ffffffff816226d4>] genl_family_rcv_msg+0x264/0x300 [ 19.791127] [<ffffffff816227fe>] genl_rcv_msg+0x8e/0xd0 [ 19.791127] [<ffffffff816204b9>] netlink_rcv_skb+0xa9/0xc0 [ 19.791127] [<ffffffff81620a18>] genl_rcv+0x28/0x40 [ 19.791127] [<ffffffff8161fafd>] netlink_unicast+0xdd/0x190 [ 19.791127] [<ffffffff8161fed9>] netlink_sendmsg+0x329/0x750 [ 19.791127] [<ffffffff815d0579>] sock_sendmsg+0x99/0xd0 [ 19.791127] [<ffffffff815d099e>] ___sys_sendmsg+0x39e/0x3b0 [ 19.791127] [<ffffffff815d1f02>] __sys_sendmsg+0x42/0x80 [ 19.791127] [<ffffffff815d1f52>] SyS_sendmsg+0x12/0x20 [ 19.791127] [<ffffffff81736619>] system_call_fastpath+0x16/0x1b [ 19.791127] -> #0 (genl_mutex){+.+.+.}: [ 19.791127] [<ffffffff810e88b4>] __lock_acquire+0x17b4/0x1b20 [ 19.791127] [<ffffffff810e9412>] lock_acquire+0xa2/0x1f0 [ 19.791127] [<ffffffff81727357>] mutex_lock_nested+0x87/0x3e0 [ 19.791127] [<ffffffff81621512>] ctrl_dumpfamily+0x172/0x180 [ 19.791127] [<ffffffff8161daa0>] netlink_dump+0x80/0x230 [ 19.791127] [<ffffffff8161df28>] netlink_recvmsg+0x2d8/0x390 [ 19.791127] [<ffffffff815d0c28>] sock_recvmsg+0xa8/0xe0 [ 19.791127] [<ffffffff815d1693>] ___sys_recvmsg+0x113/0x290 [ 19.791127] [<ffffffff815d2172>] __sys_recvmsg+0x42/0x80 [ 19.791127] [<ffffffff815d21c2>] SyS_recvmsg+0x12/0x20 [ 19.791127] [<ffffffff81736619>] system_call_fastpath+0x16/0x1b [ 19.791127] other info that might help us debug this: [ 19.791127] Possible unsafe locking scenario: [ 19.791127] CPU0 CPU1 [ 19.791127] ---- ---- [ 19.791127] lock(nlk->cb_mutex); [ 19.791127] lock(genl_mutex); [ 19.791127] lock(nlk->cb_mutex); [ 19.791127] lock(genl_mutex); [ 19.791127] *** DEADLOCK *** [ 19.791127] 1 lock held by crda/385: [ 19.791127] #0: (nlk->cb_mutex){+.+.+.}, at: [<ffffffff8161da41>] netlink_dump+0x21/0x230 [ 19.791127] stack backtrace: [ 19.791127] CPU: 0 PID: 385 Comm: crda Not tainted 3.11.0-0.rc6.git1.1.fc20.x86_64 #1 [ 19.791127] Hardware name: LENOVO 74585FU/74585FU, BIOS 6DET72WW (3.22 ) 10/25/2012 [ 19.791127] ffffffff82586930 ffff880121841910 ffffffff81723746 ffffffff82586930 [ 19.791127] ffff880121841950 ffffffff8171f347 ffff8801218419a0 ffff8801241f0850 [ 19.791127] ffff8801241f0000 0000000000000001 0000000000000001 ffff8801241f0850 [ 19.791127] Call Trace: [ 19.791127] [<ffffffff81723746>] dump_stack+0x54/0x74 [ 19.791127] [<ffffffff8171f347>] print_circular_bug+0x201/0x20f [ 19.791127] [<ffffffff810e88b4>] __lock_acquire+0x17b4/0x1b20 [ 19.791127] [<ffffffff810e9412>] lock_acquire+0xa2/0x1f0 [ 19.791127] [<ffffffff81621512>] ? ctrl_dumpfamily+0x172/0x180 [ 19.791127] [<ffffffff81727357>] mutex_lock_nested+0x87/0x3e0 [ 19.791127] [<ffffffff81621512>] ? ctrl_dumpfamily+0x172/0x180 [ 19.791127] [<ffffffff81621512>] ? ctrl_dumpfamily+0x172/0x180 [ 19.791127] [<ffffffff815da4e1>] ? __kmalloc_reserve.isra.25+0x31/0x90 [ 19.791127] [<ffffffff81621512>] ctrl_dumpfamily+0x172/0x180 [ 19.791127] [<ffffffff8161daa0>] netlink_dump+0x80/0x230 [ 19.791127] [<ffffffff8161df28>] netlink_recvmsg+0x2d8/0x390 [ 19.791127] [<ffffffff815d0c28>] sock_recvmsg+0xa8/0xe0 [ 19.791127] [<ffffffff810e359d>] ? trace_hardirqs_off+0xd/0x10 [ 19.791127] [<ffffffff810e40ff>] ? lock_release_holdtime.part.28+0xf/0x1a0 [ 19.791127] [<ffffffff810e9868>] ? lock_release_non_nested+0x308/0x350 [ 19.791127] [<ffffffff810e40ff>] ? lock_release_holdtime.part.28+0xf/0x1a0 [ 19.791127] [<ffffffff815d1693>] ___sys_recvmsg+0x113/0x290 [ 19.791127] [<ffffffff810e40ff>] ? lock_release_holdtime.part.28+0xf/0x1a0 [ 19.791127] [<ffffffff810a5b5f>] ? up_read+0x1f/0x40 [ 19.791127] [<ffffffff81210197>] ? fget_light+0x3a7/0x510 [ 19.791127] [<ffffffff815d2172>] __sys_recvmsg+0x42/0x80 [ 19.791127] [<ffffffff815d21c2>] SyS_recvmsg+0x12/0x20 [ 19.791127] [<ffffffff81736619>] system_call_fastpath+0x16/0x1b
(In reply to Tom London from comment #12) > So, at least for one boot, the very pronounced pause during boot and the > above spew is gone. Right. That's what I expect to see with the change we made. I doubt it is the cause of the spoke hang. (In reply to Tom London from comment #13) > Rebooted. No systemd-udevd spew. > > Probably unrelated, but I see this: Yes, that is unrelated. It should be fixed in an updated kernel shortly, as upstream is aware of it and the patch that caused it should be reverted.
Dropping blocker status from this and re-opening 983319, as it seems this isn't causing the anaconda bug and doesn't seem inherently blockery itself.
*** Bug 1109970 has been marked as a duplicate of this bug. ***