997149 – task systemd-udevd:757 blocked for more than 120 seconds.

Bug 997149 - task systemd-udevd:757 blocked for more than 120 seconds.

Summary: task systemd-udevd:757 blocked for more than 120 seconds.

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1109970 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-08-14 19:06 UTC by Brian Lane
Modified:	2014-07-07 13:49 UTC (History)
CC List:	19 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-08-21 18:06:00 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
syslog (63.99 KB, text/plain) 2013-08-14 19:06 UTC, Brian Lane	no flags	Details
View All

Description Brian Lane 2013-08-14 19:06:58 UTC

Created attachment 786668 [details]
syslog

Description of problem:
Scratch built Anaconda boot.iso from rawhide gets stuck after storage spoke. Nothing obvious in the Anaconda logs, but after letting it sit this message shows up in syslog:

18:39:05,825 ERR kernel:[  243.684161] INFO: task systemd-udevd:757 blocked for more than 120 seconds.
18:39:05,825 ERR kernel:[  243.684167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
18:39:05,825 INFO kernel:[  243.684169] systemd-udevd   D ffff88010cefdc00  4568   757    742 0x00000000
18:39:05,825 WARNING kernel:[  243.684178]  ffff88010d219c08 0000000000000046 00000000001d5300 ffff88010d219fd8
18:39:05,825 WARNING kernel:[  243.684183]  ffff88010d219fd8 00000000001d5300 ffff88010c0e5c00 ffff880118cd8b50
18:39:05,825 WARNING kernel:[  243.684188]  0000000000000246 ffff88010c0e5c00 ffff880118cd8b48 ffffffffa0254580
18:39:05,825 WARNING kernel:[  243.684193] Call Trace:
18:39:05,825 WARNING kernel:[  243.684207]  [<ffffffff8172b9c9>] schedule_preempt_disabled+0x39/0x90
18:39:05,825 WARNING kernel:[  243.684211]  [<ffffffff81727ced>] mutex_lock_nested+0x1ad/0x3e0
18:39:05,825 WARNING kernel:[  243.684227]  [<ffffffff81486129>] ? subsys_interface_register+0x59/0xf0
18:39:05,825 WARNING kernel:[  243.684232]  [<ffffffff81486129>] ? subsys_interface_register+0x59/0xf0
18:39:05,825 WARNING kernel:[  243.684237]  [<ffffffff8172d3b6>] ? _raw_write_unlock_irqrestore+0x36/0x70
18:39:05,825 WARNING kernel:[  243.684243]  [<ffffffff81486129>] subsys_interface_register+0x59/0xf0
18:39:05,825 WARNING kernel:[  243.684258]  [<ffffffff810e6c8d>] ? trace_hardirqs_on+0xd/0x10
18:39:05,825 WARNING kernel:[  243.684270]  [<ffffffff8159589f>] cpufreq_register_driver+0x9f/0x200
18:39:05,825 WARNING kernel:[  243.684278]  [<ffffffffa025b16c>] acpi_cpufreq_init+0x16c/0x1000 [acpi_cpufreq]
18:39:05,825 WARNING kernel:[  243.684291]  [<ffffffffa025b000>] ? 0xffffffffa025afff
18:39:05,825 WARNING kernel:[  243.684300]  [<ffffffff810020fa>] do_one_initcall+0xfa/0x1b0
18:39:05,825 WARNING kernel:[  243.684312]  [<ffffffff8105cb93>] ? set_memory_nx+0x43/0x50
18:39:05,825 WARNING kernel:[  243.684324]  [<ffffffff810f74ff>] load_module+0x1c6f/0x27f0
18:39:05,825 WARNING kernel:[  243.684328]  [<ffffffff810f2ce0>] ? store_uevent+0x40/0x40
18:39:05,825 WARNING kernel:[  243.684337]  [<ffffffff810f8216>] SyS_finit_module+0x86/0xb0
18:39:05,825 WARNING kernel:[  243.684344]  [<ffffffff81736e99>] system_call_fastpath+0x16/0x1b
18:39:05,825 WARNING kernel:[  243.684347] 1 lock held by systemd-udevd/757:
18:39:05,825 WARNING kernel:[  243.684349]  #0:  (subsys mutex#4){+.+.+.}, at: [<ffffffff81486129>] subsys_interface_register+0x59/0xf0
18:39:16,833 INFO kernel:[  254.692354] microcode: Microcode Update Driver: v2.00 <tigran.co.uk>, Peter Oruba
18:39:16,840 WARNING kernel:[  254.699383] systemd-udevd (768) used greatest stack depth: 3112 bytes left

Version-Release number of selected component (if applicable):
systemd-206-2.fc20

Comment 1 Harald Hoyer 2013-08-20 11:29:39 UTC

kernel bug with load_module()

Might be related that the kernel config does not set:
CONFIG_FW_LOADER_USER_HELPER=n

otherwise firmware loading can take a looooong time

Comment 2 Josh Boyer 2013-08-20 12:59:14 UTC

(In reply to Harald Hoyer from comment #1)
> kernel bug with load_module()
> 
> Might be related that the kernel config does not set:
> CONFIG_FW_LOADER_USER_HELPER=n
> 
> otherwise firmware loading can take a looooong time

Yeah, that whole mess is... well it's a mess.  It'd be helpful if the udev maintainers gave us a heads up before they went and made the upstream DEFAULT kernel setting not work anymore.

Comment 3 Josh Boyer 2013-08-20 13:04:45 UTC

(In reply to Josh Boyer from comment #2)
> (In reply to Harald Hoyer from comment #1)
> > kernel bug with load_module()
> > 
> > Might be related that the kernel config does not set:
> > CONFIG_FW_LOADER_USER_HELPER=n
> > 
> > otherwise firmware loading can take a looooong time
> 
> Yeah, that whole mess is... well it's a mess.  It'd be helpful if the udev
> maintainers gave us a heads up before they went and made the upstream
> DEFAULT kernel setting not work anymore.

So we can't turn it off without disabling the Dell RBU driver apparently.  That driver selects it the user helper because it needs it for some reason.  Like I said, this is a mess.

Comment 4 Jóhann B. Guðmundsson 2013-08-20 13:19:44 UTC

Also see

https://lkml.org/lkml/2013/7/6/112

Comment 5 Kay Sievers 2013-08-20 14:17:35 UTC

(In reply to Josh Boyer from comment #2)
> Yeah, that whole mess is... well it's a mess.  It'd be helpful if the udev
> maintainers gave us a heads up before they went and made the upstream
> DEFAULT kernel setting not work anymore.

Oh, we we tried and gave a head-up that we disabled the udev firmware
handling:
  Mar 20 01:44:51 <kay> jwb: FYI, we wil disabled the userspace firmware
  loading code in udev now too; so all pieces are gone, it's history ...

  http://cgit.freedesktop.org/systemd/systemd/commit/?id=a3bd8447be4ea2ce230eb8ae0e815c04d85fa15a

But nobody knew that this wasn't enough, or that the broken DELL stuff
would pull-in the firmware loader, or that other stuff was now relying
on userspace to just blindly *cancel* all firmware requests.

It all makes not much sense, like the whole userspace firmware loading
never did, and right, it is really just a huge mess.

Today, the kernel should not rely on userspace to do anything, not wait
for 60 seconds for something that will never happen, especially when it
itself has already found out that there is nothing to find in for userspace.

Seems we have these 3 options:

1. apply Takashi's patch, which is the cleanest and simplest solution,
and the kernel can provide backwards compat as long as it needs, and
userspace is not involved at all in firmware loading, like it should be:
  https://lkml.org/lkml/2013/7/8/81

2. disable the DELL rbu firmware mis-user and let DELL fix it.

3. add an awkward udev rule to immediately cancel all kernel firmware requests:
  SUBSYSTEM=="firmware", ACTION=="add", ATTR{loading}="-1"

Comment 6 Josh Boyer 2013-08-20 14:53:38 UTC

(In reply to Kay Sievers from comment #5)
> (In reply to Josh Boyer from comment #2)
> > Yeah, that whole mess is... well it's a mess.  It'd be helpful if the udev
> > maintainers gave us a heads up before they went and made the upstream
> > DEFAULT kernel setting not work anymore.
> 
> Oh, we we tried and gave a head-up that we disabled the udev firmware
> handling:
>   Mar 20 01:44:51 <kay> jwb: FYI, we wil disabled the userspace firmware
>   loading code in udev now too; so all pieces are gone, it's history ...

My mistake.  I clearly didn't remember that.  Though looking in the git history for the Fedora config fragments, we did turn this option off back in Februrary with the 3.8 merge window kernels.  I guess we missed the Dell RBU thing bringing it back in.

> http://cgit.freedesktop.org/systemd/systemd/commit/
> ?id=a3bd8447be4ea2ce230eb8ae0e815c04d85fa15a
> 
> But nobody knew that this wasn't enough, or that the broken DELL stuff
> would pull-in the firmware loader, or that other stuff was now relying
> on userspace to just blindly *cancel* all firmware requests.

Yeah :\.

> It all makes not much sense, like the whole userspace firmware loading
> never did, and right, it is really just a huge mess.
> 
> Today, the kernel should not rely on userspace to do anything, not wait
> for 60 seconds for something that will never happen, especially when it
> itself has already found out that there is nothing to find in for userspace.
> 
> Seems we have these 3 options:
> 
> 1. apply Takashi's patch, which is the cleanest and simplest solution,
> and the kernel can provide backwards compat as long as it needs, and
> userspace is not involved at all in firmware loading, like it should be:
>   https://lkml.org/lkml/2013/7/8/81
> 
> 2. disable the DELL rbu firmware mis-user and let DELL fix it.

I'll look at one of these two.  I don't like the third option at all.

Comment 7 Brian Lane 2013-08-20 17:07:46 UTC

*** Bug 983319 has been marked as a duplicate of this bug. ***

Comment 8 Josh Boyer 2013-08-20 17:31:25 UTC

I disabled the Dell rbu driver.

Takashi's patch isn't any better and likely isn't going upstream.

Comment 9 Adam Williamson 2013-08-20 17:36:17 UTC

In case it wasn't clear - sorry! - this is a showstopper for Alpha at present. Can we get a kernel build with the fix pretty soon, so this isn't a problem for TC builds? Thanks!

Comment 10 Adam Williamson 2013-08-21 07:04:04 UTC

So, a new kernel build was done:

http://koji.fedoraproject.org/koji/buildinfo?buildID=457729

I built a live image with that kernel and latest anaconda, and tried it. After patching #997690, anaconda still gets stuck on returning from the partitioning spoke.

I don't know if a) the kernel update didn't really fix the kernel problem, b) the kernel update fixed the kernel problem but that was not really what caused the anaconda problem, or c) something else. But the anaconda issue definitely still seems to be there, for me.

Comment 11 Josh Boyer 2013-08-21 12:59:42 UTC

(In reply to Adam Williamson from comment #10)
> So, a new kernel build was done:
> 
> http://koji.fedoraproject.org/koji/buildinfo?buildID=457729
> 
> I built a live image with that kernel and latest anaconda, and tried it.
> After patching #997690, anaconda still gets stuck on returning from the
> partitioning spoke.
> 
> I don't know if a) the kernel update didn't really fix the kernel problem,
> b) the kernel update fixed the kernel problem but that was not really what
> caused the anaconda problem, or c) something else. But the anaconda issue
> definitely still seems to be there, for me.

Likely either b) or c).

If this only happens when you go into the storage spoke, then it's possible udev might be waiting but I find it pretty unlikely.  It was already an educated guess on Harald and Brian's part that udev hanging was the issue, as that was the only thing that showed up in the logs.

Comment 12 Tom London 2013-08-21 13:35:16 UTC

Previously posted on: https://bugzilla.redhat.com/show_bug.cgi?id=996295

Here is what I was seeing (prior to kernel-3.11.0-0.rc6.git1.1.fc20.x86_64)

[root@tlondon systemd]# journalctl -b | grep systemd-udevd
Aug 20 06:01:58 tlondon.localhost.org systemd-udevd[142]: starting version 206
Aug 20 06:02:04 tlondon.localhost.org systemd-udevd[322]: starting version 206
Aug 20 06:02:05 tlondon.localhost.org systemd-udevd[338]: error opening ATTR{/sys/devices/pci0000:00/0000:00:1a.1/usb4/4-1/4-1:1.0/power/control} for writing: No such file or directory
Aug 20 06:02:07 tlondon.localhost.org systemd-udevd[329]: renamed network interface wlan0 to wlp3s0
Aug 20 06:02:46 tlondon.localhost.org systemd-udevd[322]: worker [330] /devices/system/cpu/cpu0 timeout; kill it
Aug 20 06:02:46 tlondon.localhost.org systemd-udevd[322]: seq 1890 '/devices/system/cpu/cpu0' killed
Aug 20 06:02:46 tlondon.localhost.org systemd-udevd[322]: worker [336] /devices/system/cpu/cpu1 timeout; kill it
Aug 20 06:02:46 tlondon.localhost.org systemd-udevd[322]: seq 1891 '/devices/system/cpu/cpu1' killed
Aug 20 06:02:46 tlondon.localhost.org systemd-udevd[322]: worker [336] terminated by signal 9 (Killed)
Aug 20 06:04:06 tlondon.localhost.org systemd-udevd[322]: worker [330] terminated by signal 9 (Killed)
[root@tlondon systemd]# 


Updated/installed kernel-3.11.0-0.rc6.git1.1.fc20.x86_64 and rebooted my Lenovo Thinkpad X200.

Here is what I now see:

[root@tlondon ~]# journalctl -b | grep systemd-udevd
Aug 21 06:22:03 tlondon.localhost.org systemd-udevd[143]: starting version 206
Aug 21 06:22:05 tlondon.localhost.org kernel: systemd-udevd (145) used greatest stack depth: 2568 bytes left
Aug 21 06:22:12 tlondon.localhost.org systemd-udevd[309]: starting version 206
Aug 21 06:22:14 tlondon.localhost.org systemd-udevd[329]: error opening ATTR{/sys/devices/pci0000:00/0000:00:1a.1/usb4/4-1/4-1:1.0/power/control} for writing: No such file or directory
Aug 21 06:22:18 tlondon.localhost.org systemd-udevd[327]: renamed network interface wlan0 to wlp3s0
[root@tlondon ~]# 

So, at least for one boot, the very pronounced pause during boot and the above spew is gone.

Will try rebooting several times to verify.

Comment 13 Tom London 2013-08-21 14:19:18 UTC

Rebooted. No systemd-udevd spew.

Probably unrelated, but I see this:

[   19.073545] ieee80211 phy0: Selected rate control algorithm 'iwl-agn-rs'
[   19.447105] input: ThinkPad Extra Buttons as /devices/platform/thinkpad_acpi/input/input13

[   19.786006] ======================================================
[   19.786006] [ INFO: possible circular locking dependency detected ]
[   19.786006] 3.11.0-0.rc6.git1.1.fc20.x86_64 #1 Not tainted
[   19.791127] -------------------------------------------------------
[   19.791127] crda/385 is trying to acquire lock:
[   19.791127]  (genl_mutex){+.+.+.}, at: [<ffffffff81621512>] ctrl_dumpfamily+0x172/0x180
[   19.791127] 
but task is already holding lock:
[   19.791127]  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff8161da41>] netlink_dump+0x21/0x230
[   19.791127] 
which lock already depends on the new lock.

[   19.791127] 
the existing dependency chain (in reverse order) is:
[   19.791127] 
-> #1 (nlk->cb_mutex){+.+.+.}:
[   19.791127]        [<ffffffff810e9412>] lock_acquire+0xa2/0x1f0
[   19.791127]        [<ffffffff81727357>] mutex_lock_nested+0x87/0x3e0
[   19.791127]        [<ffffffff8161e605>] __netlink_dump_start+0xb5/0x1c0
[   19.791127]        [<ffffffff816226d4>] genl_family_rcv_msg+0x264/0x300
[   19.791127]        [<ffffffff816227fe>] genl_rcv_msg+0x8e/0xd0
[   19.791127]        [<ffffffff816204b9>] netlink_rcv_skb+0xa9/0xc0
[   19.791127]        [<ffffffff81620a18>] genl_rcv+0x28/0x40
[   19.791127]        [<ffffffff8161fafd>] netlink_unicast+0xdd/0x190
[   19.791127]        [<ffffffff8161fed9>] netlink_sendmsg+0x329/0x750
[   19.791127]        [<ffffffff815d0579>] sock_sendmsg+0x99/0xd0
[   19.791127]        [<ffffffff815d099e>] ___sys_sendmsg+0x39e/0x3b0
[   19.791127]        [<ffffffff815d1f02>] __sys_sendmsg+0x42/0x80
[   19.791127]        [<ffffffff815d1f52>] SyS_sendmsg+0x12/0x20
[   19.791127]        [<ffffffff81736619>] system_call_fastpath+0x16/0x1b
[   19.791127] 
-> #0 (genl_mutex){+.+.+.}:
[   19.791127]        [<ffffffff810e88b4>] __lock_acquire+0x17b4/0x1b20
[   19.791127]        [<ffffffff810e9412>] lock_acquire+0xa2/0x1f0
[   19.791127]        [<ffffffff81727357>] mutex_lock_nested+0x87/0x3e0
[   19.791127]        [<ffffffff81621512>] ctrl_dumpfamily+0x172/0x180
[   19.791127]        [<ffffffff8161daa0>] netlink_dump+0x80/0x230
[   19.791127]        [<ffffffff8161df28>] netlink_recvmsg+0x2d8/0x390
[   19.791127]        [<ffffffff815d0c28>] sock_recvmsg+0xa8/0xe0
[   19.791127]        [<ffffffff815d1693>] ___sys_recvmsg+0x113/0x290
[   19.791127]        [<ffffffff815d2172>] __sys_recvmsg+0x42/0x80
[   19.791127]        [<ffffffff815d21c2>] SyS_recvmsg+0x12/0x20
[   19.791127]        [<ffffffff81736619>] system_call_fastpath+0x16/0x1b
[   19.791127] 
other info that might help us debug this:

[   19.791127]  Possible unsafe locking scenario:

[   19.791127]        CPU0                    CPU1
[   19.791127]        ----                    ----
[   19.791127]   lock(nlk->cb_mutex);
[   19.791127]                                lock(genl_mutex);
[   19.791127]                                lock(nlk->cb_mutex);
[   19.791127]   lock(genl_mutex);
[   19.791127] 
 *** DEADLOCK ***

[   19.791127] 1 lock held by crda/385:
[   19.791127]  #0:  (nlk->cb_mutex){+.+.+.}, at: [<ffffffff8161da41>] netlink_dump+0x21/0x230
[   19.791127] 
stack backtrace:
[   19.791127] CPU: 0 PID: 385 Comm: crda Not tainted 3.11.0-0.rc6.git1.1.fc20.x86_64 #1
[   19.791127] Hardware name: LENOVO 74585FU/74585FU, BIOS 6DET72WW (3.22 ) 10/25/2012
[   19.791127]  ffffffff82586930 ffff880121841910 ffffffff81723746 ffffffff82586930
[   19.791127]  ffff880121841950 ffffffff8171f347 ffff8801218419a0 ffff8801241f0850
[   19.791127]  ffff8801241f0000 0000000000000001 0000000000000001 ffff8801241f0850
[   19.791127] Call Trace:
[   19.791127]  [<ffffffff81723746>] dump_stack+0x54/0x74
[   19.791127]  [<ffffffff8171f347>] print_circular_bug+0x201/0x20f
[   19.791127]  [<ffffffff810e88b4>] __lock_acquire+0x17b4/0x1b20
[   19.791127]  [<ffffffff810e9412>] lock_acquire+0xa2/0x1f0
[   19.791127]  [<ffffffff81621512>] ? ctrl_dumpfamily+0x172/0x180
[   19.791127]  [<ffffffff81727357>] mutex_lock_nested+0x87/0x3e0
[   19.791127]  [<ffffffff81621512>] ? ctrl_dumpfamily+0x172/0x180
[   19.791127]  [<ffffffff81621512>] ? ctrl_dumpfamily+0x172/0x180
[   19.791127]  [<ffffffff815da4e1>] ? __kmalloc_reserve.isra.25+0x31/0x90
[   19.791127]  [<ffffffff81621512>] ctrl_dumpfamily+0x172/0x180
[   19.791127]  [<ffffffff8161daa0>] netlink_dump+0x80/0x230
[   19.791127]  [<ffffffff8161df28>] netlink_recvmsg+0x2d8/0x390
[   19.791127]  [<ffffffff815d0c28>] sock_recvmsg+0xa8/0xe0
[   19.791127]  [<ffffffff810e359d>] ? trace_hardirqs_off+0xd/0x10
[   19.791127]  [<ffffffff810e40ff>] ? lock_release_holdtime.part.28+0xf/0x1a0
[   19.791127]  [<ffffffff810e9868>] ? lock_release_non_nested+0x308/0x350
[   19.791127]  [<ffffffff810e40ff>] ? lock_release_holdtime.part.28+0xf/0x1a0
[   19.791127]  [<ffffffff815d1693>] ___sys_recvmsg+0x113/0x290
[   19.791127]  [<ffffffff810e40ff>] ? lock_release_holdtime.part.28+0xf/0x1a0
[   19.791127]  [<ffffffff810a5b5f>] ? up_read+0x1f/0x40
[   19.791127]  [<ffffffff81210197>] ? fget_light+0x3a7/0x510
[   19.791127]  [<ffffffff815d2172>] __sys_recvmsg+0x42/0x80
[   19.791127]  [<ffffffff815d21c2>] SyS_recvmsg+0x12/0x20
[   19.791127]  [<ffffffff81736619>] system_call_fastpath+0x16/0x1b

Comment 14 Josh Boyer 2013-08-21 14:57:37 UTC

(In reply to Tom London from comment #12) 
> So, at least for one boot, the very pronounced pause during boot and the
> above spew is gone.

Right.  That's what I expect to see with the change we made.  I doubt it is the cause of the spoke hang.

(In reply to Tom London from comment #13)
> Rebooted. No systemd-udevd spew.
> 
> Probably unrelated, but I see this:

Yes, that is unrelated.  It should be fixed in an updated kernel shortly, as upstream is aware of it and the patch that caused it should be reverted.

Comment 15 Adam Williamson 2013-08-21 16:59:31 UTC

Dropping blocker status from this and re-opening 983319, as it seems this isn't causing the anaconda bug and doesn't seem inherently blockery itself.

Comment 16 Josh Boyer 2014-07-07 13:49:04 UTC

*** Bug 1109970 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.