Bug 918511

Summary:	Reduce the possibility of timed out jobs and do not continue to boot with an unsettled udev
Product:	Red Hat Enterprise Linux 6	Reporter:	Peter Rajnoha <prajnoha>
Component:	udev	Assignee:	Harald Hoyer <harald>
Status:	CLOSED ERRATA	QA Contact:	qe-baseos-daemons
Severity:	urgent	Docs Contact:
Priority:	high
Version:	6.4	CC:	agk, azelinka, jprokes, jwest, kay, ovasik, prajnoha, qe-baseos-daemons, salmy, sauchter, sbradley, skito, udev-maint-list, ykawada, zkabelac
Target Milestone:	rc
Target Release:	6.5
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The number udevd spawns as workers is only dependent on the RAM size. Consequence: For machines with relatively big RAM sizes and lots of disks, a lot of udevd workers are running in parallel, maximizing CPU and I/O. This can cause udev events to timeout, because of hardware bottlenecks. Fix: The number of udevd workers is now also limited by the CPU count and significantly lower on machines with a big RAM size. Result: Less udev workers running in concurrently do not hit bottlenecks that easy and cause less or no timeouts to happen. Cause2: start_udev called udevadm settle and timed out after the default 180 seconds. Consequence2: Some devices were not completely assembled and the boot process continued causing various failures. Fix2: start_udev now waits indefinitely until udev has settled. Result2: All devices are assembled now and tghe boot process should continue without errors.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-11-21 22:00:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	835616, 960054, 1005262

Description Peter Rajnoha 2013-03-06 12:32:18 UTC

Description of problem:
If there are numerous events occurring in the system at once, we might get into a situation in which event timeouts occur and udev just kills the udev process that processes the event.

For example, numerous events can be generated when many block devices are activated at once (see also bug #885978 and bug #917538 for examples where this causes real problems).

Actual results:
Killed udev proces without any notification and without any way to do some failsafe operation, ending up with incomplete/incorrect udev state without a way to detect that something went wrong when reading udev database further or processing events.

Expected results:
Udev should have a special udev rule that would be processed on event timeout to have a way to do anything that is needed to notify any part of the system that depends on the udev (database) state and so have a way to recognize that something went wrong and so create a possibility for any fallback operation that will deal with the error condition correctly.

This could be something like RUN_TIMEOUT+="...". So exactly like the existing RUN udev rule, but fired only on timeouts.

Additional info:
Question might arise what we should do if the RUN_TIMEOUT itself causes a timeout. I suggest just ending like we do today on timeout - it's a very serious error if something like that happens and should be resolved individually then.

Briefly:

Today we have:
- event processing
- timeout
- kill

I'm suggesting:
- event processing
- timeout
- RUN_TIMEOUT+=".."
- timeout (if it happens even here)
- kill

Comment 2 Kay Sievers 2013-03-22 09:45:54 UTC

We cannot start new programs when we try to recover from the fact that
things hang because we alrady started too many programs. When we run
into a timeout, we have lost already or did too much already, and we
cannot start executing additional error programs as a result of that.

The use of overly expensive tools like lvm need to be limited to the
resources of the machine or the amount of resources lvm takes; tuning
the timeouts or the error handling to push the limits does not seem
like a convincing strategy.

Comment 3 Zdenek Kabelac 2013-03-22 12:19:45 UTC

So - let's put here some numbers with 'expensive tools':

Playing with my oldish T61 4G, 3.8 kernel which has some debugging enabled and runs systemd in verbose mode, fresh Rawhide and my git lvm. LVM devices are using 'ram disk' for back-end PV device to play in the same  play-ground like scsi_debug.

--

time $(modprobe scsi_debug dev_size_mb=2 num_tgts=200; udevadm settle)
5.5s

vgchange -ay  with 200 linear LVs  (without lvmetad which is suboptimal)
5.2s

--

scsi_debug and 1700 devices:       1m 7s

vgchange -ay  with 1700 linear LVs:   52s

--

Using higher number of devices (like 10000) leads to OOM thanks to huge amount of forked udevd process executed on my just 4G system - simply killing my desktop.

Looking at the CPU hogs -  libvirtd, udisksd (already reported massive leaking bug 922286) and gvfs-udisks2-volume-monitor seems to be eating a lot of CPU  - so let's kill them and repeat (leaving just systemd-journal logging of systemd debug message, which is still not exactly cheap...)

scsi_debug 200:    ~4s
vgchange -ay:      ~4s

scsi_debug 1700:  ~35s
vgchange -ay:     ~35s
   real	0m34.855s
   user	0m0.777s
   sys	0m2.767s

- what could be observed from the above results -  

lvm command takes actually very small portion of the time/CPU - major portion of the time is spend in udev processing.

dm udev rules are executing dmsetup twice - but still negligible since  scsi_debug performs with approximately same speed for non-dm devices.


Here is the 'perf report' from the 'vgchange -ay' command above:

    16.36%  [kernel.kallsyms]     [k] pcpu_alloc_area
     4.37%  [kernel.kallsyms]     [k] __lock_acquire
     3.83%  [kernel.kallsyms]     [k] insert_vmalloc_vmlist
     3.32%  [kernel.kallsyms]     [k] lock_is_held
     2.31%  lvm                   [.] find_lv_in_vg_by_lvid
     2.12%  [kernel.kallsyms]     [k] bd_link_disk_holder
     1.87%  [kernel.kallsyms]     [k] lock_release
     1.64%  libc-2.17.so          [.] __GI___strncmp_ssse3
     1.64%  [kernel.kallsyms]     [k] lock_acquire
     1.58%  libc-2.17.so          [.] __memset_sse2
     1.47%  [kernel.kallsyms]     [k] debug_smp_processor_id
     1.38%  [kernel.kallsyms]     [k] sched_clock_local
     1.36%  [kernel.kallsyms]     [k] native_sched_clock
     1.36%  [kernel.kallsyms]     [k] __lock_is_held
     1.27%  [kernel.kallsyms]     [k] trace_hardirqs_off_caller
     1.23%  [kernel.kallsyms]     [k] mark_lock
     1.20%  [kernel.kallsyms]     [k] add_preempt_count
     1.14%  [kernel.kallsyms]     [k] mark_held_locks
     1.12%  [kernel.kallsyms]     [k] __module_address
     1.12%  [kernel.kallsyms]     [k] sub_preempt_count
     1.07%  libdevmapper.so.1.02  [.] _is_whitelisted_char
     1.01%  [kernel.kallsyms]     [k] strcmp
     0.98%  [kernel.kallsyms]     [k] local_clock
     0.89%  [kernel.kallsyms]     [k] rcu_lockdep_current_cpu_online
     0.89%  [kernel.kallsyms]     [k] debug_lockdep_rcu_enabled
     0.85%  [kernel.kallsyms]     [k] trace_hardirqs_on_caller
     0.81%  libc-2.17.so          [.] vfprintf
     0.78%  libdevmapper.so.1.02  [.] unmangle_string
     0.76%  [kernel.kallsyms]     [k] mutex_lock_nested
     0.76%  [kernel.kallsyms]     [k] time_hardirqs_off
     0.74%  [kernel.kallsyms]     [k] rcu_is_cpu_idle
     0.74%  [kernel.kallsyms]     [k] __might_sleep
     0.72%  [kernel.kallsyms]     [k] sysfs_link_sibling
     0.67%  libdevmapper.so.1.02  [.] mangle_string


now with 'pref record -ag' version:
     2.61%          dmsetup    [k] __lock_acquire
     1.50%    systemd-udevd    [k] __lock_acquire
     1.33%          dmsetup    [k] lock_is_held
     1.15%   udisks-part-id    [k] __lock_acquire
     1.11%          dmsetup    [k] lock_release
     1.11%  udisks-dm-expor    [k] __lock_acquire
     0.98%          dmsetup    [k] page_fault
     0.96%          dmsetup    [k] lock_acquire
     0.95%          dmsetup    [k] debug_smp_processor_id
     0.93%          dmsetup    [k] native_sched_clock
     0.90%          dmsetup    [k] sched_clock_local
     0.79%         vgchange    [k] pcpu_alloc_area
     0.77%    systemd-udevd    [k] lock_is_held
     0.76%          dmsetup    [k] trace_hardirqs_off_caller
     0.75%          dmsetup    [k] add_preempt_count
     0.67%    systemd-udevd    [k] copy_pte_range

for completeness 'perf record -ag' for modprobe scsi_debug 

     4.20%    systemd-udevd  [k] __lock_acquire
     1.90%    systemd-udevd  [k] lock_release
     1.61%    systemd-udevd  [k] lock_acquire
     1.52%    systemd-udevd  [k] sched_clock_local
     1.50%    systemd-udevd  [k] native_sched_clock
     1.45%    systemd-udevd  [k] trace_hardirqs_off_caller
     1.45%   udisks-part-id  [k] __lock_acquire
     1.30%    systemd-udevd  [k] lock_is_held
     1.21%    systemd-udevd  [k] debug_smp_processor_id
     1.03%    systemd-udevd  [k] mutex_lock_nested
     1.02%    systemd-udevd  [k] local_clock
     0.88%    systemd-udevd  [k] mark_lock
     0.86%    systemd-udevd  [k] add_preempt_count
     0.85%    systemd-udevd  [k] time_hardirqs_off
     0.76%    systemd-udevd  [k] sub_preempt_count
     0.73%    systemd-udevd  [k] do_raw_spin_trylock
     0.72%          scsi_id  [k] __lock_acquire
     0.69%  systemd-journal  [k] __lock_acquire
     0.68%    systemd-udevd  [k] mark_held_locks
     0.67%   udisks-part-id  [k] lock_is_held
     0.63%      kworker/u:3  [k] pcpu_alloc_area
     0.62%    systemd-udevd  [k] trace_hardirqs_off
     0.61%   udisks-part-id  [k] lock_release
     0.59%    systemd-udevd  [k] trace_hardirqs_on_caller
     0.57%   udisks-part-id  [k] page_fault
     0.54%    systemd-udevd  [k] __lock_is_held
     0.53%   udisks-part-id  [k] lock_acquire
     0.53%   udisks-part-id  [k] debug_smp_processor_id
     0.52%    systemd-udevd  [k] sched_clock_cpu
     0.51%    systemd-udevd  [k] __mutex_unlock_slowpath
     0.50%   udisks-part-id  [k] native_sched_clock
     0.49%   udisks-part-id  [k] sched_clock_local
     0.45%    systemd-udevd  [.] _int_malloc
     0.44%          systemd  [k] __lock_acquire


The same analysis should now happen with RHEL system to compare for differences.

IMHO udev is forking simply way too many tasks - is the lowering of tasks increasing performance and avoiding any timeouts ?

Comment 4 Harald Hoyer 2013-03-22 12:59:47 UTC

kernel command line: "udev.children-max="

Comment 5 Harald Hoyer 2013-03-22 13:03:07 UTC

or in a running system:

# udevadm control --children-max=value

Comment 6 Zdenek Kabelac 2013-03-22 13:05:45 UTC

Yep, known

However the 'default' behavior is simply wrong and needs to be fixed.

Comment 7 Kay Sievers 2013-03-22 13:17:19 UTC

Can you find out what value provides the best compromise on your
box?

It certainly sounds like the better approach to limit the
concurrency instead of running against the wall and picking
up the pieces later with error handling by forking new
things. :)

Comment 8 Harald Hoyer 2013-03-25 16:57:13 UTC

http://cgit.freedesktop.org/systemd/systemd/commit/?id=8cc3f8c0bcd23bb68166cb197a4c541d7621b19c

Comment 9 Peter Rajnoha 2013-03-27 09:26:41 UTC

(In reply to comment #8)
> http://cgit.freedesktop.org/systemd/systemd/commit/
> ?id=8cc3f8c0bcd23bb68166cb197a4c541d7621b19c

Thanks!

As that is surely a nice patch to have, of course, but still we haven't provided full solution to problems that could arise - we've just made it less probable with this patch.

So my question is: how can we detect the validity of udev database if the timeout happens, either because the system is totally overloaded or because some command was hung and it failed. If we wanted to rely on udev database, it would be fine to have a way to detect it being "tainted" when the timeout/kill happens for whatever reason...

Comment 10 Peter Rajnoha 2013-03-27 09:36:54 UTC

Just one more question for my curiosity: is there at least a transaction when records are written to udev database? So if the timeout happens, the command killed and the rest of the RUN queue cut off, does the udev database end with previous state or does it contain half the old information and half the new information that was updated till the RUN queue died?

Comment 11 Peter Rajnoha 2013-03-27 09:57:49 UTC

Also, now that systemd and udev is in one and considering the systemd's journal which has quite a nice bunch of features, could a part of this interface be reused for udev database as another instance of this interface to remove the symlinky nature of current udev database, to aid lookups and to provide more integrity etc. Just my curiosity, I've no idea if that is viable at all...

Comment 13 Harald Hoyer 2013-05-28 13:00:54 UTC

(In reply to Harald Hoyer from comment #8)
> http://cgit.freedesktop.org/systemd/systemd/commit/
> ?id=8cc3f8c0bcd23bb68166cb197a4c541d7621b19c

devel_ack for this patch.

Comment 15 Harald Hoyer 2013-07-11 10:11:12 UTC

To clarify:

- We will not add something like a RUN_TIMEOUT="..."
- We will reduce the number of udev max childs
- We will patch start_udev to not timeout so early, so that LVM assembly can happen later on

Comment 22 Harald Hoyer 2013-08-09 10:29:40 UTC

First alpha build for testing:
http://people.redhat.com/harald/downloads/udev/udev-147-2.47.el6/

Would be nice, if you could test it.

Comment 28 errata-xmlrpc 2013-11-21 22:00:35 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1675.html