Red Hat Bugzilla – Bug 918511
Reduce the possibility of timed out jobs and do not continue to boot with an unsettled udev
Last modified: 2015-12-11 12:47:07 EST
Description of problem: If there are numerous events occurring in the system at once, we might get into a situation in which event timeouts occur and udev just kills the udev process that processes the event. For example, numerous events can be generated when many block devices are activated at once (see also bug #885978 and bug #917538 for examples where this causes real problems). Actual results: Killed udev proces without any notification and without any way to do some failsafe operation, ending up with incomplete/incorrect udev state without a way to detect that something went wrong when reading udev database further or processing events. Expected results: Udev should have a special udev rule that would be processed on event timeout to have a way to do anything that is needed to notify any part of the system that depends on the udev (database) state and so have a way to recognize that something went wrong and so create a possibility for any fallback operation that will deal with the error condition correctly. This could be something like RUN_TIMEOUT+="...". So exactly like the existing RUN udev rule, but fired only on timeouts. Additional info: Question might arise what we should do if the RUN_TIMEOUT itself causes a timeout. I suggest just ending like we do today on timeout - it's a very serious error if something like that happens and should be resolved individually then. Briefly: Today we have: - event processing - timeout - kill I'm suggesting: - event processing - timeout - RUN_TIMEOUT+=".." - timeout (if it happens even here) - kill
We cannot start new programs when we try to recover from the fact that things hang because we alrady started too many programs. When we run into a timeout, we have lost already or did too much already, and we cannot start executing additional error programs as a result of that. The use of overly expensive tools like lvm need to be limited to the resources of the machine or the amount of resources lvm takes; tuning the timeouts or the error handling to push the limits does not seem like a convincing strategy.
So - let's put here some numbers with 'expensive tools': Playing with my oldish T61 4G, 3.8 kernel which has some debugging enabled and runs systemd in verbose mode, fresh Rawhide and my git lvm. LVM devices are using 'ram disk' for back-end PV device to play in the same play-ground like scsi_debug. -- time $(modprobe scsi_debug dev_size_mb=2 num_tgts=200; udevadm settle) 5.5s vgchange -ay with 200 linear LVs (without lvmetad which is suboptimal) 5.2s -- scsi_debug and 1700 devices: 1m 7s vgchange -ay with 1700 linear LVs: 52s -- Using higher number of devices (like 10000) leads to OOM thanks to huge amount of forked udevd process executed on my just 4G system - simply killing my desktop. Looking at the CPU hogs - libvirtd, udisksd (already reported massive leaking bug 922286) and gvfs-udisks2-volume-monitor seems to be eating a lot of CPU - so let's kill them and repeat (leaving just systemd-journal logging of systemd debug message, which is still not exactly cheap...) scsi_debug 200: ~4s vgchange -ay: ~4s scsi_debug 1700: ~35s vgchange -ay: ~35s real 0m34.855s user 0m0.777s sys 0m2.767s - what could be observed from the above results - lvm command takes actually very small portion of the time/CPU - major portion of the time is spend in udev processing. dm udev rules are executing dmsetup twice - but still negligible since scsi_debug performs with approximately same speed for non-dm devices. Here is the 'perf report' from the 'vgchange -ay' command above: 16.36% [kernel.kallsyms] [k] pcpu_alloc_area 4.37% [kernel.kallsyms] [k] __lock_acquire 3.83% [kernel.kallsyms] [k] insert_vmalloc_vmlist 3.32% [kernel.kallsyms] [k] lock_is_held 2.31% lvm [.] find_lv_in_vg_by_lvid 2.12% [kernel.kallsyms] [k] bd_link_disk_holder 1.87% [kernel.kallsyms] [k] lock_release 1.64% libc-2.17.so [.] __GI___strncmp_ssse3 1.64% [kernel.kallsyms] [k] lock_acquire 1.58% libc-2.17.so [.] __memset_sse2 1.47% [kernel.kallsyms] [k] debug_smp_processor_id 1.38% [kernel.kallsyms] [k] sched_clock_local 1.36% [kernel.kallsyms] [k] native_sched_clock 1.36% [kernel.kallsyms] [k] __lock_is_held 1.27% [kernel.kallsyms] [k] trace_hardirqs_off_caller 1.23% [kernel.kallsyms] [k] mark_lock 1.20% [kernel.kallsyms] [k] add_preempt_count 1.14% [kernel.kallsyms] [k] mark_held_locks 1.12% [kernel.kallsyms] [k] __module_address 1.12% [kernel.kallsyms] [k] sub_preempt_count 1.07% libdevmapper.so.1.02 [.] _is_whitelisted_char 1.01% [kernel.kallsyms] [k] strcmp 0.98% [kernel.kallsyms] [k] local_clock 0.89% [kernel.kallsyms] [k] rcu_lockdep_current_cpu_online 0.89% [kernel.kallsyms] [k] debug_lockdep_rcu_enabled 0.85% [kernel.kallsyms] [k] trace_hardirqs_on_caller 0.81% libc-2.17.so [.] vfprintf 0.78% libdevmapper.so.1.02 [.] unmangle_string 0.76% [kernel.kallsyms] [k] mutex_lock_nested 0.76% [kernel.kallsyms] [k] time_hardirqs_off 0.74% [kernel.kallsyms] [k] rcu_is_cpu_idle 0.74% [kernel.kallsyms] [k] __might_sleep 0.72% [kernel.kallsyms] [k] sysfs_link_sibling 0.67% libdevmapper.so.1.02 [.] mangle_string now with 'pref record -ag' version: 2.61% dmsetup [k] __lock_acquire 1.50% systemd-udevd [k] __lock_acquire 1.33% dmsetup [k] lock_is_held 1.15% udisks-part-id [k] __lock_acquire 1.11% dmsetup [k] lock_release 1.11% udisks-dm-expor [k] __lock_acquire 0.98% dmsetup [k] page_fault 0.96% dmsetup [k] lock_acquire 0.95% dmsetup [k] debug_smp_processor_id 0.93% dmsetup [k] native_sched_clock 0.90% dmsetup [k] sched_clock_local 0.79% vgchange [k] pcpu_alloc_area 0.77% systemd-udevd [k] lock_is_held 0.76% dmsetup [k] trace_hardirqs_off_caller 0.75% dmsetup [k] add_preempt_count 0.67% systemd-udevd [k] copy_pte_range for completeness 'perf record -ag' for modprobe scsi_debug 4.20% systemd-udevd [k] __lock_acquire 1.90% systemd-udevd [k] lock_release 1.61% systemd-udevd [k] lock_acquire 1.52% systemd-udevd [k] sched_clock_local 1.50% systemd-udevd [k] native_sched_clock 1.45% systemd-udevd [k] trace_hardirqs_off_caller 1.45% udisks-part-id [k] __lock_acquire 1.30% systemd-udevd [k] lock_is_held 1.21% systemd-udevd [k] debug_smp_processor_id 1.03% systemd-udevd [k] mutex_lock_nested 1.02% systemd-udevd [k] local_clock 0.88% systemd-udevd [k] mark_lock 0.86% systemd-udevd [k] add_preempt_count 0.85% systemd-udevd [k] time_hardirqs_off 0.76% systemd-udevd [k] sub_preempt_count 0.73% systemd-udevd [k] do_raw_spin_trylock 0.72% scsi_id [k] __lock_acquire 0.69% systemd-journal [k] __lock_acquire 0.68% systemd-udevd [k] mark_held_locks 0.67% udisks-part-id [k] lock_is_held 0.63% kworker/u:3 [k] pcpu_alloc_area 0.62% systemd-udevd [k] trace_hardirqs_off 0.61% udisks-part-id [k] lock_release 0.59% systemd-udevd [k] trace_hardirqs_on_caller 0.57% udisks-part-id [k] page_fault 0.54% systemd-udevd [k] __lock_is_held 0.53% udisks-part-id [k] lock_acquire 0.53% udisks-part-id [k] debug_smp_processor_id 0.52% systemd-udevd [k] sched_clock_cpu 0.51% systemd-udevd [k] __mutex_unlock_slowpath 0.50% udisks-part-id [k] native_sched_clock 0.49% udisks-part-id [k] sched_clock_local 0.45% systemd-udevd [.] _int_malloc 0.44% systemd [k] __lock_acquire The same analysis should now happen with RHEL system to compare for differences. IMHO udev is forking simply way too many tasks - is the lowering of tasks increasing performance and avoiding any timeouts ?
kernel command line: "udev.children-max="
or in a running system: # udevadm control --children-max=value
Yep, known However the 'default' behavior is simply wrong and needs to be fixed.
Can you find out what value provides the best compromise on your box? It certainly sounds like the better approach to limit the concurrency instead of running against the wall and picking up the pieces later with error handling by forking new things. :)
http://cgit.freedesktop.org/systemd/systemd/commit/?id=8cc3f8c0bcd23bb68166cb197a4c541d7621b19c
(In reply to comment #8) > http://cgit.freedesktop.org/systemd/systemd/commit/ > ?id=8cc3f8c0bcd23bb68166cb197a4c541d7621b19c Thanks! As that is surely a nice patch to have, of course, but still we haven't provided full solution to problems that could arise - we've just made it less probable with this patch. So my question is: how can we detect the validity of udev database if the timeout happens, either because the system is totally overloaded or because some command was hung and it failed. If we wanted to rely on udev database, it would be fine to have a way to detect it being "tainted" when the timeout/kill happens for whatever reason...
Just one more question for my curiosity: is there at least a transaction when records are written to udev database? So if the timeout happens, the command killed and the rest of the RUN queue cut off, does the udev database end with previous state or does it contain half the old information and half the new information that was updated till the RUN queue died?
Also, now that systemd and udev is in one and considering the systemd's journal which has quite a nice bunch of features, could a part of this interface be reused for udev database as another instance of this interface to remove the symlinky nature of current udev database, to aid lookups and to provide more integrity etc. Just my curiosity, I've no idea if that is viable at all...
(In reply to Harald Hoyer from comment #8) > http://cgit.freedesktop.org/systemd/systemd/commit/ > ?id=8cc3f8c0bcd23bb68166cb197a4c541d7621b19c devel_ack for this patch.
To clarify: - We will not add something like a RUN_TIMEOUT="..." - We will reduce the number of udev max childs - We will patch start_udev to not timeout so early, so that LVM assembly can happen later on
First alpha build for testing: http://people.redhat.com/harald/downloads/udev/udev-147-2.47.el6/ Would be nice, if you could test it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1675.html