Bug 918511
Summary: | Reduce the possibility of timed out jobs and do not continue to boot with an unsettled udev | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Peter Rajnoha <prajnoha> |
Component: | udev | Assignee: | Harald Hoyer <harald> |
Status: | CLOSED ERRATA | QA Contact: | qe-baseos-daemons |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 6.4 | CC: | agk, azelinka, jprokes, jwest, kay, ovasik, prajnoha, qe-baseos-daemons, salmy, sauchter, sbradley, skito, udev-maint-list, ykawada, zkabelac |
Target Milestone: | rc | ||
Target Release: | 6.5 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
The number udevd spawns as workers is only dependent on the RAM size.
Consequence:
For machines with relatively big RAM sizes and lots of disks, a lot of udevd workers are running in parallel, maximizing CPU and I/O. This can cause udev events to timeout, because of hardware bottlenecks.
Fix:
The number of udevd workers is now also limited by the CPU count and significantly lower on machines with a big RAM size.
Result:
Less udev workers running in concurrently do not hit bottlenecks that easy and cause less or no timeouts to happen.
Cause2:
start_udev called udevadm settle and timed out after the default 180 seconds.
Consequence2:
Some devices were not completely assembled and the boot process continued causing various failures.
Fix2:
start_udev now waits indefinitely until udev has settled.
Result2:
All devices are assembled now and tghe boot process should continue without errors.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2013-11-21 22:00:35 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 835616, 960054, 1005262 |
Description
Peter Rajnoha
2013-03-06 12:32:18 UTC
We cannot start new programs when we try to recover from the fact that things hang because we alrady started too many programs. When we run into a timeout, we have lost already or did too much already, and we cannot start executing additional error programs as a result of that. The use of overly expensive tools like lvm need to be limited to the resources of the machine or the amount of resources lvm takes; tuning the timeouts or the error handling to push the limits does not seem like a convincing strategy. So - let's put here some numbers with 'expensive tools': Playing with my oldish T61 4G, 3.8 kernel which has some debugging enabled and runs systemd in verbose mode, fresh Rawhide and my git lvm. LVM devices are using 'ram disk' for back-end PV device to play in the same play-ground like scsi_debug. -- time $(modprobe scsi_debug dev_size_mb=2 num_tgts=200; udevadm settle) 5.5s vgchange -ay with 200 linear LVs (without lvmetad which is suboptimal) 5.2s -- scsi_debug and 1700 devices: 1m 7s vgchange -ay with 1700 linear LVs: 52s -- Using higher number of devices (like 10000) leads to OOM thanks to huge amount of forked udevd process executed on my just 4G system - simply killing my desktop. Looking at the CPU hogs - libvirtd, udisksd (already reported massive leaking bug 922286) and gvfs-udisks2-volume-monitor seems to be eating a lot of CPU - so let's kill them and repeat (leaving just systemd-journal logging of systemd debug message, which is still not exactly cheap...) scsi_debug 200: ~4s vgchange -ay: ~4s scsi_debug 1700: ~35s vgchange -ay: ~35s real 0m34.855s user 0m0.777s sys 0m2.767s - what could be observed from the above results - lvm command takes actually very small portion of the time/CPU - major portion of the time is spend in udev processing. dm udev rules are executing dmsetup twice - but still negligible since scsi_debug performs with approximately same speed for non-dm devices. Here is the 'perf report' from the 'vgchange -ay' command above: 16.36% [kernel.kallsyms] [k] pcpu_alloc_area 4.37% [kernel.kallsyms] [k] __lock_acquire 3.83% [kernel.kallsyms] [k] insert_vmalloc_vmlist 3.32% [kernel.kallsyms] [k] lock_is_held 2.31% lvm [.] find_lv_in_vg_by_lvid 2.12% [kernel.kallsyms] [k] bd_link_disk_holder 1.87% [kernel.kallsyms] [k] lock_release 1.64% libc-2.17.so [.] __GI___strncmp_ssse3 1.64% [kernel.kallsyms] [k] lock_acquire 1.58% libc-2.17.so [.] __memset_sse2 1.47% [kernel.kallsyms] [k] debug_smp_processor_id 1.38% [kernel.kallsyms] [k] sched_clock_local 1.36% [kernel.kallsyms] [k] native_sched_clock 1.36% [kernel.kallsyms] [k] __lock_is_held 1.27% [kernel.kallsyms] [k] trace_hardirqs_off_caller 1.23% [kernel.kallsyms] [k] mark_lock 1.20% [kernel.kallsyms] [k] add_preempt_count 1.14% [kernel.kallsyms] [k] mark_held_locks 1.12% [kernel.kallsyms] [k] __module_address 1.12% [kernel.kallsyms] [k] sub_preempt_count 1.07% libdevmapper.so.1.02 [.] _is_whitelisted_char 1.01% [kernel.kallsyms] [k] strcmp 0.98% [kernel.kallsyms] [k] local_clock 0.89% [kernel.kallsyms] [k] rcu_lockdep_current_cpu_online 0.89% [kernel.kallsyms] [k] debug_lockdep_rcu_enabled 0.85% [kernel.kallsyms] [k] trace_hardirqs_on_caller 0.81% libc-2.17.so [.] vfprintf 0.78% libdevmapper.so.1.02 [.] unmangle_string 0.76% [kernel.kallsyms] [k] mutex_lock_nested 0.76% [kernel.kallsyms] [k] time_hardirqs_off 0.74% [kernel.kallsyms] [k] rcu_is_cpu_idle 0.74% [kernel.kallsyms] [k] __might_sleep 0.72% [kernel.kallsyms] [k] sysfs_link_sibling 0.67% libdevmapper.so.1.02 [.] mangle_string now with 'pref record -ag' version: 2.61% dmsetup [k] __lock_acquire 1.50% systemd-udevd [k] __lock_acquire 1.33% dmsetup [k] lock_is_held 1.15% udisks-part-id [k] __lock_acquire 1.11% dmsetup [k] lock_release 1.11% udisks-dm-expor [k] __lock_acquire 0.98% dmsetup [k] page_fault 0.96% dmsetup [k] lock_acquire 0.95% dmsetup [k] debug_smp_processor_id 0.93% dmsetup [k] native_sched_clock 0.90% dmsetup [k] sched_clock_local 0.79% vgchange [k] pcpu_alloc_area 0.77% systemd-udevd [k] lock_is_held 0.76% dmsetup [k] trace_hardirqs_off_caller 0.75% dmsetup [k] add_preempt_count 0.67% systemd-udevd [k] copy_pte_range for completeness 'perf record -ag' for modprobe scsi_debug 4.20% systemd-udevd [k] __lock_acquire 1.90% systemd-udevd [k] lock_release 1.61% systemd-udevd [k] lock_acquire 1.52% systemd-udevd [k] sched_clock_local 1.50% systemd-udevd [k] native_sched_clock 1.45% systemd-udevd [k] trace_hardirqs_off_caller 1.45% udisks-part-id [k] __lock_acquire 1.30% systemd-udevd [k] lock_is_held 1.21% systemd-udevd [k] debug_smp_processor_id 1.03% systemd-udevd [k] mutex_lock_nested 1.02% systemd-udevd [k] local_clock 0.88% systemd-udevd [k] mark_lock 0.86% systemd-udevd [k] add_preempt_count 0.85% systemd-udevd [k] time_hardirqs_off 0.76% systemd-udevd [k] sub_preempt_count 0.73% systemd-udevd [k] do_raw_spin_trylock 0.72% scsi_id [k] __lock_acquire 0.69% systemd-journal [k] __lock_acquire 0.68% systemd-udevd [k] mark_held_locks 0.67% udisks-part-id [k] lock_is_held 0.63% kworker/u:3 [k] pcpu_alloc_area 0.62% systemd-udevd [k] trace_hardirqs_off 0.61% udisks-part-id [k] lock_release 0.59% systemd-udevd [k] trace_hardirqs_on_caller 0.57% udisks-part-id [k] page_fault 0.54% systemd-udevd [k] __lock_is_held 0.53% udisks-part-id [k] lock_acquire 0.53% udisks-part-id [k] debug_smp_processor_id 0.52% systemd-udevd [k] sched_clock_cpu 0.51% systemd-udevd [k] __mutex_unlock_slowpath 0.50% udisks-part-id [k] native_sched_clock 0.49% udisks-part-id [k] sched_clock_local 0.45% systemd-udevd [.] _int_malloc 0.44% systemd [k] __lock_acquire The same analysis should now happen with RHEL system to compare for differences. IMHO udev is forking simply way too many tasks - is the lowering of tasks increasing performance and avoiding any timeouts ? kernel command line: "udev.children-max=" or in a running system: # udevadm control --children-max=value Yep, known However the 'default' behavior is simply wrong and needs to be fixed. Can you find out what value provides the best compromise on your box? It certainly sounds like the better approach to limit the concurrency instead of running against the wall and picking up the pieces later with error handling by forking new things. :) (In reply to comment #8) > http://cgit.freedesktop.org/systemd/systemd/commit/ > ?id=8cc3f8c0bcd23bb68166cb197a4c541d7621b19c Thanks! As that is surely a nice patch to have, of course, but still we haven't provided full solution to problems that could arise - we've just made it less probable with this patch. So my question is: how can we detect the validity of udev database if the timeout happens, either because the system is totally overloaded or because some command was hung and it failed. If we wanted to rely on udev database, it would be fine to have a way to detect it being "tainted" when the timeout/kill happens for whatever reason... Just one more question for my curiosity: is there at least a transaction when records are written to udev database? So if the timeout happens, the command killed and the rest of the RUN queue cut off, does the udev database end with previous state or does it contain half the old information and half the new information that was updated till the RUN queue died? Also, now that systemd and udev is in one and considering the systemd's journal which has quite a nice bunch of features, could a part of this interface be reused for udev database as another instance of this interface to remove the symlinky nature of current udev database, to aid lookups and to provide more integrity etc. Just my curiosity, I've no idea if that is viable at all... (In reply to Harald Hoyer from comment #8) > http://cgit.freedesktop.org/systemd/systemd/commit/ > ?id=8cc3f8c0bcd23bb68166cb197a4c541d7621b19c devel_ack for this patch. To clarify: - We will not add something like a RUN_TIMEOUT="..." - We will reduce the number of udev max childs - We will patch start_udev to not timeout so early, so that LVM assembly can happen later on First alpha build for testing: http://people.redhat.com/harald/downloads/udev/udev-147-2.47.el6/ Would be nice, if you could test it. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1675.html |