Description of problem:
If there are numerous events occurring in the system at once, we might get into a situation in which event timeouts occur and udev just kills the udev process that processes the event.
For example, numerous events can be generated when many block devices are activated at once (see also bug #885978 and bug #917538 for examples where this causes real problems).
Killed udev proces without any notification and without any way to do some failsafe operation, ending up with incomplete/incorrect udev state without a way to detect that something went wrong when reading udev database further or processing events.
Udev should have a special udev rule that would be processed on event timeout to have a way to do anything that is needed to notify any part of the system that depends on the udev (database) state and so have a way to recognize that something went wrong and so create a possibility for any fallback operation that will deal with the error condition correctly.
This could be something like RUN_TIMEOUT+="...". So exactly like the existing RUN udev rule, but fired only on timeouts.
Question might arise what we should do if the RUN_TIMEOUT itself causes a timeout. I suggest just ending like we do today on timeout - it's a very serious error if something like that happens and should be resolved individually then.
Today we have:
- event processing
- event processing
- timeout (if it happens even here)
We cannot start new programs when we try to recover from the fact that
things hang because we alrady started too many programs. When we run
into a timeout, we have lost already or did too much already, and we
cannot start executing additional error programs as a result of that.
The use of overly expensive tools like lvm need to be limited to the
resources of the machine or the amount of resources lvm takes; tuning
the timeouts or the error handling to push the limits does not seem
like a convincing strategy.
So - let's put here some numbers with 'expensive tools':
Playing with my oldish T61 4G, 3.8 kernel which has some debugging enabled and runs systemd in verbose mode, fresh Rawhide and my git lvm. LVM devices are using 'ram disk' for back-end PV device to play in the same play-ground like scsi_debug.
time $(modprobe scsi_debug dev_size_mb=2 num_tgts=200; udevadm settle)
vgchange -ay with 200 linear LVs (without lvmetad which is suboptimal)
scsi_debug and 1700 devices: 1m 7s
vgchange -ay with 1700 linear LVs: 52s
Using higher number of devices (like 10000) leads to OOM thanks to huge amount of forked udevd process executed on my just 4G system - simply killing my desktop.
Looking at the CPU hogs - libvirtd, udisksd (already reported massive leaking bug 922286) and gvfs-udisks2-volume-monitor seems to be eating a lot of CPU - so let's kill them and repeat (leaving just systemd-journal logging of systemd debug message, which is still not exactly cheap...)
scsi_debug 200: ~4s
vgchange -ay: ~4s
scsi_debug 1700: ~35s
vgchange -ay: ~35s
- what could be observed from the above results -
lvm command takes actually very small portion of the time/CPU - major portion of the time is spend in udev processing.
dm udev rules are executing dmsetup twice - but still negligible since scsi_debug performs with approximately same speed for non-dm devices.
Here is the 'perf report' from the 'vgchange -ay' command above:
16.36% [kernel.kallsyms] [k] pcpu_alloc_area
4.37% [kernel.kallsyms] [k] __lock_acquire
3.83% [kernel.kallsyms] [k] insert_vmalloc_vmlist
3.32% [kernel.kallsyms] [k] lock_is_held
2.31% lvm [.] find_lv_in_vg_by_lvid
2.12% [kernel.kallsyms] [k] bd_link_disk_holder
1.87% [kernel.kallsyms] [k] lock_release
1.64% libc-2.17.so [.] __GI___strncmp_ssse3
1.64% [kernel.kallsyms] [k] lock_acquire
1.58% libc-2.17.so [.] __memset_sse2
1.47% [kernel.kallsyms] [k] debug_smp_processor_id
1.38% [kernel.kallsyms] [k] sched_clock_local
1.36% [kernel.kallsyms] [k] native_sched_clock
1.36% [kernel.kallsyms] [k] __lock_is_held
1.27% [kernel.kallsyms] [k] trace_hardirqs_off_caller
1.23% [kernel.kallsyms] [k] mark_lock
1.20% [kernel.kallsyms] [k] add_preempt_count
1.14% [kernel.kallsyms] [k] mark_held_locks
1.12% [kernel.kallsyms] [k] __module_address
1.12% [kernel.kallsyms] [k] sub_preempt_count
1.07% libdevmapper.so.1.02 [.] _is_whitelisted_char
1.01% [kernel.kallsyms] [k] strcmp
0.98% [kernel.kallsyms] [k] local_clock
0.89% [kernel.kallsyms] [k] rcu_lockdep_current_cpu_online
0.89% [kernel.kallsyms] [k] debug_lockdep_rcu_enabled
0.85% [kernel.kallsyms] [k] trace_hardirqs_on_caller
0.81% libc-2.17.so [.] vfprintf
0.78% libdevmapper.so.1.02 [.] unmangle_string
0.76% [kernel.kallsyms] [k] mutex_lock_nested
0.76% [kernel.kallsyms] [k] time_hardirqs_off
0.74% [kernel.kallsyms] [k] rcu_is_cpu_idle
0.74% [kernel.kallsyms] [k] __might_sleep
0.72% [kernel.kallsyms] [k] sysfs_link_sibling
0.67% libdevmapper.so.1.02 [.] mangle_string
now with 'pref record -ag' version:
2.61% dmsetup [k] __lock_acquire
1.50% systemd-udevd [k] __lock_acquire
1.33% dmsetup [k] lock_is_held
1.15% udisks-part-id [k] __lock_acquire
1.11% dmsetup [k] lock_release
1.11% udisks-dm-expor [k] __lock_acquire
0.98% dmsetup [k] page_fault
0.96% dmsetup [k] lock_acquire
0.95% dmsetup [k] debug_smp_processor_id
0.93% dmsetup [k] native_sched_clock
0.90% dmsetup [k] sched_clock_local
0.79% vgchange [k] pcpu_alloc_area
0.77% systemd-udevd [k] lock_is_held
0.76% dmsetup [k] trace_hardirqs_off_caller
0.75% dmsetup [k] add_preempt_count
0.67% systemd-udevd [k] copy_pte_range
for completeness 'perf record -ag' for modprobe scsi_debug
4.20% systemd-udevd [k] __lock_acquire
1.90% systemd-udevd [k] lock_release
1.61% systemd-udevd [k] lock_acquire
1.52% systemd-udevd [k] sched_clock_local
1.50% systemd-udevd [k] native_sched_clock
1.45% systemd-udevd [k] trace_hardirqs_off_caller
1.45% udisks-part-id [k] __lock_acquire
1.30% systemd-udevd [k] lock_is_held
1.21% systemd-udevd [k] debug_smp_processor_id
1.03% systemd-udevd [k] mutex_lock_nested
1.02% systemd-udevd [k] local_clock
0.88% systemd-udevd [k] mark_lock
0.86% systemd-udevd [k] add_preempt_count
0.85% systemd-udevd [k] time_hardirqs_off
0.76% systemd-udevd [k] sub_preempt_count
0.73% systemd-udevd [k] do_raw_spin_trylock
0.72% scsi_id [k] __lock_acquire
0.69% systemd-journal [k] __lock_acquire
0.68% systemd-udevd [k] mark_held_locks
0.67% udisks-part-id [k] lock_is_held
0.63% kworker/u:3 [k] pcpu_alloc_area
0.62% systemd-udevd [k] trace_hardirqs_off
0.61% udisks-part-id [k] lock_release
0.59% systemd-udevd [k] trace_hardirqs_on_caller
0.57% udisks-part-id [k] page_fault
0.54% systemd-udevd [k] __lock_is_held
0.53% udisks-part-id [k] lock_acquire
0.53% udisks-part-id [k] debug_smp_processor_id
0.52% systemd-udevd [k] sched_clock_cpu
0.51% systemd-udevd [k] __mutex_unlock_slowpath
0.50% udisks-part-id [k] native_sched_clock
0.49% udisks-part-id [k] sched_clock_local
0.45% systemd-udevd [.] _int_malloc
0.44% systemd [k] __lock_acquire
The same analysis should now happen with RHEL system to compare for differences.
IMHO udev is forking simply way too many tasks - is the lowering of tasks increasing performance and avoiding any timeouts ?
kernel command line: "udev.children-max="
or in a running system:
# udevadm control --children-max=value
However the 'default' behavior is simply wrong and needs to be fixed.
Can you find out what value provides the best compromise on your
It certainly sounds like the better approach to limit the
concurrency instead of running against the wall and picking
up the pieces later with error handling by forking new
(In reply to comment #8)
As that is surely a nice patch to have, of course, but still we haven't provided full solution to problems that could arise - we've just made it less probable with this patch.
So my question is: how can we detect the validity of udev database if the timeout happens, either because the system is totally overloaded or because some command was hung and it failed. If we wanted to rely on udev database, it would be fine to have a way to detect it being "tainted" when the timeout/kill happens for whatever reason...
Just one more question for my curiosity: is there at least a transaction when records are written to udev database? So if the timeout happens, the command killed and the rest of the RUN queue cut off, does the udev database end with previous state or does it contain half the old information and half the new information that was updated till the RUN queue died?
Also, now that systemd and udev is in one and considering the systemd's journal which has quite a nice bunch of features, could a part of this interface be reused for udev database as another instance of this interface to remove the symlinky nature of current udev database, to aid lookups and to provide more integrity etc. Just my curiosity, I've no idea if that is viable at all...
(In reply to Harald Hoyer from comment #8)
devel_ack for this patch.
- We will not add something like a RUN_TIMEOUT="..."
- We will reduce the number of udev max childs
- We will patch start_udev to not timeout so early, so that LVM assembly can happen later on
First alpha build for testing:
Would be nice, if you could test it.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.