Red Hat Bugzilla – Bug 1020150
[RFE] Use udev monitoring to enhance synchronization
Last modified: 2017-10-31 10:50:57 EDT
Description of problem:
lvm2 should start to use udev monitoring.
This will improve a bit udev rule process - since at least 1 dmsetup call (for semaphore counter) could be eliminated.
Next advantage will be - no limited RPC resources would be used.
We also gain option to eliminated many retry steps for closing device we have put in because of randomized watch rule. We just 'settle' in case we fail for the first time.
Other distribution will stop messing lvm2 code with their incorrect patches.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
settle was ruled out from any solution a long time ago because it waits for unrelated changes (that could be blocked)
What's proposed here was ruled out before the current solution was written and I don't believe udev has been enhanced sufficiently since then for this to work.
settle could be used in case, we get race with close & watch rule - i.e.
when I use 'sleep xxx < /dev/vg/lv' to keep device open - we could either retry close for ~3 seconds or we may just wait till all udev events from the startup point are processed so we could be sure no relevant watch rules are executed as a result of umount could be in progress.
There are cases where the 'settle' is clear win as well as there could be cases, where settling might take much longer - We may possibly combine the best from both 'worlds' - i.e. retry for at most 3 secs and if we detect udev has 'settled' earlier we may stop retry instantly and report device open error.
IMHO there is now 30sec timeout in udev which should not block things even for plain settle for a very long time.
Anyway - settle is just one minor point here - primary goal would be to elimination of semaphore thing from the field - since monitor could easily filter out only disk events and we just await our own rules.
As a major bonus I could see then embedding pvscan manipulation for lvmetad which would be executed under proper context of lvm command instead of 'over-complexed' udev/systemd/service - since I believe CPU accounting should be on behalf of the command, that caused device manipulation and not hidden somewhere deeply in systemd services.
Another point to consider here is tracing and debugging should be more relevant.
Another thing to consider here (with patches like this:
is even the complexity to build and create correctly configured lvm2 package for a user - I think we are already past the point average user can find proper options - but it seems even distro maintainers will now find it hard to tune right set of options.
Compared with the built-in support for monitoring - when lvmetad is just an udev monitor with capabilities to fork lvchange auto-activation - makes this fairly easier.
From discussion with udev developer - we make get same cookieID behavior by using 'TAG' feature and udev_monitor_filter_add_match_tag().
While man page for TAG support states " Excessive use might result in inefficient event handling" - it's meant for the attaching many different TAGS to the device - using single TAG is perfectly ok.
So we should be able to monitor only for exact device just like we decrement (limited amount or sysV) semaphore ID.
Udev will also likely implement much easier 'CANCEL' message when udev worker is killed.
Just to add few more words about 'TAGS' limitation - all the TAGS associated with a single device are 'hashed' into a 64b bitmask used for quick in-kernel socket filtering - so the udev server broadcasts the event to every monitor who listens for a given mask - so there could be some false positives (since all tags are 'OR-ed' together) - so those are at the end filtered on the udev monitor 'client' part. So our cookieID should be probably designed and checked we have good bit-spreading.
Another thing to not forget about is the synchronization agains device removal.
Now we are not able to synchronize with removal of volume group and its recreation - since the check of VG name existence in /dev may return true when we are not waiting for actual link removal from udev.
Unfortunately at presence there are more tasks executed i.e. via systemd which makes the synchronization process nearly impossible - we simple don't have any source of scheduled/in-flight operations for a device, thus maybe enhancing kernel API with some flags for suspend/remove operations might be necessary or we will need to fight hard with other process to manipulate a PV/VG/LV.