561425 – Remove assumption that 'change' uevents only originates from libdevmapper

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 561425 - Remove assumption that 'change' uevents only originates from libdevmapper

Summary: Remove assumption that 'change' uevents only originates from libdevmapper

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Peter Rajnoha
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	548870
TreeView+	depends on / blocked

Reported:	2010-02-03 16:20 UTC by David Zeuthen
Modified:	2010-11-15 14:32 UTC (History)
CC List:	11 users (show)
Fixed In Version:	lvm2-2.02.68-1.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-11-15 14:32:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description David Zeuthen 2010-02-03 16:20:18 UTC

See bug 548870 for details.

Comment 2 RHEL Program Management 2010-02-03 16:37:08 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 4 David Zeuthen 2010-03-08 20:29:55 UTC

Any progress on this? Thanks.

Comment 5 Alasdair Kergon 2010-03-08 20:49:03 UTC

Discussion on some other bz (I forget which).  Still got an impasse with upstream udev, unfortunately.

Comment 6 David Zeuthen 2010-03-09 14:50:08 UTC

First, I don't know why you think that you have control over uevents - if you examine the OS you will find several places where both user-space and kernel-space requests uevents from various subsystems.

Second, from a 50,000 feet point of view, it is not a very good idea to make your udev rules depend on transient things like depending on the value of certain environment variables in the first place. Doing it this way, makes things brittle, prone to race conditions and much harder to debug. Instead, it is much better to make the rules read values from e.g. sysfs files. Which means that you need a couple of lines of code kernel-side to cache these values. Handwaving here: So, from where I'm sitting, all you need to do is to export sysfs files that will serve the contents of whatever you put in the environment variables you were emitting in the 'change' event from libdevmapper originally.

Third, I'd also like to point out that the OS has relied on this behavior for a very long time now - we've used OPTIONS+="watch" for a long time - and it is absolutely essential for presentation interfaces: it is what allows the user interface to be dynamically updated when you do e.g. 'mkfs /dev/sda2' or 'e2label /dev/sda2' from the terminal. It is also what ensures that the /dev/disk/by-{id,...} hierarchies are properly updated - a mechanism that many parts of the OS depends on.

It is not reasonable to expect people to do things differently just because we are dealing with a device-mapper device instead of a libata device. Especially not when there is a path for you to rectify such bad assumptions.

Comment 7 Alasdair Kergon 2010-03-09 15:14:49 UTC

sysfs really should not be used to store userspace properties.  I doubt I could get that past lkml.  Whether we can find a way of pretending they are 'kernel' things, rather than arbitrary blobs of data stored on behalf of userspace, I don't know. More possible workarounds/layer violations for us to consider...

mkfs should be issuing whatever notification is necessary directly.  Using 'watch' in that case appears to be a temporary (and inefficient) workaround until that (and other tools) get fixed.

"not reasonable to expect people to do things differently"
- I completely agree.  Unfortunately some existing udev design & implementation assumptions (which do not apply in the case of dm devices) make that impossible for us at present.

Comment 8 Milan Broz 2010-03-09 16:10:37 UTC

David,
please do not try to take assumptions which are true for plain block devices but not for device-mapper. It is not that device-mapper devices are special but mainly because the event architecture seems not to be properly integrated with it yet.

Creating device-mapper device is not plain allocation of underlying block device (like libata or partition, and even partitions are handled specially in kernel) but more complex operation (we can simplify it to three steps - create/load_table/resume_device).

With volume manager it is even more complicated - several devices are combined into other and after that is finished (several create/resume steps) the final device is usable, not before.

But the kernel kobject architecure have no idea about this (yes, device-mapper should be integrated into block layer to fix it properly) and handles the events like there are the only basic, simple devices.

So there are examples of several problems you are seeing

1) Add event is sent when device is not yet initialised
2) There are multiple REMOVE events
3) triggering event from userspace causes kernel to reissue event, but because it is handled through generic layer, there are no callbacks to device-mapper code so it causes several misconfigurations (like disappearing of links etc).

(unfortunatelly device-mapper is here only add-on over the basic device kobject, not special class of device)

(And I wonder what should happen if userspace requests REMOVE event.)

I fully agree that dm device should behave like normal block devices from userspace (udev) point of view. And I already write some kernel patches which should move ADD/REMOVE event to proper place - not sure if accepted by upstream but we are testing that, I think it should simplify and unify some udev rules.
(Peter is testing them now.)

But because there is no callback to device-mapper code when some userspace request re-sending uevent, the addditional attributes (which are needed in rules) must be stored in sysfs or in udev db.

But (IMHO) we should not store is sysfs attributes which are not related to kernel device state (like lvm visible/invisible flag which affects _only_ some symlinks are created). LVM have metadata handled in userspace, it was basic design principle. It is not about a few lines in kernel - it is about the basic design of the DM/LVM logic which works reliably for years.

But I can still imagine some solution to this - either store some "userspace only" attributes in udev db or discuss some kernel solution.

What is IMHO serious mistake in design is the "watch" (IOW inotify on write close device). I know the story behind it and there are surely some situations when this can be used (but this should be very rare situation).

But requesting watch on _all_ devices cannot work reliably (specially when it is followed by scan program which locks device).

Every udev-aware application which modifies some on-disk metadata should sent change event - ok.

But it should not be assumed that write/close on disk == we need to scan device. This can seriously break some more complex operations which requires write to several devices or sequence of steps to achive final state. The application(s) know when the transaction is finished, how you can guess that from write is finished == data on disk)(s) are ready to scan? Did anyone audited all programs if follows this logic?

I think that if we want to use events this way (inform listening applications that device have updated metadata on it) the change event must be issued from the application itself which are managing this data, not by any watch(dog).

BTW more complex situation happens in cluster with shared storage. We already must handle this in clustered lvm. "Watch" will not help here if operation is performed from other node. Again, cluster-udev-aware application should inform other nodes that meatada changes (issue change event for you) on _all_ nodes.

(And what happens, if some 3rd party use some journal block device, and after every transaction it will close the device. Why you should issue scan for this? What if this happens 100 time per second? Scan will lock the device and that application will randonmly fail with -EBUSY? I know it is bad example, but there are many bad programs in the wild...)

IOW it creates race between udev slaves (which are following event and scan the affected device) and the caller, which is manipulating with that device.

Please do not take this as resistence to change - I want myself to make it work properly - but we cannot change the design principles. We can surely implement some compromise but it is not so simple like you are trying to show in comment #6.

And discussion is very hard if you start with "your bad assumption....".

Comment 9 David Zeuthen 2010-03-09 16:46:48 UTC

Hi,

I don't really care whether you store things in sysfs or read it from somewhere in userspace - as I said in comment 6, it was only a suggestion.

My main point is really that device-mapper block devices are so very different from ordinary block devices. And I think a lot of this is self-inflicted because of your implementation decision to have most of it implemented in user space.

For example, look at md-raid - from an interface point of view, here's the main things

1. there is no complicated user space that runs to collect properties
about the RAID device, e.g. /sys/block/md%d/md has everything we
need

2. there is no complicated user space to collect properties from the
component device (similar to LVM2 PV), basically mdadm --export
opens just component device and reads simple values that are
imported into the device environment

3. life-cycle rules are clear-cut (at least from an user space
view) - there are no weird rules that you can't use the device
under certain circumstances

(actually there are a couple of windows where you can't use the
device but this is fine as it's before the device is announced
to user space.)

4. anyone can request uevents

Let us compare this point-by-point with device-mapper / lvm2

1. an ioctl has to be issued against the device - this is OK but it
would be nicer with sysfs properties as it's easier to debug (e.g.
when ping-ponging with people on bugzilla).

2. liblvm2app scans all of /dev and easily deadlocks (bug 570332) - this
is unacceptable especially because we only ask for the metadata on the
given PV.

3. life-cycle rules are very unintuitive and you expect all users
to special-case device-mapper devices (in particular check the
value of the properties DM_UDEV_DISABLE_OTHER_RULES_FLAG and
DM_UDEV_DISABLE_DISK_RULES_FLAG)

4. things easily fall over if uevents originates from outside
libdevmapper (this bug - also see bug 570355).

I'm sorry, but I don't know how to make forward progress here and it's not helping that conversations are spread on 10+ bugs. But I hope this comment helps illustrate two points

- just how _different_ a device-mapper block device is. And that you
guys actually expect user space to totally special case it. Which I
find not acceptable.

- that a lot of this is self-inflicted because of the kernel/user split
that you chose.

I also don't think that it's realistic for udev to change because of this - sorry to sound like a jerk, but maybe you should have asked for that when kobject/udev happened five years ago.

I think the only realistic path is that you make the user-visible differences between device-mapper block devices and e.g. libata block devices (or mdraid block devices for that matter) as small as possible. In a way, I'd like to yell "do what mdraid does!" if you know what I mean...

Thanks,
David

Comment 10 Alasdair Kergon 2010-03-09 17:25:05 UTC

"I also don't think that it's realistic for udev to change because of this "

And I don't think it's realistic for dm/lvm2 to be rewritten because of this either.  udev is providing a *service* to dm/lvm2 - managing /dev on its behalf - but its interface is not yet rich enough to reimplement *existing* dm/lvm2 functionality.  We're bending over backwards at the moment trying to invent hacks and workarounds to salvage something workable out of this mess, but perhaps we should just admit defeat and accept that the two will never be able to co-exist happily?

This is only going to get resolved by co-operative changes on *both* sides.

Comment 11 David Zeuthen 2010-03-09 18:20:08 UTC

(In reply to comment #10)
> "I also don't think that it's realistic for udev to change because of this "
> 
> And I don't think it's realistic for dm/lvm2 to be rewritten because of this
> either.  udev is providing a *service* to dm/lvm2 - managing /dev on its behalf
> - but its interface is not yet rich enough to reimplement *existing* dm/lvm2
> functionality.  

No. In many ways, udev is just the tiny user-space portion. A substantial part that you are conveniently ignoring is kobject/sysfs. Which follows exactly the same model of exposing kernel objects to user-space 1:1.

Now, device-mapper and lvm2 is undeniably using kobject/sysfs as a service to use your own words ... in fact we're in this mess exactly because this happened without you explicitly choosing it to be so ... e.g. it happened when the block layer started getting exposed in sysfs. That's when this mess started five years ago.

> This is only going to get resolved by co-operative changes on
> *both* sides.

Sure. Maybe it would help if we can agree on the user-space visible interface, e.g. points 1-4 in comment 9 first. Thanks.

Comment 12 Alasdair Kergon 2010-03-09 18:38:39 UTC

The state of the internal *kernel* kobjects does *not* correspond to the availability/status of devices in *userspace* - they are quite different things!  Just because a device number has been assigned in-kernel does not mean that a device exists that should be visible to normal system applications in userspace - those two things have been separate ever since device-mapper was accepted into the kernel.

There has to be a 'wrapper' layer either in userspace (which has been the existing approach we've all been struggling with) or in kernel (which Milan has been playing with recently - this requires kernel changes *outside* device-mapper code).

md has a different architecture from dm and does not afford a valid comparison.

Comment 13 David Zeuthen 2010-03-09 19:24:09 UTC

(In reply to comment #12)
> The state of the internal *kernel* kobjects does *not* correspond to the
> availability/status of devices in *userspace* - they are quite different
> things!  Just because a device number has been assigned in-kernel does not mean
> that a device exists that should be visible to normal system applications in
> userspace - those two things have been separate ever since device-mapper was
> accepted into the kernel.

Right, but the way the block layer is wired into kobject/uevent/sysfs does mean that it is exposed 1:1 to user space. For better or worse, this isn't very compatible with the device-mapper.

> There has to be a 'wrapper' layer either in userspace (which has been the
> existing approach we've all been struggling with) or in kernel (which Milan has
> been playing with recently - this requires kernel changes *outside*
> device-mapper code).

Sounds great. Will such a layer have the salient qualities such that the block devices act and look like e.g. libata block devices? (E.g. points 3. and 4. in comment 9)

> md has a different architecture from dm 

I'm well-aware of that. 

> and does not afford a valid comparison.    

It sure does offer a valid comparison and even a _good_ one at it.

It shows that complex virtual block device infrastructures _can_ be made to work just like other block devices work. FWIW, I think a lot of this discussion is about characterizing what expectations user-space like udev + co has when it comes to what a block device really _is_. Basically, we've found that user space at least expects these two requirements to be true:

 - once the block device has been created (e.g. when the 'add' uevent has
   been sent) it needs to be operational (meaning: user space may do
   open/read/write/close/etc on it) until the 'remove' uevent has been sent.

 - you need to handle that uevents from any origin including kernel
   subsystems (for example, a fs driver may emit 'change' on the block
   device that is backing the filesystem. Ditto for a HBA LLD), user
   space (e.g. udevadm trigger) and udev itself (e.g. OPTIONS+="watch'")
   may occur

I really don't think these requirements are too much to ask for. And I really very strongly object to having to games like "you can't use the block device if the udev property XYZ has the value ABC or the sysfs file FOO has the values BAR and BAZ".

So to speak, the fact that device-mapper uses the block device itself for configuration is what is causing the problems. Before we had udev/kobject/sysfs this didn't really cause many problems because everything was done synchronously _anyway_.

In a modern asynchronous udev/kobject/sysfs based world (which is the world we _need_ to live in if we want to handle e.g. hotpluggable disks) this of course won't work and which is why we're starting to see these problems. So the addition of a wrapper-layer (or other configuration back-channel) will help a lot here.

(Hey, I'm fully aware that I'm repeating myself here and I apologize for that. I'm simply trying really hard to emphasize exactly what to expect (and not expect) from a block device. Because I think the lack of agreement in that area is what is causing the current mess.)

Thanks,
David

Comment 14 Alasdair Kergon 2010-03-10 02:47:22 UTC

"- once the block device has been created (e.g. when the 'add' uevent has
been sent) it needs to be operational (meaning: user space may do
open/read/write/close/etc on it) until the 'remove' uevent has been sent."

There are different definitions of the point at which the "block device has been created" for different parts of the system.

Milan's latest attempt at this part of the problem takes your practical userspace definition - "user space may do operations on it" - and attaches the kernel's "add" uevent to that *instead* of attaching it to the creation of the kernel kobject (which happened an arbitrary amount of time earlier).

In other words "add" is then defined for *userspace* convenience rather than *kernel* convenience (as today). This makes the userspace udev rules simpler at the expense of more complexity in the kernel.

Regarding the 'uevents from any source'. Well, udev is serving two independent purposes for us: maintaining dm nodes and maintaining lvm2 nodes. The information for the dm nodes comes from the kernel - relatively straightforward. But the information for the lvm2 nodes (ie symlinks to dm) comes from userspace lvm2 and is nothing to do with kernel device-mapper. In the current incarnation of our rules, this *userspace* lvm2 state is being passed via a 'cookie' variable attached to the uevent, opaque to the kernel. It depends entirely on userspace lvm2 state - some of which is held in the memory of the running lvm2 process. This state is what we were suggesting might be cached in the udev database and re-supplied from there along with future uevents from other sources. We don't really have any solution to this problem yet.

Comment 15 Peter Rajnoha 2010-06-14 14:22:44 UTC

Back to the original bug report...

The request to remove the assumption that the event has its origin in libdevmapper calling ioctl can't be satisfied because of the nature of lvm2 and device-mapper. We really need an aditional information to be passed into the udev rules to direct its processing.

This kind of information is hard to obtain just by calling a simple lvm command - it is generated in runtime based on relations among other devices and the stage of activation in which it occurs. And storing this information in sysfs is not the way to go as well because of the reasons already discussed here in this report.

However, there's a new IMPORT{db} udev rule that allows us to retrieve the information stored in the udev database and use it in case we don't have it provided in the event directly (through the DM_COOKIE environment variable that also includes the flags we use to direct the rule application).

For this to work properly, we also need to preserve the udev database from the initrd where dm devices could be activated (see also bug #603724).

Once we have fixed udev's start_udev script out, we should be able to fully support articial events (though still not able to synchronize with such events) - we'll able able to decide whether we should react to an event based on the information stored in the udev db...

Comment 16 Peter Rajnoha 2010-06-22 09:01:43 UTC

Current RHEL6 version of udev (147-2.18) preserves the udev database and backports the IMPORT{db} udev rule. Now, we can rely on udev database and we can fully support the use of "udevadm trigger --action={add,change}", "echo {add,change} > /sys/.../<dm_device>/uevent" as well as the events generated as a result of the OPTIONS+="watch" rule (or even any newer ones added later).

We still can't synchronize with such events. But since this problem is covered as a consequence in other bug reports like bug #570359 or bug #577798, let's consider that the exact problem reported here is resolved.

The correction for device-mapper udev rules (10-dm.rules) is scheduled for the next lvm2/device-mapper build.

Comment 18 releng-rhel@redhat.com 2010-11-15 14:32:37 UTC

Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.

Note You need to log in before you can comment on or make changes to this bug.