Bug 814782

Summary:	Handle different client filters / missing devices in lvmetad
Product:	Red Hat Enterprise Linux 6	Reporter:	Alasdair Kergon <agk>
Component:	lvm2	Assignee:	Petr Rockai <prockai>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	6.4	CC:	agk, cmarthal, coughlan, dwysocha, heinzm, jbrassow, msnitzer, nperic, prajnoha, prockai, thornber, zkabelac
Target Milestone:	rc
Target Release:	6.4
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:	lvm2-2.02.98-1.el6	Doc Type:	Bug Fix
Doc Text:	The interaction of LVM filters and lvmetad could have lead to unexpected and undesirable results, and updates to "filter" settings while lvmetad was running would not force lvmetad to forget about devices forbidden by the filter. Since the normal "filter" setting in lvm.conf is often used on the command line, a new option was added to lvm.conf (global_filter) which also applies to lvmetad. The traditional "filter" settings only applies at the command level and does not affect device visibility to lvmetad. The options are documented in more detail in the example configuration file.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-02-21 08:09:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	817776

Description Alasdair Kergon 2012-04-20 16:15:45 UTC

Currently, LVM filters are a property of the client, not the lvmetad server.  Different clients might see different subsets of the devices on the system.
A uevent for the addition or removal of a device might have got lost.

Devices can also become missing if they disappear from the system or if they are filtered out by a particular client.

Ensure lvmetad copes with these situations, presenting a consistent picture to each client, without ever needing to be restarted.

Consider further self-correction or manual correction mechanisms that avoid a daemon restart.

Comment 1 Petr Rockai 2012-07-01 10:08:13 UTC

I disagree about the requirement to automatically handle mised uevents (which equate to kernel/udev bugs). In those cases, a manual re-scan can be issued to rectify the situation.

Basically, I believe this bug amounts to making sure that an appropriate scan command (vgscan --cache?) is available to sync the state of the daemon from a scan result. Care needs to be taken to avoid races to the extent possible.

Comment 2 Peter Rajnoha 2012-07-17 08:07:38 UTC

See also related bug #837599 comment #4.

As for missed uevents, I agree with Petr here that if anything like that happens, kernel/udev must be fixed. It's inadmissible to have missing events from kernel - it's a serious kernel/udev bug otherwise. Well, unless udev rules are badly written or lvmetad itself misbehaves, but that should be fixed directly as well...

As for the filters, I'd put a priority on this a bit as in an environment where LVM is used as a backend for virtual machines, such configuration might become unusable if we're not able to filter out volumes used within the guests out of scanning (guests can define their own LVM layout inside the volume itself and this clashes with host's LVM configuration, even worse if the volume names are same). That's just an example where people might hit this quite often.

A suggestion here was to apply filtering on the information returned from lvmetad directly in the client. Now the question is whether "pvscan --cache" command used for updating the lvmetad should provide filtered or unfiltered view for lvmetad update:

- if unfiltered "pvscan --cache" is used, only one client update is needed, directly from command line/on boot within boot script/call from udev rule. BUT we might get into trouble just like in the example with host/guests volumes where the guests must be filtered.

- if filtered, each client is responsible for updating the lvmetad itself with "pvscan --cache" call.

What will udev rule use? Filtered or unfiltered? There's only one common udev rule used globally in the system contrary to several possible lvmetad clients used. It's not possible and designed to run several udev daemons in parallel...

Then the question is what's the exact use case for several clients with different filtering rules? (testing is the most common situation?) Each client then should have its own lvmetad udev rule then as well, right?

What I'm trying to say is that we're trying to put together two designs - several lvmetad clients with different filtering possibly vs. one instance of udevd in the system processing the rules which is used to update lvmetad state primarily.

Comment 3 Zdenek Kabelac 2012-07-17 09:00:53 UTC

I think this would require some sensitive change in filtering layer of lvm command.  We will need to support different filtering for lvmetad.

Currently filters are accessing devices - but with this change - we will need to let pvscan --cache run without any filters - thus lvmetad will always have full view of all devices - thus it needs to be able to handle some vectors where user could have same UUID & VG name - just device name will be different.

When lvm queries lvmetad - it will then apply its filtering rules.

Comment 4 Peter Rajnoha 2012-07-17 09:13:01 UTC

Yes, I think we need to implement something like that - that seems to be the most straightforward and the cleanest solution - so we need to be sure to handle duplicates correctly and that the underlying device will make the point of difference (that the filter will look at as well, filtering done on client side only for the lvmetad answer). Alasdair, mornfall, what do you think?

Comment 5 Petr Rockai 2012-07-20 15:36:22 UTC

"could have same UUID" -- no, that's not supported, and cannot be -- it's Universally Unique... uniqueness is the whole point of a UUID. Anything else can clash, but an UUID is what defines a VG identity.

As for filtering, lvmetad always needs a full view of the system. A pvscan --cache must never apply any filtering, and all commands that want to filter need to do so on data coming from lvmetad.

*IF* (and that's a big if) it's unavoidable that someone has multiple clashing copies of a VG with the same UUID, they will have to apply a consistent filter to pvscan --cache (and to all pvscan --cache invocations, including those in udev rules). If anything ever goes out of sync on those filters, they are out of luck, with a broken system. Any filter change can only happen with lvmetad turned off.

Comment 6 Peter Rajnoha 2012-07-23 10:15:14 UTC

(In reply to comment #5)
> "could have same UUID" -- no, that's not supported, and cannot be -- it's
> Universally Unique... uniqueness is the whole point of a UUID. Anything else
> can clash, but an UUID is what defines a VG identity.
> 
> As for filtering, lvmetad always needs a full view of the system. A pvscan
> --cache must never apply any filtering, and all commands that want to filter
> need to do so on data coming from lvmetad.
> 

If filtering is not applied here, then we must assure that any kind of inconsistency is silently ignored and "pvscan --cache <device>" as well as plain "pvscan --cache" (without specifying a device) just looks at what is on the disk and just updates lvmetad with what it has found, no matter if there are inconsistent VGs, cluster VGs, duplicate names, whatever corruption might it be... I think it's working like that today, right? (with the exception that "pvscan --cache" uses filtering that it should probably not if we want a consistent solution).


But there's really one situation that bothers me the most:

Filtering comes *very handy* in host-guest environments where we just need to filter the "inner/guest" LVM state out of the "outer/host" LVM state. Duplicate names are common. UUID duplicates probably less common if done right. But I can imagine someone just copying the template guest image and making several copies then (which would copy the whole LVM layout as well if used inside). So in this case we need an extra step to change all the UUIDs to be unique around. We can do without filtering, but it's a bit inconvenient compared to old/non-lvmetad functionality where the filters did the job of the "separation" for us here.
And this is just a practical example that comes into my mind now (as I'm using lots of VMs with lots of assigned LVs and I'm hitting the problems now when I switch lvmetad on :) But they might not be as problematic as they seem to be, see comment below...).

> *IF* (and that's a big if) it's unavoidable that someone has multiple
> clashing copies of a VG with the same UUID, they will have to apply a
> consistent filter to pvscan --cache (and to all pvscan --cache invocations,
> including those in udev rules). If anything ever goes out of sync on those
> filters, they are out of luck, with a broken system. Any filter change can
> only happen with lvmetad turned off.

I'm not saying that going without filtering for "pvscan --cache" is wrong. That's fine. We just need to be sure that we cover it consistently (so the same behaviour for "pvscan --cache" and "pvscan --cache <device>"). Also, nowadays, when using filtering, I get:

  [0] rawhide/~ # lvs
    No device found for PV JNBIxh-lBFr-nTYX-pVj0-Uq0O-PscL-QeezwW.

...if that device is filtered out. So we have to differentiate here what has been filtered out (which is the 1st stage) and what is really missing if it is not filtered out and we really can't find the device (which is the 2nd stage). So this is just a small bug that has to be fixed which is easy as a matter of fact...

If we can't deal with duplicate UUIDs with lvmetad because the filter is not active, we have to document how we can deal with this some other way or how to workaround this (just like in the example described above) as users might be used to filters doing the fine job for them.

So the only thing we need to make sure is *to have a clear procedure* for incosistencies, duplicates and possible metadata corruptions we might find for devices that would normally be filtered out (just like in the example with the host/guest environment - it's really not a job of the host to deal with this, the guest must see this and make any corrections inside so it does not clash with any existing host's metadata).

Also, we definitely don't want to autoactivate the volumes on the host that belong to the guest (and which should be activated there). But that's just a side effect which could be easily fixed as well - just looking at the filter if doing the autoactivation...

Comment 7 Petr Rockai 2012-07-31 08:22:37 UTC

Re duplicate UUIDs and filtering: I disagree that it is a practical solution. It is actually extremely fragile. If your filters ever go out of sync for whatever reason, all LVM commands will consider all VGs that share a UUID to be a single inconsistent group, and will happily wipe out all metadata belonging to all copies that happen to have a lower seqno. Running lvmetad only makes this problem more explicit, since it always needs to see all VGs. It simply cannot distinguish between "inconsistent VG" and "duplicate UUID", since UUID is the only thing that allows us to distinguish VGs reliably.

Anyway, the only time I can see this happening is when you are running multiple snapshots of a virtual image that has LVM in it and you expose those snapshots in the host as block devices. Maybe a simple thing like "global/recursive_lvm = 0" (as default) would work. Actually, thinking of it more, I have an idea...

Either with a "recursive_lvm" setting or with a global filter, we could do the following trick: with every request to lvmetad, send in the "global" configuration that affects which devices are scanned. If we have the "complete = X" support mentioned in #839941 we can use this to trigger replies with "complete = 0" whenever we run into a mismatch between a cached (in lvmetad) value of the filter or the recursive_lvm setting and the value sent in a request. However, this also means that LVM commands that obtain a "complete = 0" reply must be able to trigger a re-scan (either by forking off pvscan --cache or running the equivalent internally), so they can continue working normally after the administrator edits lvm.conf.

Opinions? (Btw., do we actually need recursive_lvm for anything nowadays? It used to be handy to obtain raid10-like functionality IIRC, but with the new raid target, that is covered, right? I would still make it available in the config file, but off by default; it's not very well tested and it's kinda dangerous anyway...) This solution would also most likely predicate a D-3 solution to #839941, even if there are currently dissenting voices?

Comment 8 Peter Rajnoha 2012-08-01 09:25:32 UTC

(In reply to comment #7)
> sync for whatever reason, all LVM commands will consider all VGs that share
> a UUID to be a single inconsistent group, and will happily wipe out all
> metadata belonging to all copies that happen to have a lower seqno. Running

Fair enough...

> Anyway, the only time I can see this happening is when you are running
> multiple snapshots of a virtual image that has LVM in it and you expose
> those snapshots in the host as block devices. Maybe a simple thing like
> "global/recursive_lvm = 0" (as default) would work. Actually, thinking of it
> more, I have an idea...
> 

Hmm, but such a global setting would enable/disable that for all volumes, right? So if I wanted to have stacked LVM (for some reason) and disable it only for a certain set of volumes, I have no choice here (which was provided by filters).

> Either with a "recursive_lvm" setting or with a global filter, we could do
> the following trick: with every request to lvmetad, send in the "global"
> configuration that affects which devices are scanned. If we have the
> "complete = X" support mentioned in #839941 we can use this to trigger
> replies with "complete = 0" whenever we run into a mismatch between a cached
> (in lvmetad) value of the filter or the recursive_lvm setting and the value
> sent in a request. However, this also means that LVM commands that obtain a
> "complete = 0" reply must be able to trigger a re-scan (either by forking
> off pvscan --cache or running the equivalent internally), so they can
> continue working normally after the administrator edits lvm.conf.

Sure, we need to avoid any "manual" command reexecution. We can't say "if you see this error message on the output, just reexecute the command or, even worse, run pvscan --cache directly". This must be handled inside the command itself that will parse the answer from lvmetad and it will trigger the rescan as needed, as if pvscan --cache is used (and yes that would also solve the 839941 - we'd just remove the ExecPost=pvscan --cache and rely on the first LVM command that would trigger the scan and will fill the lvmetad with info).

So if I understand this correctly, in case we have several clients with different filters and these clients requests are interleaved, we always end up with direct scanning, so lvmetad would be effectively out of the game this way, right?

> 
> Opinions? (Btw., do we actually need recursive_lvm for anything nowadays? It
> used to be handy to obtain raid10-like functionality IIRC, but with the new
> raid target, that is covered, right? I would still make it available in the
> config file, but off by default; it's not very well tested and it's kinda
> dangerous anyway...) This solution would also most likely predicate a D-3
> solution to #839941, even if there are currently dissenting voices?

What about bug #817960? Isn't that interfering? People can still rely on this to work. IOW, the solution should provide the same functionality as it was before (and with filters in place). But hard to tell whose using that actually, what are the exact use cases...

Comment 9 Petr Rockai 2012-08-07 20:29:55 UTC

Last week, we had a meatspace discussion on this in Brno and the conclusions were:

1) In pvscan and pvscan --cache, use (a new) lvmetad/filter setting for filtering and always ignore devices/filter; lvmetad/filter is empty by default and its usage is discouraged
2) Pass results from lvmetad through devices/filter on client side
3) With each request to lvmetad, send in a "validity token" (currently a hash of lvmetad/filter);
  a) When the validity token is the same as currently held by lvmetad, lvmetad proceeds as normal
  b) If the token does not match, lvmetad still gives an answer, but it is tagged with "incomplete = 1"; the client encountering "incomplete = 1" will, normally, trigger a "pvscan --cache" (internally), which also updates the token cached in lvmetad
4) Users are responsible to either avoid duplicated UUIDs entirely (preferred) or, failing that, they should filter out anything with duplicated UUIDs using lvmetad/filter;
  a) to access those devices, they need to provide --config with use_lvmetad = 0 and a filter to only make those specific devices accessible;
  b) alternatively, they could run multiple lvmetad instances, using different sockets, and using different lvm.conf, with mutually exclusive lvmetad/filter (possible, but NOT recommended)

I believe that the above constitutes a very reasonable compromise, and also magically fixes #839941. I'll go ahead with the implementation, aiming for having it all done this week.

Comment 10 Peter Rajnoha 2012-08-08 08:26:28 UTC

(In reply to comment #9)
> Last week, we had a meatspace discussion on this in Brno and the conclusions
> were:
> 
> 1) In pvscan and pvscan --cache, use (a new) lvmetad/filter setting for
> filtering and always ignore devices/filter; lvmetad/filter is empty by
> default and its usage is discouraged
> 2) Pass results from lvmetad through devices/filter on client side

Yes! That's consistent now - *global* filter for *global* lvmetad and udev. And then specific client filters.

> 3) With each request to lvmetad, send in a "validity token" (currently a
> hash of lvmetad/filter);
>   a) When the validity token is the same as currently held by lvmetad,
> lvmetad proceeds as normal
>   b) If the token does not match, lvmetad still gives an answer, but it is
> tagged with "incomplete = 1"; the client encountering "incomplete = 1" will,
> normally, trigger a "pvscan --cache" (internally), which also updates the
> token cached in lvmetad
> 4) Users are responsible to either avoid duplicated UUIDs entirely
> (preferred) or, failing that, they should filter out anything with
> duplicated UUIDs using lvmetad/filter;
>   a) to access those devices, they need to provide --config with use_lvmetad
> = 0 and a filter to only make those specific devices accessible;

For example libguestfs+guestfsbrowser uses filters this way to access the guest images and make changes in them. We need to let them know OR document this visibly somewhere. Since such use is very rare, use_lvmetad=0 should not make any harm in this case. It just needs to be documented well!

>   b) alternatively, they could run multiple lvmetad instances, using
> different sockets, and using different lvm.conf, with mutually exclusive
> lvmetad/filter (possible, but NOT recommended)

That would be a bit harder with respect to systemd/initscripts (if we want to start such services automatically on boot), but possible. But let's try to avoid this. Let's just use "multiple" lvmetad instances for testing only. I see the solution "a" above as a much much better way to go.

> 
> I believe that the above constitutes a very reasonable compromise, and also
> magically fixes #839941.

Yes, that's the nice thing about this solution as well :) Minimal changes to current lvmetad code without any complex things to introduce and does solve the problem very well and consistently...

Comment 11 Corey Marthaler 2012-08-09 17:38:08 UTC

Adding QA ack for 6.4. 

Devel will need to provide unit testing results however before this bug can be
ultimately verified by QA.

Comment 12 Alasdair Kergon 2012-08-09 18:01:01 UTC

1. How do the two filters interact?

I'd prefer that the new filter is not tied to lvmetad: perhaps call it devices/global_filter and always apply it in *all* code, client and server.

It's then clearer again that if you exclude devices using that filter, no lvm code will use those devices.  (If it's just called lvmetad/filter, then when you *aren't* using lvmetad, the filter would be ignored, making the transition to/from lvmetad harder: you're using lvmetad and some devices are ignored, then you turn off lvmetad and suddenly those devices become visible again.)

Only ignore the existing devices/filter in the variant of the pvscan command that updates lvmetad viz. pvscan --cache.  (The code should already do this.)
Normal 'pvscan' will apply devices/filter.

2. Yes, the existing filter is then client-side.

Comment 13 Alasdair Kergon 2012-08-09 18:32:56 UTC

I don't understand what's been written about point 4.  (UUID is ambiguous here too - PV UUID? VG UUID? vgname?  LVID?)

Regarding all types of UUIDs, I don't believe a system with lvmetad enabled should behave any differently from one that doesn't use it.

So can you please break this down into:

a) a system using lvmetad needs to behave differently from a non-lvmetad system in respect of some specific situation(s) because...

and/or

b) both an lvmetad system and a non-lvmetad system don't handle some specific situation very well and so we propose...

Comment 14 Alasdair Kergon 2012-08-09 19:04:47 UTC

(There are many situations where it is legitimate for duplicated IDs of various types to occur, and the way these situations are handled shouldn't depend upon whether or not lvmetad is in use.)

Regarding things like vgimportclone, libguestfs, the existing precedence rules should be sufficient for lvmetad - if there's already a device of the same type with the same PV UUID, then the existing one will continue to be used and the new one ignored.  Client code that wants to manipulate the "copy" should indeed not use the system's lvmetad.

Comment 15 Alasdair Kergon 2012-08-09 19:18:30 UTC

Should we support running multiple lvmetad daemons on a (test) system?  Yes, but keep this simple - it looks like LVM_LVMETAD_SOCKET might already do the trick  and should be added to the ENVIRONMENT VARIABLES section of the lvm(8) man page.

Comment 16 Alasdair Kergon 2012-08-09 19:54:15 UTC

(The existing precedence rules are in lvmcache_add() and _insert_vginfo in lvmcache.c.  Need to be sure that 'pvscan --cache' still applies these the same way when populating lvmetad.)

Item 3, if I understand it correctly, is saying that the precise subset of data that any instance of lvmetad stores is defined by the global_filter.  Different lvmetad instances (with different sockets) may use different global_filters.  An operation exists to compare the client's global_filter against the server's and if they are different, the client knows it can't trust the data to some degree.  To handle start up and global filter changes, separate states might be recorded in the daemon and presented to clients: "global_filter_hash this daemon is currently using", "another connected client is scanning", "this data includes the results of a completed scan with the current global filter hash".

Then for example, if a client connects and finds the global filter hash does not match AND another client is currently scanning, then it knows there's a problem as two clients are connecting concurrently to the same daemon but with different global filters.

Comment 17 Alasdair Kergon 2012-08-09 20:05:28 UTC

If a client connects and
  the global filter hash matches
  there is no connected client already scanning
  the data does not contains results of a complete scan
then the client scans the remaining devices, skipping any devices already cached - there's no need to re-scan anything already known about.

If the global filter hash doesn't match, or if 'pvscan --cache' is run explicitly, the existing lvmetad cache is emptied and everything is rescanned.

Comment 18 Petr Rockai 2012-09-26 18:43:00 UTC

Fixed upstream.

Comment 21 Nenad Peric 2013-01-29 10:26:21 UTC

Tested with iscsi devices on two serves. 
The device was used on one machine to create a VG and was then scanned on another with and without global filtering.
normal filter remained the default one ( a/.*/" )

vgscan -vvv without global filter (on r6-node02) :

        Using /dev/sdd1
        Opened /dev/sdd1 RO O_DIRECT
        /dev/sdd1: block size is 1024 bytes
      /dev/sdd1: lvm2 label detected at sector 1
        lvmcache: /dev/sdd1: now in VG #orphans_lvm2 (#orphans_lvm2) with 0 mdas
        /dev/sdd1: Found metadata at 4608 size 717 (in area at 4096 size 1044480) for vg (cxNKiv-krT3-2bTI-mF7M-MJus-syod-MD1aRG)
        lvmcache: /dev/sdd1: now in VG vg with 1 mdas
        lvmcache: /dev/sdd1: setting vg VGID to cxNKivkrT32bTImF7MMJussyodMD1aRG
        lvmcache: /dev/sdd1: VG vg: Set creation host to r6-node01.
        Closed /dev/sdd1


vgscan -vvv --cache with filtering enabled (global_filter = ["r|/dev/sdd|"] )

       Closed /dev/sdc1
      Setting response to OK
      Setting response to OK
        /dev/sdd: Skipping (regex)
        /dev/sdd1: Skipping (regex)
        Opened /dev/sde RO O_DIRECT


Marking verified with:

lvm2-2.02.98-8.el6.x86_64

Comment 22 errata-xmlrpc 2013-02-21 08:09:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0501.html