Bug 951600

Summary:	Provide error-free read-only access to clustered VG metadata visible on a non-clustered system using --readonly.
Product:	Red Hat Enterprise Linux 6	Reporter:	Anil Vettathu <avettath>
Component:	lvm2	Assignee:	Alasdair Kergon <agk>
lvm2 sub component:	Displaying and Reporting (RHEL6)	QA Contact:	Cluster QE <mspqa-list>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	agk, amureini, anande, avyadav, bazulay, cmarthal, cpelland, cshao, cww, dparikh, dwysocha, gouyang, hadong, hchiramm, heinzm, huiwa, iheim, jbrassow, jkt, jraju, leiwang, lpeer, lyarwood, marcobillpeter, mkalinin, msnitzer, mspqa-list, nperic, prajnoha, prockai, psubrama, rbalakri, sbhat, scohen, slevine, spanjikk, sputhenp, thornber, tvvcox, yaniwang, ycui, yeylon, zkabelac
Version:	6.5
Target Milestone:	rc
Target Release:	6.6
Hardware:	x86_64
OS:	Linux
Whiteboard:	storage
Fixed In Version:	lvm2-2.02.107-1.el6	Doc Type:	Enhancement
Doc Text:	LVM commands that report the state of Logical Volumes, Volume Groups or Physical Volumes acquire a new command-line parameter: --readonly. This uses a special read-only mode that accesses on-disk metadata without needing locks. Uses include: * peeking at disks inside virtual machines while they are in use; * peeking inside Volume Groups that are marked as clustered when the necessary clustered locking is unavailable for whatever reason. The commands are unable to report whether or not Logical Volumes are actually in use because there is no communication with any device-mapper kernel driver. The lv_attr field of the 'lvs' command shows an X where this information would normally appear.	Story Points:	---
Clone Of:
Clones:	1116944 (view as bug list)		Environment:
Last Closed:	2014-10-14 08:24:14 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	820991
Bug Blocks:	988951, 1116944

Comment 8 Ayal Baron 2013-04-28 11:21:22 UTC

The underlying issue seems to be that 'pvs' has a non-zero return code (i.e. failure) in case devices (pvs) in clustered configuration are present (even though they are not used by vdsm).

Moving to lvm.

Comment 9 Alasdair Kergon 2013-04-28 12:02:07 UTC

Please obtain a -vvvv trace, as usual for any LVM question, so we can see what is happening.

(No PVs are listed explicitly on the command line, so it expands the command line to the list of all PVs, then finds that some are in clustered VGs so skips them and gives the error because the output requested is incomplete.  As a workaround, try listing the PVs to be reported upon explicitly on the command line.)

I can only speculate about what might be going on, but my quick attempt to replicate this without RHEV with the current upstream LVM2 code has not shown this behaviour.

So it might be something already fixed upstream that could be backported or it might be something specific to the RHEV environment.

Comment 43 Alasdair Kergon 2013-05-16 14:06:39 UTC

The exit code 5 is sensible in these circumstances from lvm's point of view:

Say PV1 is clustered and PV2 is not clustered.

pvs PV1 - gives error
pvs PV2 - gives success
pvs PV1 PV2 - gives error
pvs - gives error

Now I suppose we could look into adding a global lvm.conf option to ignore clustered VGs silently - IOW providing incomplete output in these circumstances silently.  That would mean that 'pvs' would hide PVs that were clustered without indicating that it had done so.

Comment 44 Alasdair Kergon 2013-05-16 14:12:12 UTC

(I'd be very worried about an alternative of including these PVs in the output with information in the columns that is not reliable and still returning success.)

Comment 45 Alasdair Kergon 2013-05-16 14:29:36 UTC

With a 'hide clustered VG' option perhaps:

pvs PV1 - error
pvs PV2 - PV2 displayed OK; success
pvs PV1 PV2 - PV2 displayed OK; error from PV1
pvs - PV2 displayed; success
pvs -a  - PV2 displayed; success 

The rules here are:
clustered PV1 is never displayed
If clustered PV1 is explicitly mentioned on the cmdline, you still get an error

Comment 47 Peter Rajnoha 2013-05-17 12:56:57 UTC

(In reply to comment #43)
> Now I suppose we could look into adding a global lvm.conf option to ignore
> clustered VGs silently - IOW providing incomplete output in these
> circumstances silently.  That would mean that 'pvs' would hide PVs that were
> clustered without indicating that it had done so.

...this is already reported as bug #820991.

Comment 48 Ayal Baron 2013-05-19 07:50:44 UTC

(In reply to comment #47)
> (In reply to comment #43)
> > Now I suppose we could look into adding a global lvm.conf option to ignore
> > clustered VGs silently - IOW providing incomplete output in these
> > circumstances silently.  That would mean that 'pvs' would hide PVs that were
> > clustered without indicating that it had done so.
> 
> ...this is already reported as bug #820991.

I cannot say that I agree with the approach since the use case is totally valid and from the user pov she did nothing wrong yet things start to fail.

However, if this is the approach we're going to take, since this has hit us multiple times already, can we increase priority of this ability?
We will then proceed to incorporate it in vdsm's default config (which is passed from command line, not relying on lvm.conf).

It would probably be necessary though to be able to list the clustered PVs (just names would suffice)

Comment 49 Alasdair Kergon 2013-05-19 14:37:52 UTC

(In reply to comment #48)
 
> I cannot say that I agree with the approach since the use case is totally
> valid and from the user pov she did nothing wrong yet things start to fail.

From the lvm point of view this is what we have:

A system containing some PVs in clustered VGs and some in non-clustered VGs.

To obtain information about the clustered VGs you *must* run with clustered locking enabled.  If you ask for information about the clustered VGs without clustered locking, lvm commands skip over them and return an error.
You can run queries against the non-clustered VGs and not get an error.

The current behaviour of lvm is sensible and correct.

Comment 50 Alasdair Kergon 2013-05-19 14:44:38 UTC

So:

What is the minimum amount of information it is necessary for vdsm to obtain about these clustered VGs without obtaining clustered locks?

Is there then some safe and correct way for lvm tools to provide that information without obtaining clustered locks?

For example, what does vdsm try to do with pe_alloc_count on one of these clustered VGs?

Comment 51 Alasdair Kergon 2013-05-19 14:57:52 UTC

For example, could vdsm make do with just the "Physical Volume Label Fields" in respect of the PVs that are in clustered VGs?

pvs -o help

  Physical Volume Label Fields
  ----------------------------
    pv_all               - All fields in this section.
    pv_fmt               - Type of metadata.
    pv_uuid              - Unique identifier.
    dev_size             - Size of underlying device in current units.
    pv_name              - Name.
    pv_mda_free          - Free metadata area space on this device in current units.
    pv_mda_size          - Size of smallest metadata area on this device in current units.

Comment 52 Alasdair Kergon 2013-05-19 14:59:04 UTC

(pv_all there is a bug)

Comment 53 Ayal Baron 2013-05-21 13:05:45 UTC

What we're currently retrieving is: uuid,name,size,vg_name,vg_uuid,pe_start,pe_count,pe_alloc_count,mda_count,dev_size

However, we can split this into 2 different calls:
1. for getting physical info as you suggest (but we would need an indication of whether the pv belongs to a vg, just to know if it's in use or not).
2. for getting vg related info on specific PVs that we manage (obviously would need to be non-clustered).

Comment 54 Alasdair Kergon 2013-05-21 17:12:58 UTC

So that would mean:

1) run pvs for the PV label fields
   - this will tell you all PVs including clustered ones without error, but would not tell you which are clustered and which are not

Then either 

2a) run pvs with the VG-based fields (the other PV ones are actually obtained from the VG metadata) specifying the PVs you are interested in that you know are not clustered.  Is this possible for you and sufficient?

or

2b) We change lvm as described in comment #45, and then you run pvs with the new option and get given details of all PVs in VGs that are not clustered, without error.

You get no further information about the clustered VG.


-----

If that's still insufficient, then

3) We need to look in more detail at 2b to see if there is any other way we can indicate which PVs appeared to belong to clustered VGs and which PVs don't, if it's important for you to know that.

Comment 55 Humble Chirammal 2013-05-24 07:41:52 UTC

what happens if sysadmin or someone mistakenly set 'cluster bit' ( may be from other server ) on ONE of the  VG  ( == RHEV storage domain) and using it ? if vdsm skip the 'cluster bit' on it and proceeded, it can cause issues in future, Isn't it ? 


Is there a way to get a list of VGs 'skipped becuase of cluster bit set' from 'pvs' ? if yes, can't vdsm proceed and check whether vdsm really care about those VGs considering the storage domain (== vg) which in control of RHEV always have 'RHAT_storage_domain' tag with it?

I may be missing something, but thought of sharing these bits if it helps.

Comment 56 Ayal Baron 2013-05-26 09:23:26 UTC

(In reply to Alasdair Kergon from comment #54)
> So that would mean:
> 
> 1) run pvs for the PV label fields
>    - this will tell you all PVs including clustered ones without error, but
> would not tell you which are clustered and which are not

Why not specify which PVs are clustered so we'd know to skip them?

> 
> Then either 
> 
> 2a) run pvs with the VG-based fields (the other PV ones are actually
> obtained from the VG metadata) specifying the PVs you are interested in that
> you know are not clustered.  Is this possible for you and sufficient?

How would we differentiate?

> 
> or
> 
> 2b) We change lvm as described in comment #45, and then you run pvs with the
> new option and get given details of all PVs in VGs that are not clustered,
> without error.
> 
> You get no further information about the clustered VG.
> 
> 
> -----
> 
> If that's still insufficient, then
> 
> 3) We need to look in more detail at 2b to see if there is any other way we
> can indicate which PVs appeared to belong to clustered VGs and which PVs
> don't, if it's important for you to know that.

It is since we need to report this to the user.
There are several flows here:
1. creating a new VG - we need to be able to report a list of PVs/potential PVs and what they have on them (could be that clustered vg is an old remnant that needs to be overwritten).
2. on going work - need to ignore clustered PVs since clearly we don't need them.
For number 2 I'm fine with LVM ops just ignoring these PVs (we'll pass the required config param).
For number 1 we need a way to make the distinction.

Comment 57 Alasdair Kergon 2013-05-27 21:40:51 UTC

(In reply to Ayal Baron from comment #56)
> Why not specify which PVs are clustered so we'd know to skip them?

'Clustered' is a VG property and is not available with PV label fields.

> 2. on going work - need to ignore clustered PVs since clearly we don't need
> them.
> For number 2 I'm fine with LVM ops just ignoring these PVs (we'll pass the
> required config param).

OK - so something like comment #45 could cover case 2.

Comment 58 Alasdair Kergon 2013-05-27 22:26:25 UTC

> 1. creating a new VG - we need to be able to report a list of PVs/potential
> PVs and what they have on them (could be that clustered vg is an old remnant
> that needs to be overwritten).

> For number 1 we need a way to make the distinction.


At the point of "creating a new VG" - you *know* that the volumes are not in use and the lvm cluster locking is not active (as you're contemplating overwriting it) and therefore it is safe to access the VG while not holding the clustered lock that would normally be required?  Otherwise on what basis are you trying to access the clustered VG information?  If this metadata was added by a guest that is running, shouldn't you query lvm inside that running guest to obtain this information definitively?


Or is what you're really asking for here some mechanism to peer inside volumes owned/shared by a guest from outside *while they might be in use* and without any co-operation from the guest?

Comment 60 Ayal Baron 2013-06-25 06:56:36 UTC

(In reply to Alasdair Kergon from comment #58)
> > 1. creating a new VG - we need to be able to report a list of PVs/potential
> > PVs and what they have on them (could be that clustered vg is an old remnant
> > that needs to be overwritten).
> 
> > For number 1 we need a way to make the distinction.
> 
> 
> At the point of "creating a new VG" - you *know* that the volumes are not in
> use and the lvm cluster locking is not active (as you're contemplating
> overwriting it) and therefore it is safe to access the VG while not holding
> the clustered lock that would normally be required?  Otherwise on what basis
> are you trying to access the clustered VG information?  If this metadata was
> added by a guest that is running, shouldn't you query lvm inside that
> running guest to obtain this information definitively?
> 
> 
> Or is what you're really asking for here some mechanism to peer inside
> volumes owned/shared by a guest from outside *while they might be in use*
> and without any co-operation from the guest?

This is exactly what I'm asking for here, the ability to determine whether a PV is part of a clustered VG so that:
1. we can know not to touch it
2. we can tell the user 'there be danger'

Comment 67 Alasdair Kergon 2013-10-01 21:56:01 UTC

Under bug 820991, I have added an --ignoreskippedcluster option that allows commands to ignore clustered objects that they cannot properly read, which may form part of the solution to this problem.

Comment 68 Alasdair Kergon 2013-10-01 22:01:21 UTC

"2. on going work - need to ignore clustered PVs since clearly we don't need them.
For number 2 I'm fine with LVM ops just ignoring these PVs (we'll pass the required config param)."


So far the new parameter is accepted by:

pvs, vgs, lvs, pvdisplay, vgdisplay, lvdisplay, vgchange, lvchange

Is it needed by any other commands at this stage?

Comment 69 Ayal Baron 2013-10-01 22:54:51 UTC

(In reply to Alasdair Kergon from comment #68)
> "2. on going work - need to ignore clustered PVs since clearly we don't need
> them.
> For number 2 I'm fine with LVM ops just ignoring these PVs (we'll pass the
> required config param)."
> 
> 
> So far the new parameter is accepted by:
> 
> pvs, vgs, lvs, pvdisplay, vgdisplay, lvdisplay, vgchange, lvchange
> 
> Is it needed by any other commands at this stage?

none that I can think of

Comment 88 Alasdair Kergon 2014-04-17 14:01:01 UTC

So we need to be able to report upon the state of any Volume Group metadata regardless of whether it is being used, whether it is consistent, whether it is being changed at the time it is being accessed.

Any locks we might take during these operations would be meaningless because we are running in a different domain from the one that holds the locks.  We cannot lock the objects.  This is like global/locking_type 0 (no locking).

The metadata cannot be cached.  This means lvmetad must not be used.
This is like global/use_lvmetad = 0.

We must not perform any action that attempts to change the metadata.  This is like locking/metadata_read_only = 1.

    # If set to 1, no operations that change on-disk metadata will be permitted.
    # Additionally, read-only commands that encounter metadata in need of repair
    # will still be allowed to proceed exactly as if the repair had been 
    # performed (except for the unchanged vg_seqno).

We must not activate the LV nor probe any activation state (because we have no direct access to the domain where it might be active).  This is like
--driverloaded n

              Whether  or  not  the device-mapper kernel driver is loaded.  If
              you set this to n, no  attempt  will  be  made  to  contact  the
              driver.

We must not attempt to backup any metadata because no local metadata backup is seen.  This is like --autobackup n.

Comment 89 Alasdair Kergon 2014-04-17 14:09:41 UTC

Commands needing this support would be:

pvs, vgs, lvs, pvdisplay, vgdisplay, lvdisplay, vgcfgbackup

plus lvmdiskscan, lvscan, pvscan, vgscan for completeness
plus built-ins that don't use metadata

This amounts to all commands flagged PERMITTED_READ_ONLY with the exception of the xxchange commands which could do nothing useful under these restrictions.

Comment 90 Alasdair Kergon 2014-04-17 20:40:41 UTC

I'm experimenting with adding a new hybrid command line option --readonly to the 11 commands I mentioned above that will set up the particular configuration described in comment 88.

Comment 91 Alasdair Kergon 2014-04-17 20:44:11 UTC

       --readonly
              Run the command in a special read-only mode which will read on-disk metadata without needing to take any locks.  This can be used  to  peek  inside  metadata
              used  by a virtual machine image while the virtual machine is running.  It can also be used to peek inside the metadata of clustered Volume Groups when clus-
              tered locking is not configured or running.  No attempt will be made to communicate with the device-mapper kernel driver, so this option is unable to  report
              whether or not Logical Volumes are actually in use.

Comment 92 Alasdair Kergon 2014-04-18 01:56:46 UTC

https://www.redhat.com/archives/lvm-devel/2014-April/msg00098.html

Also changed lv_attr char to X when we don't know the right value because the --readonly flag has no access to the LV's lock domain and removed 'LV Status' from lvdisplay output.

  LV    VG   Attr       LSize   Pool  Origin Data%  Move Log Cpy%Sync Convert
  lvol1 vg2  twi-XXtz--  52.00m                                              


Before:
# vgs

  Skipping clustered volume group vg3

After:
# vgs --readonly

  VG   #PV #LV #SN Attr   VSize   VFree  
  vg3    2   1   0 wz--nc  56.00m  44.00m


Didn't add it to vgscan as that's pointless.

Comment 93 Alasdair Kergon 2014-04-18 02:03:04 UTC

This option should work on a machine that can see disks belonging to a cluster it is not a member of.  It should also work when pointed at disks inside a running VM.

It should also cope if the cluster or VM is changing the metadata while the command is run.  (In some cases it will issue some warnings if it detects this happening, but it should still report the metadata.)

Comment 97 Nenad Peric 2014-07-21 09:47:09 UTC

[root@virt-063 yum.repos.d]# vgs
  connect() failed on local socket: No such file or directory
  Internal cluster locking initialisation failed.
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.
  Skipping clustered volume group cluster
  Skipping volume group cluster
  VG         #PV #LV #SN Attr   VSize VFree
  vg_virt063   1   2   0 wz--n- 7.51g    0 
[root@virt-063 yum.repos.d]# vgs --readonly
  VG         #PV #LV #SN Attr   VSize  VFree 
  cluster      5   2   0 wz--nc 74.98g 18.75g
  vg_virt063   1   2   0 wz--n-  7.51g     0 
[root@virt-063 yum.repos.d]# lvs --readonly
  LV      VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  biglv1  cluster    -wi-XX----  18.75g                                                    
  biglv2  cluster    -wi-XX----  37.49g                                                    
  lv_root vg_virt063 -wi-XX----   6.71g                                                    
  lv_swap vg_virt063 -wi-XX---- 816.00m                                                    
[root@virt-063 yum.repos.d]# 



[root@virt-063 yum.repos.d]# pvs
  connect() failed on local socket: No such file or directory
  Internal cluster locking initialisation failed.
  WARNING: Falling back to local file-based locking.
  Volume Groups with the clustered attribute will be inaccessible.
  Skipping clustered volume group cluster
  Skipping volume group cluster
  Skipping clustered volume group cluster
  Skipping volume group cluster
  Skipping clustered volume group cluster
  Skipping volume group cluster
  Skipping clustered volume group cluster
  Skipping volume group cluster
  Skipping clustered volume group cluster
  Skipping volume group cluster
  PV         VG         Fmt  Attr PSize PFree
  /dev/vda2  vg_virt063 lvm2 a--  7.51g    0 
[root@virt-063 yum.repos.d]# pvs --readonly
  PV         VG         Fmt  Attr PSize  PFree 
  /dev/sda   cluster    lvm2 a--  15.00g     0 
  /dev/sdb   cluster    lvm2 a--  15.00g     0 
  /dev/sdc   cluster    lvm2 a--  15.00g  3.75g
  /dev/sdd   cluster    lvm2 a--  15.00g     0 
  /dev/sdh   cluster    lvm2 a--  15.00g 15.00g
  /dev/vda2  vg_virt063 lvm2 a--   7.51g     0 



Marking VERIFIED with:

lvm2-2.02.107-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
lvm2-libs-2.02.107-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
lvm2-cluster-2.02.107-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
udev-147-2.56.el6    BUILT: Fri Jul 11 16:53:07 CEST 2014
device-mapper-1.02.86-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
device-mapper-libs-1.02.86-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
device-mapper-event-1.02.86-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
device-mapper-event-libs-1.02.86-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014
device-mapper-persistent-data-0.3.2-1.el6    BUILT: Fri Apr  4 15:43:06 CEST 2014
cmirror-2.02.107-2.el6    BUILT: Fri Jul 11 15:47:33 CEST 2014

Comment 98 errata-xmlrpc 2014-10-14 08:24:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1387.html