Bug 475550 - HALVM - no VG metadata locking can cause serious corruption.
HALVM - no VG metadata locking can cause serious corruption.
Status: CLOSED WONTFIX
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: rgmanager (Show other bugs)
4
All Linux
high Severity high
: rc
: ---
Assigned To: Jonathan Earl Brassow
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-12-09 11:50 EST by Issue Tracker
Modified: 2010-10-23 02:27 EDT (History)
13 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-05-20 11:53:16 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Issue Tracker 2008-12-09 11:50:03 EST
Escalated to Bugzilla from IssueTracker
Comment 1 Issue Tracker 2008-12-09 11:50:04 EST
State the problem

   1. Provide time and date of the problem

      Any time/date, reproduceable

   2. Indicate the platform(s) (architectures) the problem is being reported against.

      All Archs

   3. Provide clear and concise problem description as it is understood at the time of escalation

          HALVM bases itself on VG (or LV) tags to failover VGs or LVs accross nodes in a cluster. HALVM uses as its most basic form of locking the activation {volume_list = [] } feature of LVM, which allows or disallows the activation of a device-mapper table for the LV in question. However because there is no locking of the VG metadata itself or no locking of the VG activation in memory, it is possible and easy to have the same VG active on both nodes at the same time (but not LVs) and therefore to destroy the VG metadata on the node that does not hold the TAGs for a VG/LV, because activation/volume_list simply locks LV activation, but not vgrename, or lvresize, or any other operations on the VG metadata.

          * Observed behavior

          vgscan activates the VG metadata on all cluster nodes, and vgdisplay does show them as active, and it is possible to run commands such as lvresize,vgreduce, vgrename, etc, on the node that does not hold TAGs for that VG.

          * Desired behavior 

          A locking mechanism that will prevent the node that does not possess the appropriate TAG, which will prevent the VG metadata from becoming active, (ie vgdisplay won't show it, and it will not be possible to perform VG metadata modifications on it using the standard lvm command set).

   4. State specific action requested of SEG

         Evaluate with engineering if this is possible, which apparently isn't. And open a discussion with engineering. This behaviour has already caused data loss for our customers who ended up resizing LVs on the wrong machine successfully and destroyed their data. This should be treated as a bug because HALVM is supposed to protect the LVM metadata, and it does not do it.

   5. State whether or not a defect in the product is suspected

         Yes, the usage of activation/volume_list as a locking mechanism, that does not protect VG metadata.

   6. If there is a proposed patch, make sure it is in unified diff format (diff -pruN)

         N/A

   7. Refrain from using the word "hang", as it can mean different things to different people in different contexts. Use a better and more specific description of your problem.

         N/A

   8. This is especially important for severity one and two issues. What is the impact to the customer when they experience this problem?
          
         This is particularly important because more and more customers are adopting HALVM without the knowledge that it is easily possible to destroy data without any warning or protection.


Provide supporting info

   1. State other actions already taken in working the problem:

      bugzilla searches, reproducer in house, discussion with LVM devel team members.

   2. Attach sosreport

      none

   3. Attach other supporting data

      as comments on ticket.

   4. Provide issue repro information:

      - setup a cluster with shared storage
      - configure halvm
      - create a VG and LVs on one node.
      - configure rgmanager to use it with halvm on a service
      - activate the service

      - on the passive node for this service do a vgscan and notice that the VG is there available.
      - on the passive node start changing the vg and lvs, doing lvresize, vgrename etc.
      - on the active node do a vgscan and notice the change


   5. List any known hot-fix packages on the system

      N/A

   6. List any customer applied changes from the last 30 days 

      N/A


GFS/Cluster Suite Specific

   1. Attach sosreport for all nodes

      N/A

   2. State fencing method being used

      indiferent

   3. Do the following:
          * Verify that all the ''cluster.conf'' files are the same
          * Verify that all the cluster node names are in ''/etc/hosts''
          * Verify ''/etc/hosts'' names match what is in ''ifconfig''
          * Verify that there is a fence device (Red Hat does not support manual fencing.) 

          indiferent

This event sent from IssueTracker by fleitner  [Support Engineering Group]
 issue 244080
Comment 2 Issue Tracker 2008-12-09 11:50:06 EST
1- Machines after boot, the halvm service is disabled: 

[root@pe1950-3 ~]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  pe1950-3-hb                              Online, Local, rgmanager
  pe1950-4-hb                              Online, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  halvm                (none)                         disabled        


[root@pe1950-4 ~]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  pe1950-3-hb                              Online, rgmanager
  pe1950-4-hb                              Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  halvm                (none)                         disabled    


2- Configuration is correct, VG FC is not on volume_list and will have LVs
activated only if TAG is present. But there are no tags atm.

BUT already the problem reveals itself, vgs and lvs show VG FC with its
metadata active on both nodes. Already dangerous, even though LVs are not
active.

[root@pe1950-3 ~]# lvm dumpconfig activation/volume_list
  volume_list=["myvg", "VolGroup00", "P3P4",
"@pe1950-3.gsslab.fab.redhat.com"]

[root@pe1950-3 ~]# vgs -o +vg_tags FC
  VG   #PV #LV #SN Attr   VSize  VFree VG Tags
  FC     1   3   0 wz--n- 10.00G 7.00G        

[root@pe1950-3 ~]#  lvs -o +lv_tags FC
  /dev/dm-10: Checksum error
  LV   VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert LV Tags
  lv01 FC   -wi--- 1.00G                                              
  lv02 FC   -wi--- 1.00G                                              
  lv03 FC   -wi--- 1.00G     


[root@pe1950-4 ~]#  lvm dumpconfig activation/volume_list 
  volume_list=["myvg", "VolGroup00", "P3P4",
"@pe1950-4.gsslab.fab.redhat.com"]

[root@pe1950-4 ~]# vgs -o +vg_tags FC
  /dev/cdrom: read failed after 0 of 2048 at 0: Input/output error
  /dev/dm-9: Checksum error
  VG   #PV #LV #SN Attr   VSize  VFree VG Tags
  FC     1   3   0 wz--n- 10.00G 7.00G 

[root@pe1950-4 ~]# lvs -o +lv_tags FC
  LV   VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert LV Tags
  lv01 FC   -wi--- 1.00G                                              
  lv02 FC   -wi--- 1.00G                                              
  lv03 FC   -wi--- 1.00G   

3- Service halvm is simply the activation of VG FC, which will active all
LVs from FC.

                <resources>
                        <lvm name="mylv" vg_name="FC"/>
                        ...
                </resources>
        <service autostart="0" recovery="restart" name="halvm">
                        <lvm ref="mylv"/>
                </service>
        </rm>


4- Activating halvm service:

[root@pe1950-3 ~]# clusvcadm -e halvm
Member pe1950-3.gsslab.fab.redhat.com trying to enable halvm...success
Service halvm is now running on pe1950-3.gsslab.fab.redhat.com

[root@pe1950-3 ~]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  pe1950-3.gsslab.fab.redhat.com           Online, Local, rgmanager
  pe1950-4.gsslab.fab.redhat.com           Online, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  halvm                pe1950-3.gsslab.fab.redhat.com started         

[root@pe1950-4 ~]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  pe1950-3.gsslab.fab.redhat.com           Online, rgmanager
  pe1950-4.gsslab.fab.redhat.com           Online, Local, rgmanager

  Service Name         Owner (Last)                   State         
  ------- ----         ----- ------                   -----         
  halvm                pe1950-3.gsslab.fab.redhat.com started    

service running on pe1950-3.gsslab.fab.redhat.com.

4.1- Status of VGs: Both nodes have FC metadata active, but they do see
FC, ideally node pe1950-4 should NOT have FC active.

[root@pe1950-3 ~]# vgs -o +vg_tags FC
  VG   #PV #LV #SN Attr   VSize  VFree VG Tags                       
  FC     1   3   0 wz--n- 10.00G 7.00G pe1950-3.gsslab.fab.redhat.com

[root@pe1950-4 ~]# vgs -o +vg_tags FC
  VG   #PV #LV #SN Attr   VSize  VFree VG Tags                       
  FC     1   3   0 wz--n- 10.00G 7.00G pe1950-3.gsslab.fab.redhat.com

4.2- As expected HALVM did its job and allowed only pe1950-3 to activate
the LVs.

[root@pe1950-3 ~]# lvs -o +lv_tags FC
  LV   VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert LV Tags
  lv01 FC   -wi-a- 1.00G                                              
  lv02 FC   -wi-a- 1.00G                                              
  lv03 FC   -wi-a- 1.00G      

[root@pe1950-4 ~]# lvs -o +lv_tags FC
  LV   VG   Attr   LSize Origin Snap%  Move Log Copy%  Convert LV Tags
  lv01 FC   -wi--- 1.00G                                              
  lv02 FC   -wi--- 1.00G                                              
  lv03 FC   -wi--- 1.00G    

5- But now, it is easily seen that because pe1950-4 has activated the VG
metadata for FC, 

5.1- we can start to change it, while there are LVs active on pe1950-3:

[root@pe1950-4 ~]# vgs -o +vg_tags FC
  VG   #PV #LV #SN Attr   VSize  VFree VG Tags                       
  FC     1   3   0 wz--n- 10.00G 7.51G pe1950-3.gsslab.fab.redhat.com

[root@pe1950-4 ~]# lvresize -L 500M FC/lv01
  Reducing logical volume lv01 to 500.00 MB
  Logical volume lv01 successfully resized

[root@pe1950-4 ~]# lvs -o +lv_tags FC
  /dev/dm-9: Checksum error
  LV   VG   Attr   LSize   Origin Snap%  Move Log Copy%  Convert LV Tags
  lv01 FC   -wi--- 500.00M                                              
  lv02 FC   -wi---   1.00G                                              
  lv03 FC   -wi---   1.00G      

FC/lv01 is now 500M, but on the other node it is active and using 1G. The
filesystem on top would now be in danger as soon as lv01 is deactivated on
pe1950-3.

5.2- There is already an incosistency because pe1950-3 shows a size for
the LV based on metadata:

[root@pe1950-3 ~]# lvs -o +lv_tags FC
  LV   VG   Attr   LSize   Origin Snap%  Move Log Copy%  Convert LV Tags
  lv01 FC   -wi-a- 500.00M                                              
  lv02 FC   -wi-a-   1.00G                                              
  lv03 FC   -wi-a-   1.00G    

and a different size on the dmsetup table:

[root@pe1950-3 ~]# dmsetup table FC-lv01
0 2097152 linear 253:10 384

6- It is possible to run many lvm commands and destroy the LVs VGs to much
higher extent. It is not clear for the user on what machine do to the
changes, therefore HALVM is delivering a false idea of locking and data
security.

7- Recommendation is that either halvm use another type of locking and not
activation/volume_list, or that it force VG metadata to becomes readonly to
avoid problems. 

Thanks
Eduardo.



Issue escalated to Support Engineering Group by: edamato.
edamato assigned to issue for EMEA Production Escalation.
Internal Status set to 'Waiting on SEG'
Status set to: Waiting on Tech

This event sent from IssueTracker by fleitner  [Support Engineering Group]
 issue 244080
Comment 3 RHEL Product and Program Management 2008-12-09 12:19:43 EST
This request was evaluated by Red Hat Product Management for
inclusion, but this component is not scheduled to be updated in
the current Red Hat Enterprise Linux release. If you would like
this request to be reviewed for the next minor release, ask your
support representative to set the next rhel-x.y flag to "?".
Comment 4 Jonathan Earl Brassow 2008-12-16 13:03:23 EST
This bug boils down to a weakness in the tag enforcement.  We are in discussions to design a fix.  Changes would likely revolve around volume_list and being able to specify policies.  The policies would be something like:
- ACTIVATION - those not on the list will not be allowed to activate (this is the current enforcement)
- COMMIT - those not on the list will not be allowed to alter metadata, but they can still read/display information about a volume
- READ - those not on the list will not be allowed to even read/display information about the volume.

Selecting a 'COMMIT' or 'READ' policy would fix this bug.

Discussion is still ongoing, and may result in a 'conditional NACK - design' for rhel4.8.
Comment 8 Alasdair Kergon 2009-11-04 06:47:24 EST
Firstly, a misconception here.  VG metadata is never 'active' on a node.  In this setup both nodes will always see all the metadata at all times.  What activation/volume_list does is act as a filter instructing certain LVM operations to ignore it.  

The request here is for additional filter mechanisms to restrict which node may *view* and/or *change* the VG metadata.

Now 'activation/volume_list' is an LV-based list.  The request here is for a filter at VG level.  How could you extract a VG-based filter from an LV-based one?  You could say 'if any LV in the VG matches, then treat the VG as matching'.  That would certainly provide consistency, but it would not be efficient to implement (you'd have to test every LV in the VG) and it would be easy to misconfigure.

Alternatively, we could provide a new VG-level filter setting which matches directly at VG level - so it would perform matches against VG name and VG tags - and take precedence over 'activation/volume_list'.   So if this new setting does not find a match for a particular VG, no LVs in that VG will be activated regardless of what 'activation/volume_list' says.

The simplest implementation could be as 'global/volume_group_list'.  But at what level would the filter be applied?  To apply it, the VG tags would need to be known. We could capture these during scanning, adding them to struct lvmcache_vginfo.  However, that would be prone to races.  The filter could be applied up-front: it would prevent the locking of such a VG for write operations regardless of locking_type.  To do this without races, it would have to acquire the write lock first and then check the filter to decide whether to drop the lock and return.  (It would also need to be applied during activation before checking 'activation/volume_list'.)  

But should it return an error to the user if the write lock request fails for this reason?  'vgchange -xn vg0' with vg0 filtered out - yes, an error.  'vgchange -x' with no args - no, not an error.  That's rather awkward as it means quite a bit of code has to 'know' whether each VG being processed was specifically asked for (error), or a result of an 'all VGs' or tag list expansion (not error).  Hiding the VGs from the display output too?  Well in one way that's similar, but means changing more code paths to be capable of 'backing out without causing an error'.  Once we start trying to change code, it'll probably become obvious that one way (filter writes; filter reads and writes; offer choice of both) is easier to do than the other (it depends how much code is shared/not shared between the code paths), and I think that will decide which solution we go for.
Comment 11 Jonathan Earl Brassow 2010-05-11 18:18:44 EDT
The limitations outlined in this bug put it in the same family as bz 572311, 583769, 509368.  The proposed solution is found in bug 585229 (RHEL5) or 585217 (RHEL6/upstream).

The solution is dependent on being able to use single machine device-mapper targets when LVs are activated exclusively in a cluster.  If this ability is not added to RHEL4, then the solution will not be viable there.
Comment 12 Jonathan Earl Brassow 2010-05-20 11:53:16 EDT
Closing bug WONTFIX.  A solution exists on RHEL5 (bug 585229) and upstream/RHEL6 (bug 585217).

The "workaround" for RHEL4 will have to rely on the users putting policies in place to not make alterations to LVs and VGs which are actively being managed by rgmanager.  It is ok to make alterations on the machine where the service is running, but not from machines where it is not.

Note You need to log in before you can comment on or make changes to this bug.