Escalated to Bugzilla from IssueTracker
State the problem 1. Provide time and date of the problem Any time/date, reproduceable 2. Indicate the platform(s) (architectures) the problem is being reported against. All Archs 3. Provide clear and concise problem description as it is understood at the time of escalation HALVM bases itself on VG (or LV) tags to failover VGs or LVs accross nodes in a cluster. HALVM uses as its most basic form of locking the activation {volume_list = [] } feature of LVM, which allows or disallows the activation of a device-mapper table for the LV in question. However because there is no locking of the VG metadata itself or no locking of the VG activation in memory, it is possible and easy to have the same VG active on both nodes at the same time (but not LVs) and therefore to destroy the VG metadata on the node that does not hold the TAGs for a VG/LV, because activation/volume_list simply locks LV activation, but not vgrename, or lvresize, or any other operations on the VG metadata. * Observed behavior vgscan activates the VG metadata on all cluster nodes, and vgdisplay does show them as active, and it is possible to run commands such as lvresize,vgreduce, vgrename, etc, on the node that does not hold TAGs for that VG. * Desired behavior A locking mechanism that will prevent the node that does not possess the appropriate TAG, which will prevent the VG metadata from becoming active, (ie vgdisplay won't show it, and it will not be possible to perform VG metadata modifications on it using the standard lvm command set). 4. State specific action requested of SEG Evaluate with engineering if this is possible, which apparently isn't. And open a discussion with engineering. This behaviour has already caused data loss for our customers who ended up resizing LVs on the wrong machine successfully and destroyed their data. This should be treated as a bug because HALVM is supposed to protect the LVM metadata, and it does not do it. 5. State whether or not a defect in the product is suspected Yes, the usage of activation/volume_list as a locking mechanism, that does not protect VG metadata. 6. If there is a proposed patch, make sure it is in unified diff format (diff -pruN) N/A 7. Refrain from using the word "hang", as it can mean different things to different people in different contexts. Use a better and more specific description of your problem. N/A 8. This is especially important for severity one and two issues. What is the impact to the customer when they experience this problem? This is particularly important because more and more customers are adopting HALVM without the knowledge that it is easily possible to destroy data without any warning or protection. Provide supporting info 1. State other actions already taken in working the problem: bugzilla searches, reproducer in house, discussion with LVM devel team members. 2. Attach sosreport none 3. Attach other supporting data as comments on ticket. 4. Provide issue repro information: - setup a cluster with shared storage - configure halvm - create a VG and LVs on one node. - configure rgmanager to use it with halvm on a service - activate the service - on the passive node for this service do a vgscan and notice that the VG is there available. - on the passive node start changing the vg and lvs, doing lvresize, vgrename etc. - on the active node do a vgscan and notice the change 5. List any known hot-fix packages on the system N/A 6. List any customer applied changes from the last 30 days N/A GFS/Cluster Suite Specific 1. Attach sosreport for all nodes N/A 2. State fencing method being used indiferent 3. Do the following: * Verify that all the ''cluster.conf'' files are the same * Verify that all the cluster node names are in ''/etc/hosts'' * Verify ''/etc/hosts'' names match what is in ''ifconfig'' * Verify that there is a fence device (Red Hat does not support manual fencing.) indiferent This event sent from IssueTracker by fleitner [Support Engineering Group] issue 244080
1- Machines after boot, the halvm service is disabled: [root@pe1950-3 ~]# clustat Member Status: Quorate Member Name Status ------ ---- ------ pe1950-3-hb Online, Local, rgmanager pe1950-4-hb Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- halvm (none) disabled [root@pe1950-4 ~]# clustat Member Status: Quorate Member Name Status ------ ---- ------ pe1950-3-hb Online, rgmanager pe1950-4-hb Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- halvm (none) disabled 2- Configuration is correct, VG FC is not on volume_list and will have LVs activated only if TAG is present. But there are no tags atm. BUT already the problem reveals itself, vgs and lvs show VG FC with its metadata active on both nodes. Already dangerous, even though LVs are not active. [root@pe1950-3 ~]# lvm dumpconfig activation/volume_list volume_list=["myvg", "VolGroup00", "P3P4", "@pe1950-3.gsslab.fab.redhat.com"] [root@pe1950-3 ~]# vgs -o +vg_tags FC VG #PV #LV #SN Attr VSize VFree VG Tags FC 1 3 0 wz--n- 10.00G 7.00G [root@pe1950-3 ~]# lvs -o +lv_tags FC /dev/dm-10: Checksum error LV VG Attr LSize Origin Snap% Move Log Copy% Convert LV Tags lv01 FC -wi--- 1.00G lv02 FC -wi--- 1.00G lv03 FC -wi--- 1.00G [root@pe1950-4 ~]# lvm dumpconfig activation/volume_list volume_list=["myvg", "VolGroup00", "P3P4", "@pe1950-4.gsslab.fab.redhat.com"] [root@pe1950-4 ~]# vgs -o +vg_tags FC /dev/cdrom: read failed after 0 of 2048 at 0: Input/output error /dev/dm-9: Checksum error VG #PV #LV #SN Attr VSize VFree VG Tags FC 1 3 0 wz--n- 10.00G 7.00G [root@pe1950-4 ~]# lvs -o +lv_tags FC LV VG Attr LSize Origin Snap% Move Log Copy% Convert LV Tags lv01 FC -wi--- 1.00G lv02 FC -wi--- 1.00G lv03 FC -wi--- 1.00G 3- Service halvm is simply the activation of VG FC, which will active all LVs from FC. <resources> <lvm name="mylv" vg_name="FC"/> ... </resources> <service autostart="0" recovery="restart" name="halvm"> <lvm ref="mylv"/> </service> </rm> 4- Activating halvm service: [root@pe1950-3 ~]# clusvcadm -e halvm Member pe1950-3.gsslab.fab.redhat.com trying to enable halvm...success Service halvm is now running on pe1950-3.gsslab.fab.redhat.com [root@pe1950-3 ~]# clustat Member Status: Quorate Member Name Status ------ ---- ------ pe1950-3.gsslab.fab.redhat.com Online, Local, rgmanager pe1950-4.gsslab.fab.redhat.com Online, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- halvm pe1950-3.gsslab.fab.redhat.com started [root@pe1950-4 ~]# clustat Member Status: Quorate Member Name Status ------ ---- ------ pe1950-3.gsslab.fab.redhat.com Online, rgmanager pe1950-4.gsslab.fab.redhat.com Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- halvm pe1950-3.gsslab.fab.redhat.com started service running on pe1950-3.gsslab.fab.redhat.com. 4.1- Status of VGs: Both nodes have FC metadata active, but they do see FC, ideally node pe1950-4 should NOT have FC active. [root@pe1950-3 ~]# vgs -o +vg_tags FC VG #PV #LV #SN Attr VSize VFree VG Tags FC 1 3 0 wz--n- 10.00G 7.00G pe1950-3.gsslab.fab.redhat.com [root@pe1950-4 ~]# vgs -o +vg_tags FC VG #PV #LV #SN Attr VSize VFree VG Tags FC 1 3 0 wz--n- 10.00G 7.00G pe1950-3.gsslab.fab.redhat.com 4.2- As expected HALVM did its job and allowed only pe1950-3 to activate the LVs. [root@pe1950-3 ~]# lvs -o +lv_tags FC LV VG Attr LSize Origin Snap% Move Log Copy% Convert LV Tags lv01 FC -wi-a- 1.00G lv02 FC -wi-a- 1.00G lv03 FC -wi-a- 1.00G [root@pe1950-4 ~]# lvs -o +lv_tags FC LV VG Attr LSize Origin Snap% Move Log Copy% Convert LV Tags lv01 FC -wi--- 1.00G lv02 FC -wi--- 1.00G lv03 FC -wi--- 1.00G 5- But now, it is easily seen that because pe1950-4 has activated the VG metadata for FC, 5.1- we can start to change it, while there are LVs active on pe1950-3: [root@pe1950-4 ~]# vgs -o +vg_tags FC VG #PV #LV #SN Attr VSize VFree VG Tags FC 1 3 0 wz--n- 10.00G 7.51G pe1950-3.gsslab.fab.redhat.com [root@pe1950-4 ~]# lvresize -L 500M FC/lv01 Reducing logical volume lv01 to 500.00 MB Logical volume lv01 successfully resized [root@pe1950-4 ~]# lvs -o +lv_tags FC /dev/dm-9: Checksum error LV VG Attr LSize Origin Snap% Move Log Copy% Convert LV Tags lv01 FC -wi--- 500.00M lv02 FC -wi--- 1.00G lv03 FC -wi--- 1.00G FC/lv01 is now 500M, but on the other node it is active and using 1G. The filesystem on top would now be in danger as soon as lv01 is deactivated on pe1950-3. 5.2- There is already an incosistency because pe1950-3 shows a size for the LV based on metadata: [root@pe1950-3 ~]# lvs -o +lv_tags FC LV VG Attr LSize Origin Snap% Move Log Copy% Convert LV Tags lv01 FC -wi-a- 500.00M lv02 FC -wi-a- 1.00G lv03 FC -wi-a- 1.00G and a different size on the dmsetup table: [root@pe1950-3 ~]# dmsetup table FC-lv01 0 2097152 linear 253:10 384 6- It is possible to run many lvm commands and destroy the LVs VGs to much higher extent. It is not clear for the user on what machine do to the changes, therefore HALVM is delivering a false idea of locking and data security. 7- Recommendation is that either halvm use another type of locking and not activation/volume_list, or that it force VG metadata to becomes readonly to avoid problems. Thanks Eduardo. Issue escalated to Support Engineering Group by: edamato. edamato assigned to issue for EMEA Production Escalation. Internal Status set to 'Waiting on SEG' Status set to: Waiting on Tech This event sent from IssueTracker by fleitner [Support Engineering Group] issue 244080
This request was evaluated by Red Hat Product Management for inclusion, but this component is not scheduled to be updated in the current Red Hat Enterprise Linux release. If you would like this request to be reviewed for the next minor release, ask your support representative to set the next rhel-x.y flag to "?".
This bug boils down to a weakness in the tag enforcement. We are in discussions to design a fix. Changes would likely revolve around volume_list and being able to specify policies. The policies would be something like: - ACTIVATION - those not on the list will not be allowed to activate (this is the current enforcement) - COMMIT - those not on the list will not be allowed to alter metadata, but they can still read/display information about a volume - READ - those not on the list will not be allowed to even read/display information about the volume. Selecting a 'COMMIT' or 'READ' policy would fix this bug. Discussion is still ongoing, and may result in a 'conditional NACK - design' for rhel4.8.
Firstly, a misconception here. VG metadata is never 'active' on a node. In this setup both nodes will always see all the metadata at all times. What activation/volume_list does is act as a filter instructing certain LVM operations to ignore it. The request here is for additional filter mechanisms to restrict which node may *view* and/or *change* the VG metadata. Now 'activation/volume_list' is an LV-based list. The request here is for a filter at VG level. How could you extract a VG-based filter from an LV-based one? You could say 'if any LV in the VG matches, then treat the VG as matching'. That would certainly provide consistency, but it would not be efficient to implement (you'd have to test every LV in the VG) and it would be easy to misconfigure. Alternatively, we could provide a new VG-level filter setting which matches directly at VG level - so it would perform matches against VG name and VG tags - and take precedence over 'activation/volume_list'. So if this new setting does not find a match for a particular VG, no LVs in that VG will be activated regardless of what 'activation/volume_list' says. The simplest implementation could be as 'global/volume_group_list'. But at what level would the filter be applied? To apply it, the VG tags would need to be known. We could capture these during scanning, adding them to struct lvmcache_vginfo. However, that would be prone to races. The filter could be applied up-front: it would prevent the locking of such a VG for write operations regardless of locking_type. To do this without races, it would have to acquire the write lock first and then check the filter to decide whether to drop the lock and return. (It would also need to be applied during activation before checking 'activation/volume_list'.) But should it return an error to the user if the write lock request fails for this reason? 'vgchange -xn vg0' with vg0 filtered out - yes, an error. 'vgchange -x' with no args - no, not an error. That's rather awkward as it means quite a bit of code has to 'know' whether each VG being processed was specifically asked for (error), or a result of an 'all VGs' or tag list expansion (not error). Hiding the VGs from the display output too? Well in one way that's similar, but means changing more code paths to be capable of 'backing out without causing an error'. Once we start trying to change code, it'll probably become obvious that one way (filter writes; filter reads and writes; offer choice of both) is easier to do than the other (it depends how much code is shared/not shared between the code paths), and I think that will decide which solution we go for.
The limitations outlined in this bug put it in the same family as bz 572311, 583769, 509368. The proposed solution is found in bug 585229 (RHEL5) or 585217 (RHEL6/upstream). The solution is dependent on being able to use single machine device-mapper targets when LVs are activated exclusively in a cluster. If this ability is not added to RHEL4, then the solution will not be viable there.
Closing bug WONTFIX. A solution exists on RHEL5 (bug 585229) and upstream/RHEL6 (bug 585217). The "workaround" for RHEL4 will have to rely on the users putting policies in place to not make alterations to LVs and VGs which are actively being managed by rgmanager. It is ok to make alterations on the machine where the service is running, but not from machines where it is not.