Description of problem: Version-Release number of selected component (if applicable): RHEL 5 RHCS beta1 How reproducible: Follow steps below. Steps to Reproduce: We started new test with cmirror activated (was not activated earlier); -> LVM configuration scratched! 1. Start of cluster on both nodes; service ccsd start service cman start service cmirror start service clvmd start service fenced start service rgmanager start OK - No problems! 2. Created a new VG; vgcreate testvg1 /dev/emcpowera /dev/emcpowerc OK - No problems! 3. Created a new mirrored LV; lvcreate -L 500M -m 1 --corelog -n testlv1 testvg1 OK - No problems! 4. Created ext3 filesystem on volume; mke2fs -j /dev/testvg1/testlv1 OK - No problems! 5. Configured cluster with LVM and filesystem as resources, and added it into failover service. OK - No problems! 6. With active cluster and a job writing against filesystem, we removed one of the disks in the mirrored volume from the SAN side (/dev/emcpowerc). OK - Volume is without downtime automatically converted to a linear volume. Job writing against filesystem continuing without problems. LVM status ok. 7. The same test as above but in addition we forced power-off on the active cluster node (with active writing job against filesystem). OK - Volume behave as above, in addition the cluster is doing a failover to seccond node. LVM status ok. Write job is forced to halt as expected ;-) -> Problems from this point! We are now activating the node which has been forced down with power-off and joining it to the cluster. SAN disk is also activated. From this point the cluster service fails to handle the filesystem/volume. As an example we get the following message when trying to activate the volume (on both nodes); [root@tnscl02cn001 ~]# vgchange -a y Volume group "testvg1" inconsistent Inconsistent metadata copies found - updating to use version 188 Error locking on node tnscl02cn001: Volume group for uuid not found: kigNllj6NfwPVqvTyihozk2MBX2Z3hqNyXXyTC4s5jR8RJVo18CqqCgqkyJCiCWn Error locking on node tnscl02cn002: Volume group for uuid not found: kigNllj6NfwPVqvTyihozk2MBX2Z3hqNyXXyTC4s5jR8RJVo18CqqCgqkyJCiCWn 0 logical volume(s) in volume group "testvg1" now active So at the moment we are not able to activate the volumes..... any tip (cleaning etc.)? Actual results: Unable to activate volume on host that re-joined after power failure. Expected results: Able to activate volume on host that re-joined after power failure. Additional info:
Created attachment 153058 [details] LVM diag before node 1 outage
Created attachment 153059 [details] LVM diag before node 2 outage
Created attachment 153060 [details] LVM diag after node1 outage
Created attachment 153061 [details] LVM diag after node2 outage
Created attachment 153062 [details] lvmdump node1
Created attachment 153063 [details] lvmdump node2
*** Bug 237174 has been marked as a duplicate of this bug. ***
I ran into this problem with another customer who is using strictly HA LVM. I resolved that issue with an update to the /usr/share/cluster/lvm.sh script. I'm a little confused as to the setup you are running... It appears that you are trying to run both HA LVM and CLVM at the same time. If you are using CLVM, there is no need to setup LVM in rgmanager. rgmanager is for failing over services... there is no need to fail-over a cluster mirror, because it is already running on the second node. That fact can be a bit confusing... I would advocate opening another bug against 'cluster suite/rgmanager' stating that the lvm fail-over script should detect (and not operate on) clustered volumes. So, you should decide whether you want CLVM or HA LVM CLVM pros: 1) relatively easy setup 2) usable in active/active environments CLVM cons: 1) mirror performance is somewhat slower that single-machine variant HA LVM pros: 1) better for use in active/passive environments 2) better performance than cluster variant HA LVM cons: 1) setup is difficult, and you must get it right 2) can not be used in active/active environment
I'm going to assume you want to go with CLVM. Please fix: 1) set ordered="0" in your cluster.conf, otherwise the service will attempt to revert back to the original node when it comes back up 2) remove the lvm resource references from your cluster.conf file Things to watch out for: When you bring back a disk and a machine, remember that the disk you are bringing back has stale LVM metadata on it. That is what the conflict was when the service was trying to start. It sees the old meta data on one disk and the new meta data on the other disks and says "Hey, this volume group is inconsistent". Let's call the device that failed and was brought back 'X'. You will need to re-initialize the device (pvcreate X) and re-add it to the volume group (vgextend testvg1 X). This will eliminate the conflict. Normally, if a disk fails, you replace it - then there is also no conflict. We are currently working on the LVM tools to automagically handle conflicts in the metadata (bug 221613).
Thanks. Actually, the customer is more interested in strictly HA LVM, since they will use it in a failover environment. What's the procedure of setting that up? They want to test it and verify that it works ok for them.
Procedure for setting up HA LVM is now in release notes [bug 241907]: http://www.redhat.com/docs/manuals/csgfs/release-notes/CS_4-RHEL4U5-relnotes.html I've also updated the HA lvm script for rhel4.6/rhel5.1 to allow the (mis)configuration of putting a CLVM volume as a resource for rgmanager (HA LVM).
Closing this, I think the response in comment #12 answers the question. If there still some problem, please reopen this bug.