Description of problem: This works when using the "new" clvm locking method for HA LVM, but not when using the "old" tags locking method. <rm> <failoverdomains> <failoverdomain name="TAFT_domain" ordered="1" restricted="1"> <failoverdomainnode name="taft-01" priority="1"/> <failoverdomainnode name="taft-02" priority="1"/> <failoverdomainnode name="taft-03" priority="1"/> <failoverdomainnode name="taft-04" priority="1"/> </failoverdomain> </failoverdomains> <resources> <lvm name="lvm" vg_name="TAFT" lv_name="ha1"/> <fs name="fs1" device="/dev/TAFT/ha1" force_fsck="0" force_unmount="1" self_fence="0" fstype="ext3" mountpoint="/mnt/fs1" options=""/> </resources> <service autostart="1" domain="TAFT_domain" name="halvm" recovery="relocate"> <lvm ref="lvm"/> <fs ref="fs1"/> </service> </rm> [cmarthal@silver bin]$ ./revolution_9 -R ../../var/share/resource_files/taft.xml -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.8 -v TAFT -A -i 3 only supported IO load for HA LVM is NONE, changing default ioload. mirror device(s) (ha1) found in TAFT on taft-02 ================================================================================ Iteration 0.1 started at Wed Dec 21 16:12:38 CST 2011 ================================================================================ Scenario kill_random_legs: Kill random legs ********* Mirror info for this scenario ********* * mirrors: ha1 * leg devices: /dev/sdb1 /dev/sdc1 * log devices: /dev/sdh1 * failpv(s): /dev/sdb1 * failnode(s): taft-02 * leg fault policy: remove * log fault policy: allocate * HA mirror: YES * HA service name: halvm * HA current owner: taft-02 ************************************************* Mirror Structure(s): LV Attr LSize Copy% Devices ha1 Mwi-ao 25.00G 100.00 ha1_mimage_0(0),ha1_mimage_1(0) [ha1_mimage_0] iwi-ao 25.00G /dev/sdb1(0) [ha1_mimage_1] iwi-ao 25.00G /dev/sdc1(0) [ha1_mlog] lwi-ao 4.00M /dev/sdh1(0) PV=/dev/sdb1 ha1_mimage_0: 6 PV=/dev/sdb1 ha1_mimage_0: 6 Disabling device sdb on taft-02 Attempting I/O to cause mirror down conversion(s) on taft-02 1+0 records in 1+0 records out 512 bytes (512 B) copied, 3.8e-05 seconds, 13.5 MB/s RELOCATING halvm FROM taft-02 TO taft-03 Relocation attempt of service halvm to taft-03 failed. FI_engine: recover() method failed "OLD" Owner: Dec 21 16:12:56 taft-02 lvm[8358]: Primary mirror device 253:3 has failed (D). Dec 21 16:12:56 taft-02 lvm[8358]: Device failure in TAFT-ha1. Dec 21 16:12:56 taft-02 lvm[8358]: /dev/sdb1: read failed after 0 of 512 at 145669554176: Input/output error Dec 21 16:12:56 taft-02 lvm[8358]: /dev/sdb1: read failed after 0 of 512 at 145669664768: Input/output error Dec 21 16:12:56 taft-02 lvm[8358]: /dev/sdb1: read failed after 0 of 512 at 0: Input/output error Dec 21 16:12:56 taft-02 lvm[8358]: /dev/sdb1: read failed after 0 of 512 at 4096: Input/output error Dec 21 16:12:56 taft-02 lvm[8358]: /dev/sdb1: read failed after 0 of 2048 at 0: Input/output error Dec 21 16:12:57 taft-02 lvm[8358]: Couldn't find device with uuid tQq8sX-K9LF-bWcn-QTTH-vBxb-QBXi-4cLc3D. Dec 21 16:13:03 taft-02 clurgmgrd[7795]: <notice> Stopping service service:halvm Dec 21 16:13:13 taft-02 lvm[8358]: Mirror status: 1 of 2 images failed. Dec 21 16:13:13 taft-02 lvm[8358]: Repair of mirrored LV TAFT/ha1 finished successfully. Dec 21 16:13:19 taft-02 clurgmgrd: [7795]: <notice> Deactivating TAFT/ha1 Dec 21 16:13:19 taft-02 clurgmgrd: [7795]: <notice> Making resilient : lvchange -an TAFT/ha1 Dec 21 16:13:21 taft-02 clurgmgrd: [7795]: <notice> Resilient command: lvchange -an TAFT/ha1 --config devices{filter=["a|/dev/sda2|","a|/dev/sdc1|","a|/dev/sdd1|","a|/dev/sde1|","a|/dev/sdf1|","a|/dev/sdg1|","a|/dev/sdh1|","a|unknown|","a|device|","r|.*|"]} Dec 21 16:13:21 taft-02 lvm[8358]: TAFT-ha1 has unmirrored portion. Dec 21 16:13:21 taft-02 lvm[8358]: dm_task_run failed, errno = 6, No such device or address Dec 21 16:13:21 taft-02 lvm[8358]: TAFT-ha1 disappeared, detaching Dec 21 16:13:21 taft-02 lvm[8358]: No longer monitoring mirror device TAFT-ha1 for events. Dec 21 16:13:23 taft-02 clurgmgrd: [7795]: <notice> Removing ownership tag (taft-02) from TAFT/ha1 Dec 21 16:13:25 taft-02 clurgmgrd: [7795]: <err> Unable to delete tag from TAFT/ha1 Dec 21 16:13:25 taft-02 clurgmgrd: [7795]: <err> Failed to stop TAFT/ha1 Dec 21 16:13:25 taft-02 clurgmgrd: [7795]: <err> Failed to stop TAFT/ha1 Dec 21 16:13:25 taft-02 clurgmgrd[7795]: <notice> stop on lvm "lvm" returned 1 (generic error) Dec 21 16:13:25 taft-02 clurgmgrd[7795]: <crit> #12: RG service:halvm failed to stop; intervention required Dec 21 16:13:25 taft-02 clurgmgrd[7795]: <notice> Service service:halvm is failed Dec 21 16:13:26 taft-02 clurgmgrd[7795]: <alert> #2: Service service:halvm returned failure code. Last Owner: taft-02 Dec 21 16:13:26 taft-02 clurgmgrd[7795]: <alert> #4: Administrator intervention required. [root@taft-02 ~]# lvs -a -o +devices,lv_tags Couldn't find device with uuid tQq8sX-K9LF-bWcn-QTTH-vBxb-QBXi-4cLc3D. LV VG Attr LSize Copy% Convert Devices LV Tags ha1 TAFT -wi--- 25.00G /dev/sdc1(0) taft-02 "NEW" Owner: Dec 21 16:13:01 taft-03 qarshd[8850]: Running cmdline: clusvcadm -r halvm -m taft-03 Dec 21 16:13:23 taft-03 clurgmgrd[7783]: <err> #43: Service service:halvm has failed; can not start. [root@taft-03 ~]# lvs -a -o +devices,lv_tags LV VG Attr LSize Copy% Convert Devices LV Tags ha1 TAFT -wi--- 25.00G /dev/sdc1(0) taft-02 Version-Release number of selected component (if applicable): 2.6.18-301.el5 lvm2-2.02.88-5.el5 BUILT: Fri Dec 2 12:25:45 CST 2011 lvm2-cluster-2.02.88-5.el5 BUILT: Fri Dec 2 12:48:37 CST 2011 device-mapper-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 device-mapper-event-1.02.67-2.el5 BUILT: Mon Oct 17 08:31:56 CDT 2011 cmirror-1.1.39-14.el5 BUILT: Wed Nov 2 17:25:33 CDT 2011 kmod-cmirror-0.1.22-3.el5 BUILT: Tue Dec 22 13:39:47 CST 2009
This is easily reproduced. Dec 22 11:55:05 taft-03 clurgmgrd: [7233]: <notice> Removing ownership tag (taft-03) from TAFT/ha1 Dec 22 11:55:07 taft-03 clurgmgrd: [7233]: <err> Unable to delete tag from TAFT/ha1 Dec 22 11:55:07 taft-03 clurgmgrd: [7233]: <err> Failed to stop TAFT/ha1 Dec 22 11:55:07 taft-03 clurgmgrd: [7233]: <err> Failed to stop TAFT/ha1 Dec 22 11:55:07 taft-03 clurgmgrd[7233]: <notice> stop on lvm "lvm" returned 1 (generic error) Dec 22 11:55:07 taft-03 clurgmgrd[7233]: <crit> #12: RG service:halvm failed to stop; intervention required Dec 22 11:55:07 taft-03 clurgmgrd[7233]: <notice> Service service:halvm is failed Dec 22 11:55:07 taft-03 clurgmgrd[7233]: <alert> #2: Service service:halvm returned failure code. Last Owner: taft-03 Dec 22 11:55:07 taft-03 clurgmgrd[7233]: <alert> #4: Administrator intervention required.
when I try to run revolution_9, it craps out saying, "only supported IO load for HA LVM is NONE". You seem to have managed to override this. How?
I /think/ I understand what is going on here. Since I have been unable to run the test, I've broken down what I think the steps are that are causing the problem: 1) add tag to a mirror ('lvchange --addtag foo vg/lv') 2) disable a mirror leg (with no I/O and after sync has completed) 3) run 'lvconvert --repair --use-policies vg/lv' 4) run 'lvchange --deltag foo vg/lv' Step #4 now fails because you can't change the metadata of a VG (i.e. remove the tag on the LV) while there are missing PVs. This comes up now because the dmeventd script has switched from using 'vgreduce --removemissing' to 'lvconvert --repair'. The former would remove the missing PV from the VG while the latter does not. The output from the 'lvchange --deltag' operation is suppressed by rgmanager - instead we get the generic "Unable to delete tag from vg/lv" message. The solution would be to run 'vgreduce --removemissing' if the 'lvchange --deltag' operation fails and try a second time to remove the tag. The solution is also simple to verify - if I can run the test.
Created attachment 551698 [details] Patch to fix problem
Testing has revealed that the problem was correctly identified by comment #3. The script was changed so that it could handle a particular failure case - something that was necessary due to the change in LVM behaviour. Other operations (i.e. most failure cases and nominal operation) are unaffected.
Fix verified in rgmanager-2.0.52-28.el5. Scenario kill_primary_leg: Kill primary leg ********* Mirror info for this scenario ********* * mirrors: ha1 * leg devices: /dev/sdc1 /dev/sdd1 * log devices: /dev/sdh1 * failpv(s): /dev/sdc1 * failnode(s): taft-04 * leg fault policy: allocate * log fault policy: allocate * HA mirror: YES * HA service name: halvm * HA current owner: taft-04 ************************************************* Mirror Structure(s): LV Attr LSize Copy% Devices ha1 mwi-ao 25.00G 100.00 ha1_mimage_0(0),ha1_mimage_1(0) [ha1_mimage_0] iwi-ao 25.00G /dev/sdc1(0) [ha1_mimage_1] iwi-ao 25.00G /dev/sdd1(0) [ha1_mlog] lwi-ao 4.00M /dev/sdh1(0) PV=/dev/sdc1 ha1_mimage_0: 5.1 PV=/dev/sdc1 ha1_mimage_0: 5.1 Disabling device sdc on taft-04 Attempting I/O to cause mirror down conversion(s) on taft-04 1+0 records in 1+0 records out 512 bytes (512 B) copied, 3.8e-05 seconds, 13.5 MB/s RELOCATING halvm FROM taft-04 TO taft-01 Verifying current sanity of lvm after the failure Mirror Structure(s): LV Attr LSize Copy% Devices ha1 mwi-ao 25.00G 4.97 ha1_mimage_0(0),ha1_mimage_1(0) [ha1_mimage_0] Iwi-ao 25.00G /dev/sdd1(0) [ha1_mimage_1] Iwi-ao 25.00G /dev/sde1(0) [ha1_mlog] lwi-ao 4.00M /dev/sdh1(1) Verify that each of the mirror repairs finished successfully Verifying FAILED device /dev/sdc1 is *NOT* in the volume(s) olog: 1 Verifying LOG device(s) /dev/sdh1 *ARE* in the mirror(s) Verifying LEG device /dev/sdd1 *IS* in the volume(s) verify the newly allocated dm devices were added as a result of the failures Checking EXISTENCE of ha1_mimage_0 on: taft-01 Checking EXISTENCE of ha1_mimage_1 on: taft-01 Enabling device sdc on taft-04 Waiting until all mirrors become fully syncd... 0/1 mirror(s) are fully synced: ( 6.66% ) 0/1 mirror(s) are fully synced: ( 8.07% ) [...] 0/1 mirror(s) are fully synced: ( 98.72% ) 1/1 mirror(s) are fully synced: ( 100.00% ) Jan 11 11:30:37 taft-01 clurgmgrd: [11593]: <notice> Removing ownership tag (taft-01) from TAFT/ha1 Jan 11 11:30:39 taft-01 clurgmgrd: [11593]: <err> Unable to delete tag from TAFT/ha1 Jan 11 11:30:39 taft-01 clurgmgrd: [11593]: <err> Attempting volume group clean-up and retry
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Cause: Consequence: Fix: Result:
Technical note updated. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,7 +1,11 @@ Cause: +The behaviour of the LVM commands has changed, causing failed devices to be removed from a mirror but not from the volume group as it was previously. Consequence: +Volume groups cannot be changed while a failed device is present. This includes changing the tags on a logical volume. Failure to alter the tags on the affected logical volume causes an inability to relocate the service. Fix: +An additional LVM command is called to remove the failed physical volume from the volume group. The tags on the logical volumes can then be altered. -Result:+Result: +Two commands are now required to perform the operations that one command did in the past. This change resolves the issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2012-0163.html