772773 – HA LVM service relocation after leg device failure fails due to inability to remove tag

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 772773 - HA LVM service relocation after leg device failure fails due to inability to remove tag

Summary: HA LVM service relocation after leg device failure fails due to inability to ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	6.2
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:	769731
Blocks:
TreeView+	depends on / blocked

Reported:	2012-01-09 22:50 UTC by Jonathan Earl Brassow
Modified:	2012-06-20 14:39 UTC (History)
CC List:	15 users (show)
Fixed In Version:	resource-agents-3.9.2-12.el6
Doc Type:	Bug Fix
Doc Text:	No documentation needed.
Clone Of:	769731
Environment:
Last Closed:	2012-06-20 14:39:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Patch for 769731 fix (which this bug was cloned from) (918 bytes, patch) 2012-01-09 22:55 UTC, Jonathan Earl Brassow	no flags	Details \| Diff
Patch 1 of 2 to fix the problem (1.30 KB, patch) 2012-04-19 23:59 UTC, Jonathan Earl Brassow	no flags	Details \| Diff
Patch 2 of 2 to fix the problem (1.04 KB, patch) 2012-04-20 00:00 UTC, Jonathan Earl Brassow	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2012:0947	0	normal	SHIPPED_LIVE	resource-agents bug fix and enhancement update	2012-06-19 20:59:56 UTC

Description Jonathan Earl Brassow 2012-01-09 22:50:29 UTC

+++ This bug was initially created as a clone of Bug #769731 +++

Description of problem:
This works when using the "new" clvm locking method for HA LVM, but not when using the "old" tags locking method.

  <rm>
    <failoverdomains>
      <failoverdomain name="TAFT_domain" ordered="1" restricted="1">
        <failoverdomainnode name="taft-01" priority="1"/>
        <failoverdomainnode name="taft-02" priority="1"/>
        <failoverdomainnode name="taft-03" priority="1"/>
        <failoverdomainnode name="taft-04" priority="1"/>
      </failoverdomain>
    </failoverdomains>
    <resources>
      <lvm name="lvm" vg_name="TAFT" lv_name="ha1"/>
      <fs name="fs1" device="/dev/TAFT/ha1" force_fsck="0" force_unmount="1" self_fence="0" fstype="ext3" mountpoint="/mnt/fs1" options=""/>
    </resources>
    <service autostart="1" domain="TAFT_domain" name="halvm" recovery="relocate">
      <lvm ref="lvm"/>
      <fs ref="fs1"/>
    </service>
  </rm>


[cmarthal@silver bin]$ ./revolution_9 -R ../../var/share/resource_files/taft.xml  -l /home/msp/cmarthal/work/rhel5/sts-root -r /usr/tests/sts-rhel5.8  -v TAFT -A -i 3
only supported IO load for HA LVM is NONE, changing default ioload.
mirror device(s) (ha1) found in TAFT on taft-02

================================================================================
Iteration 0.1 started at Wed Dec 21 16:12:38 CST 2011
================================================================================
Scenario kill_random_legs: Kill random legs

********* Mirror info for this scenario *********
* mirrors:            ha1
* leg devices:        /dev/sdb1 /dev/sdc1
* log devices:        /dev/sdh1
* failpv(s):          /dev/sdb1
* failnode(s):        taft-02
* leg fault policy:   remove
* log fault policy:   allocate

* HA mirror:          YES
* HA service name:    halvm
* HA current owner:   taft-02
*************************************************

Mirror Structure(s):
  LV             Attr   LSize  Copy%  Devices                        
  ha1            Mwi-ao 25.00G 100.00 ha1_mimage_0(0),ha1_mimage_1(0)
  [ha1_mimage_0] iwi-ao 25.00G        /dev/sdb1(0)                   
  [ha1_mimage_1] iwi-ao 25.00G        /dev/sdc1(0)                   
  [ha1_mlog]     lwi-ao  4.00M        /dev/sdh1(0)                   

PV=/dev/sdb1
        ha1_mimage_0: 6
PV=/dev/sdb1
        ha1_mimage_0: 6

Disabling device sdb on taft-02

Attempting I/O to cause mirror down conversion(s) on taft-02
1+0 records in
1+0 records out
512 bytes (512 B) copied, 3.8e-05 seconds, 13.5 MB/s

RELOCATING halvm FROM taft-02 TO taft-03

Relocation attempt of service halvm to taft-03 failed.
FI_engine: recover() method failed



"OLD" Owner:

Dec 21 16:12:56 taft-02 lvm[8358]: Primary mirror device 253:3 has failed (D).
Dec 21 16:12:56 taft-02 lvm[8358]: Device failure in TAFT-ha1.
Dec 21 16:12:56 taft-02 lvm[8358]: /dev/sdb1: read failed after 0 of 512 at 145669554176: Input/output error
Dec 21 16:12:56 taft-02 lvm[8358]: /dev/sdb1: read failed after 0 of 512 at 145669664768: Input/output error
Dec 21 16:12:56 taft-02 lvm[8358]: /dev/sdb1: read failed after 0 of 512 at 0: Input/output error
Dec 21 16:12:56 taft-02 lvm[8358]: /dev/sdb1: read failed after 0 of 512 at 4096: Input/output error
Dec 21 16:12:56 taft-02 lvm[8358]: /dev/sdb1: read failed after 0 of 2048 at 0: Input/output error
Dec 21 16:12:57 taft-02 lvm[8358]: Couldn't find device with uuid tQq8sX-K9LF-bWcn-QTTH-vBxb-QBXi-4cLc3D.
Dec 21 16:13:03 taft-02 clurgmgrd[7795]: <notice> Stopping service service:halvm 
Dec 21 16:13:13 taft-02 lvm[8358]: Mirror status: 1 of 2 images failed.
Dec 21 16:13:13 taft-02 lvm[8358]: Repair of mirrored LV TAFT/ha1 finished successfully.
Dec 21 16:13:19 taft-02 clurgmgrd: [7795]: <notice> Deactivating TAFT/ha1 
Dec 21 16:13:19 taft-02 clurgmgrd: [7795]: <notice> Making resilient : lvchange -an TAFT/ha1 
Dec 21 16:13:21 taft-02 clurgmgrd: [7795]: <notice> Resilient command: lvchange -an TAFT/ha1 --config devices{filter=["a|/dev/sda2|","a|/dev/sdc1|","a|/dev/sdd1|","a|/dev/sde1|","a|/dev/sdf1|","a|/dev/sdg1|","a|/dev/sdh1|","a|unknown|","a|device|","r|.*|"]} 
Dec 21 16:13:21 taft-02 lvm[8358]: TAFT-ha1 has unmirrored portion.
Dec 21 16:13:21 taft-02 lvm[8358]: dm_task_run failed, errno = 6, No such device or address
Dec 21 16:13:21 taft-02 lvm[8358]: TAFT-ha1 disappeared, detaching
Dec 21 16:13:21 taft-02 lvm[8358]: No longer monitoring mirror device TAFT-ha1 for events.
Dec 21 16:13:23 taft-02 clurgmgrd: [7795]: <notice> Removing ownership tag (taft-02) from TAFT/ha1 
Dec 21 16:13:25 taft-02 clurgmgrd: [7795]: <err> Unable to delete tag from TAFT/ha1 
Dec 21 16:13:25 taft-02 clurgmgrd: [7795]: <err> Failed to stop TAFT/ha1 
Dec 21 16:13:25 taft-02 clurgmgrd: [7795]: <err> Failed to stop TAFT/ha1 
Dec 21 16:13:25 taft-02 clurgmgrd[7795]: <notice> stop on lvm "lvm" returned 1 (generic error) 
Dec 21 16:13:25 taft-02 clurgmgrd[7795]: <crit> #12: RG service:halvm failed to stop; intervention required 
Dec 21 16:13:25 taft-02 clurgmgrd[7795]: <notice> Service service:halvm is failed 
Dec 21 16:13:26 taft-02 clurgmgrd[7795]: <alert> #2: Service service:halvm returned failure code.  Last Owner: taft-02 
Dec 21 16:13:26 taft-02 clurgmgrd[7795]: <alert> #4: Administrator intervention required. 

[root@taft-02 ~]# lvs -a -o +devices,lv_tags
  Couldn't find device with uuid tQq8sX-K9LF-bWcn-QTTH-vBxb-QBXi-4cLc3D.
  LV    VG      Attr   LSize  Copy%  Convert Devices       LV Tags
  ha1   TAFT    -wi--- 25.00G                /dev/sdc1(0)  taft-02



"NEW" Owner:
Dec 21 16:13:01 taft-03 qarshd[8850]: Running cmdline: clusvcadm -r halvm -m taft-03
Dec 21 16:13:23 taft-03 clurgmgrd[7783]: <err> #43: Service service:halvm has failed; can not start.

[root@taft-03 ~]# lvs -a -o +devices,lv_tags
  LV    VG      Attr   LSize  Copy%  Convert Devices       LV Tags
  ha1   TAFT    -wi--- 25.00G                /dev/sdc1(0)  taft-02


Version-Release number of selected component (if applicable):
2.6.18-301.el5

lvm2-2.02.88-5.el5    BUILT: Fri Dec  2 12:25:45 CST 2011
lvm2-cluster-2.02.88-5.el5    BUILT: Fri Dec  2 12:48:37 CST 2011
device-mapper-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
device-mapper-event-1.02.67-2.el5    BUILT: Mon Oct 17 08:31:56 CDT 2011
cmirror-1.1.39-14.el5    BUILT: Wed Nov  2 17:25:33 CDT 2011
kmod-cmirror-0.1.22-3.el5    BUILT: Tue Dec 22 13:39:47 CST 2009

--- Additional comment from cmarthal on 2011-12-22 15:04:20 EST ---

This is easily reproduced.


Dec 22 11:55:05 taft-03 clurgmgrd: [7233]: <notice> Removing ownership tag (taft-03) from TAFT/ha1 
Dec 22 11:55:07 taft-03 clurgmgrd: [7233]: <err> Unable to delete tag from TAFT/ha1 
Dec 22 11:55:07 taft-03 clurgmgrd: [7233]: <err> Failed to stop TAFT/ha1 
Dec 22 11:55:07 taft-03 clurgmgrd: [7233]: <err> Failed to stop TAFT/ha1 
Dec 22 11:55:07 taft-03 clurgmgrd[7233]: <notice> stop on lvm "lvm" returned 1 (generic error) 
Dec 22 11:55:07 taft-03 clurgmgrd[7233]: <crit> #12: RG service:halvm failed to stop; intervention required 
Dec 22 11:55:07 taft-03 clurgmgrd[7233]: <notice> Service service:halvm is failed 
Dec 22 11:55:07 taft-03 clurgmgrd[7233]: <alert> #2: Service service:halvm returned failure code.  Last Owner: taft-03 
Dec 22 11:55:07 taft-03 clurgmgrd[7233]: <alert> #4: Administrator intervention required.

--- Additional comment from jbrassow on 2012-01-04 16:53:32 EST ---

when I try to run revolution_9, it craps out saying,
"only supported IO load for HA LVM is NONE".  You seem to have managed to override this.  How?

--- Additional comment from jbrassow on 2012-01-05 11:48:35 EST ---

I /think/ I understand what is going on here.  Since I have been unable to run the test, I've broken down what I think the steps are that are causing the problem:
1) add tag to a mirror ('lvchange --addtag foo vg/lv')
2) disable a mirror leg (with no I/O and after sync has completed)
3) run 'lvconvert --repair --use-policies vg/lv'
4) run 'lvchange --deltag foo vg/lv'

Step #4 now fails because you can't change the metadata of a VG (i.e. remove the tag on the LV) while there are missing PVs.  This comes up now because the dmeventd script has switched from using 'vgreduce --removemissing' to 'lvconvert --repair'.  The former would remove the missing PV from the VG while the latter does not.

The output from the 'lvchange --deltag' operation is suppressed by rgmanager - instead we get the generic "Unable to delete tag from vg/lv" message.

The solution would be to run 'vgreduce --removemissing' if the 'lvchange --deltag' operation fails and try a second time to remove the tag.  The solution is also simple to verify - if I can run the test.

Comment 1 Jonathan Earl Brassow 2012-01-09 22:55:41 UTC

Created attachment 551697 [details]
Patch for 769731 fix (which this bug was cloned from)

Comment 6 Jonathan Earl Brassow 2012-04-19 23:59:26 UTC

Created attachment 578810 [details]
Patch 1 of 2 to fix the problem

This first patch removes a redundant test.  If the 'strip_and_add_tag' function succeeds, the operating node will be the owner - that is certain.  If it fails, then the ownership is unknown, but we will try again anyway.

The problem comes when we try 'strip_and_add_tag' and it fails.  We may clean up the volume group, but we don't have a chance to add the proper tag before 'vg_owner' is called - making this test not only redundant in the good case, but fatal to operation in the bad case.

This patch only fixes this bug if the service is moving due to a node failure.  If the service is moving due to an administrative command, the next node will see a tag on the VG that belongs to a live node and refuse to activate it.

An additional patch will clear up the issue of moving the service via admin command.

Comment 7 Jonathan Earl Brassow 2012-04-20 00:00:26 UTC

Created attachment 578811 [details]
Patch 2 of 2 to fix the problem

Comment 8 Jonathan Earl Brassow 2012-04-20 00:01:56 UTC

The second patch attempts to clean-up the VG after a failure to strip any tags and then tries again.

Comment 10 Chris Feist 2012-04-30 21:22:21 UTC

Rebuilt in resource-agents-3.9.2-12.el6.

Comment 12 Chris Feist 2012-04-30 22:34:16 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
No documentation needed.

Comment 15 errata-xmlrpc 2012-06-20 14:39:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0947.html

Note You need to log in before you can comment on or make changes to this bug.