Bug 743112

Summary: multiple failed cmirror devices are unable to be replaced due to locking errors
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 6.2CC: agk, dwysocha, heinzm, jbrassow, mbroz, prajnoha, prockai, syeghiay, thornber, zkabelac
Target Milestone: rcKeywords: Regression, TestBlocker
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.87-6.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-12-06 17:03:22 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 746254    
Bug Blocks: 748554    

Description Corey Marthaler 2011-10-03 21:05:22 UTC
Description of problem:
Scenario: Kill multiple legs of synced 3 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_multiple_legs_3legs_1
* sync:               1
* striped:            0
* leg devices:        /dev/sdf1 /dev/sdd1 /dev/sdg1
* log devices:        /dev/sdb1
* no MDA devices:     
* failpv(s):          /dev/sdf1 /dev/sdg1
* failnode(s):        taft-01 taft-02 taft-03 taft-04
* leg fault policy:   allocate
* log fault policy:   allocate
******************************************************

Creating mirror(s) on taft-04...
taft-04: lvcreate -m 2 -n syncd_multiple_legs_3legs_1 -L 300M helter_skelter /dev/sdf1:0-1000 /dev/sdd1:0-1000 /dev/sdg1:0-1000 /dev/sdb1:0-150

PV=/dev/sdg1
        syncd_multiple_legs_3legs_1_mimage_2: 5.1
PV=/dev/sdf1
        syncd_multiple_legs_3legs_1_mimage_0: 5.1
PV=/dev/sdg1
        syncd_multiple_legs_3legs_1_mimage_2: 5.1
PV=/dev/sdf1
        syncd_multiple_legs_3legs_1_mimage_0: 5.1

Waiting until all mirrors become fully syncd...
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating gfs2 on top of mirror(s) on taft-01...
Mounting mirrored gfs2 filesystems on taft-01...
Mounting mirrored gfs2 filesystems on taft-02...
Mounting mirrored gfs2 filesystems on taft-03...
Mounting mirrored gfs2 filesystems on taft-04...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Sleeping 10 seconds to get some outsanding GFS I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----
        ---- taft-02 ----
        ---- taft-03 ----
        ---- taft-04 ----

Disabling device sdf on taft-01
Disabling device sdg on taft-01
Disabling device sdf on taft-02
Disabling device sdg on taft-02
Disabling device sdf on taft-03
Disabling device sdg on taft-03
Disabling device sdf on taft-04
Disabling device sdg on taft-04

Attempting I/O to cause mirror down conversion(s) on taft-01
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.34491 s, 122 MB/s
Verifying current sanity of lvm after the failure
  Couldn't find device with uuid GF8iiQ-zZQw-aLHL-GXEx-OZSx-tM1m-663Iu3.
  Couldn't find device with uuid XNUu12-W3jG-5ID1-YBhQ-2cC7-9pDx-6eEcWF.
Verifying FAILED device /dev/sdf1 is *NOT* in the volume(s)
  Couldn't find device with uuid GF8iiQ-zZQw-aLHL-GXEx-OZSx-tM1m-663Iu3.
  Couldn't find device with uuid XNUu12-W3jG-5ID1-YBhQ-2cC7-9pDx-6eEcWF.
Verifying FAILED device /dev/sdg1 is *NOT* in the volume(s)
  Couldn't find device with uuid GF8iiQ-zZQw-aLHL-GXEx-OZSx-tM1m-663Iu3.
  Couldn't find device with uuid XNUu12-W3jG-5ID1-YBhQ-2cC7-9pDx-6eEcWF.
olog: 1
Verifying LOG device(s) /dev/sdb1 *ARE* in the mirror(s)
  Couldn't find device with uuid GF8iiQ-zZQw-aLHL-GXEx-OZSx-tM1m-663Iu3.
  Couldn't find device with uuid XNUu12-W3jG-5ID1-YBhQ-2cC7-9pDx-6eEcWF.
Verifying LEG device /dev/sdd1 *IS* in the volume(s)
  Couldn't find device with uuid GF8iiQ-zZQw-aLHL-GXEx-OZSx-tM1m-663Iu3.
  Couldn't find device with uuid XNUu12-W3jG-5ID1-YBhQ-2cC7-9pDx-6eEcWF.
verify the newly allocated dm devices were added as a result of the failures
Checking EXISTENCE of syncd_multiple_legs_3legs_1_mimage_0 on:  taft-01 taft-02
syncd_multiple_legs_3legs_1_mimage_0 on taft-02 should now exist
** ARE THERE ENOUGH SPARE PVS TO ALLOCATE? **


# BEFORE THE FAILURE:
[root@taft-02 ~]# lvs -a -o +devices
  LV                                     Attr   LSize   Log                              Copy%  Devices
  syncd_multiple_legs_3legs_1            mwi-ao 300.00m syncd_multiple_legs_3legs_1_mlog 100.00 syncd_multiple_legs_3legs_1_mimage_0(0),syncd_multiple_legs_3legs_1_mimage_1(0),syncd_multiple_legs_3legs_1_mimage_2(0)
  [syncd_multiple_legs_3legs_1_mimage_0] iwi-ao 300.00m                                         /dev/sdf1(0)
  [syncd_multiple_legs_3legs_1_mimage_1] iwi-ao 300.00m                                         /dev/sdd1(0)
  [syncd_multiple_legs_3legs_1_mimage_2] iwi-ao 300.00m                                         /dev/sdg1(0)
  [syncd_multiple_legs_3legs_1_mlog]     lwi-ao   4.00m                                         /dev/sdb1(0)

# AFTER THE FAILURE:
[root@taft-02 ~]# lvs -a -o +devices
  Couldn't find device with uuid GF8iiQ-zZQw-aLHL-GXEx-OZSx-tM1m-663Iu3.
  Couldn't find device with uuid XNUu12-W3jG-5ID1-YBhQ-2cC7-9pDx-6eEcWF.
  LV                                     Attr   LSize   Log Copy%  Devices
  syncd_multiple_legs_3legs_1            mwi-ao 300.00m     100.00 syncd_multiple_legs_3legs_1_mimage_2(0),syncd_multiple_legs_3legs_1_mimage_3(0)
  syncd_multiple_legs_3legs_1_mimage_0   -wi--- 300.00m            unknown device(0)
  syncd_multiple_legs_3legs_1_mimage_1   vwi-a- 300.00m 
  [syncd_multiple_legs_3legs_1_mimage_2] iwi-ao 300.00m            /dev/sdd1(0)
  [syncd_multiple_legs_3legs_1_mimage_3] iwi-ao 300.00m            /dev/sdh1(0)
  syncd_multiple_legs_3legs_1_mlog       -wi-s-   4.00m            /dev/sdb1(0)


[root@taft-02 ~]# grep lvm\\[ /var/log/messages
Oct  3 15:46:10 taft-02 lvm[1257]: helter_skelter-syncd_multiple_legs_3legs_1 is now in-sync.
Oct  3 15:51:09 taft-02 lvm[2433]: Monitoring mirror device helter_skelter-syncd_multiple_legs_3legs_1 for events.
Oct  3 15:51:20 taft-02 lvm[2433]: No longer monitoring mirror device helter_skelter-syncd_multiple_legs_3legs_1 for events.
Oct  3 15:53:15 taft-02 lvm[2433]: Monitoring mirror device helter_skelter-syncd_multiple_legs_3legs_1 for events.
Oct  3 15:53:40 taft-02 lvm[2433]: helter_skelter-syncd_multiple_legs_3legs_1 is now in-sync.
Oct  3 15:54:16 taft-02 lvm[2433]: Primary mirror device 253:4 has failed (D).
Oct  3 15:54:16 taft-02 lvm[2433]: Secondary mirror device 253:6 has failed (D).
Oct  3 15:54:16 taft-02 lvm[2433]: Device failure in helter_skelter-syncd_multiple_legs_3legs_1.
Oct  3 15:54:17 taft-02 lvm[2433]: /dev/sdf1: read failed after 0 of 512 at 145669554176: Input/output error
Oct  3 15:54:17 taft-02 lvm[2433]: /dev/sdf1: read failed after 0 of 512 at 145669664768: Input/output error
Oct  3 15:54:17 taft-02 lvm[2433]: /dev/sdf1: read failed after 0 of 512 at 0: Input/output error
Oct  3 15:54:17 taft-02 lvm[2433]: /dev/sdf1: read failed after 0 of 512 at 4096: Input/output error
Oct  3 15:54:17 taft-02 lvm[2433]: /dev/sdf1: read failed after 0 of 2048 at 0: Input/output error
Oct  3 15:54:17 taft-02 lvm[2433]: /dev/sdg1: read failed after 0 of 512 at 145669554176: Input/output error
Oct  3 15:54:17 taft-02 lvm[2433]: /dev/sdg1: read failed after 0 of 512 at 145669664768: Input/output error
Oct  3 15:54:17 taft-02 lvm[2433]: /dev/sdg1: read failed after 0 of 512 at 0: Input/output error
Oct  3 15:54:17 taft-02 lvm[2433]: /dev/sdg1: read failed after 0 of 512 at 4096: Input/output error
Oct  3 15:54:17 taft-02 lvm[2433]: /dev/sdg1: read failed after 0 of 2048 at 0: Input/output error
Oct  3 15:54:17 taft-02 lvm[2433]: Couldn't find device with uuid GF8iiQ-zZQw-aLHL-GXEx-OZSx-tM1m-663Iu3.
Oct  3 15:54:17 taft-02 lvm[2433]: Couldn't find device with uuid XNUu12-W3jG-5ID1-YBhQ-2cC7-9pDx-6eEcWF.
Oct  3 15:54:20 taft-02 lvm[2433]: Error locking on node taft-01: LV helter_skelter/syncd_multiple_legs_3legs_1_mimage_0 in use: not deactivating
Oct  3 15:54:20 taft-02 lvm[2433]: Mirror status: 2 of 3 images failed.
Oct  3 15:54:20 taft-02 lvm[2433]: Trying to up-convert to 3 images, 1 logs.
Oct  3 15:54:20 taft-02 lvm[2433]: Unable to create LV syncd_multiple_legs_3legs_1_mlog in Volume Group helter_skelter: name already in use.
Oct  3 15:54:20 taft-02 lvm[2433]: Failed to create mirror log.
Oct  3 15:54:20 taft-02 lvm[2433]: Trying to up-convert to 2 images, 1 logs.
Oct  3 15:54:20 taft-02 lvm[2433]: Unable to create LV syncd_multiple_legs_3legs_1_mlog in Volume Group helter_skelter: name already in use.
Oct  3 15:54:20 taft-02 lvm[2433]: Failed to create mirror log.
Oct  3 15:54:20 taft-02 lvm[2433]: Trying to up-convert to 2 images, 0 logs.
Oct  3 15:54:20 taft-02 lvm[2433]: Monitoring mirror device helter_skelter-syncd_multiple_legs_3legs_1 for events.
Oct  3 15:54:20 taft-02 lvm[2433]: Another thread is handling an event. Waiting...
Oct  3 15:54:21 taft-02 lvm[2433]: WARNING: Failed to replace 1 of 3 images in volume syncd_multiple_legs_3legs_1
Oct  3 15:54:21 taft-02 lvm[2433]: WARNING: Failed to replace 1 of 1 logs in volume syncd_multiple_legs_3legs_1
Oct  3 15:54:21 taft-02 lvm[2433]: helter_skelter/syncd_multiple_legs_3legs_1: Converted: 1.3%
Oct  3 15:54:36 taft-02 lvm[2433]: helter_skelter/syncd_multiple_legs_3legs_1: Converted: 100.0%
Oct  3 15:54:51 taft-02 lvm[2433]: Repair of mirrored LV helter_skelter/syncd_multiple_legs_3legs_1 finished successfully.
Oct  3 15:54:51 taft-02 lvm[2433]: helter_skelter-syncd_multiple_legs_3legs_1 is now in-sync.
Oct  3 15:54:53 taft-02 lvm[2433]: No longer monitoring mirror device helter_skelter-syncd_multiple_legs_3legs_1 for events.


Version-Release number of selected component (if applicable):
2.6.32-203.el6.x86_64

lvm2-2.02.87-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
lvm2-libs-2.02.87-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
lvm2-cluster-2.02.87-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.66-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
device-mapper-libs-1.02.66-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
device-mapper-event-1.02.66-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
device-mapper-event-libs-1.02.66-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011
cmirror-2.02.87-3.el6    BUILT: Wed Sep 21 09:54:55 CDT 2011


How reproducible:
Everytime

Comment 1 Corey Marthaler 2011-10-03 21:27:29 UTC
This test case works in single machine mode. It appears to be the cmirror locking failures that cause this issue.

Comment 2 Corey Marthaler 2011-10-03 22:10:57 UTC
This test case also appears to work when only one device is failed.

Comment 5 Jonathan Earl Brassow 2011-10-14 14:25:22 UTC
Requires a kernel change too...  The fix will have to wait for more than just userspace changes (which are already committed upstream.

Comment 7 Jonathan Earl Brassow 2011-10-14 20:17:33 UTC
git commit id for the LVM changes are:
681ceb16d80e9cfbb768a05b5296e23da427b9cd
c954b73149f992c1a29c6ce5abc8ac4b7ecaf4aa

Comment 9 Jonathan Earl Brassow 2011-10-17 16:38:46 UTC
If this bug is reproducible in a single machine test case, then the proposed patches - while still necessary - will not yield a solution to this bug.

Comment 13 Corey Marthaler 2011-11-03 21:59:52 UTC
Fix verified in the lvm latest rpms with latest kernel.

2.6.32-216.el6.x86_64

lvm2-2.02.87-6.el6    BUILT: Wed Oct 19 06:46:31 CDT 2011
lvm2-libs-2.02.87-6.el6    BUILT: Wed Oct 19 06:46:31 CDT 2011
lvm2-cluster-2.02.87-6.el6    BUILT: Wed Oct 19 06:46:31 CDT 2011
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.66-6.el6    BUILT: Wed Oct 19 06:46:31 CDT 2011
device-mapper-libs-1.02.66-6.el6    BUILT: Wed Oct 19 06:46:31 CDT 2011
device-mapper-event-1.02.66-6.el6    BUILT: Wed Oct 19 06:46:31 CDT 2011
device-mapper-event-libs-1.02.66-6.el6    BUILT: Wed Oct 19 06:46:31 CDT 2011
cmirror-2.02.87-6.el6    BUILT: Wed Oct 19 06:46:31 CDT 2011

Comment 14 errata-xmlrpc 2011-12-06 17:03:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1522.html