Bug 728299 - log device failure on multiple HA mirrors resulted in a corrupt VG
Summary: log device failure on multiple HA mirrors resulted in a corrupt VG
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2
Version: 6.1
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Jonathan Earl Brassow
QA Contact: Corey Marthaler
URL:
Whiteboard:
Depends On:
Blocks: 756082
TreeView+ depends on / blocked
 
Reported: 2011-08-04 15:57 UTC by Corey Marthaler
Modified: 2012-07-12 16:42 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-07-12 16:42:00 UTC


Attachments (Terms of Use)
log from taft-03 (853.99 KB, text/plain)
2011-08-04 16:09 UTC, Corey Marthaler
no flags Details

Description Corey Marthaler 2011-08-04 15:57:51 UTC
Description of problem:
I have three mirrors included in one HA service, and then failed the log device. The VG that these mirrors resided on ended up inconsistent after the failure.


Cluster Status for TAFT @ Thu Aug  4 10:02:15 2011
Member Status: Quorate

 Member Name      ID   Status
 ------ ----      ---- ------
 taft-01             1 Online, rgmanager
 taft-02             2 Online, rgmanager
 taft-03             3 Online, Local, rgmanager
 taft-04             4 Online, rgmanager

 Service Name    Owner (Last)  State
 ------- ----    ----- ------  -----
 service:halvm   taft-03       started


Scenario: Kill log

********* Mirror info for this scenario *********
* mirrors:            ha1 ha2 ha3
* leg devices:        /dev/sdb1 /dev/sdc1
* log devices:        /dev/sdd1
* failpv(s):          /dev/sdd1
* failnode(s):        taft-03
* leg fault policy:   allocate
* log fault policy:   remove

* HA mirror:          YES
* HA service name:    halvm
* HA current owner:   taft-03
*************************************************

PV=/dev/sdd1
        ha1_mlog: 2
PV=/dev/sdd1
        ha1_mlog: 2


Disabling device sdd on taft-03

Attempting I/O to cause mirror down conversion(s) on taft-03
1+0 records in
1+0 records out
512 bytes (512 B) copied, 5.4389e-05 s, 9.4 MB/s
1+0 records in
1+0 records out
512 bytes (512 B) copied, 4.9121e-05 s, 10.4 MB/s
1+0 records in
1+0 records out
512 bytes (512 B) copied, 5.4579e-05 s, 9.4 MB/s
[DEADLOCK]

[root@taft-03 ~]# lvs -a -o +devices
  /dev/sdd1: open failed: No such device or address
  /dev/sdd2: read failed after 0 of 2048 at 0: Input/output error
  /dev/sdd3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde1: Checksum error
  /dev/sde2: Checksum error
  /dev/sde3: Checksum error
  /dev/sdf1: Checksum error
  /dev/sdf2: Checksum error
  /dev/sdf3: Checksum error
  /dev/sdg1: Checksum error
  /dev/sdg2: Checksum error
  /dev/sdg3: Checksum error
  /dev/sdh1: Checksum error
  /dev/sdh2: Checksum error
  /dev/sdh3: Checksum error
  /dev/sdb1: Checksum error
  /dev/sdb2: Checksum error
  /dev/sdb3: Checksum error
  /dev/sdc1: Checksum error
  /dev/sdc2: Checksum error
  /dev/sdc3: Checksum error
[DEADLOCK]

[root@taft-03 ~]# mount
/dev/mapper/TAFT-ha1 on /mnt/fs1 type ext3 (rw)
/dev/mapper/TAFT-ha2 on /mnt/fs2 type ext3 (rw)
/dev/mapper/TAFT-ha3 on /mnt/fs3 type ext3 (rw)
[root@taft-03 ~]# touch /mnt/fs3/foobar
[DEADLOCK]

[root@taft-03 ~]# cat /var/log/messages | grep lvm\\[
Aug  3 16:38:16 taft-03 lvm[7397]: Monitoring mirror device TAFT-ha1 for events.
Aug  3 16:38:16 taft-03 lvm[7397]: Monitoring mirror device TAFT-ha2 for events.
Aug  3 16:38:16 taft-03 lvm[7397]: Monitoring mirror device TAFT-ha3 for events.
Aug  3 16:38:58 taft-03 lvm[7397]: No longer monitoring mirror device TAFT-ha1 for events.
Aug  3 16:38:59 taft-03 lvm[7397]: No longer monitoring mirror device TAFT-ha2 for events.
Aug  3 16:38:59 taft-03 lvm[7397]: No longer monitoring mirror device TAFT-ha3 for events.
Aug  3 16:39:05 taft-03 lvm[7397]: Monitoring mirror device TAFT-ha1 for events.
Aug  3 16:39:05 taft-03 lvm[7397]: Monitoring mirror device TAFT-ha2 for events.
Aug  3 16:39:05 taft-03 lvm[7397]: Monitoring mirror device TAFT-ha3 for events.
Aug  3 16:42:18 taft-03 lvm[7397]: TAFT-ha1 is now in-sync.
Aug  3 16:43:15 taft-03 lvm[7397]: TAFT-ha2 is now in-sync.
Aug  3 16:43:17 taft-03 lvm[7397]: TAFT-ha3 is now in-sync.
Aug  3 18:19:52 taft-03 lvm[7397]: Log device 253:3 has failed (D).
Aug  3 18:19:52 taft-03 lvm[7397]: Device failure in TAFT-ha1.
Aug  3 18:19:53 taft-03 lvm[7397]: /dev/sdd1: read failed after 0 of 4096 at 48553721856: Input/output error
[...]
Aug  3 18:19:53 taft-03 lvm[7397]: /dev/sdd2: read failed after 0 of 512 at 48553721856: Input/output error
[...]
Aug  3 18:19:53 taft-03 lvm[7397]: /dev/sdd3: read failed after 0 of 4096 at 48561979392: Input/output error
[...]
Aug  3 18:19:53 taft-03 lvm[7397]: Another thread is handling an event. Waiting...
Aug  3 18:19:53 taft-03 lvm[7397]: Couldn't find device with uuid 1vFlLa-d3zr-Aeki-rWOV-Ny2c-6QBf-pXjO10.
Aug  3 18:19:53 taft-03 lvm[7397]: Couldn't find device with uuid 2JLe6g-Ed6M-wuG1-o86o-8trB-GLR0-3uSrLe.
Aug  3 18:19:53 taft-03 lvm[7397]: Couldn't find device with uuid YJRe5Z-SD7i-uwuF-wPR5-pXn8-Kor9-0YRyBs.
Aug  3 18:19:54 taft-03 lvm[7397]: Another thread is handling an event. Waiting...
Aug  3 18:20:02 taft-03 lvm[7397]: Mirror log status: 1 of 1 images failed - switching to core
Aug  3 18:20:19 taft-03 lvm[7397]: Monitoring mirror device TAFT-ha1 for events.
Aug  3 18:20:19 taft-03 lvm[7397]: Another thread is handling an event. Waiting...
Aug  3 18:20:41 taft-03 lvm[7397]: /dev/sdd1: open failed: No such device or address
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdg3: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: Couldn't read volume group metadata.
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdh1: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: Couldn't read volume group metadata.
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdh2: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: Couldn't read volume group metadata.
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdh3: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: Couldn't read volume group metadata.
Aug  3 18:20:49 taft-03 lvm[7397]: Inconsistent pre-commit metadata copies for volume group TAFT
Aug  3 18:20:49 taft-03 lvm[7397]: Volume group TAFT metadata is inconsistent
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdb1: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdb2: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdb3: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdc1: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdc2: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdc3: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sde1: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sde2: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sde3: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdf1: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdf2: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdf3: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdg1: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdg2: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdg3: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdh1: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdh2: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdh3: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: Volume group for uuid not found: 3lHsWDW2DxtCFsmYLor0bkTOnSGLT32DZhywE4y8JW7ggRRPWplApfe0Ab4udn9N
Aug  3 18:20:49 taft-03 lvm[7397]: Problem reactivating ha1
Aug  3 18:20:49 taft-03 lvm[7397]: Repair of mirrored LV TAFT/ha1 failed.
Aug  3 18:20:49 taft-03 lvm[7397]: Failed to remove faulty devices in TAFT-ha1.
Aug  3 18:20:49 taft-03 lvm[7397]: Log device 253:11 has failed (D).
Aug  3 18:20:49 taft-03 lvm[7397]: Device failure in TAFT-ha3.
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdb1: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdb2: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdb3: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdc1: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdc2: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdc3: Checksum error
Aug  3 18:20:49 taft-03 lvm[7397]: /dev/sdd2: read failed after 0 of 512 at 48553721856: Input/output error
[...]
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sdd3: read failed after 0 of 4096 at 48561979392: Input/output error
[...]
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sde1: Checksum error
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sde2: Checksum error
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sde3: Checksum error
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sdf1: Checksum error
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sdf2: Checksum error
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sdf3: Checksum error
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sdg1: Checksum error
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sdg2: Checksum error
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sdg3: Checksum error
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sdh1: Checksum error
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sdh2: Checksum error
Aug  3 18:20:50 taft-03 lvm[7397]: /dev/sdh3: Checksum error
Aug  3 18:20:50 taft-03 lvm[7397]: Volume group "TAFT" not found
Aug  3 18:20:50 taft-03 lvm[7397]: Repair of mirrored LV TAFT/ha3 failed.
Aug  3 18:20:50 taft-03 lvm[7397]: Failed to remove faulty devices in TAFT-ha3.
Aug  3 18:20:50 taft-03 lvm[7397]: Log device 253:7 has failed (D).
Aug  3 18:20:50 taft-03 lvm[7397]: Device failure in TAFT-ha2.
Aug  3 18:20:51 taft-03 lvm[7397]: device-mapper: waitevent ioctl failed: Interrupted system call
Aug  3 18:20:51 taft-03 lvm[7397]: No longer monitoring mirror device TAFT-ha1 for events.




Version-Release number of selected component (if applicable):
2.6.32-174.el6.x86_64
rgmanager-3.0.12.1-2.el6.x86_64

lvm2-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
lvm2-libs-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
lvm2-cluster-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
udev-147-2.35.el6    BUILT: Wed Mar 30 07:32:05 CDT 2011
device-mapper-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-libs-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-event-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-event-libs-1.02.62-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011
cmirror-2.02.83-3.el6    BUILT: Fri Mar 18 09:31:10 CDT 2011

Comment 1 Corey Marthaler 2011-08-04 16:04:07 UTC
Here's what the mirror volumes looked like before /dev/sdd was failed. This HA setup was using the tag method for HA.

    # If volume_list is defined, each LV is only activated if there is a
    # match against the list.
    #   "vgname" and "vgname/lvname" are matched exactly.
    #   "@tag" matches any tag set in the LV or VG.
    #   "@*" matches if any tag defined on the host is also set in the LV or VG
    #
    # volume_list = [ "vg1", "vg2/lvol1", "@tag1", "@*" ]
    volume_list = [ "vg_taft03", "@taft-03" ]


[root@taft-03 ~]# vgs -o +vg_tags
  VG        #PV #LV #SN Attr   VSize   VFree   VG Tags
  TAFT       21   4   0 wz-pn- 949.57g 901.55g taft-03

[root@taft-03 ~]# lvs -a -o +devices
  LV             VG    Attr   LSize  Log      Copy%  Devices
  ha1            TAFT  mwi-ao  8.00g ha1_mlog 100.00 ha1_mimage_0(0),ha1_mimage_1(0)
  [ha1_mimage_0] TAFT  iwi-ao  8.00g                 /dev/sdb1(0)
  [ha1_mimage_1] TAFT  iwi-ao  8.00g                 /dev/sdc1(0)
  [ha1_mlog]     TAFT  lwi-ao  4.00m                 /dev/sdd1(0)
  ha2            TAFT  mwi-ao  8.00g ha2_mlog 100.00 ha2_mimage_0(0),ha2_mimage_1(0)
  [ha2_mimage_0] TAFT  iwi-ao  8.00g                 /dev/sdb2(0)
  [ha2_mimage_1] TAFT  iwi-ao  8.00g                 /dev/sdc2(0)
  [ha2_mlog]     TAFT  lwi-ao  4.00m                 /dev/sdd2(0)
  ha3            TAFT  mwi-ao  8.00g ha3_mlog 100.00 ha3_mimage_0(0),ha3_mimage_1(0)
  [ha3_mimage_0] TAFT  iwi-ao  8.00g                 /dev/sdb3(0)
  [ha3_mimage_1] TAFT  iwi-ao  8.00g                 /dev/sdc3(0)
  [ha3_mlog]     TAFT  lwi-ao  4.00m                 /dev/sdd3(0)

Comment 2 Corey Marthaler 2011-08-04 16:09:00 UTC
Created attachment 516743 [details]
log from taft-03

Comment 3 Jonathan Earl Brassow 2011-09-22 18:32:45 UTC
There's not quite enough information to go on.  I'm hoping for a reproducer.  Was this a one time event?

I'm not sure how the lvm commands would produce metadata with checksum errors.  Perhaps another machine was involved somehow.

Since we are using the 'tag' method.  There is a higher probability of other machine interaction.

Comment 8 Corey Marthaler 2012-07-12 16:42:00 UTC
I can no longer reproduce this issue with the latest rpms. Closing...


2.6.32-279.el6.x86_64
rgmanager-3.0.12.1-12.el6.x86_64
lvm2-2.02.95-10.el6    BUILT: Fri May 18 03:26:00 CDT 2012
lvm2-libs-2.02.95-10.el6    BUILT: Fri May 18 03:26:00 CDT 2012
lvm2-cluster-2.02.95-10.el6    BUILT: Fri May 18 03:26:00 CDT 2012
udev-147-2.41.el6    BUILT: Thu Mar  1 13:01:08 CST 2012
device-mapper-1.02.74-10.el6    BUILT: Fri May 18 03:26:00 CDT 2012
device-mapper-libs-1.02.74-10.el6    BUILT: Fri May 18 03:26:00 CDT 2012
device-mapper-event-1.02.74-10.el6    BUILT: Fri May 18 03:26:00 CDT 2012
device-mapper-event-libs-1.02.74-10.el6    BUILT: Fri May 18 03:26:00 CDT 2012
cmirror-2.02.95-10.el6    BUILT: Fri May 18 03:26:00 CDT 2012


Note You need to log in before you can comment on or make changes to this bug.