Bug 226815 - Loss of clvmd volumes due to cmirror primary leg failures
Summary: Loss of clvmd volumes due to cmirror primary leg failures
Keywords:
Status: CLOSED DUPLICATE of bug 232711
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: cmirror
Version: 4
Hardware: All
OS: Linux
high
high
Target Milestone: ---
Assignee: Jonathan Earl Brassow
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-02-01 18:52 UTC by Corey Marthaler
Modified: 2010-01-12 02:02 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-04-12 18:18:35 UTC
Embargoed:


Attachments (Terms of Use)

Description Corey Marthaler 2007-02-01 18:52:30 UTC
Description of problem:
On the two node x86_64 (kool/salem) cluster, I had three 5G cmirrors each with
GFS on top, and each running I/O. None of them were fully synced yet.
 
[root@salem ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       72G  1.7G   66G   3% /
/dev/sda1              99M   19M   76M  20% /boot
none                 1004M     0 1004M   0% /dev/shm
/dev/mapper/vg-mirror1
                      4.8G   20K  4.8G   1% /mnt/gfs1
/dev/mapper/vg-mirror2
                      4.8G   20K  4.8G   1% /mnt/gfs2
/dev/mapper/vg-mirror3
                      4.8G   20K  4.8G   1% /mnt/gfs3


[root@salem ~]# lvs -a -o +devices
  LV                 VG   Attr   LSize Origin Snap%  Move Log          Copy% 
Devices                          
  mirror1            vg   mwi-ao 5.00G                    mirror1_mlog  20.08
mirror1_mimage_0(0),mirror1_mimage_1(0)
  [mirror1_mimage_0] vg   iwi-ao 5.00G                                       
/dev/sdc(0)                      
  [mirror1_mimage_1] vg   iwi-ao 5.00G                                       
/dev/sde(0)                      
  [mirror1_mlog]     vg   lwi-ao 4.00M                                       
/dev/sdd(0)                      
  mirror2            vg   mwi-ao 5.00G                    mirror2_mlog  16.72
mirror2_mimage_0(0),mirror2_mimage_1(0)
  [mirror2_mimage_0] vg   iwi-ao 5.00G                                       
/dev/sdc(1280)                   
  [mirror2_mimage_1] vg   iwi-ao 5.00G                                       
/dev/sde(1280)                   
  [mirror2_mlog]     vg   lwi-ao 4.00M                                       
/dev/sdd(1)                      
  mirror3            vg   mwi-ao 5.00G                    mirror3_mlog  24.45
mirror3_mimage_0(0),mirror3_mimage_1(0)
  [mirror3_mimage_0] vg   iwi-ao 5.00G                                       
/dev/sdc(2560)                   
  [mirror3_mimage_1] vg   iwi-ao 5.00G                                       
/dev/sde(2560)                   
  [mirror3_mlog]     vg   lwi-ao 4.00M                                       
/dev/sdd(2)                      


I then attempted to create two more mirrors, and while the 5th one was being
created I failed /dev/sdb and /dev/sdc (which was the primary leg for the
original 3 mirrors) and i'm sure apart of the two new cmirrors. At that point
the I/O stopped and I waited for the mirrors to down convert. According to the
log messages, these conversions did appear to take place, however, none of the
I/O restarted, and now all the filesystem are inaccessable.

[root@salem ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       72G  1.7G   66G   3% /
/dev/sda1              99M   19M   76M  20% /boot
none                 1004M     0 1004M   0% /dev/shm
df: `/mnt/gfs1': Input/output error
df: `/mnt/gfs2': Input/output error
/dev/mapper/vg-mirror3
                      4.8G  228K  4.8G   1% /mnt/gfs3
[root@salem ~]# ls -lrt /mnt/gfs3
[HANG]

Also, all my lvs commands hang as well. 


Here's what's in the logs:
[...]
Feb  1 12:24:03 kool lvm[4270]: WARNING: Bad device removed from mirror volume,
vg/mirror1
Feb  1 12:24:03 kool lvm[4270]: WARNING: Mirror volume, vg/mirror1 converted to
linear due to device failure.
[...]
Feb  1 12:24:07 kool lvm[4270]: WARNING: Bad device removed from mirror volume,
vg/mirror2
Feb  1 12:24:08 kool lvm[4270]: WARNING: Mirror volume, vg/mirror2 converted to
linear due to device failure.
[...]



[...]
Feb  1 12:23:41 salem lvm[4320]: No longer monitoring mirror device vg-mirror1
for events
Feb  1 12:23:45 salem lvm[4320]: No longer monitoring mirror device vg-mirror2
for events
Feb  1 12:23:50 salem lvm[4320]: No longer monitoring mirror device vg-mirror3
for events
Feb  1 12:33:54 salem kernel: device-mapper: A read failure occurred on a mirror
device.
Feb  1 12:33:54 salem kernel: device-mapper: Unable to retry read.
[...]

Version-Release number of selected component (if applicable):
[root@salem ~]# uname -ar
Linux salem 2.6.9-43.ELsmp #1 SMP Wed Jan 10 19:57:37 EST 2007 x86_64 x86_64
x86_64 GNU/Linux
[root@salem ~]# rpm -qa | grep lvm2
lvm2-cluster-2.02.20-1.el4
lvm2-cluster-debuginfo-2.02.06-6.0.RHEL4
lvm2-2.02.20-1.el4
[root@salem ~]# rpm -qa | grep cmirror
cmirror-kernel-largesmp-2.6.9-18.6
cmirror-debuginfo-1.0.1-1
cmirror-kernel-debuginfo-2.6.9-18.6
cmirror-kernel-smp-2.6.9-18.7
cmirror-kernel-2.6.9-18.6
cmirror-1.0.1-1

Comment 1 Corey Marthaler 2007-02-01 22:46:07 UTC
I can hit was also appears to be this bug by pulling the fc cable from one of
the machines in the two node cluster while there is I/O going to the cmirrors,
and then plugging it back in later. The cmirror filesystems end up deadlocked as
well as any other fs stat cmds like 'df'. This also caused the sync percents to
get messed up.


Before pullig the cable
[root@salem ~]# lvs -a -o +devices
  LV                  VG    Attr   LSize Origin Snap%  Move Log           Copy%
 Devices                       
  cmirror1            corey Mwi-ao 5.00G                    cmirror1_mlog 100.00
cmirror1_mimage_0(0),cmirror1_mimage_1(0)
  [cmirror1_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(0)                   
  [cmirror1_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(0)                   
  [cmirror1_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(0)                   
  cmirror2            corey Mwi-ao 5.00G                    cmirror2_mlog 100.00
cmirror2_mimage_0(0),cmirror2_mimage_1(0)
  [cmirror2_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(1280)                
  [cmirror2_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(1280)                
  [cmirror2_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(1)                   
  cmirror3            corey Mwi-ao 5.00G                    cmirror3_mlog 100.00
cmirror3_mimage_0(0),cmirror3_mimage_1(0)
  [cmirror3_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(2560)                
  [cmirror3_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(2560)                
  [cmirror3_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(2)                   


[root@kool ~]# lvs -a -o +devices
  LV                  VG    Attr   LSize Origin Snap%  Move Log           Copy%
 Devices                       
  cmirror1            corey Mwi-ao 5.00G                    cmirror1_mlog 100.00
cmirror1_mimage_0(0),cmirror1_mimage_1(0)
  [cmirror1_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(0)                   
  [cmirror1_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(0)                   
  [cmirror1_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(0)                   
  cmirror2            corey Mwi-ao 5.00G                    cmirror2_mlog 100.00
cmirror2_mimage_0(0),cmirror2_mimage_1(0)
  [cmirror2_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(1280)                
  [cmirror2_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(1280)                
  [cmirror2_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(1)                   
  cmirror3            corey Mwi-ao 5.00G                    cmirror3_mlog 100.00
cmirror3_mimage_0(0),cmirror3_mimage_1(0)
  [cmirror3_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(2560)                
  [cmirror3_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(2560)                
  [cmirror3_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(2)                   



Here I pulled the cable from salem:
[root@salem ~]# lvs -a -o +devices
  /dev/dm-2: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-4: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-6: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-7: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-8: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-10: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-11: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-12: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-13: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdb: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdc: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdd: read failed after 0 of 4096 at 0: Input/output error
  /dev/sde: read failed after 0 of 4096 at 0: Input/output error
  No volume groups found


Saw the scsi errors on salem and this on kool:
dm-cmirror: Error while listening for server response: -110
dm-cmirror: Error while listening for server response: -110
dm-cmirror: Error while listening for server response: -110
dm-cmirror: Error while listening for server response: -110
dm-cmirror: Error while listening for server response: -110


Then I plugged the cable back in:
[root@salem ~]# lvs -a -o +devices
  /dev/dm-5: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-9: read failed after 0 of 4096 at 0: Input/output error
  /dev/dm-13: read failed after 0 of 4096 at 0: Input/output error
  LV                  VG    Attr   LSize Origin Snap%  Move Log           Copy%
 Devices                       
  cmirror1            corey Mwi-ao 5.00G                    cmirror1_mlog   0.00
cmirror1_mimage_0(0),cmirror1_mimage_1(0)
  [cmirror1_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(0)                   
  [cmirror1_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(0)                   
  [cmirror1_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(0)                   
  cmirror2            corey Mwi-ao 5.00G                    cmirror2_mlog 100.00
cmirror2_mimage_0(0),cmirror2_mimage_1(0)
  [cmirror2_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(1280)                
  [cmirror2_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(1280)                
  [cmirror2_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(1)                   
  cmirror3            corey Mwi-ao 5.00G                    cmirror3_mlog 100.00
cmirror3_mimage_0(0),cmirror3_mimage_1(0)
  [cmirror3_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(2560)                
  [cmirror3_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(2560)                
  [cmirror3_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(2)                   

[root@kool ~]# lvs -a -o +devices
  LV                  VG    Attr   LSize Origin Snap%  Move Log           Copy%
 Devices                       
  cmirror1            corey Mwi-ao 5.00G                    cmirror1_mlog   0.00
cmirror1_mimage_0(0),cmirror1_mimage_1(0)
  [cmirror1_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(0)                   
  [cmirror1_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(0)                   
  [cmirror1_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(0)                   
  cmirror2            corey Mwi-ao 5.00G                    cmirror2_mlog 100.00
cmirror2_mimage_0(0),cmirror2_mimage_1(0)
  [cmirror2_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(1280)                
  [cmirror2_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(1280)                
  [cmirror2_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(1)                   
  cmirror3            corey Mwi-ao 5.00G                    cmirror3_mlog 100.00
cmirror3_mimage_0(0),cmirror3_mimage_1(0)
  [cmirror3_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(2560)                
  [cmirror3_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(2560)                
  [cmirror3_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(2)                   

[root@kool ~]# lvs -a -o +devices
  LV                  VG    Attr   LSize Origin Snap%  Move Log           Copy%
 Devices                       
  cmirror1            corey Mwi-ao 5.00G                    cmirror1_mlog   0.00
cmirror1_mimage_0(0),cmirror1_mimage_1(0)
  [cmirror1_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(0)                   
  [cmirror1_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(0)                   
  [cmirror1_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(0)                   
  cmirror2            corey Mwi-ao 5.00G                    cmirror2_mlog 100.00
cmirror2_mimage_0(0),cmirror2_mimage_1(0)
  [cmirror2_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(1280)                
  [cmirror2_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(1280)                
  [cmirror2_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(1)                   
  cmirror3            corey Mwi-ao 5.00G                    cmirror3_mlog 100.00
cmirror3_mimage_0(0),cmirror3_mimage_1(0)
  [cmirror3_mimage_0] corey iwi-ao 5.00G                                       
 /dev/sdc(2560)                
  [cmirror3_mimage_1] corey iwi-ao 5.00G                                       
 /dev/sde(2560)                
  [cmirror3_mlog]     corey lwi-ao 4.00M                                       
 /dev/sdd(2)                   


[root@kool ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00
                       72G  1.8G   66G   3% /
/dev/sda1              99M   23M   71M  25% /boot
none                 1004M     0 1004M   0% /dev/shm
[HANG]


Comment 2 Jonathan Earl Brassow 2007-02-20 22:39:11 UTC
I'm still trying to weed through this to find what could be addressed better...
 Of course you're going to have problems when you fail your primary device
before the mirror is in-sync.  That's not a bug...  I would expect file system
commands to have problems as well.

The important issues here are:
dm-cmirror: Error while listening for server response: -110
and:
are we properly swapping out a mirror for an error target when the primary
device goes out on a non-sync'ed mirror.


Comment 3 Jonathan Earl Brassow 2007-02-21 20:30:26 UTC
please help me reproduce with the latest cmirror-kernel package (>= 2/21/2007)


Comment 4 Jonathan Earl Brassow 2007-02-27 17:01:33 UTC
Marking modified, as I believe this has been fixed in the process of fixing
other bugs.

Comment 5 Corey Marthaler 2007-04-12 18:18:35 UTC
It sounds like this is another case where the primary leg is being failed before
it's synced, so this is a dup of bz 232711 and 233031.

*** This bug has been marked as a duplicate of 232711 ***


Note You need to log in before you can comment on or make changes to this bug.