Bug 199498

Summary:	mirror leg failure during I/O causes I/O hang and apparent volume corruption
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Corey Marthaler <cmarthal>
Component:	cmirror	Assignee:	Jonathan Earl Brassow <jbrassow>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	high
Version:	4	CC:	agk, cfeist, dwysocha, mbroz
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-08-05 21:32:19 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Corey Marthaler 2006-07-19 21:20:31 UTC

Description of problem:

We created a mirror out of three partitions, 2 on one raid unit (leg 1 and the
mlog) and 1 on another (leg 0). We then started up continuous I/O to the mirror
and then failed leg 0 by powering off the second raid unit. After countless SCSI
errors (signifying the failed device) and "device-mapper: Attempting to revert
sync status of region" messages, the I/O eventually hung. The volumes appear
corrupt.

Before we failed the RAID:
[root@taft-03 ~]# lvs -a -o +devices
  LV                VG   Attr   LSize Origin Snap%  Move Log         Copy% 
Devices                           
  mirror            vg   mwi-a- 2.00G                    mirror_mlog 100.00
mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] vg   iwi-ao 2.00G                                      
/dev/sdd1(0)                      
  [mirror_mimage_1] vg   iwi-ao 2.00G                                      
/dev/sdb1(0)                      
  [mirror_mlog]     vg   lwi-ao 4.00M                                      
/dev/sdc1(0)  

After we failed the RAID:
[root@taft-04 ~]# lvs
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 2147418112: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdd1: read failed after 0 of 512 at 1999063744512: Input/output error
  /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid 'bSXkBN-aVrT-zIm8-KvaA-QToI-loW4-1kNJNC'.
  Couldn't find all physical volumes for volume group vg.
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid 'bSXkBN-aVrT-zIm8-KvaA-QToI-loW4-1kNJNC'.
  Couldn't find all physical volumes for volume group vg.
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid 'bSXkBN-aVrT-zIm8-KvaA-QToI-loW4-1kNJNC'.
  Couldn't find all physical volumes for volume group vg.
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error
  Couldn't find device with uuid 'bSXkBN-aVrT-zIm8-KvaA-QToI-loW4-1kNJNC'.
  Couldn't find all physical volumes for volume group vg.
  Volume group "vg" not found


We then attempted to write once again to that mirror with a simple dd and saw
the following messages on the console:

Jul 19 11:18:39 taft-04 kernel: SCSI error : <1 0 1 0> return code = 0x10000
Jul 19 11:18:39 taft-04 kernel: end_request: I/O error, dev sdd, sector 16449
Jul 19 11:18:39 taft-04 kernel: device-mapper: Error during write occurred.
Jul 19 11:18:39 taft-04 kernel: device-mapper: incrementing error_count on 253:3
Jul 19 11:18:39 taft-04 dmeventd[4383]: Mirror device, 253:3, has failed.
Jul 19 11:18:39 taft-04 dmeventd[4383]: Device failure in vg-mirror
Jul 19 11:18:39 taft-04 kernel: device-mapper: Attempting to revert sync status
of region #0

Which then causes an lvs cmd to hang:
[root@taft-04 ~]# lvs
  /dev/dm-3: read failed after 0 of 4096 at 0: Input/output error
  /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error
[HANG}


stuck lvscan:
[...]
write(2, "\n", 1
)                       = 1
close(4)                                = 0
stat("/proc/lvm/VGs/vg", 0x7fbfffb580)  = -1 ENOENT (No such file or directory)
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
write(3, "3\1\377\277\0\0\0\0\0\0\0\0\7\0\0\0\0\1\4V_vg\0\220", 25) = 25
read(3,


[root@taft-04 ~]# dmsetup ls
vg-mirror_mimage_1      (253, 4)
vg-mirror_mimage_0      (253, 3)
vg-mirror       (253, 5)
VolGroup00-LogVol01     (253, 1)
VolGroup00-LogVol00     (253, 0)
vg-mirror_mlog  (253, 2)



Version-Release number of selected component (if applicable):
[root@taft-04 ~]# uname -ar
Linux taft-04 2.6.9-42.ELsmp #1 SMP Wed Jul 12 23:32:02 EDT 2006 x86_64 x86_64
x86_64 GNU/Linux
[root@taft-04 ~]# rpm -q lvm2
lvm2-2.02.06-6.0.RHEL4
[root@taft-04 ~]# rpm -q lvm2-cluster
lvm2-cluster-2.02.06-6.0.RHEL4
[root@taft-04 ~]# rpm -q cmirror
cmirror-1.0.1-0
[root@taft-04 ~]# rpm -q cmirror-kernel
cmirror-kernel-2.6.9-10.2
[root@taft-04 ~]# rpm -q device-mapper
device-mapper-1.02.07-4.0.RHEL4

Comment 1 Corey Marthaler 2006-07-19 22:10:11 UTC

Reproduced this issue with I/O on a non cmirror server. 

Trying again with I/O on the cmirror server...

Comment 2 Corey Marthaler 2006-07-19 22:30:07 UTC

With I/O on the mirror server the device failure case worked after about 5
minutes of scsi and write errors before the device was finally converted to a
linear.

Re-attempting the case in comment #1 and will let is hang over night to see if
we are just not waiting long enough.

Comment 3 Corey Marthaler 2006-07-20 14:47:40 UTC

Attempted the case in comment #1 again (I/O on a non cmirror server and then
fail one of the legs) and this time the volumes didn't get corrupted, however it
didn't appear that it got properely converted to a linear either.

After the failed leg: 
[root@taft-02 ~]# lvscan
  /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error
  ACTIVE            '/dev/vg/mirror' [1.00 GB] inherit
[root@taft-02 ~]# lvs -a -o +devices
  /dev/sdd1: read failed after 0 of 2048 at 0: Input/output error
  LV                VG   Attr   LSize Origin Snap%  Move Log         Copy% 
Devices                           
  mirror            vg   mwi-s- 1.00G                    mirror_mlog 100.00
mirror_mimage_0(0),mirror_mimage_1(0)
  [mirror_mimage_0] vg   iwi-so 1.00G                                          
                              
  [mirror_mimage_1] vg   iwi-so 1.00G                                      
/dev/sdb1(0)                      
  [mirror_mlog]     vg   lwi-so 4.00M                                      
/dev/sdc1(0)                      

After a couple of minutes had passed since the failed leg, the I/O ended up
hanging (as well as clvmd) and I let that hang overnight, it never came back.

Comment 4 Corey Marthaler 2006-07-20 22:18:43 UTC

Hit this issue while attempting the same leg failure but with I/O from all nodes
in the cluster to the mirror. CLVMD ended up hung as well as the I/O.

Comment 7 Jonathan Earl Brassow 2006-10-17 14:40:35 UTC

clvmd should no longer hang, and no volume corruption should occur.

as far as I/O hanging... that could be bug #199724, and could be a result of
mirror reconfiguration taking too long.

Comment 8 Corey Marthaler 2007-03-19 22:03:05 UTC

marking this verified, there are other leg failure bugs open for more specific
cases.

Comment 9 Chris Feist 2008-08-05 21:32:19 UTC

Closing as this has been fixed in the current (4.7) release.