Bug 159276

Summary:	Recovering failed IO path causes problems.
Product:	Red Hat Enterprise Linux 2.1	Reporter:	Björn Augustsson <oggust>
Component:	kernel	Assignee:	Jim Paradis <jparadis>
Status:	CLOSED WONTFIX	QA Contact:	Brian Brock <bbrock>
Severity:	high	Docs Contact:
Priority:	medium
Version:	2.1	CC:	peterm
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2006-06-08 21:58:53 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Björn Augustsson 2005-06-01 10:46:34 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050512 Red Hat/1.7.8-1.1.3.1

Description of problem:
[ First of all, yes, this isn't the latest kernel version for 2.1, or even
 close. This is a test system for one where we can't change the versions
 easily.]

A rhel AS 2.1 box have two qla4010 iSCSI adapters (with qla4xxx-v3.22-2,
the latest version), talking to an IBM DS300 iSCSI array. 

/etc/raidtab:

raiddev                 /dev/md0
raid-level              multipath
persistent-superblock   1
nr-raid-disks           2

device                  /dev/sdb1
raid-disk               0

device                  /dev/sdd1
raid-disk               1

########################################

raiddev                 /dev/md1
raid-level              multipath
persistent-superblock   1
nr-raid-disks           2

device                  /dev/sdc1
raid-disk               0

device                  /dev/sde1
raid-disk               1




dmesg(the md part of it):

md: autorun ...
md: considering sdd1 ...
md:  adding sdd1 ...
md:  adding sdb1 ...
md: created md0
md: running: <sdd1><sdb1>
md: multipath personality registered as nr 7
md0: max total readahead window set to 124k
md0: 1 data-disks, max readahead per data-disk: 124k
multipath: device sdd1 operational as IO path 0
multipath: making IO path sdb1 a spare path (not in sync)
(checking disk 0)
multipath: array md0 active with 1 out of 1 IO paths (1 spare IO paths)
md: updating md0 RAID superblock on device
md: ... autorun DONE.
md: autorun ...
md: considering sde1 ...
md:  adding sde1 ...
md:  adding sdc1 ...
md: created md1
md: running: <sde1><sdc1>
md1: max total readahead window set to 124k
md1: 1 data-disks, max readahead per data-disk: 124k
multipath: making IO path sde1 a spare path (not in sync)
multipath: device sdc1 operational as IO path 0
(checking disk 0)
multipath: array md1 active with 1 out of 1 IO paths (1 spare IO paths)
md: updating md1 RAID superblock on device
md: ... autorun DONE.
md: mount(pid 302) used obsolete MD ioctl, upgrade your software to use new ictls.
md: mount(pid 302) used obsolete MD ioctl, upgrade your software to use new ictls.


So it runs them as active-passive. Sure.  We failed (made an ext3, dd:d 
lots of data, pulled the plug) the active path, and it started using the
other one after a while, and lots of 
messages.1:May 27 10:01:17 curtis kernel:  I/O error: dev 08:41, sector 1053376
messages.1:May 27 10:01:17 curtis kernel:  I/O error: dev 08:41, sector 1053496
messages.1:May 27 10:01:17 curtis kernel:  I/O error: dev 08:41, sector 1053384
messages.1:May 27 10:01:17 curtis kernel:  I/O error: dev 08:41, sector 1053504

Fine. So we re-plug the old path, and do (according to http://docs.hp.com/en/B9903-90012/ch08s03.html) 

raidsetfaulty -c raidtab /dev/md1 /dev/sde1

Which causes loads of 

messages.1:May 27 10:02:47 curtis kernel: multipath: sdc1: redirecting sector 816312 to another IO path
messages.1:May 27 10:02:47 curtis kernel: multipath: sdc1: redirecting sector 816440 to another IO path
messages.1:May 27 10:02:47 curtis kernel: multipath: sdc1: redirecting sector 81

but after that, the command hangs, in the D state. Sysrq-t gives:

May 27 12:18:36 curtis kernel: raidsetfaulty D CC497E30  5000  4847   2386                     (L-TLB)
May 27 12:18:36 curtis kernel: Call Trace: [__wait_on_buffer+118/160] __wait_on_buffer [kernel] 0x76 (0xcc497e44)
May 27 12:18:36 curtis kernel: Call Trace: [<c0146e66>] __wait_on_buffer [kernel] 0x76 (0xcc497e44)
May 27 12:18:36 curtis kernel: [wait_for_locked_buffers+132/176] wait_for_locked_buffers [kernel] 0x84 (0xcc497e88)
May 27 12:18:36 curtis kernel: [<c0147124>] wait_for_locked_buffers [kernel] 0x84 (0xcc497e88)
May 27 12:18:36 curtis kernel: [sync_buffers+53/64] sync_buffers [kernel] 0x35 (0xcc497eb0)
May 27 12:18:36 curtis kernel: [<c0147185>] sync_buffers [kernel] 0x35 (0xcc497eb0)
May 27 12:18:36 curtis kernel: [fsync_no_super+22/32] fsync_no_super [kernel] 0x16 (0xcc497edc)
May 27 12:18:36 curtis kernel: [<c0147266>] fsync_no_super [kernel] 0x16 (0xcc497edc)
May 27 12:18:36 curtis kernel: [blkdev_put+64/224] blkdev_put [kernel] 0x40 (0xcc497ef4)
May 27 12:18:36 curtis kernel: [<c014da70>] blkdev_put [kernel] 0x40 (0xcc497ef4)
May 27 12:18:36 curtis kernel: [__fput+43/208] __fput [kernel] 0x2b (0xcc497f0c)
May 27 12:18:36 curtis kernel: [<c0146b9b>] __fput [kernel] 0x2b (0xcc497f0c)
May 27 12:18:36 curtis kernel: [filp_close+158/176] filp_close [kernel] 0x9e (0xcc497f38)
May 27 12:18:36 curtis kernel: [<c01457ae>] filp_close [kernel] 0x9e (0xcc497f38)
May 27 12:18:36 curtis kernel: [put_files_struct+77/224] put_files_struct [kernel] 0x4d (0xcc497f5c)
May 27 12:18:36 curtis kernel: [<c011f1fd>] put_files_struct [kernel] 0x4d (0xcc497f5c)
May 27 12:18:36 curtis kernel: [do_exit+311/624] do_exit [kernel] 0x137 (0xcc497f78)
May 27 12:18:36 curtis kernel: [<c011fa47>] do_exit [kernel] 0x137 (0xcc497f78)
May 27 12:18:36 curtis kernel: [blkdev_ioctl+38/64] blkdev_ioctl [kernel] 0x26 (0xcc497f80)
May 27 12:18:36 curtis kernel: [<c014db56>] blkdev_ioctl [kernel] 0x26 (0xcc497f80)
May 27 12:18:36 curtis kernel: [sys_ioctl+599/672] sys_ioctl [kernel] 0x257 (0xcc497f94)
May 27 12:18:36 curtis kernel: [<c01558a7>] sys_ioctl [kernel] 0x257 (0xcc497f94)
May 27 12:18:36 curtis kernel: [sys_ioctl+659/672] sys_ioctl [kernel] 0x293 (0xcc497fa4)
May 27 12:18:36 curtis kernel: [<c01558e3>] sys_ioctl [kernel] 0x293 (0xcc497fa4)
May 27 12:18:36 curtis kernel: [system_call+51/56] system_call [kernel] 0x33 (0xcc497fc0)
May 27 12:18:36 curtis kernel: [<c01073e3>] system_call [kernel] 0x33 (0xcc497fc0)
May 27 12:18:36 curtis kernel:

(That's for the raidsetfaulty command; I have the rest of the dump if 
 you want it.)

All kinds of stuff stop working after that. We rebooted. The system came
up fine with both paths.

/August.

Version-Release number of selected component (if applicable):
kernel-smp-2.4.9-e.35

How reproducible:
Didn't try

Steps to Reproduce:
1. See above.
2.
3.
  

Additional info:

Comment 1 Jim Paradis 2006-06-08 21:58:53 UTC

RHEL2.1 is currently accepting only critical security fixes.  This issue is
outside the current scope of support.