159276 – Recovering failed IO path causes problems.

Bug 159276 - Recovering failed IO path causes problems.

Summary: Recovering failed IO path causes problems.

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jim Paradis
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-06-01 10:46 UTC by Björn Augustsson
Modified:	2013-08-06 01:14 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-06-08 21:58:53 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Björn Augustsson 2005-06-01 10:46:34 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050512 Red Hat/1.7.8-1.1.3.1

Description of problem:
[ First of all, yes, this isn't the latest kernel version for 2.1, or even
 close. This is a test system for one where we can't change the versions
 easily.]

A rhel AS 2.1 box have two qla4010 iSCSI adapters (with qla4xxx-v3.22-2,
the latest version), talking to an IBM DS300 iSCSI array. 

/etc/raidtab:

raiddev                 /dev/md0
raid-level              multipath
persistent-superblock   1
nr-raid-disks           2

device                  /dev/sdb1
raid-disk               0

device                  /dev/sdd1
raid-disk               1

########################################

raiddev                 /dev/md1
raid-level              multipath
persistent-superblock   1
nr-raid-disks           2

device                  /dev/sdc1
raid-disk               0

device                  /dev/sde1
raid-disk               1




dmesg(the md part of it):

md: autorun ...
md: considering sdd1 ...
md:  adding sdd1 ...
md:  adding sdb1 ...
md: created md0
md: running: <sdd1><sdb1>
md: multipath personality registered as nr 7
md0: max total readahead window set to 124k
md0: 1 data-disks, max readahead per data-disk: 124k
multipath: device sdd1 operational as IO path 0
multipath: making IO path sdb1 a spare path (not in sync)
(checking disk 0)
multipath: array md0 active with 1 out of 1 IO paths (1 spare IO paths)
md: updating md0 RAID superblock on device
md: ... autorun DONE.
md: autorun ...
md: considering sde1 ...
md:  adding sde1 ...
md:  adding sdc1 ...
md: created md1
md: running: <sde1><sdc1>
md1: max total readahead window set to 124k
md1: 1 data-disks, max readahead per data-disk: 124k
multipath: making IO path sde1 a spare path (not in sync)
multipath: device sdc1 operational as IO path 0
(checking disk 0)
multipath: array md1 active with 1 out of 1 IO paths (1 spare IO paths)
md: updating md1 RAID superblock on device
md: ... autorun DONE.
md: mount(pid 302) used obsolete MD ioctl, upgrade your software to use new ictls.
md: mount(pid 302) used obsolete MD ioctl, upgrade your software to use new ictls.


So it runs them as active-passive. Sure.  We failed (made an ext3, dd:d 
lots of data, pulled the plug) the active path, and it started using the
other one after a while, and lots of 
messages.1:May 27 10:01:17 curtis kernel:  I/O error: dev 08:41, sector 1053376
messages.1:May 27 10:01:17 curtis kernel:  I/O error: dev 08:41, sector 1053496
messages.1:May 27 10:01:17 curtis kernel:  I/O error: dev 08:41, sector 1053384
messages.1:May 27 10:01:17 curtis kernel:  I/O error: dev 08:41, sector 1053504

Fine. So we re-plug the old path, and do (according to http://docs.hp.com/en/B9903-90012/ch08s03.html) 

raidsetfaulty -c raidtab /dev/md1 /dev/sde1

Which causes loads of 

messages.1:May 27 10:02:47 curtis kernel: multipath: sdc1: redirecting sector 816312 to another IO path
messages.1:May 27 10:02:47 curtis kernel: multipath: sdc1: redirecting sector 816440 to another IO path
messages.1:May 27 10:02:47 curtis kernel: multipath: sdc1: redirecting sector 81

but after that, the command hangs, in the D state. Sysrq-t gives:

May 27 12:18:36 curtis kernel: raidsetfaulty D CC497E30  5000  4847   2386                     (L-TLB)
May 27 12:18:36 curtis kernel: Call Trace: [__wait_on_buffer+118/160] __wait_on_buffer [kernel] 0x76 (0xcc497e44)
May 27 12:18:36 curtis kernel: Call Trace: [<c0146e66>] __wait_on_buffer [kernel] 0x76 (0xcc497e44)
May 27 12:18:36 curtis kernel: [wait_for_locked_buffers+132/176] wait_for_locked_buffers [kernel] 0x84 (0xcc497e88)
May 27 12:18:36 curtis kernel: [<c0147124>] wait_for_locked_buffers [kernel] 0x84 (0xcc497e88)
May 27 12:18:36 curtis kernel: [sync_buffers+53/64] sync_buffers [kernel] 0x35 (0xcc497eb0)
May 27 12:18:36 curtis kernel: [<c0147185>] sync_buffers [kernel] 0x35 (0xcc497eb0)
May 27 12:18:36 curtis kernel: [fsync_no_super+22/32] fsync_no_super [kernel] 0x16 (0xcc497edc)
May 27 12:18:36 curtis kernel: [<c0147266>] fsync_no_super [kernel] 0x16 (0xcc497edc)
May 27 12:18:36 curtis kernel: [blkdev_put+64/224] blkdev_put [kernel] 0x40 (0xcc497ef4)
May 27 12:18:36 curtis kernel: [<c014da70>] blkdev_put [kernel] 0x40 (0xcc497ef4)
May 27 12:18:36 curtis kernel: [__fput+43/208] __fput [kernel] 0x2b (0xcc497f0c)
May 27 12:18:36 curtis kernel: [<c0146b9b>] __fput [kernel] 0x2b (0xcc497f0c)
May 27 12:18:36 curtis kernel: [filp_close+158/176] filp_close [kernel] 0x9e (0xcc497f38)
May 27 12:18:36 curtis kernel: [<c01457ae>] filp_close [kernel] 0x9e (0xcc497f38)
May 27 12:18:36 curtis kernel: [put_files_struct+77/224] put_files_struct [kernel] 0x4d (0xcc497f5c)
May 27 12:18:36 curtis kernel: [<c011f1fd>] put_files_struct [kernel] 0x4d (0xcc497f5c)
May 27 12:18:36 curtis kernel: [do_exit+311/624] do_exit [kernel] 0x137 (0xcc497f78)
May 27 12:18:36 curtis kernel: [<c011fa47>] do_exit [kernel] 0x137 (0xcc497f78)
May 27 12:18:36 curtis kernel: [blkdev_ioctl+38/64] blkdev_ioctl [kernel] 0x26 (0xcc497f80)
May 27 12:18:36 curtis kernel: [<c014db56>] blkdev_ioctl [kernel] 0x26 (0xcc497f80)
May 27 12:18:36 curtis kernel: [sys_ioctl+599/672] sys_ioctl [kernel] 0x257 (0xcc497f94)
May 27 12:18:36 curtis kernel: [<c01558a7>] sys_ioctl [kernel] 0x257 (0xcc497f94)
May 27 12:18:36 curtis kernel: [sys_ioctl+659/672] sys_ioctl [kernel] 0x293 (0xcc497fa4)
May 27 12:18:36 curtis kernel: [<c01558e3>] sys_ioctl [kernel] 0x293 (0xcc497fa4)
May 27 12:18:36 curtis kernel: [system_call+51/56] system_call [kernel] 0x33 (0xcc497fc0)
May 27 12:18:36 curtis kernel: [<c01073e3>] system_call [kernel] 0x33 (0xcc497fc0)
May 27 12:18:36 curtis kernel:

(That's for the raidsetfaulty command; I have the rest of the dump if 
 you want it.)

All kinds of stuff stop working after that. We rebooted. The system came
up fine with both paths.

/August.

Version-Release number of selected component (if applicable):
kernel-smp-2.4.9-e.35

How reproducible:
Didn't try

Steps to Reproduce:
1. See above.
2.
3.
  

Additional info:

Comment 1 Jim Paradis 2006-06-08 21:58:53 UTC

RHEL2.1 is currently accepting only critical security fixes.  This issue is
outside the current scope of support.

Note You need to log in before you can comment on or make changes to this bug.