473454 – device-mapper multipath: kmpathd oops in process_queued_ios

Bug 473454 - device-mapper multipath: kmpathd oops in process_queued_ios

Summary: device-mapper multipath: kmpathd oops in process_queued_ios

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.7
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	LVM and device-mapper development team
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-11-28 15:58 UTC by Bryn M. Reeves
Modified:	2018-10-27 14:32 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-01-20 20:11:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
disassembly of dm-multipath.ko (246.96 KB, text/plain) 2008-11-28 17:32 UTC, Bryn M. Reeves	no flags	Details
Show Obsolete (1) View All

Description Bryn M. Reeves 2008-11-28 15:58:29 UTC

Description of problem:
<ffffffffa01f79e8>{:dm_multipath:process_queued_ios+148}
PML4 c6a5b3067 
PGD 0
Oops: 0000 [1] 
SMP
CPU 7
Modules linked in: vfat fat cpqci(U) mptctl mptbase ipmi_devintf ipmi_si ipmi_msghandler nfsd exportfs lockd nfs_acl parport_pc lp parport netconsole autofs4 i2c_dev i2c_core sg sunrpc qioctlmod ide_dump cciss_dump scsi_dump diskdump zlib_deflate ext3 jbd button battery ac joydev ohci_hcd ehci_hcd uhci_hcd k8_edac edac_mc md5 ipv6 bonding(U) dm_round_robin dm_emc dm_multipath lock_dlm(U) dlm(U) gfs(U) lock_harness(U) cman(U) qla6312 qla2400 qla2300 qla2xxx scsi_transport_fc cciss sd_mod scsi_mod dm_snapshot dm_mirror dm_mod e1000
Pid: 2624, comm: kmpathd/7 Tainted: P      2.6.9-55.0.2.ELsmp
RIP: 0010:[<ffffffffa01f79e8>] <ffffffffa01f79e8>{:dm_multipath:process_queued_ios+148}
RSP: 0018:0000010c6eddfe38  EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000246
RDX: 0000000000000246 RSI: 0000000000000246 RDI: 000001106f606218
RBP: 000001106f606200 R08: 000001046beedc08 R09: 6db6db6db6db6db7
R10: 0000000300000000 R11: 000000006f606280 R12: 000001106f606218
R13: 0000000000000246 R14: 000001106f606220 R15: 0000000000000001
FS:  0000002a958a1b00(0000) GS:ffffffff804edf00(0000) knlGS:00000000f7fe36c0
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000010 CR3: 000000007fe0e000 CR4: 00000000000006e0
Process kmpathd/7 (pid: 2624, threadinfo 0000010c6edde000, task 000001106f7b7030)
Stack: 
000000016f606280 000001106f606280 000001106f606288 000001046beedbc0
0000000000000246 000001106f606200 ffffffffa01f7954 ffffffff80147c42 
000001000800aa80 ffffffffffffffff
Call Trace:
<ffffffffa01f7954>{:dm_multipath:process_queued_ios+0}
<ffffffff80147c42>{worker_thread+419} 
<ffffffff801341cc>{default_wake_function+0}
<ffffffff8013421d>{__wake_up_common+67} 
<ffffffff801341cc>{default_wake_function+0}
<ffffffff8014b990>{keventd_create_kthread+0} 
<ffffffff80147a9f>{worker_thread+0}
<ffffffff8014b990>{keventd_create_kthread+0}
<ffffffff8014b967>{kthread+200}
<ffffffff80110f47>{child_rip+8}
<ffffffff8014b990>{keventd_create_kthread+0}
<ffffffff8014b89f>{kthread+0} <ffffffff80110f3f>{child_rip+0}

Code: 48 8b 43 10 49 8b 0e 48 8d 53 20 4c 89 f7 8b 70 2c ff 51 20
RIP <ffffffffa01f79e8>{:dm_multipath:process_queued_ios+148} RSP <0000010c6eddfe38>
CR2: 0000000000000010

Above oops occurred during a period of very frequent I/O errors. This happened due to an internal LCC failure on the attached Clariion storage. This causes the LUNs to be unavailable via the preferred (primary) storage processor, forcing trespasses to the secondary controller.

The emc hardware handler appeared to be trying to re-instate paths via the primary processor, leading to repeated trespasses and failovers (since the primary processor cannot access these LUNs at all until the LCC has been replaced). 

Version-Release number of selected component (if applicable):
2.6.9-55.0.2.ELsmp

How reproducible:
Unclear

Steps to Reproduce:
1. Requires a Clariion with a broken LCC on SP-A
2. Configure some LUNs that are owned by SP-A
3. Break the LCC on SP-A that serves these LUNs
4. Watch dm-multipath try to re-instate paths via SP-A
  
Actual results:
Paths flap around; the storage cannot service requests via SP-A at this point but dm-multipath keeps attempting to fail back to this SP. Eventually the above oops occurred while this was happening.

Expected results:
It's probably possible to handle this situation more gracefully, but at a minimum kmpathd should not panic in this situation.

Additional info:

Comment 3 Bryn M. Reeves 2008-11-28 17:32:45 UTC

Created attachment 325030 [details]
disassembly of dm-multipath.ko

objdump -d --line-numbers of dm-multipath.ko

Comment 7 Jerry Levy 2009-05-14 13:52:58 UTC

Probably will not be reproducable on CX4 with R26 or later FLARE code as a lower-level redirector prevents this issue from occurring.

Comment 9 Tom Coughlan 2010-01-20 20:11:46 UTC

This problem has received low priority because it has only been seen when there is a hardware failure (a bad Clariion Link Control Card), while running with older firmware. Comment 7 indicates that the path flipping behavior that caused the crash will not happen on CX4 with R26 or later FLARE code. 

Although it is possible that some other scenario could trigger this crash, we are not currently able to reproduce the problem. This prevents us from developing and thoroughly testing a fix. At this stage in the life of RHEL 4, we believe the risk associated with making a change outweighs the risk that this problem will occur. Re-open this BZ if the problem is seen again on current hw/fw.

Comment 11 Young Choi 2010-03-03 01:23:34 UTC

I meet the same oops in my environment, a little different from this bug that I use a HUAWEI storage S2600 Instead of EMC storage CX3.

The kernel panic I met can be reproduced by the follow actions:
1. Host is Oracle Enterprise Linux 4 update5, with a Emulex LPe11000-M4 FC card.
   Link to a switch and switch link to the storage both controller A and controller B.
2. Storage allocate several LUNs for host, and run multipathd daemon for creating dm-0 dm-1 and etc.
3. Use "dd" to run IO for device dm-0 and dm-1, and from iostat I see the IO is to controller A of the storage.
4. Reboot the controller A of the storage, and IO will failover to the line linked to controller B of the storage.
   Before failover, my private pg_init will be called to send mode select command just like other hardware handler.
5. Test the above several times, it will make kernel panic. There will one or two kernel panic in 10 tests.

This BZ may be re-opened for a better fix.

Note You need to log in before you can comment on or make changes to this bug.