Bug 718239

Summary: be2iscsi: panic in multipath configuration on Clariion CX4-480
Product: Red Hat Enterprise Linux 6 Reporter: Martin Wilck <martin.wilck>
Component: kernelAssignee: Rob Evers <revers>
Status: CLOSED NOTABUG QA Contact: Storage QE <storage-qe>
Severity: high Docs Contact:
Priority: high    
Version: 6.1CC: czhang, fge, gasmith, lmcilroy, ltroan, mchristi, michael.hagmann
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-08-07 09:22:27 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 696653, 720397, 782183, 840683    

Description Martin Wilck 2011-07-01 14:39:49 UTC
Description of problem:
Installation on Fujitsu PRIMERGY BX960S1 over iSCSI boot + multipath using be2iscsi (Emulex OCe10102 CNA) and multipath. Panic during first boot after installation, when the EMC multipath handler (dm_emc) is loaded. Before the panic, there are lots of errors when the kernel tries to access the backup path of the Clariion (I think that's normal behavior).

Version-Release number of selected component (if applicable):
2.6.32-131.0.15.el6

How reproducible:
always

Steps to Reproduce:
1. Install RHEL6.1 on a iSCSI volume accessed through be2iscsi with multipath.
  
Actual results:
Panic at first reboot

Expected results:
No panic

Additional info:

All looks OK at first, until the EMC driver jumps in.

iscsi: registered transport (be2iscsi)
be2iscsi 0000:10:00.2: PCI INT C -> GSI 35 (level, low) -> IRQ 35
scsi0 : ServerEngines 10Gbe open-iscsi Initiator Driver
eth0: Link up
(beiscsi_async_link_state_process():243):Link UP on Physical Port 0
be2iscsi 0000:10:00.3: PCI INT D -> GSI 36 (level, low) -> IRQ 36
scsi1 : ServerEngines 10Gbe open-iscsi Initiator Driver
(beiscsi_async_link_state_process():243):Link UP on Physical Port 1
Logging into iqn.1992-04.com.emc:cx.fnm00082400064.a10 172.17.140.133:3260,1^M
iscsistart: Connection1:0 to [target: iqn.1992-04.com.emc:cx.fnm00082400064.a10, portal: 172.17.140.133,3260] through [iface: be2iscsi.00:00:c9:c0:dc:ef] is operational now^M
scsi 0:0:0:0: Direct-Access     DGC      RAID 5           0430 PQ: 0 ANSI: 4
iscsistart: Logging into iqn.1992-04.com.emc:cx.fnm00082400064.b11 172.17.141.133:3260,1^M
scsi 1:0:0:0: Direct-Access     DGC      RAID 5           0430 PQ: 0 ANSI: 4
iscsistart: Connection2:0 to [target: iqn.1992-04.com.emc:cx.fnm00082400064.b11, portal: 172.17.141.133,3260] through [iface: be2iscsi.00:00:c9:c0:dc:f1] is operational now^M
...
sd 0:0:0:0: [sda] 41943040 512-byte logical blocks: (21.4 GB/20.0 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 1:0:0:0: [sdb] 41943040 512-byte logical blocks: (21.4 GB/20.0 GiB)
sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2
sd 1:0:0:0: [sdb] Test WP failed, assume Write Enabled
sd 1:0:0:0: [sdb] Asking for cache data failed
sd 1:0:0:0: [sdb] Assuming drive cache: write through
sd 0:0:0:0: [sda] Attached SCSI disk
sd 1:0:0:0: [sdb] Test WP failed, assume Write Enabled
sd 1:0:0:0: [sdb] Asking for cache data failed
sd 1:0:0:0: [sdb] Assuming drive cache: write through
 sdb:
sd 1:0:0:0: [sdb] Device not ready
sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 1:0:0:0: [sdb] Sense Key : Not Ready [current]
sd 1:0:0:0: [sdb] Add. Sense: Logical unit not ready, manual intervention required
sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
end_request: I/O error, dev sdb, sector 0

...
[ I think this is normal, the Clariion backup path is not responding, no way to handle this until multipath is up ]
...


sd 0:0:0:0: emc: connected to SP A Port 10 (owned, default SP A)
sd 1:0:0:0: emc: detected Clariion CX4-480, flags 0
sd 1:0:0:0: emc: connected to SP B Port 11 (bound, default SP A)
emc: device handler registered
device-mapper: multipath round-robin: version 1.0.0 loaded
sd 0:0:0:0: emc: at SP A Port 10 (owned, default SP A)
sd 0:0:0:0: emc: at SP A Port 10 (owned, default SP A)
BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0
IP: [<ffffffffa008e5aa>] beiscsi_process_cq+0x63a/0x7d0 [be2iscsi]
PGD 4639bd067 PUD 4638ff067 PMD 0
Oops: 0002 [#1] SMP
last sysfs file: /sys/devices/virtual/block/dm-1/dev
CPU 0
Modules linked in: dm_round_robin scsi_dh_emc sd_mod crc_t10dif ahci ixgbe dca mdio crc32c_intel be2iscsi be2net dm_multipath dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iscsi_ibft iscsi_boot_sysfs [last unloaded: scsi_wait_scan]

Modules linked in: dm_round_robin scsi_dh_emc sd_mod crc_t10dif ahci ixgbe dca mdio crc32c_intel be2iscsi be2net dm_multipath dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iscsi_ibft iscsi_boot_sysfs [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-131.0.15.el6.x86_64 #1 PRIMERGY BX960 S1
RIP: 0010:[<ffffffffa008e5aa>]  [<ffffffffa008e5aa>] beiscsi_process_cq+0x63a/0x7d0 [be2iscsi]
RSP: 0018:ffff880036603dd0  EFLAGS: 00010206
RAX: 0000000081000000 RBX: ffff88046694e080 RCX: ffff88046a638558
RDX: 0000000000810000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff880036603e50 R08: ffff88046a6384c0 R09: ffff88046778ba48
R10: ffff88046aafb300 R11: 0000000000000001 R12: ffff88046730cca0
R13: ffff880469d30688 R14: 0000000000000000 R15: 0000000000000202
FS:  0000000000000000(0000) GS:ffff880036600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000000000e0 CR3: 00000004639b8000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a2d020)
Stack:
 ffff88046573cc18 ffff880400000005 ffff88046dad0000 ffff88046a6384c0
<0> fe05000181268d35 ffff880469d308c4 ffff88046573cc18 0000000000000000
<0> ffff880036610f88 0000000000010f40 0000000000000000 000000000000000a
Call Trace:
 <IRQ>
 [<ffffffffa008e764>] be_iopoll+0x24/0x70 [be2iscsi]
 [<ffffffff8124fd96>] blk_iopoll_softirq+0x96/0x110
 [<ffffffffa008deb1>] ? be_isr_msix+0x1a1/0x260 [be2iscsi]
 [<ffffffff8106f717>] __do_softirq+0xb7/0x1e0
 [<ffffffff810d6940>] ? handle_IRQ_event+0x60/0x170
 [<ffffffff8100c2cc>] call_softirq+0x1c/0x30
 [<ffffffff8100df05>] do_softirq+0x65/0xa0
 [<ffffffff8106f505>] irq_exit+0x85/0x90
 [<ffffffff814e3505>] do_IRQ+0x75/0xf0
 [<ffffffff8100bad3>] ret_from_intr+0x0/0x11
 <EOI>
 [<ffffffff812bb86e>] ? intel_idle+0xde/0x170
 [<ffffffff812bb851>] ? intel_idle+0xc1/0x170
 [<ffffffff813eccb7>] cpuidle_idle_call+0xa7/0x140
 [<ffffffff81009e96>] cpu_idle+0xb6/0x110
 [<ffffffff814c376a>] rest_init+0x7a/0x80
 [<ffffffff81bbdf28>] start_kernel+0x41d/0x429
 [<ffffffff81bbd33a>] x86_64_start_reservations+0x125/0x129
 [<ffffffff81bbd438>] x86_64_start_kernel+0xfa/0x109
Code: 84 dc fe ff ff 49 8b 88 90 00 00 00 8b 43 08 4d 8b 0f 44 8b 7b 04 4c 8b 51 30 89 45 a4 8b 03 49 8b 78 50 89 c2 c1 ea 08 0f b6 f2 <89> b7 e0 00 00 00 89 c6 c1 ee 10 40 84 f6 75 5e c1 e8 18 83 c8
RIP  [<ffffffffa008e5aa>] beiscsi_process_cq+0x63a/0x7d0 [be2iscsi]
 RSP <ffff880036603dd0>
CR2: 00000000000000e0
---[ end trace 4a76a2e6379c3ab8 ]---

Comment 2 Martin Wilck 2011-07-01 16:00:31 UTC
I think this is the following code:

1133 static void
1134 be_complete_io(struct beiscsi_conn *beiscsi_conn,
1135            struct iscsi_task *task, struct sol_cqe *psol)
1136 {
...
1156     status = ((psol->dw[offsetof(struct amap_sol_cqe, i_sts) / 32]
1157                         & SOL_STS_MASK) >> 8);
1158     flags = ((psol->dw[offsetof(struct amap_sol_cqe, i_flags) / 32]
1159                     & SOL_FLAGS_MASK) >> 24) | 0x80;
1160
1161     task->sc->result = (DID_OK << 16) | status;    <==== Crash

Looks as if be_complete_io() is calles with a task with sc == NULL. It is apparently possible that task->sc can be NULL, at least there are some places in the code where this condition is checked.

Maybe the dh_emc driver creates a special command that would cause this situation to occur?

Comment 4 Larry Troan 2011-10-04 23:54:29 UTC
Setting FJ6.2bugs tracker (was FJ6.1Bugs tracker) and requested an exception until we understand the importance to Fujitsu to fix in 6.2 or 6.2.z.

Bug is public so Emulex and EMC can view it.

Comment 5 Larry Troan 2011-10-05 00:40:39 UTC
May be DUP of Bug 738934 per comment #3 above.

Comment 6 Martin Wilck 2011-10-05 07:57:03 UTC
This is of course important (it's a PANIC), and we haven't been able to release be2iscsi for RHEL6.1 for this reason (and bug 726353).

I can't see comment #3, nor bug 738924.

Comment 7 Gary Smith 2011-10-05 08:35:25 UTC
I've already requested FJ access to Bug 738934 before possibly closing it as DUP - awaiting response.

Comment 9 Mike Christie 2012-03-16 01:34:20 UTC
The oops in this bz described in comment #2 should be fixed in 6.2. It got fixed in a general be2iscsi update in 6.2.

https://bugzilla.redhat.com/show_bug.cgi?id=738934 is for a different issue. We saw the oops too in that bz, but we are trying to figure out why a scsi command timed out. In our test, when we enabled CHAP it would lead to scsi commands timing out and the scsi eh running. We would then hit the same oops you guys hit in here. If we disabled CHAP though, it all worked ok. No scsi command timeouts. So we only have 738934 open to investigate why CHAP is causing problems for some of our setups.

Comment 13 Gary Smith 2012-08-06 09:10:59 UTC
@martin

Have you checked this issue with 6.3?

Mike's comment#9 suggests that this particular oops was resolved in 6.2 and the other similar problem from bug 738934 was resolved in kernel-2.6.32-279.el6 / RHEL6.3 via http://rhn.redhat.com/errata/RHSA-2012-0862.html

"Enabling CHAP (Challenge-Handshake Authentication Protocol) on an iSCSI target for the be2iscsi driver results in kernel panic. To work around this issue, disable CHAP on the iSCSI target."

Also, bug 726353 was closed awaiting a retest from your QA guys, so a retest on 6.3 would help clarify the current status.

Comment 14 Martin Wilck 2012-08-07 08:23:00 UTC
We haven't seen this any more. The bug can be closed.

Comment 15 Gary Smith 2012-08-07 09:22:27 UTC
Thanks for the clarification, closing as requested.