Hide Forgot
Description of problem: Installation on Fujitsu PRIMERGY BX960S1 over iSCSI boot + multipath using be2iscsi (Emulex OCe10102 CNA) and multipath. Panic during first boot after installation, when the EMC multipath handler (dm_emc) is loaded. Before the panic, there are lots of errors when the kernel tries to access the backup path of the Clariion (I think that's normal behavior). Version-Release number of selected component (if applicable): 2.6.32-131.0.15.el6 How reproducible: always Steps to Reproduce: 1. Install RHEL6.1 on a iSCSI volume accessed through be2iscsi with multipath. Actual results: Panic at first reboot Expected results: No panic Additional info: All looks OK at first, until the EMC driver jumps in. iscsi: registered transport (be2iscsi) be2iscsi 0000:10:00.2: PCI INT C -> GSI 35 (level, low) -> IRQ 35 scsi0 : ServerEngines 10Gbe open-iscsi Initiator Driver eth0: Link up (beiscsi_async_link_state_process():243):Link UP on Physical Port 0 be2iscsi 0000:10:00.3: PCI INT D -> GSI 36 (level, low) -> IRQ 36 scsi1 : ServerEngines 10Gbe open-iscsi Initiator Driver (beiscsi_async_link_state_process():243):Link UP on Physical Port 1 Logging into iqn.1992-04.com.emc:cx.fnm00082400064.a10 172.17.140.133:3260,1^M iscsistart: Connection1:0 to [target: iqn.1992-04.com.emc:cx.fnm00082400064.a10, portal: 172.17.140.133,3260] through [iface: be2iscsi.00:00:c9:c0:dc:ef] is operational now^M scsi 0:0:0:0: Direct-Access DGC RAID 5 0430 PQ: 0 ANSI: 4 iscsistart: Logging into iqn.1992-04.com.emc:cx.fnm00082400064.b11 172.17.141.133:3260,1^M scsi 1:0:0:0: Direct-Access DGC RAID 5 0430 PQ: 0 ANSI: 4 iscsistart: Connection2:0 to [target: iqn.1992-04.com.emc:cx.fnm00082400064.b11, portal: 172.17.141.133,3260] through [iface: be2iscsi.00:00:c9:c0:dc:f1] is operational now^M ... sd 0:0:0:0: [sda] 41943040 512-byte logical blocks: (21.4 GB/20.0 GiB) sd 0:0:0:0: [sda] Write Protect is off sd 1:0:0:0: [sdb] 41943040 512-byte logical blocks: (21.4 GB/20.0 GiB) sd 0:0:0:0: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sd 1:0:0:0: [sdb] Test WP failed, assume Write Enabled sd 1:0:0:0: [sdb] Asking for cache data failed sd 1:0:0:0: [sdb] Assuming drive cache: write through sd 0:0:0:0: [sda] Attached SCSI disk sd 1:0:0:0: [sdb] Test WP failed, assume Write Enabled sd 1:0:0:0: [sdb] Asking for cache data failed sd 1:0:0:0: [sdb] Assuming drive cache: write through sdb: sd 1:0:0:0: [sdb] Device not ready sd 1:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 1:0:0:0: [sdb] Sense Key : Not Ready [current] sd 1:0:0:0: [sdb] Add. Sense: Logical unit not ready, manual intervention required sd 1:0:0:0: [sdb] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00 end_request: I/O error, dev sdb, sector 0 ... [ I think this is normal, the Clariion backup path is not responding, no way to handle this until multipath is up ] ... sd 0:0:0:0: emc: connected to SP A Port 10 (owned, default SP A) sd 1:0:0:0: emc: detected Clariion CX4-480, flags 0 sd 1:0:0:0: emc: connected to SP B Port 11 (bound, default SP A) emc: device handler registered device-mapper: multipath round-robin: version 1.0.0 loaded sd 0:0:0:0: emc: at SP A Port 10 (owned, default SP A) sd 0:0:0:0: emc: at SP A Port 10 (owned, default SP A) BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0 IP: [<ffffffffa008e5aa>] beiscsi_process_cq+0x63a/0x7d0 [be2iscsi] PGD 4639bd067 PUD 4638ff067 PMD 0 Oops: 0002 [#1] SMP last sysfs file: /sys/devices/virtual/block/dm-1/dev CPU 0 Modules linked in: dm_round_robin scsi_dh_emc sd_mod crc_t10dif ahci ixgbe dca mdio crc32c_intel be2iscsi be2net dm_multipath dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iscsi_ibft iscsi_boot_sysfs [last unloaded: scsi_wait_scan] Modules linked in: dm_round_robin scsi_dh_emc sd_mod crc_t10dif ahci ixgbe dca mdio crc32c_intel be2iscsi be2net dm_multipath dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi iscsi_ibft iscsi_boot_sysfs [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-131.0.15.el6.x86_64 #1 PRIMERGY BX960 S1 RIP: 0010:[<ffffffffa008e5aa>] [<ffffffffa008e5aa>] beiscsi_process_cq+0x63a/0x7d0 [be2iscsi] RSP: 0018:ffff880036603dd0 EFLAGS: 00010206 RAX: 0000000081000000 RBX: ffff88046694e080 RCX: ffff88046a638558 RDX: 0000000000810000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff880036603e50 R08: ffff88046a6384c0 R09: ffff88046778ba48 R10: ffff88046aafb300 R11: 0000000000000001 R12: ffff88046730cca0 R13: ffff880469d30688 R14: 0000000000000000 R15: 0000000000000202 FS: 0000000000000000(0000) GS:ffff880036600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00000000000000e0 CR3: 00000004639b8000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper (pid: 0, threadinfo ffffffff81a00000, task ffffffff81a2d020) Stack: ffff88046573cc18 ffff880400000005 ffff88046dad0000 ffff88046a6384c0 <0> fe05000181268d35 ffff880469d308c4 ffff88046573cc18 0000000000000000 <0> ffff880036610f88 0000000000010f40 0000000000000000 000000000000000a Call Trace: <IRQ> [<ffffffffa008e764>] be_iopoll+0x24/0x70 [be2iscsi] [<ffffffff8124fd96>] blk_iopoll_softirq+0x96/0x110 [<ffffffffa008deb1>] ? be_isr_msix+0x1a1/0x260 [be2iscsi] [<ffffffff8106f717>] __do_softirq+0xb7/0x1e0 [<ffffffff810d6940>] ? handle_IRQ_event+0x60/0x170 [<ffffffff8100c2cc>] call_softirq+0x1c/0x30 [<ffffffff8100df05>] do_softirq+0x65/0xa0 [<ffffffff8106f505>] irq_exit+0x85/0x90 [<ffffffff814e3505>] do_IRQ+0x75/0xf0 [<ffffffff8100bad3>] ret_from_intr+0x0/0x11 <EOI> [<ffffffff812bb86e>] ? intel_idle+0xde/0x170 [<ffffffff812bb851>] ? intel_idle+0xc1/0x170 [<ffffffff813eccb7>] cpuidle_idle_call+0xa7/0x140 [<ffffffff81009e96>] cpu_idle+0xb6/0x110 [<ffffffff814c376a>] rest_init+0x7a/0x80 [<ffffffff81bbdf28>] start_kernel+0x41d/0x429 [<ffffffff81bbd33a>] x86_64_start_reservations+0x125/0x129 [<ffffffff81bbd438>] x86_64_start_kernel+0xfa/0x109 Code: 84 dc fe ff ff 49 8b 88 90 00 00 00 8b 43 08 4d 8b 0f 44 8b 7b 04 4c 8b 51 30 89 45 a4 8b 03 49 8b 78 50 89 c2 c1 ea 08 0f b6 f2 <89> b7 e0 00 00 00 89 c6 c1 ee 10 40 84 f6 75 5e c1 e8 18 83 c8 RIP [<ffffffffa008e5aa>] beiscsi_process_cq+0x63a/0x7d0 [be2iscsi] RSP <ffff880036603dd0> CR2: 00000000000000e0 ---[ end trace 4a76a2e6379c3ab8 ]---
I think this is the following code: 1133 static void 1134 be_complete_io(struct beiscsi_conn *beiscsi_conn, 1135 struct iscsi_task *task, struct sol_cqe *psol) 1136 { ... 1156 status = ((psol->dw[offsetof(struct amap_sol_cqe, i_sts) / 32] 1157 & SOL_STS_MASK) >> 8); 1158 flags = ((psol->dw[offsetof(struct amap_sol_cqe, i_flags) / 32] 1159 & SOL_FLAGS_MASK) >> 24) | 0x80; 1160 1161 task->sc->result = (DID_OK << 16) | status; <==== Crash Looks as if be_complete_io() is calles with a task with sc == NULL. It is apparently possible that task->sc can be NULL, at least there are some places in the code where this condition is checked. Maybe the dh_emc driver creates a special command that would cause this situation to occur?
Setting FJ6.2bugs tracker (was FJ6.1Bugs tracker) and requested an exception until we understand the importance to Fujitsu to fix in 6.2 or 6.2.z. Bug is public so Emulex and EMC can view it.
May be DUP of Bug 738934 per comment #3 above.
This is of course important (it's a PANIC), and we haven't been able to release be2iscsi for RHEL6.1 for this reason (and bug 726353). I can't see comment #3, nor bug 738924.
I've already requested FJ access to Bug 738934 before possibly closing it as DUP - awaiting response.
The oops in this bz described in comment #2 should be fixed in 6.2. It got fixed in a general be2iscsi update in 6.2. https://bugzilla.redhat.com/show_bug.cgi?id=738934 is for a different issue. We saw the oops too in that bz, but we are trying to figure out why a scsi command timed out. In our test, when we enabled CHAP it would lead to scsi commands timing out and the scsi eh running. We would then hit the same oops you guys hit in here. If we disabled CHAP though, it all worked ok. No scsi command timeouts. So we only have 738934 open to investigate why CHAP is causing problems for some of our setups.
@martin Have you checked this issue with 6.3? Mike's comment#9 suggests that this particular oops was resolved in 6.2 and the other similar problem from bug 738934 was resolved in kernel-2.6.32-279.el6 / RHEL6.3 via http://rhn.redhat.com/errata/RHSA-2012-0862.html "Enabling CHAP (Challenge-Handshake Authentication Protocol) on an iSCSI target for the be2iscsi driver results in kernel panic. To work around this issue, disable CHAP on the iSCSI target." Also, bug 726353 was closed awaiting a retest from your QA guys, so a retest on 6.3 would help clarify the current status.
We haven't seen this any more. The bug can be closed.
Thanks for the clarification, closing as requested.