Description of problem: Running a recommended test case from BZ 662154 Causes a panic on CCISS systems. Steps to Reproduce: 1. Install 5.6 onto a CCISS system and run the following test case. #!/bin/bash ((i=0)) while (( i<100000)) do sg_turs /dev/cciss/c0d0 ((i+=1)) done The system will panic. # ./test.sh Unable to handle kernel NULL pointer dereference at 0000000000000030 RIP: [<ffffffff880b94fc>] :cciss:cciss_softirq_done+0xea/0x36d PGD 0 Oops: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/resource CPU 1 Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 xfrm_nalgo crypto_api loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport ata_piix libata ide_cd be2iscsi i5000_edac cdrom pcspkr libiscsi2 be2net edac_mc tpm_tis scsi_transport_iscsi2 serio_raw scsi_transport_iscsi 8021q hpilo tpm tpm_bios bnx2 shpchp dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 0, comm: swapper Not tainted 2.6.18-236.el5 #1 RIP: 0010:[<ffffffff880b94fc>] [<ffffffff880b94fc>] :cciss:cciss_softirq_done+0xea/0x36d RSP: 0018:ffff8101aff37ec0 EFLAGS: 00010246 RAX: 0000000040002988 RBX: 0000000000000002 RCX: ffff81019b287608 RDX: ffff8101aff37f20 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff810037e00034 R08: 0000000000000000 R09: ffff8101aff31e38 R10: 0000000000000082 R11: 0000000000000048 R12: 0000000000000000 R13: ffff81019b2875f8 R14: ffff8101aff50000 R15: ffff810037e00000 FS: 0000000000000000(0000) GS:ffff8101aff147c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000030 CR3: 0000000000201000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffff8101aff30000, task ffff8101aff18100) Stack: 0000000000000086 0000000000000082 ffffffff880bc1c9 0000000000000046 ffff8101aff50000 0000000000000046 0000000000000001 ffffffff8043cfc0 000000000000000a 0000000000000001 ffffffff8044f280 ffffffff80037e5a Call Trace: <IRQ> [<ffffffff880bc1c9>] :cciss:do_cciss_intr+0xaab/0xae8 [<ffffffff80037e5a>] blk_done_softirq+0x5f/0x6d [<ffffffff80012464>] __do_softirq+0x89/0x133 [<ffffffff8005e2fc>] call_softirq+0x1c/0x28 [<ffffffff8006d5f5>] do_softirq+0x2c/0x7d [<ffffffff8006d485>] do_IRQ+0xec/0xf5 [<ffffffff80057083>] mwait_idle+0x0/0x20 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff8006b981>] mwait_idle_with_hints+0x66/0x67 [<ffffffff8005708f>] mwait_idle+0xc/0x20 [<ffffffff80049360>] cpu_idle+0x95/0xb8 [<ffffffff80078672>] start_secondary+0x490/0x49f Code: 8b 57 30 74 0a 89 d5 81 e5 00 fe ff ff eb 07 41 8b ad dc 00 RIP [<ffffffff880b94fc>] :cciss:cciss_softirq_done+0xea/0x36d RSP <ffff8101aff37ec0> CR2: 0000000000000030 <0>Kernel panic - not syncing: Fatal exception
(In reply to comment #0) I was able to reproduce this on another system, will look into it.
Hi Mike, Steve, it's very likely that this is caused by the latest patchset - update to 3.6.22 ported to RHEL5. Please look at this issue.
Created attachment 471810 [details] fix panic in blk_rq_bytes When the tur is sent down then in blk_rq_byte is the rq->bio = null, this causes a null pointer dereference here int nr_sectors = bio_sectors(rq->bio); Let me know if the patch is ok for you.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Tomas, your patch attachment 471810 [details] looks fine. For comparison, here is what we have in our CVS for the driver which we build for RHEL5 for the blk_rq_bytes function: /** * blk_rq_bytes - Returns bytes left to complete in the entire request * @rq: the request being processed * this function is copied from later kernels (2.6.29-ish), where it is * normally defined in blk/blk-core.c * Slightly modified for older kernels. **/ static unsigned int blk_rq_bytes(struct request *rq) { int nr_sectors; if (blk_fs_request(rq)) { BUG_ON(!rq->bio); nr_sectors = bio_sectors(rq->bio); return nr_sectors << 9; } return rq->data_len; } -- steve
(In reply to comment #7) > Tomas, your patch attachment 471810 [details] looks fine. Thanks Steve. In our git the blk_rq_bytes is called only when blk_pc_request(rq) is true so the branch > if (blk_fs_request(rq)) { > BUG_ON(!rq->bio); > nr_sectors = bio_sectors(rq->bio); > return nr_sectors << 9; > } is never taken. Maybe I could misuse the opportunity :) - could you review also the https://bugzilla.redhat.com/attachment.cgi?id=467508 in https://bugzilla.redhat.com/show_bug.cgi?id=635143#c28 ?
(In reply to comment #8) > (In reply to comment #7) > > Tomas, your patch attachment 471810 [details] looks fine. > Thanks Steve. > In our git the blk_rq_bytes is called only when blk_pc_request(rq) is true > so the branch > > if (blk_fs_request(rq)) { > > BUG_ON(!rq->bio); > > nr_sectors = bio_sectors(rq->bio); > > return nr_sectors << 9; > > } > is never taken. Hmm. This appears to be true in our driver as well. Good catch. Not that it hurts much as it is. The compiler might even be smart enough to figure that out, since blk_rq_bytes is only called in one place that I see and so probably gets inlined, and those blk_xx_request() are macros iirc (that got removed in later kernels for some reason)... but might be a long shot for it to be that smart. > > Maybe I could misuse the opportunity :) - could you review also the > https://bugzilla.redhat.com/attachment.cgi?id=467508 > in > https://bugzilla.redhat.com/show_bug.cgi?id=635143#c28 > ? Ok, will take a look. -- steve
in kernel-2.6.18-241.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Using the cciss driver, when a TUR (Test Unit Ready) was executed, the rq->bio pointer in the blk_rq_bytes function was of value null, which resulted in a null pointer dereference, and, consequently, kernel panic occurred. With this update, the rq->bio pointer is used only when the blk_fs_request(rq) condition is true, thus, kernel panic no longer occurs.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html