Bug 443053

Summary:

cciss driver crash

Product:

Red Hat Enterprise Linux 4

Reporter:

Vivek Goyal <vgoyal>

Component:

kernel

Assignee:

Tomas Henzl <thenzl>

Status:

CLOSED ERRATA

QA Contact:

Martin Jenner <mjenner>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.7

CC:

atodorov, coldwell, coughlan, dchapman, jburke, jgiles, luyu, mike.miller, steve.cameron, tcamuso

Target Milestone:

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

RHSA-2008-0665

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2008-07-24 19:29:12 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
simple reproducer	none
disable SCSI_IOCTL_SEND_COMMAND for cciss	none

Description Vivek Goyal 2008-04-18 13:00:47 UTC

Description of problem:

A cciss driver caused system crash.

Version-Release number of selected component (if applicable):

RHEL4 U7 68.32

How reproducible:

Noticed once in RHTS Tier2 testing

Steps to Reproduce:
1.
2.
3.
  
Actual results:

System crashed

Expected results:

System should not crash

Additional info:

Failure logs are here.

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=2732796

Comment 1 Luming Yu 2008-04-21 03:08:20 UTC

The log shows :
>anaconda[710]: bugcheck! 0 [1]
>....
>Pid: 710, CPU 3, comm:             anaconda
>psr : 0000101008022018 ifs : 8000000000000a9d ip  : [<a0000002002e8a90>]    Not
>tainted
>ip is at do_cciss_request+0x11d0/0x1320 [cciss]

The corresponding kernel code is:

>  case 0: /* unknown error (used by GCC for __builtin_abort()) */
>  if (notify_die(DIE_BREAK, "break 0", regs, break_num, TRAP_BRKPT, SIGTRAP)
>                         == NOTIFY_STOP)
>      return;
>  if (die_if_kernel("bugcheck!", regs, break_num))
>      return;
>  sig = SIGILL; code = ILL_ILLOPC;
>      break;


So please verify if it is casued by __builtin_abort in this driver?
Then probably we need to figure out why the driver could get to __builtin_abort?

Comment 2 Tomas Henzl 2008-04-21 11:58:53 UTC

I was able to boot the kernel 2.6.9-68.32 on the affected machine manually,
couldn't it be the case where the problem is somewhere in the anaconda ?

[root@hp-bl860c-01 ~]# uname -a
Linux hp-bl860c-01.rhts.boston.redhat.com 2.6.9-68.32.EL #1 SMP Mon Apr 7
15:34:52 EDT 2008 ia64 ia64 ia64 GNU/Linux

Comment 3 Vivek Goyal 2008-04-21 15:46:24 UTC

(In reply to comment #1)
 
> So please verify if it is casued by __builtin_abort in this driver?
> Then probably we need to figure out why the driver could get to __builtin_abort?

My hunch is that we are hitting a BUG() in cciss driver and which in turn is
calling ia64_abort(). Somehow BUG() message is not visible in logs and that
could be because of log level.

I did a disassembly of the cciss code and offset 11d0 seems to be mapped to line
3097 of cciss.c

drivers/block/cciss.c:3097
   10026:       46 00 01 00 00 00             (p25) break.i 0x1004
   1002c:       00 00 00 00                         break.i 0x0
   10030:       00 00 00 00 00 00       [MII]       break.m 0x0
   10036:       00 00 00 00 47 5f                   addp4 r0=-8192,r0
   1003c:       76 00 03 51                   (p62) tnat.z.or p50,p0=r96

And line 3097 is BUG().

So to me it looks like that cciss driver thinks that it got an invalid request.

Comment 4 Vivek Goyal 2008-04-21 15:49:09 UTC

(In reply to comment #2)
> I was able to boot the kernel 2.6.9-68.32 on the affected machine manually,
> couldn't it be the case where the problem is somewhere in the anaconda ?
> 
> [root@hp-bl860c-01 ~]# uname -a
> Linux hp-bl860c-01.rhts.boston.redhat.com 2.6.9-68.32.EL #1 SMP Mon Apr 7
> 15:34:52 EDT 2008 ia64 ia64 ia64 GNU/Linux
> 

I can also boot 68.32 successfully. Looking at the backtrace, it looks like some
kind of ioctl is being invoked on device managed by cciss driver and it crashes.
Looks like anaconda calls that ioctl and in normal boot we don't call that ioctl
hence we are fine. 

I will see if I can reproduce the issue...

Comment 5 Vivek Goyal 2008-04-21 17:26:35 UTC

I just tried to install this machine with 68.32 again through rhts and it
crashes again. So this issue is reproducible on this machine and certainly
anaconda does some operation (most likely invoking an ioctl) and it crashes the
kernel...

Comment 6 Luming Yu 2008-04-22 02:34:57 UTC

Adding Doug,

Comment 7 Luming Yu 2008-04-22 02:38:11 UTC

Because this is HP box, probably Doug knows something about the crash..

Comment 8 Luming Yu 2008-04-22 03:13:49 UTC

based on the comment# 3, this bug sounds more like a driver specific issue
rather than IA64 Arch problem although the problem is observed on a IPF box..
So it should affect all platform with the devices managed by the cciss driver.
Moving it to generic category.

Comment 9 Doug Chapman 2008-04-22 03:22:16 UTC

Since (I assume) this has not been seen on any other systems I will check the
cciss firmware on the box.  Perhaps it is out of date.

Comment 10 Doug Chapman 2008-04-22 04:54:59 UTC

Firmware update didn't make any difference.  We still hit the panic on the
ioctl.  I will see if I can reproduce this on any other hardware.

Here is the stack trace of the panic:

anaconda[710]: bugcheck! 0 [1]
Modules linked in: dm_snapshot dm_mirror dm_zero dm_mod ext3 jbd msdos raid6
raid5 xor raid1 raid0 qla2400 cciss mptscsih mptsas mptspi mptscsi mptbase
qla2xxx scsi_transport_fc tg3 ohci_hcd ehci_hcd sr_mod sd_mod scsi_mod
lapic_status loop nfs nfs_acl lockd sunrpc vfat fat cramfs

Pid: 710, CPU 3, comm:             anaconda
psr : 0000101008022018 ifs : 8000000000000a9d ip  : [<a0000002002e8a90>]    Not
tainted
ip is at do_cciss_request+0x11d0/0x1320 [cciss]
unat: 0000000000000000 pfs : 0000000000000a9d rsc : 0000000000000003
rnat: 3841a5799c0eb221 bsps: ea596e21aabae15c pr  : 00400280009599a9
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000002002e8a90 b6  : a00000010006eb40 b7  : a00000010038bea0
f6  : 1003e0000000000001200 f7  : 1003e8080808080808081
f8  : 1003e00000000000023dc f9  : 1003e000000000e580000
f10 : 1003e00000000356f424c f11 : 1003e44b831eee7285baf
r1  : a0000001009e0e30 r2  : 0000000000000001 r3  : 0000000000100000
r8  : 000000000000002a r9  : 0000000000000001 r10 : e00000000103540c
r11 : 0000000000000003 r12 : e00000000bbcf740 r13 : e00000000bbc8000
r14 : 0000000000004000 r15 : e00000000bbc8de0 r16 : e000000001034af0
r17 : 0000000000000014 r18 : e0000100fd82802c r19 : e000000001035400
r20 : e000000001034ac0 r21 : 0000000000000002 r22 : 0000000000000001
r23 : e0000100fd828040 r24 : e000000001035b60 r25 : e000000001035b58
r26 : e000000001035b38 r27 : 0000000000000074 r28 : 0000000000000074
r29 : 0000000000000065 r30 : e0000100fd828050 r31 : 00000000356f424c

Call Trace:
 [<a000000100016e40>] show_stack+0x80/0xa0
                                sp=e00000000bbcf2b0 bsp=e00000000bbc9430
 [<a000000100017750>] show_regs+0x890/0x8c0
                                sp=e00000000bbcf480 bsp=e00000000bbc93e0
 [<a00000010003e9b0>] die+0x150/0x240
                                sp=e00000000bbcf4a0 bsp=e00000000bbc93a0
 [<a00000010003eae0>] die_if_kernel+0x40/0x60
                                sp=e00000000bbcf4a0 bsp=e00000000bbc9370
 [<a00000010003ec80>] ia64_bad_break+0x180/0x600
                                sp=e00000000bbcf4a0 bsp=e00000000bbc9348
 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260
                                sp=e00000000bbcf570 bsp=e00000000bbc9348
 [<a0000002002e8a90>] do_cciss_request+0x11d0/0x1320 [cciss]
                                sp=e00000000bbcf740 bsp=e00000000bbc9260
 [<a0000001003722f0>] __generic_unplug_device+0xd0/0x100
                                sp=e00000000bbcfb30 bsp=e00000000bbc9240
 [<a000000100372350>] generic_unplug_device+0x30/0x60
                                sp=e00000000bbcfb30 bsp=e00000000bbc9218
 [<a000000100373a40>] blk_execute_rq+0x1a0/0x220
                                sp=e00000000bbcfb30 bsp=e00000000bbc91d8
 [<a000000100381100>] scsi_cmd_ioctl+0xfe0/0x1520
                                sp=e00000000bbcfbb0 bsp=e00000000bbc9150
 [<a0000002002effb0>] cciss_ioctl+0x1990/0x3c60 [cciss]
                                sp=e00000000bbcfd10 bsp=e00000000bbc9080
 [<a00000010037ab00>] blkdev_ioctl+0x220/0xc00
                                sp=e00000000bbcfe20 bsp=e00000000bbc9038
 [<a0000001001409c0>] block_ioctl+0x40/0x60
                                sp=e00000000bbcfe20 bsp=e00000000bbc9000
 [<a000000100159a20>] sys_ioctl+0x6a0/0xb20
                                sp=e00000000bbcfe20 bsp=e00000000bbc8f68
 [<a00000010000f4a0>] ia64_ret_from_syscall+0x0/0x20
                                sp=e00000000bbcfe30 bsp=e00000000bbc8f68

Comment 11 Chip Coldwell 2008-04-22 15:35:35 UTC

The stack trace looks like a race between the ioctl path and the normal block
I/O path.

Chip

Comment 12 Doug Chapman 2008-04-22 16:02:28 UTC

I reproduced this on hp-sapphire-02.rhts which has the same model of cciss card
(P600) so this does not appear to be specific to that system.  We need to see if
this crash happens on x86 also.  I am adding Tony Camuso who handles our
proliant systems.

I tried last night running anaconda in userspace in --test mode but was unable
to reproduce this.  However, I had only updated anaconda and the kernel, perhaps
I should try a full yum upgrade and then try.

Comment 13 Doug Chapman 2008-04-22 17:18:45 UTC

Also reproduced on an hp rx4640 with an older 5304-256 model smart array card so
this appears to be a general cciss driver bug, not specific to a specific smart
array card.

Adding Mike Miller to the CC list.

Comment 14 Doug Chapman 2008-04-22 18:01:37 UTC

(In reply to comment #11)
> The stack trace looks like a race between the ioctl path and the normal block
> I/O path.
> 
> Chip
> 

I tried booting with maxcpus=1 to see if that would avoid this (which if it were
a race it should) however I still hit the panic.

I scheduled a reserve workflow on hp-ml370g5-01 (the only free HP x86_64 box
with cciss I could find) however RHTS still has not scheduled the job.  Once
that runs hopefully it will tell us if this is ia64 specific.

- Doug

Comment 15 Tomas Henzl 2008-04-28 14:50:24 UTC

I successfully installed build RHEL4.7-20080424-i386 with kernel '2.6.9-69.ELsmp
#1 SMP Tue Apr 15 18:33:35 EDT 2008 i686 i686 i386 GNU/Linux' on a box with HP
Smart Array 5i Controller.
This could mean that it is architecture dependent (ia64).

Comment 16 Doug Chapman 2008-04-28 15:24:20 UTC

(In reply to comment #15)
> I successfully installed build RHEL4.7-20080424-i386 with kernel '2.6.9-69.ELsmp
> #1 SMP Tue Apr 15 18:33:35 EDT 2008 i686 i686 i386 GNU/Linux' on a box with HP
> Smart Array 5i Controller.
> This could mean that it is architecture dependent (ia64).

Now this is fun, I just tried this tree on one of my ia64 systems that hit the
crash last week and it works now too.

So, either this magically got fixed, or is some race condition we just don't hit
all the time.  It would be best if we could try on the system that it was
originally reported on however it is was moved to the new lab and isn't back up yet.

Comment 19 Doug Chapman 2008-05-06 18:56:52 UTC

I just hit this again on hp-sapphire-02 when installing RHEL4.7/kernel-2.6.9-69

I have tried multiple things to try to reproduce this at runtime without any
success.  I am open to ideas as trying to debug this under anaconda at install
time seems pretty much impossible.

anaconda[1072]: bugcheck! 0 [1]
Modules linked in: dm_snapshot dm_mirror dm_zero dm_mod ext3 jbd msdos raid6
raid5 xor raid1 raid0 cciss e1000 usb_storage ohci_hcd ehci_hcd sr_mod sd_mod
scsi_mod lapic_status loop nfs nfs_acl lockd sunrpc vfat fat cramfs

Pid: 1072, CPU 0, comm:             anaconda
psr : 0000101008022038 ifs : 8000000000000a9d ip  : [<a0000002002c0a90>]    Not
tainted
ip is at do_cciss_request+0x11d0/0x1320 [cciss]
unat: 0000000000000000 pfs : 0000000000000a9d rsc : 0000000000000003
rnat: 2fe3e13bb0aec078 bsps: b75fee1a3745b502 pr  : 004002800095a9a9
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000002002c0a90 b6  : a000000100016020 b7  : a00000010038bea0
f6  : 1003e0000000000001200 f7  : 1003e8080808080808081
f8  : 1003e00000000000023dc f9  : 1003e000000000e580000
f10 : 1003e00000000356f424c f11 : 1003e44b831eee7285baf
r1  : a0000001009e0e30 r2  : 0000000000006000 r3  : 0000000000006000
r8  : 000000000000002a r9  : 00000000000000fd r10 : a0000001007f3880
r11 : 0000000000000600 r12 : e000000102327740 r13 : e000000102320000
r14 : 0000000000004000 r15 : c0000000fee00000 r16 : 00000000000000fd
r17 : 0000000000000006 r18 : a000000100a087c0 r19 : a000000100a087c0
r20 : 0000000000000004 r21 : 0000000000000000 r22 : 0000000000000000
r23 : 0000000000000000 r24 : 0000000000000000 r25 : 0000000000000004
r26 : e00000003dd20dd0 r27 : 0000000000000000 r28 : e000000102320dd4
r29 : e00000003dd20dd4 r30 : e000000100838050 r31 : 00000000356f424c

Call Trace:
 [<a000000100016e40>] show_stack+0x80/0xa0
                                sp=e0000001023272b0 bsp=e000000102321430
 [<a000000100017750>] show_regs+0x890/0x8c0
                                sp=e000000102327480 bsp=e0000001023213e0
 [<a00000010003e9b0>] die+0x150/0x240
                                sp=e0000001023274a0 bsp=e0000001023213a0
 [<a00000010003eae0>] die_if_kernel+0x40/0x60
                                sp=e0000001023274a0 bsp=e000000102321370
 [<a00000010003ec80>] ia64_bad_break+0x180/0x600
                                sp=e0000001023274a0 bsp=e000000102321348
 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260
                                sp=e000000102327570 bsp=e000000102321348
 [<a0000002002c0a90>] do_cciss_request+0x11d0/0x1320 [cciss]
                                sp=e000000102327740 bsp=e000000102321260
 [<a0000001003722f0>] __generic_unplug_device+0xd0/0x100
                                sp=e000000102327b30 bsp=e000000102321240
 [<a000000100372350>] generic_unplug_device+0x30/0x60
                                sp=e000000102327b30 bsp=e000000102321218
 [<a000000100373a40>] blk_execute_rq+0x1a0/0x220
                                sp=e000000102327b30 bsp=e0000001023211d8
 [<a000000100381100>] scsi_cmd_ioctl+0xfe0/0x1520
                                sp=e000000102327bb0 bsp=e000000102321150
 [<a0000002002c7fb0>] cciss_ioctl+0x1990/0x3c60 [cciss]
                                sp=e000000102327d10 bsp=e000000102321080
 [<a00000010037ab00>] blkdev_ioctl+0x220/0xc00
                                sp=e000000102327e20 bsp=e000000102321038
 [<a0000001001409c0>] block_ioctl+0x40/0x60
                                sp=e000000102327e20 bsp=e000000102321000
 [<a000000100159a20>] sys_ioctl+0x6a0/0xb20
                                sp=e000000102327e20 bsp=e000000102320f68
 [<a00000010000f4a0>] ia64_ret_from_syscall+0x0/0x20
                                sp=e000000102327e30 bsp=e000000102320f68

Comment 20 Luming Yu 2008-05-07 02:54:31 UTC

since this is a race problem between ioctl path and block io path , probably we
need to identify what kind of ioctl is involved here... Maybe we need to write a
test case to test ioctl path while stressing block io path to help debug the
problem without anaconda..

Comment 21 Mike Miller (OS Dev) 2008-05-07 18:07:41 UTC

There is a function defined in cciss_scsi.c called print_cmd. Right now it's #if
0 out. Move the #if 0, prototype it in cciss.c, and then call it before BUG.
That will print out the CDB so we can see what's being called.

Comment 22 Doug Chapman 2008-05-07 19:06:19 UTC

(In reply to comment #21)
> There is a function defined in cciss_scsi.c called print_cmd. Right now it's #if
> 0 out. Move the #if 0, prototype it in cciss.c, and then call it before BUG.
> That will print out the CDB so we can see what's being called.

Building a kernel with this now.  Thanks for the info.  Hopefully once we have
this info we can reproduce this outside of anacona.

Comment 23 Doug Chapman 2008-05-07 22:49:32 UTC

After finally figuring out how to make an initrd.img that anaconda was happy
with I thought I was making some progress.  However now I am stuck.

No matter what I seem to do I cannot get any of my debug output to show up on
the console.  This includes printk and the print_cmd stuff Mike pointed to.  My
guess is there is something funny that anaconda does (although I don't know how
this is possible).  I know I am booting the right version of my module because
if I remove the BUG() statement we get farther and panic at a later point, so
this does confirm that the panic is from cciss.c:3097 but I cannot get any debug
output.

Does anybody have any ideas on what is happening to the output?  I will look at
this with a fresh mind tomorrow.

Comment 24 Doug Chapman 2008-05-08 21:02:21 UTC

Some progress...

I never was able to get any debug output to work, I am still very perplexed
about that.  But, I was able to get netdump working under anaconda and get a
dump of this crash.

Chip mentioned he may need this system (hp-sapphire-02) in a day or two for
another cciss issue but  for now I can make it available to debug the dump. 
Contact me for login info.

Comment 25 Doug Chapman 2008-05-08 21:28:26 UTC

This may help with determining the race condition, also might explain why printk
isn't working here.

In crash if I look at the dmesg buffer via "log" I see:

kernel BUG at drivers/block/cciss.c:2999!

which is this bit of code:

   2998         if (creq->nr_phys_segments > MAXSGENTRIES)
   2999                 BUG();


but, this is _not_ the ip given by the panic that follows, according to the
address of the ip the crash happened at drivers/block/cciss.c: 3097

   3094         } else {
   3095                 printk(KERN_WARNING "cciss%d: bad request type %ld\n",
   3096                                         h->ctlr, creq->flags);
   3097                 BUG();
   3098         }


so, it appears that we are hitting both of these at the same time.  Since one
was already in the process of trying to do output to the console perhaps that is
why the "bad request" printk never happens.  Also I am guessing the fact that we
hit both of these might be our race condition?

Comment 26 Doug Chapman 2008-05-15 18:23:02 UTC

Thomas,

Do you want to grab the system I have the kernel dump on?  Is the dump even
useful?  The system is just sitting idle and if it is not going to be used for
this issue I could make other use of the system.

- Doug

Comment 27 Tomas Henzl 2008-05-16 12:09:14 UTC

(In reply to comment #26)
Doug,
the Comments #25 is useful, at least the second part is new code in RHEL4.7. 
> Do you want to grab the system I have the kernel dump on?  Is the dump even
> useful?  The system is just sitting idle and if it is not going to be used for
> this issue I could make other use of the system.
To be able to find there something more valuable,I'd need access not only to the
system, but also to the system from which the affected kernel is installed, so I
could change the cciss driver and see what happens then. 
Otherwise I could only repeat your test and see the failing system which is of
no use.
Tomas

Comment 28 Tomas Henzl 2008-05-16 15:14:49 UTC

Doug,
we could start by removing all patches I mean use the code from RHEL4.6 in order
to see what patch is causing the problems.
The patches listed below form RHEL4.7 on top of RHEL4.6.
linux-2.6.9-cciss-add-init-of-drv-cylinders-back-to-cciss_geom.patch
linux-2.6.9-cciss-Modify-proc-driver-cciss-entries-to-avoid-s.patch
linux-2.6.9-cciss-Add-SG_IO-ioctl-and-fix-error-reporting-for-S.patch
linux-2.6.9-cciss-Change-version-number-to-3.6.20-RH1.patch
linux-2.6.9-cciss-Remove-read_ahead-and-use-block-layer-default.patch
linux-2.6.9-cciss-Copyright-information-updated-as-per-HP-Legal.patch
linux-2.6.9-cciss-Support-new-SAS-SATA-controllers.patch
The next step would be, if we find the right patch, trying to localize what is
wrong with it.

Comment 29 Doug Chapman 2008-05-16 18:47:52 UTC

The panic appears to have been introduced by:

linux-2.6.9-cciss-Add-SG_IO-ioctl-and-fix-error-reporting-for-S.patch

which is a very big patch, will try to narrow down a bit if possible.

Comment 30 Doug Chapman 2008-05-16 19:48:17 UTC

Looks like it is the handling of one of these ioctl's which was added by this
patch.  If I ifdef out this section of code it works OK:

        /* scsi_cmd_ioctl handles these, below, though some are not */
        /* very meaningful for cciss.  SG_IO is the main one people want. */

        case SG_GET_VERSION_NUM:
        case SG_SET_TIMEOUT:
        case SG_GET_TIMEOUT:
        case SG_GET_RESERVED_SIZE:
        case SG_SET_RESERVED_SIZE:
        case SG_EMULATED_HOST:
        case SG_IO:
        case SCSI_IOCTL_SEND_COMMAND:
                return scsi_cmd_ioctl(filep, disk, cmd, argp);


I don't know which of the above triggers the panic yet but this narrows it down.

Comment 31 Doug Chapman 2008-05-16 20:03:39 UTC

Created attachment 305740 [details]
simple reproducer

I can now reproduce this outside of anaconda.  The attached program panics my
rx6600 in the same way as anaconda.

I need to run the test program twice to hit the panic.	First time OK, second
time panic.  So, perhaps we are not cleaning up something from the first time
SCSI_IOCTL_SEND_COMMAND is called.

Comment 32 Doug Chapman 2008-05-16 20:09:22 UTC

(In reply to comment #31)
> 
> I need to run the test program twice to hit the panic.	First time OK, second
> time panic.  So, perhaps we are not cleaning up something from the first time
> SCSI_IOCTL_SEND_COMMAND is called.
> 

Actually it just appears to be somewhat random, sometimes I need to run it
several times before I see the panic, but either way it does reproduce it
eventually.

Comment 33 Tomas Henzl 2008-05-19 14:49:45 UTC

I tried to reproduce on an i386 box. Initially I wasn't successful but after I
run the ioctl it in an endless loop and added writing of a big file in a second
console I got this -
 [<c01192a7>] smp_local_timer_interrupt+0x1c/0x5f                  
 [<c011934d>] smp_apic_timer_interrupt+0x63/0x8f                   
 [<c0327a26>] apic_timer_interrupt+0x1a/0x20                       
 [<c0125b4b>] profile_tick+0x12/0x36                               
 [<c01192a7>] smp_local_timer_interrupt+0x1c/0x5f                  
 [<c0106bd4>] do_invalid_op+0x0/0xf2                               
 [<c0327ac3>] error_code+0x2f/0x38                                 
 [<f8865636>] do_cciss_request+0x44/0x484 [cciss]                  
 [<c0154326>] kmem_freepages+0x6d/0x87                             
 [<c01543bf>] slab_destroy+0x59/0x72                               
 [<c016f33e>] wake_up_buffer+0x9/0x29                              
 [<c014f57b>] mempool_free+0x11f/0x122                             
 [<c025dee5>] __blk_put_request+0x63/0x73                          
 [<c025f1a3>] end_that_request_last+0x28d/0x2ac                    
 [<f886624d>] complete_command+0x3cb/0x3e2 [cciss]                 
 [<c025cdd5>] blk_start_queue+0x21/0x3e                            
 [<f8865d60>] do_cciss_intr+0x2ea/0x40c [cciss]                    
 [<c0107f08>] handle_IRQ_event+0x25/0x4f                           
 [<c01088d6>] do_IRQ+0x18a/0x2bf                                   
 =======================                                           
 [<c0327a04>] common_interrupt+0x18/0x20                           
 [<c010403b>] default_idle+0x23/0x29                               
 [<c010408f>] cpu_idle+0x1f/0x34                                   
 [<c03e3700>] start_kernel+0x214/0x216

Comment 34 Mike Miller (OS Dev) 2008-05-21 15:26:15 UTC

Has this been tested in rhel5.x? Any results?
Can we get the sources for this install kernel so we can try to debug it here?
We would also need the config file.

Comment 35 Stephen Cameron 2008-05-21 15:52:53 UTC

Hmm, your test program is using SCSI_IOCTL_SEND_COMMAND (obsolete?  it is in
some kernels anyway).

I'm pretty sure I hit some problems with SCSI_IOCTL_SEND_COMMAND and RHEL4
before, since I wrote this comment in cciss.c:

        case SCSI_IOCTL_SEND_COMMAND:   /* scsi_cmd_ioctl is broken for this on
RHEL4. */
                                        /* It creates a request with no bio. */
                                        /* Consequently, we don't support it. */

        /* scsi_cmd_ioctl would normally handle these, below, but */
        /* they aren't a good fit for cciss, as CD-ROMs are */
        /* not supported, and we don't have any bus/target/lun */
        /* which we present to the kernel. */

        case CDROM_SEND_PACKET:
        case CDROMCLOSETRAY:
        case CDROMEJECT:
        case SCSI_IOCTL_GET_IDLUN:
        case SCSI_IOCTL_GET_BUS_NUMBER:
        default:
                return -ENOTTY;
        }



Now, it doesn't seem to be returning "not a typewriter", so I'm guessing the
SCSI mid layer is changed (or has been for a long time?) such that
SCSI_IOCTL_SEND_COMMAND tries to cram things down the regular io path, like
SG_IO does (using scsi_cmd_ioctl(), but, (according to my comment, which I have
only a scant recollection of writing) creates a request with no bio.


Hmmm, I wrote down this in my notes:

"Mon Apr 23 09:23:15 CDT 2007 Looks like the SCSI_IOCTL_SEND_COMMAND code path
in RHEL4 for block devices in scsi_ioctl.c and ll_rw_blk.c is broken, in that it
never maps the buffer to a bio. (In newer kernels there's a call to
blk_rq_map_kern which maps the kernel buffer that gets kmalloc'ed to a bio. In
RHEL4, this function doesn't exist, though there's a blk_rq_map_user, to map a
user buffer. Nothing like either of those gets called.) So, the upshot is, on
RHEL4, SCSI_IOCTL_SEND_COMMAND is kind of screwed. That's probably ok, since
SG_IO seems to work.

So, for 2.6 drivers series, just going to not support SCSI_IOCTL_SEND_COMMAND.

Committed SG_IO changes to cciss_2_6_18 branch. "

Comment 36 Doug Chapman 2008-05-21 16:26:32 UTC

(In reply to comment #35)
> 
> 
> Now, it doesn't seem to be returning "not a typewriter", so I'm guessing the
> SCSI mid layer is changed (or has been for a long time?) 
> 

This was added just recently in the patch:
linux-2.6.9-cciss-Add-SG_IO-ioctl-and-fix-error-reporting-for-S.patch

        case SCSI_IOCTL_SEND_COMMAND:
                return scsi_cmd_ioctl(filep, disk, cmd, argp);

so no, it does not return -ENOTTY, sounds like that is the fix.

Comment 37 Doug Chapman 2008-05-21 17:25:01 UTC

Created attachment 306285 [details]
disable SCSI_IOCTL_SEND_COMMAND for cciss

Steve, Mike,

It sounds like this is the fix that should be in RHEL4?

Comment 38 Stephen Cameron 2008-05-21 18:03:50 UTC

I think so, that is what we have in the drivers we distribute for RHEL4.  Unless
you wanted to try to fix the problem in ll_rw_blk.c and scsi_ioctl.c -- if that
is what I think it is, my notes on that are more a year old -- and I'm not sure
how to fix it anyway, but somebody must know, as it's fixed in later kernels.

-- steve

Comment 39 Doug Chapman 2008-05-21 18:18:12 UTC

It seems the smartest thing to do here is to disable SCSI_IOCTL_SEND_COMMAND as
in the patch I attached because:

1: we never had it before for cciss (was there a reason it was added?)
2: it is disabled in the HP owned driver
3: it is also disabled in RHEL5

Comment 40 Stephen Cameron 2008-05-21 18:45:42 UTC

As for #2 and #3 above, SCSI_IOCTL_SEND_COMMAND is DISabled in our "2.6.x"
series of cciss drivers which is for kernels < about 2.6.15 (e.g. RHEL4), and it
is ENabled in our "3.6.x" series of drivers which is for kernels > about 2.6.15
(e.g. RHEL5).

For our (HP's) RHEL5 driver, we currently have SCSI_IOCTL_SEND_COMMAND enabled,
and I am pretty sure it even worked last time I tried it.

-- steve

Comment 41 Tom Coughlan 2008-05-22 14:48:59 UTC

The build for RHEL 4.7 beta snapshot 1 is planned for next Tues. 5/27. We will
need a patch tested and ready for review before then. With the Monday holiday,
that means today or tomorrow...

Comment 42 Doug Chapman 2008-05-22 15:06:06 UTC

(In reply to comment #41)
> The build for RHEL 4.7 beta snapshot 1 is planned for next Tues. 5/27. We will
> need a patch tested and ready for review before then. With the Monday holiday,
> that means today or tomorrow...

I agree completely, I would like to see this submitted ASAP.  If we have
agreement here I would be happy to submit the patch I have attached here to
rhkernel list today.  I have tested it and it is minimally intrusive, all it
does is disable a ioctl that probably should not have been enabled in the first
place.

Comment 43 Mike Miller (OS Dev) 2008-05-22 15:16:33 UTC

We (HP) agree that attachment 306285 [details] should be added.

Comment 44 Doug Chapman 2008-05-22 17:33:54 UTC

Patch posted to rhkernel-list for review for RHEL4.7.

Comment 47 Doug Chapman 2008-05-27 13:14:00 UTC

*** Bug 448100 has been marked as a duplicate of this bug. ***

Comment 48 Vivek Goyal 2008-05-29 20:51:26 UTC

Committed in 71.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 51 errata-xmlrpc 2008-07-24 19:29:12 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html