Bug 443053
Summary: | cciss driver crash | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Vivek Goyal <vgoyal> | ||||||
Component: | kernel | Assignee: | Tomas Henzl <thenzl> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 4.7 | CC: | atodorov, coldwell, coughlan, dchapman, jburke, jgiles, luyu, mike.miller, steve.cameron, tcamuso | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | RHSA-2008-0665 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2008-07-24 19:29:12 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Vivek Goyal
2008-04-18 13:00:47 UTC
The log shows : >anaconda[710]: bugcheck! 0 [1] >.... >Pid: 710, CPU 3, comm: anaconda >psr : 0000101008022018 ifs : 8000000000000a9d ip : [<a0000002002e8a90>] Not >tainted >ip is at do_cciss_request+0x11d0/0x1320 [cciss] The corresponding kernel code is: > case 0: /* unknown error (used by GCC for __builtin_abort()) */ > if (notify_die(DIE_BREAK, "break 0", regs, break_num, TRAP_BRKPT, SIGTRAP) > == NOTIFY_STOP) > return; > if (die_if_kernel("bugcheck!", regs, break_num)) > return; > sig = SIGILL; code = ILL_ILLOPC; > break; So please verify if it is casued by __builtin_abort in this driver? Then probably we need to figure out why the driver could get to __builtin_abort? I was able to boot the kernel 2.6.9-68.32 on the affected machine manually, couldn't it be the case where the problem is somewhere in the anaconda ? [root@hp-bl860c-01 ~]# uname -a Linux hp-bl860c-01.rhts.boston.redhat.com 2.6.9-68.32.EL #1 SMP Mon Apr 7 15:34:52 EDT 2008 ia64 ia64 ia64 GNU/Linux (In reply to comment #1) > So please verify if it is casued by __builtin_abort in this driver? > Then probably we need to figure out why the driver could get to __builtin_abort? My hunch is that we are hitting a BUG() in cciss driver and which in turn is calling ia64_abort(). Somehow BUG() message is not visible in logs and that could be because of log level. I did a disassembly of the cciss code and offset 11d0 seems to be mapped to line 3097 of cciss.c drivers/block/cciss.c:3097 10026: 46 00 01 00 00 00 (p25) break.i 0x1004 1002c: 00 00 00 00 break.i 0x0 10030: 00 00 00 00 00 00 [MII] break.m 0x0 10036: 00 00 00 00 47 5f addp4 r0=-8192,r0 1003c: 76 00 03 51 (p62) tnat.z.or p50,p0=r96 And line 3097 is BUG(). So to me it looks like that cciss driver thinks that it got an invalid request. (In reply to comment #2) > I was able to boot the kernel 2.6.9-68.32 on the affected machine manually, > couldn't it be the case where the problem is somewhere in the anaconda ? > > [root@hp-bl860c-01 ~]# uname -a > Linux hp-bl860c-01.rhts.boston.redhat.com 2.6.9-68.32.EL #1 SMP Mon Apr 7 > 15:34:52 EDT 2008 ia64 ia64 ia64 GNU/Linux > I can also boot 68.32 successfully. Looking at the backtrace, it looks like some kind of ioctl is being invoked on device managed by cciss driver and it crashes. Looks like anaconda calls that ioctl and in normal boot we don't call that ioctl hence we are fine. I will see if I can reproduce the issue... I just tried to install this machine with 68.32 again through rhts and it crashes again. So this issue is reproducible on this machine and certainly anaconda does some operation (most likely invoking an ioctl) and it crashes the kernel... Adding Doug, Because this is HP box, probably Doug knows something about the crash.. based on the comment# 3, this bug sounds more like a driver specific issue rather than IA64 Arch problem although the problem is observed on a IPF box.. So it should affect all platform with the devices managed by the cciss driver. Moving it to generic category. Since (I assume) this has not been seen on any other systems I will check the cciss firmware on the box. Perhaps it is out of date. Firmware update didn't make any difference. We still hit the panic on the ioctl. I will see if I can reproduce this on any other hardware. Here is the stack trace of the panic: anaconda[710]: bugcheck! 0 [1] Modules linked in: dm_snapshot dm_mirror dm_zero dm_mod ext3 jbd msdos raid6 raid5 xor raid1 raid0 qla2400 cciss mptscsih mptsas mptspi mptscsi mptbase qla2xxx scsi_transport_fc tg3 ohci_hcd ehci_hcd sr_mod sd_mod scsi_mod lapic_status loop nfs nfs_acl lockd sunrpc vfat fat cramfs Pid: 710, CPU 3, comm: anaconda psr : 0000101008022018 ifs : 8000000000000a9d ip : [<a0000002002e8a90>] Not tainted ip is at do_cciss_request+0x11d0/0x1320 [cciss] unat: 0000000000000000 pfs : 0000000000000a9d rsc : 0000000000000003 rnat: 3841a5799c0eb221 bsps: ea596e21aabae15c pr : 00400280009599a9 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000002002e8a90 b6 : a00000010006eb40 b7 : a00000010038bea0 f6 : 1003e0000000000001200 f7 : 1003e8080808080808081 f8 : 1003e00000000000023dc f9 : 1003e000000000e580000 f10 : 1003e00000000356f424c f11 : 1003e44b831eee7285baf r1 : a0000001009e0e30 r2 : 0000000000000001 r3 : 0000000000100000 r8 : 000000000000002a r9 : 0000000000000001 r10 : e00000000103540c r11 : 0000000000000003 r12 : e00000000bbcf740 r13 : e00000000bbc8000 r14 : 0000000000004000 r15 : e00000000bbc8de0 r16 : e000000001034af0 r17 : 0000000000000014 r18 : e0000100fd82802c r19 : e000000001035400 r20 : e000000001034ac0 r21 : 0000000000000002 r22 : 0000000000000001 r23 : e0000100fd828040 r24 : e000000001035b60 r25 : e000000001035b58 r26 : e000000001035b38 r27 : 0000000000000074 r28 : 0000000000000074 r29 : 0000000000000065 r30 : e0000100fd828050 r31 : 00000000356f424c Call Trace: [<a000000100016e40>] show_stack+0x80/0xa0 sp=e00000000bbcf2b0 bsp=e00000000bbc9430 [<a000000100017750>] show_regs+0x890/0x8c0 sp=e00000000bbcf480 bsp=e00000000bbc93e0 [<a00000010003e9b0>] die+0x150/0x240 sp=e00000000bbcf4a0 bsp=e00000000bbc93a0 [<a00000010003eae0>] die_if_kernel+0x40/0x60 sp=e00000000bbcf4a0 bsp=e00000000bbc9370 [<a00000010003ec80>] ia64_bad_break+0x180/0x600 sp=e00000000bbcf4a0 bsp=e00000000bbc9348 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260 sp=e00000000bbcf570 bsp=e00000000bbc9348 [<a0000002002e8a90>] do_cciss_request+0x11d0/0x1320 [cciss] sp=e00000000bbcf740 bsp=e00000000bbc9260 [<a0000001003722f0>] __generic_unplug_device+0xd0/0x100 sp=e00000000bbcfb30 bsp=e00000000bbc9240 [<a000000100372350>] generic_unplug_device+0x30/0x60 sp=e00000000bbcfb30 bsp=e00000000bbc9218 [<a000000100373a40>] blk_execute_rq+0x1a0/0x220 sp=e00000000bbcfb30 bsp=e00000000bbc91d8 [<a000000100381100>] scsi_cmd_ioctl+0xfe0/0x1520 sp=e00000000bbcfbb0 bsp=e00000000bbc9150 [<a0000002002effb0>] cciss_ioctl+0x1990/0x3c60 [cciss] sp=e00000000bbcfd10 bsp=e00000000bbc9080 [<a00000010037ab00>] blkdev_ioctl+0x220/0xc00 sp=e00000000bbcfe20 bsp=e00000000bbc9038 [<a0000001001409c0>] block_ioctl+0x40/0x60 sp=e00000000bbcfe20 bsp=e00000000bbc9000 [<a000000100159a20>] sys_ioctl+0x6a0/0xb20 sp=e00000000bbcfe20 bsp=e00000000bbc8f68 [<a00000010000f4a0>] ia64_ret_from_syscall+0x0/0x20 sp=e00000000bbcfe30 bsp=e00000000bbc8f68 The stack trace looks like a race between the ioctl path and the normal block I/O path. Chip I reproduced this on hp-sapphire-02.rhts which has the same model of cciss card (P600) so this does not appear to be specific to that system. We need to see if this crash happens on x86 also. I am adding Tony Camuso who handles our proliant systems. I tried last night running anaconda in userspace in --test mode but was unable to reproduce this. However, I had only updated anaconda and the kernel, perhaps I should try a full yum upgrade and then try. Also reproduced on an hp rx4640 with an older 5304-256 model smart array card so this appears to be a general cciss driver bug, not specific to a specific smart array card. Adding Mike Miller to the CC list. (In reply to comment #11) > The stack trace looks like a race between the ioctl path and the normal block > I/O path. > > Chip > I tried booting with maxcpus=1 to see if that would avoid this (which if it were a race it should) however I still hit the panic. I scheduled a reserve workflow on hp-ml370g5-01 (the only free HP x86_64 box with cciss I could find) however RHTS still has not scheduled the job. Once that runs hopefully it will tell us if this is ia64 specific. - Doug I successfully installed build RHEL4.7-20080424-i386 with kernel '2.6.9-69.ELsmp #1 SMP Tue Apr 15 18:33:35 EDT 2008 i686 i686 i386 GNU/Linux' on a box with HP Smart Array 5i Controller. This could mean that it is architecture dependent (ia64). (In reply to comment #15) > I successfully installed build RHEL4.7-20080424-i386 with kernel '2.6.9-69.ELsmp > #1 SMP Tue Apr 15 18:33:35 EDT 2008 i686 i686 i386 GNU/Linux' on a box with HP > Smart Array 5i Controller. > This could mean that it is architecture dependent (ia64). Now this is fun, I just tried this tree on one of my ia64 systems that hit the crash last week and it works now too. So, either this magically got fixed, or is some race condition we just don't hit all the time. It would be best if we could try on the system that it was originally reported on however it is was moved to the new lab and isn't back up yet. I just hit this again on hp-sapphire-02 when installing RHEL4.7/kernel-2.6.9-69 I have tried multiple things to try to reproduce this at runtime without any success. I am open to ideas as trying to debug this under anaconda at install time seems pretty much impossible. anaconda[1072]: bugcheck! 0 [1] Modules linked in: dm_snapshot dm_mirror dm_zero dm_mod ext3 jbd msdos raid6 raid5 xor raid1 raid0 cciss e1000 usb_storage ohci_hcd ehci_hcd sr_mod sd_mod scsi_mod lapic_status loop nfs nfs_acl lockd sunrpc vfat fat cramfs Pid: 1072, CPU 0, comm: anaconda psr : 0000101008022038 ifs : 8000000000000a9d ip : [<a0000002002c0a90>] Not tainted ip is at do_cciss_request+0x11d0/0x1320 [cciss] unat: 0000000000000000 pfs : 0000000000000a9d rsc : 0000000000000003 rnat: 2fe3e13bb0aec078 bsps: b75fee1a3745b502 pr : 004002800095a9a9 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000002002c0a90 b6 : a000000100016020 b7 : a00000010038bea0 f6 : 1003e0000000000001200 f7 : 1003e8080808080808081 f8 : 1003e00000000000023dc f9 : 1003e000000000e580000 f10 : 1003e00000000356f424c f11 : 1003e44b831eee7285baf r1 : a0000001009e0e30 r2 : 0000000000006000 r3 : 0000000000006000 r8 : 000000000000002a r9 : 00000000000000fd r10 : a0000001007f3880 r11 : 0000000000000600 r12 : e000000102327740 r13 : e000000102320000 r14 : 0000000000004000 r15 : c0000000fee00000 r16 : 00000000000000fd r17 : 0000000000000006 r18 : a000000100a087c0 r19 : a000000100a087c0 r20 : 0000000000000004 r21 : 0000000000000000 r22 : 0000000000000000 r23 : 0000000000000000 r24 : 0000000000000000 r25 : 0000000000000004 r26 : e00000003dd20dd0 r27 : 0000000000000000 r28 : e000000102320dd4 r29 : e00000003dd20dd4 r30 : e000000100838050 r31 : 00000000356f424c Call Trace: [<a000000100016e40>] show_stack+0x80/0xa0 sp=e0000001023272b0 bsp=e000000102321430 [<a000000100017750>] show_regs+0x890/0x8c0 sp=e000000102327480 bsp=e0000001023213e0 [<a00000010003e9b0>] die+0x150/0x240 sp=e0000001023274a0 bsp=e0000001023213a0 [<a00000010003eae0>] die_if_kernel+0x40/0x60 sp=e0000001023274a0 bsp=e000000102321370 [<a00000010003ec80>] ia64_bad_break+0x180/0x600 sp=e0000001023274a0 bsp=e000000102321348 [<a00000010000f600>] ia64_leave_kernel+0x0/0x260 sp=e000000102327570 bsp=e000000102321348 [<a0000002002c0a90>] do_cciss_request+0x11d0/0x1320 [cciss] sp=e000000102327740 bsp=e000000102321260 [<a0000001003722f0>] __generic_unplug_device+0xd0/0x100 sp=e000000102327b30 bsp=e000000102321240 [<a000000100372350>] generic_unplug_device+0x30/0x60 sp=e000000102327b30 bsp=e000000102321218 [<a000000100373a40>] blk_execute_rq+0x1a0/0x220 sp=e000000102327b30 bsp=e0000001023211d8 [<a000000100381100>] scsi_cmd_ioctl+0xfe0/0x1520 sp=e000000102327bb0 bsp=e000000102321150 [<a0000002002c7fb0>] cciss_ioctl+0x1990/0x3c60 [cciss] sp=e000000102327d10 bsp=e000000102321080 [<a00000010037ab00>] blkdev_ioctl+0x220/0xc00 sp=e000000102327e20 bsp=e000000102321038 [<a0000001001409c0>] block_ioctl+0x40/0x60 sp=e000000102327e20 bsp=e000000102321000 [<a000000100159a20>] sys_ioctl+0x6a0/0xb20 sp=e000000102327e20 bsp=e000000102320f68 [<a00000010000f4a0>] ia64_ret_from_syscall+0x0/0x20 sp=e000000102327e30 bsp=e000000102320f68 since this is a race problem between ioctl path and block io path , probably we need to identify what kind of ioctl is involved here... Maybe we need to write a test case to test ioctl path while stressing block io path to help debug the problem without anaconda.. There is a function defined in cciss_scsi.c called print_cmd. Right now it's #if 0 out. Move the #if 0, prototype it in cciss.c, and then call it before BUG. That will print out the CDB so we can see what's being called. (In reply to comment #21) > There is a function defined in cciss_scsi.c called print_cmd. Right now it's #if > 0 out. Move the #if 0, prototype it in cciss.c, and then call it before BUG. > That will print out the CDB so we can see what's being called. Building a kernel with this now. Thanks for the info. Hopefully once we have this info we can reproduce this outside of anacona. After finally figuring out how to make an initrd.img that anaconda was happy with I thought I was making some progress. However now I am stuck. No matter what I seem to do I cannot get any of my debug output to show up on the console. This includes printk and the print_cmd stuff Mike pointed to. My guess is there is something funny that anaconda does (although I don't know how this is possible). I know I am booting the right version of my module because if I remove the BUG() statement we get farther and panic at a later point, so this does confirm that the panic is from cciss.c:3097 but I cannot get any debug output. Does anybody have any ideas on what is happening to the output? I will look at this with a fresh mind tomorrow. Some progress... I never was able to get any debug output to work, I am still very perplexed about that. But, I was able to get netdump working under anaconda and get a dump of this crash. Chip mentioned he may need this system (hp-sapphire-02) in a day or two for another cciss issue but for now I can make it available to debug the dump. Contact me for login info. This may help with determining the race condition, also might explain why printk isn't working here. In crash if I look at the dmesg buffer via "log" I see: kernel BUG at drivers/block/cciss.c:2999! which is this bit of code: 2998 if (creq->nr_phys_segments > MAXSGENTRIES) 2999 BUG(); but, this is _not_ the ip given by the panic that follows, according to the address of the ip the crash happened at drivers/block/cciss.c: 3097 3094 } else { 3095 printk(KERN_WARNING "cciss%d: bad request type %ld\n", 3096 h->ctlr, creq->flags); 3097 BUG(); 3098 } so, it appears that we are hitting both of these at the same time. Since one was already in the process of trying to do output to the console perhaps that is why the "bad request" printk never happens. Also I am guessing the fact that we hit both of these might be our race condition? Thomas, Do you want to grab the system I have the kernel dump on? Is the dump even useful? The system is just sitting idle and if it is not going to be used for this issue I could make other use of the system. - Doug (In reply to comment #26) Doug, the Comments #25 is useful, at least the second part is new code in RHEL4.7. > Do you want to grab the system I have the kernel dump on? Is the dump even > useful? The system is just sitting idle and if it is not going to be used for > this issue I could make other use of the system. To be able to find there something more valuable,I'd need access not only to the system, but also to the system from which the affected kernel is installed, so I could change the cciss driver and see what happens then. Otherwise I could only repeat your test and see the failing system which is of no use. Tomas Doug, we could start by removing all patches I mean use the code from RHEL4.6 in order to see what patch is causing the problems. The patches listed below form RHEL4.7 on top of RHEL4.6. linux-2.6.9-cciss-add-init-of-drv-cylinders-back-to-cciss_geom.patch linux-2.6.9-cciss-Modify-proc-driver-cciss-entries-to-avoid-s.patch linux-2.6.9-cciss-Add-SG_IO-ioctl-and-fix-error-reporting-for-S.patch linux-2.6.9-cciss-Change-version-number-to-3.6.20-RH1.patch linux-2.6.9-cciss-Remove-read_ahead-and-use-block-layer-default.patch linux-2.6.9-cciss-Copyright-information-updated-as-per-HP-Legal.patch linux-2.6.9-cciss-Support-new-SAS-SATA-controllers.patch The next step would be, if we find the right patch, trying to localize what is wrong with it. The panic appears to have been introduced by: linux-2.6.9-cciss-Add-SG_IO-ioctl-and-fix-error-reporting-for-S.patch which is a very big patch, will try to narrow down a bit if possible. Looks like it is the handling of one of these ioctl's which was added by this patch. If I ifdef out this section of code it works OK: /* scsi_cmd_ioctl handles these, below, though some are not */ /* very meaningful for cciss. SG_IO is the main one people want. */ case SG_GET_VERSION_NUM: case SG_SET_TIMEOUT: case SG_GET_TIMEOUT: case SG_GET_RESERVED_SIZE: case SG_SET_RESERVED_SIZE: case SG_EMULATED_HOST: case SG_IO: case SCSI_IOCTL_SEND_COMMAND: return scsi_cmd_ioctl(filep, disk, cmd, argp); I don't know which of the above triggers the panic yet but this narrows it down. Created attachment 305740 [details]
simple reproducer
I can now reproduce this outside of anaconda. The attached program panics my
rx6600 in the same way as anaconda.
I need to run the test program twice to hit the panic. First time OK, second
time panic. So, perhaps we are not cleaning up something from the first time
SCSI_IOCTL_SEND_COMMAND is called.
(In reply to comment #31) > > I need to run the test program twice to hit the panic. First time OK, second > time panic. So, perhaps we are not cleaning up something from the first time > SCSI_IOCTL_SEND_COMMAND is called. > Actually it just appears to be somewhat random, sometimes I need to run it several times before I see the panic, but either way it does reproduce it eventually. I tried to reproduce on an i386 box. Initially I wasn't successful but after I run the ioctl it in an endless loop and added writing of a big file in a second console I got this - [<c01192a7>] smp_local_timer_interrupt+0x1c/0x5f [<c011934d>] smp_apic_timer_interrupt+0x63/0x8f [<c0327a26>] apic_timer_interrupt+0x1a/0x20 [<c0125b4b>] profile_tick+0x12/0x36 [<c01192a7>] smp_local_timer_interrupt+0x1c/0x5f [<c0106bd4>] do_invalid_op+0x0/0xf2 [<c0327ac3>] error_code+0x2f/0x38 [<f8865636>] do_cciss_request+0x44/0x484 [cciss] [<c0154326>] kmem_freepages+0x6d/0x87 [<c01543bf>] slab_destroy+0x59/0x72 [<c016f33e>] wake_up_buffer+0x9/0x29 [<c014f57b>] mempool_free+0x11f/0x122 [<c025dee5>] __blk_put_request+0x63/0x73 [<c025f1a3>] end_that_request_last+0x28d/0x2ac [<f886624d>] complete_command+0x3cb/0x3e2 [cciss] [<c025cdd5>] blk_start_queue+0x21/0x3e [<f8865d60>] do_cciss_intr+0x2ea/0x40c [cciss] [<c0107f08>] handle_IRQ_event+0x25/0x4f [<c01088d6>] do_IRQ+0x18a/0x2bf ======================= [<c0327a04>] common_interrupt+0x18/0x20 [<c010403b>] default_idle+0x23/0x29 [<c010408f>] cpu_idle+0x1f/0x34 [<c03e3700>] start_kernel+0x214/0x216 Has this been tested in rhel5.x? Any results? Can we get the sources for this install kernel so we can try to debug it here? We would also need the config file. Hmm, your test program is using SCSI_IOCTL_SEND_COMMAND (obsolete? it is in some kernels anyway). I'm pretty sure I hit some problems with SCSI_IOCTL_SEND_COMMAND and RHEL4 before, since I wrote this comment in cciss.c: case SCSI_IOCTL_SEND_COMMAND: /* scsi_cmd_ioctl is broken for this on RHEL4. */ /* It creates a request with no bio. */ /* Consequently, we don't support it. */ /* scsi_cmd_ioctl would normally handle these, below, but */ /* they aren't a good fit for cciss, as CD-ROMs are */ /* not supported, and we don't have any bus/target/lun */ /* which we present to the kernel. */ case CDROM_SEND_PACKET: case CDROMCLOSETRAY: case CDROMEJECT: case SCSI_IOCTL_GET_IDLUN: case SCSI_IOCTL_GET_BUS_NUMBER: default: return -ENOTTY; } Now, it doesn't seem to be returning "not a typewriter", so I'm guessing the SCSI mid layer is changed (or has been for a long time?) such that SCSI_IOCTL_SEND_COMMAND tries to cram things down the regular io path, like SG_IO does (using scsi_cmd_ioctl(), but, (according to my comment, which I have only a scant recollection of writing) creates a request with no bio. Hmmm, I wrote down this in my notes: "Mon Apr 23 09:23:15 CDT 2007 Looks like the SCSI_IOCTL_SEND_COMMAND code path in RHEL4 for block devices in scsi_ioctl.c and ll_rw_blk.c is broken, in that it never maps the buffer to a bio. (In newer kernels there's a call to blk_rq_map_kern which maps the kernel buffer that gets kmalloc'ed to a bio. In RHEL4, this function doesn't exist, though there's a blk_rq_map_user, to map a user buffer. Nothing like either of those gets called.) So, the upshot is, on RHEL4, SCSI_IOCTL_SEND_COMMAND is kind of screwed. That's probably ok, since SG_IO seems to work. So, for 2.6 drivers series, just going to not support SCSI_IOCTL_SEND_COMMAND. Committed SG_IO changes to cciss_2_6_18 branch. " (In reply to comment #35) > > > Now, it doesn't seem to be returning "not a typewriter", so I'm guessing the > SCSI mid layer is changed (or has been for a long time?) > This was added just recently in the patch: linux-2.6.9-cciss-Add-SG_IO-ioctl-and-fix-error-reporting-for-S.patch case SCSI_IOCTL_SEND_COMMAND: return scsi_cmd_ioctl(filep, disk, cmd, argp); so no, it does not return -ENOTTY, sounds like that is the fix. Created attachment 306285 [details]
disable SCSI_IOCTL_SEND_COMMAND for cciss
Steve, Mike,
It sounds like this is the fix that should be in RHEL4?
I think so, that is what we have in the drivers we distribute for RHEL4. Unless you wanted to try to fix the problem in ll_rw_blk.c and scsi_ioctl.c -- if that is what I think it is, my notes on that are more a year old -- and I'm not sure how to fix it anyway, but somebody must know, as it's fixed in later kernels. -- steve It seems the smartest thing to do here is to disable SCSI_IOCTL_SEND_COMMAND as in the patch I attached because: 1: we never had it before for cciss (was there a reason it was added?) 2: it is disabled in the HP owned driver 3: it is also disabled in RHEL5 As for #2 and #3 above, SCSI_IOCTL_SEND_COMMAND is DISabled in our "2.6.x" series of cciss drivers which is for kernels < about 2.6.15 (e.g. RHEL4), and it is ENabled in our "3.6.x" series of drivers which is for kernels > about 2.6.15 (e.g. RHEL5). For our (HP's) RHEL5 driver, we currently have SCSI_IOCTL_SEND_COMMAND enabled, and I am pretty sure it even worked last time I tried it. -- steve The build for RHEL 4.7 beta snapshot 1 is planned for next Tues. 5/27. We will need a patch tested and ready for review before then. With the Monday holiday, that means today or tomorrow... (In reply to comment #41) > The build for RHEL 4.7 beta snapshot 1 is planned for next Tues. 5/27. We will > need a patch tested and ready for review before then. With the Monday holiday, > that means today or tomorrow... I agree completely, I would like to see this submitted ASAP. If we have agreement here I would be happy to submit the patch I have attached here to rhkernel list today. I have tested it and it is minimally intrusive, all it does is disable a ioctl that probably should not have been enabled in the first place. We (HP) agree that attachment 306285 [details] should be added.
Patch posted to rhkernel-list for review for RHEL4.7. *** Bug 448100 has been marked as a duplicate of this bug. *** Committed in 71.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0665.html |