Kernel OOPS during the function call scsi_get_host_dev in 2.4.18-3smp (RH 7.3) On a standard RH 7.3 smp install on a dual Processor machine with Intel Integrated RAID controller using kernel 2.4.18-3smp, with the "gdth" module loaded, I get segmentation fault during: 1. Shutdown/halt or "rmmod gdth"(basically calling gdth_flush()) 2. Accessing the "gdth" information from the /proc filesystem, with say "cat /proc/scsi/gdth/3". Note: This problem occurs ONLY with the RH kernel source for 2.4.18-3. The 2.4.18 (as downloaded from kernel.org) used with the kernel-2.4.18-i686-smp.config does not exhibit this behaviour. Which leads me to believe some Redhat custom kernel patch might be responsible. Also the problem occurs only in SMP kernels. Preliminary debugging shows that the function scsi_get_host_dev (scsi.c) seems to be causing the segmentation fault. This function is called twice in gdth_proc.c (gdth_get_info() and gdth_set_info() ) and once in gdth.c ( gdth_flush() ) Following is the OOPS for "cat /proc/scsi/gdth/3": Jun 13 12:12:19 localhost kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Jun 13 12:12:19 localhost kernel: printing eip: Jun 13 12:12:19 localhost kernel: d080e2f1 Jun 13 12:12:19 localhost kernel: *pde = 00000000 Jun 13 12:12:19 localhost kernel: Oops: 0002 Jun 13 12:12:19 localhost kernel: gdth binfmt_misc autofs eepro100 usb-ohci usbcore ext3 jbd aic7xxx sd_mod scsi Jun 13 12:12:19 localhost kernel: CPU: 1 Jun 13 12:12:19 localhost kernel: EIP: 0010:[<d080e2f1>] Not tainted Jun 13 12:12:19 localhost kernel: EFLAGS: 00010086 Jun 13 12:12:19 localhost kernel: Jun 13 12:12:19 localhost kernel: EIP is at scsi_build_commandblocks [scsi_mod] 0x21 (2.4.18-3smp) Jun 13 12:12:19 localhost kernel: eax: 00000000 ebx: cd3a9a00 ecx: 00000000 edx: cd3a9a18 Jun 13 12:12:20 localhost kernel: esi: c0350000 edi: cd3a9b10 ebp: 00000000 esp: cde03a90 Jun 13 12:12:20 localhost kernel: ds: 0018 es: 0018 ss: 0018 Jun 13 12:12:20 localhost kernel: Process cat (pid: 1308, stackpage=cde03000) Jun 13 12:12:20 localhost kernel: Stack: cd3a9a18 c0350000 00000286 cd3a9a00 c0350000 cd3a9b10 00000000 d080fcdb Jun 13 12:12:20 localhost kernel: cd3a9a00 00000000 d08f7ce0 cde03ea3 d08f0574 c0350000 00000000 00000000 Jun 13 12:12:20 localhost kernel: cde03ea3 00000000 00000000 00000000 00000004 c013117c 00000292 00001000 Jun 13 12:12:20 localhost kernel: Call Trace: [<d080fcdb>] scsi_get_host_dev_Rsmp_f651c1f2 [scsi_mod] 0x4b Jun 13 12:12:20 localhost kernel: [<d08f7ce0>] .rodata.str1.32 [gdth] 0x5c0 Jun 13 12:12:20 localhost kernel: [<d08f0574>] gdth_get_info [gdth] 0xa8 Jun 13 12:12:20 localhost kernel: [<c013117c>] filemap_nopage [kernel] 0xbc Jun 13 12:12:20 localhost kernel: [<c0139d92>] __alloc_pages [kernel] 0x72 Jun 13 12:12:20 localhost kernel: [<c013ee4b>] page_add_rmap [kernel] 0x3b Jun 13 12:12:20 localhost kernel: [<c012cdfd>] do_no_page [kernel] 0x23d Jun 13 12:12:20 localhost kernel: [<d086a735>] ext3_mark_iloc_dirty [ext3] 0x25 Jun 13 12:12:20 localhost kernel: [<c012d3a6>] vm_enough_memory [kernel] 0x36 Jun 13 12:12:20 localhost kernel: [<c022ef60>] rb_insert_color [kernel] 0x70 Jun 13 12:12:20 localhost kernel: [<c0117a2d>] do_page_fault [kernel] 0x12d Jun 13 12:12:20 localhost kernel: [<c0117900>] do_page_fault [kernel] 0x0 Jun 13 12:12:20 localhost kernel: [<c0108d5c>] error_code [kernel] 0x34 Jun 13 12:12:20 localhost kernel: [<c0117900>] do_page_fault [kernel] 0x0 Jun 13 12:12:20 localhost kernel: [<c0108d5c>] error_code [kernel] 0x34 Jun 13 12:12:20 localhost kernel: [<c022daeb>] clear_user [kernel] 0x2b Jun 13 12:12:20 localhost kernel: [<c0160b2c>] padzero [kernel] 0x1c Jun 13 12:12:20 localhost kernel: [<c0161c32>] load_elf_binary [kernel] 0x9e2 Jun 13 12:12:20 localhost kernel: [<c0139d92>] __alloc_pages [kernel] 0x72 Jun 13 12:12:20 localhost kernel: [<c013ee4b>] page_add_rmap [kernel] 0x3b Jun 13 12:12:20 localhost kernel: [<c012cba7>] do_anonymous_page [kernel] 0x117 Jun 13 12:12:20 localhost kernel: [<d08ef195>] gdth_proc_info [gdth] 0x111 Jun 13 12:12:20 localhost kernel: [<d0811a2f>] proc_scsi_read [scsi_mod] 0x3f Jun 13 12:12:20 localhost kernel: [<c0164b10>] proc_file_read [kernel] 0xe0 Jun 13 12:12:20 localhost kernel: [<c0141ad6>] sys_read [kernel] 0x96 Jun 13 12:12:20 localhost kernel: [<c012d5b5>] sys_brk [kernel] 0xc5 Jun 13 12:12:20 localhost kernel: [<c0108c6b>] system_call [kernel] 0x33 Jun 13 12:12:20 localhost kernel: Jun 13 12:12:20 localhost kernel: Jun 13 12:12:20 localhost kernel: Code: f0 fe 08 0f 88 ad 1d 00 00 80 bb 05 01 00 00 00 75 19 8b 54 ---------- Action by: bojikannanthanam Issue Registered ---------- Action by: rlandry Is this same behavior present with the 2.4.18-4 errata kernel for 7.3? Category set to: Kernel Status set to: Waiting on Client Issue assigned to: rlandry ---------- Action by: bojikannanthanam The same behaviour is present with the 2.4.18-4 errata kernel for 7.3. I used kernel-smp-2.4.18-4.i686.rpm.
The same bug exists with Redhat Beta "null" (kernel 2.4.18-11). This bug was resolved using 2.4.18-5 for RH 7.3. Seems like the patch did not make it to Redhat beta ?
In initial testing with the new board you sent, we have not reproduced this. Perhaps you need to tell us more about the configuration in which you are seeing the problem. Could you do that, please?
From: "Kannanthanam, Boji T" <boji.t.kannanthanam> We saw the problem on the standard SMP kernel on Redhat 7.3 and RH Beta (Null). To reproduce the problem just do a "cat /proc/scsi/gdth/# " where "#" is the nth SCSI controller on the system. You can also see the Kernel OOPS when you unload the "gdth" module. I usually have Redhat installed on a RAID driver on the controller.
The cat /proc/scsi/gdth/# is exactly what I did to try to reproduce the problem.
this got fixed several errata ago
The problem occurs again in Red Hat Enterprise Linux release 2.9.5AS, kernel 2.4.21-1.1931.2.399.entsnmp! I debugged it and found out what's going wrong. The problem occurs in scsi_build_commandblocks() in scsi.c, called from scsi_get_host_dev(). scsi_get_host_dev() allocates a new Scsi_Device structure, sets the elements to 0 and gives the structure pointer to scsi_build_commandblocks(). This function calls spin_lock_irqsave() with request_queue.queue_lock as the first parameter. But this element is not initialized and a NULL pointer! The problem doesn't occur with the "original" 2.4.21 kernel. In this kernel scsi_build_commandblocks() calls spin_lock_irqsave() with a pointer to a defined spinlock_t structure "device_request_lock" as the first parameter.
See Bugzilla Bug # 104520 for resolution with RHEL 3.0