Bug 97866

Summary: [scsi] LTC3224 - trap while running 'echo "scsi dump 2" > /proc/scsi/scsi '
Product: Red Hat Enterprise Linux 2.1 Reporter: Kaena Freitas <kaena>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1CC: olof, riel, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-08-17 11:27:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 116727    
Attachments:
Description Flags
trace
none
scsi_dump_status.patch
none
scsi_dump_status.patch
none
Test patch for locking problem none

Description Kaena Freitas 2003-06-23 14:07:34 UTC
Hardware Environment:
plinuxt7.cn.ibm.com(9.181.27.227)
P630 70286C3

Software Environment:
RedHat Enterprise Linux 3 Alpha4
Kernel: 2.4.20-1.1931.2.231.2.11.ent

Steps to Reproduce:
1. Install the build on the box. Select everything when prompt for packages.
2. Login the the new installed OS
3. Run the following command: 
 echo "scsi dump 2" > /proc/scsi/scsi 
 
 the system goes into xmon and stops

Actual Results:
	The serial output is as follow:
----------------------------------------------Begin of output-------------------
NIP: d000000000021fb0 XER: 0000000000000000 LR: d000000000021fa8 REGS: 
c0000000f638b6d0 TRAP: 0300    Not tainted
NIP is at .scsi_dump_status [scsi_mod] 0x26c (2.4.20-1.1931.2.231.2.11.ent)
MSR: 9000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
TASK = c0000000f6388000[1186] 'bash' Last syscall: 4
last math 0000000000000000 CPU: 0
GPR00: d000000000021fa8 c0000000f638b950 d000000000049918 0000000000000006 
GPR04: 0000000000000001 0000000000000001 c0000000002eabe0 c0000000002eaba0 
GPR08: 0000000000000004 0000000000000000 0000000000008fe1 0000000000008fe4 
GPR12: c000000000628d1c c0000000f6388000 00000000ffffec58 0000000000000000 
GPR16: 00000000100be7f0 0000000000000000 0000000000000000 0000000000000000 
GPR20: 0000000000000001 c00000000046fec0 b000000000009032 c00000000046f000 
GPR24: c0000000fe9e9e80 c0000000fe9cde00 0000000000000000 0000000000000000 
GPR28: c0000000006c1c28 c0000000006c1c28 d000000000047da0 0000000000000000 
Call Trace: 
[<d000000000021fa8>] .scsi_dump_status [scsi_mod] 0x264
[<d0000000000204a4>] .proc_scsi_gen_write [scsi_mod] 0x1b4
[<c0000000000e6c40>] .proc_file_write [kernel] 0x58
[<c0000000000a4ec0>] .sys_write [kernel] 0xe4
[<c00000000000fd48>] .ret_from_syscall_1 [kernel] 0x0

cpu 0: Vector: 300 (Data Access) at  [c0000000f638b6d0]
    pc: d000000000021fb0
    lr: d000000000021fa8
    sp: c0000000f638b950
   msr: 9000000000009032
   dar: 18
 dsisr: 40000000
  current = 0xc0000000f6388000
  paca    = 0xc00000000046f000
  current = c0000000f6388000, pid = 1186, comm = bash
0:mon>


Expected Results:
	This command outputs a list of internal SCSI command blocks as "The 
Linux 2.4 SCSI subsystem HOWTO" said in section 8.3, and the system don't go 
into xmon and stop.
	

Additional Information:
	1. On RHEL A3 build, the same process don't go into xmon. It return 
without any anyput.
	2. After reboot, we can find the following related message 
in /var/log/message:
	
-----------------------------------------Begin of the message-------------------
Jun 23 20:24:34 plinuxt4 kernel: Dump of scsi host parameters:
Jun 23 20:24:34 plinuxt4 kernel:  0 0 0 : 0 0
Jun 23 20:24:34 plinuxt4 last message repeated 3 times
Jun 23 20:24:34 plinuxt4 kernel:
Jun 23 20:24:34 plinuxt4 kernel:
Jun 23 20:24:34 plinuxt4 kernel: Dump of scsi command parameters:
Jun 23 20:24:34 plinuxt4 kernel: h:c:t:l (dev sect nsect cnumsec sg) (ret all 
flg) (to/cmd to ito) cmd snse result
Jun 23 20:24:34 plinuxt4 kernel: (  0)  0:0: 8: 0 ( 08:02 7602200    8    8 
ffffffff 2) (0 5 0x 0) (6000    0    0) 0x2a 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: (  1)  0:0: 8: 0 ( 08:02   32    8    8 
ffffffff 0) (0 5 0x 0) (6000    0    0) 0x2a 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: (  2)  0:0: 8: 0 ( 08:02 69856    8    8    1 
0) (0 5 0x 0) (6000    0    0) 0x2a 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: (  3)  0:0: 8: 0 ( 08:02 13107272    8    8 
ffffffff 0) (0 5 0x 0) (6000    0    0) 0x2a 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: (  4)  0:0: 8: 0 ( 08:02 13107624    8    8 
ffffffff 2) (0 5 0x 0) (6000    0    0) 0x2a 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: (  5)  0:0: 8: 0 ( 08:02 12845088    8    8 
ffffffff 0) (0 5 0x 0) (6000    0    0) 0x2a 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: (  6)  0:0: 8: 0 ( 08:02 17563704    8    8 
ffffffff 0) (0 5 0x 0) (6000    0    0) 0x2a 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: (  7)  0:0: 8: 0 ( 08:02 29884440    8    8 
ffffffff 2) (0 5 0x 0) (6000    0    0) 0x2a 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: (  8)  0:0: 9: 0 ( 08:10 8198    2    2 
ffffffff 4) (0 5 0x 0) (6000    0    0) 0x28 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: (  9)  0:0: 9: 0 ( 00:00    0    0    0 
ffffffff 0) (0 0 0x 0) (   0    0    0) 0x00 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: ( 10)  0:0: 9: 0 ( 00:00    0    0    0 
ffffffff 0) (0 0 0x 0) (   0    0    0) 0x00 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: ( 11)  0:0: 9: 0 ( 00:00    0    0    0 
ffffffff 0) (0 0 0x 0) (   0    0    0) 0x00 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: ( 12)  0:0: 9: 0 ( 00:00    0    0    0 
ffffffff 0) (0 0 0x 0) (   0    0    0) 0x00 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: ( 13)  0:0: 9: 0 ( 00:00    0    0    0 
ffffffff 0) (0 0 0x 0) (   0    0    0) 0x00 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: ( 14)  0:0: 9: 0 ( 00:00    0    0    0 
ffffffff 0) (0 0 0x 0) (   0    0    0) 0x00 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: ( 15)  0:0: 9: 0 ( 00:00    0    0    0 
ffffffff 0) (0 0 0x 0) (   0    0    0) 0x00 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: h:c:t:l (dev sect nsect cnumsec sg) (ret all 
flg) (to/cmd to ito) cmd snse result
Jun 23 20:24:34 plinuxt4 kernel: h:c:t:l (dev sect nsect cnumsec sg) (ret all 
flg) (to/cmd to ito) cmd snse result
Jun 23 20:24:34 plinuxt4 kernel: ( 16)  2:0: 1: 0 ( 00:00    0    0    0 
ffffffff 0) (0 5 0x 0) (1000    0    0) 0x1e 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: ( 17)  2:0: 1: 0 ( 00:00    0    0    0 
ffffffff 0) (0 0 0x 0) (   0    0    0) 0x00 0x00 0x00000000
Jun 23 20:24:34 plinuxt4 kernel: h:c:t:l (dev sect nsect cnumsec sg) (ret all 
flg) (to/cmd to ito) cmd snse result
Jun 23 20:24:34 plinuxt4 kernel: Dump of pending block device requests
Jun 23 20:24:34 plinuxt4 kernel: 0: parport_pc lp parport autofs e100 sr_mod 
cdrom ext3 jbd sym53c8xx sd_mod scsi_mod
Jun 23 20:24:34 plinuxt4 kernel: NIP: d000000000021fb0 XER: 0000000000000000 
LR: d000000000021fa8 REGS: c0000000f638b6d0 TRAP: 0300    Not tainted
Jun 23 20:24:34 plinuxt4 kernel: NIP is at .scsi_dump_status [scsi_mod] 0x26c 
(2.4.20-1.1931.2.231.2.11.ent)
Jun 23 20:24:34 plinuxt4 kernel: MSR: 9000000000009032 EE: 1 PR: 0 FP: 0 ME: 1 
IR/DR: 11
Jun 23 20:24:34 plinuxt4 kernel: TASK = c0000000f6388000[1186] 'bash' Last 
syscall: 4
Jun 23 20:24:34 plinuxt4 kernel: last math 0000000000000000 CPU: 0
Jun 23 20:24:34 plinuxt4 kernel: GPR00: d000000000021fa8 c0000000f638b950 
d000000000049918 0000000000000006
Jun 23 20:24:34 plinuxt4 kernel: GPR04: 0000000000000001 0000000000000001 
c0000000002eabe0 c0000000002eaba0
Jun 23 20:24:34 plinuxt4 kernel: GPR08: 0000000000000004 0000000000000000 
0000000000008fe1 0000000000008fe4
Jun 23 20:24:35 plinuxt4 kernel: GPR12: c000000000628d1c c0000000f6388000 
00000000ffffec58 0000000000000000
Jun 23 20:24:35 plinuxt4 kernel: GPR16: 00000000100be7f0 0000000000000000 
0000000000000000 0000000000000000
Jun 23 20:24:35 plinuxt4 kernel: GPR20: 0000000000000001 c00000000046fec0 
b000000000009032 c00000000046f000
Jun 23 20:24:35 plinuxt4 kernel: GPR24: c0000000fe9e9e80 c0000000fe9cde00 
0000000000000000 0000000000000000
Jun 23 20:24:35 plinuxt4 kernel: GPR28: c0000000006c1c28 c0000000006c1c28 
d000000000047da0 0000000000000000
Jun 23 20:24:35 plinuxt4 kernel: Call Trace:
Jun 23 20:24:35 plinuxt4 kernel: [<d000000000021fa8>] .scsi_dump_status 
[scsi_mod] 0x264
Jun 23 20:24:35 plinuxt4 kernel: [<d0000000000204a4>] .proc_scsi_gen_write 
[scsi_mod] 0x1b4
Jun 23 20:24:35 plinuxt4 kernel: [<c0000000000e6c40>] .proc_file_write [kernel] 
0x58
Jun 23 20:24:35 plinuxt4 kernel: [<c0000000000a4ec0>] .sys_write [kernel] 0xe4
------------------------------------------End of the messages-------------------

------- Additional Comment #1 From Daynerd K. Freitas 2003-06-23 10:05 ------- 
Over to Olof for investigation

Comment 1 Kaena Freitas 2003-06-23 19:48:26 UTC
------- Additional Comments From olof.com(prefers email via 
olof.com)  2003-23-06 15:46 -------
I was not able to reproduce this on my own system yet.

However, the way the request queue is traversed in the loop in scsi_dump_status
worries me, since there are no locks taken.

I'll look at this some more and check what locks are serializing access to the
list in question.


Comment 2 Kaena Freitas 2003-06-25 14:10:48 UTC
------- Additional Comments From zhouwu.com  2003-24-06 20:47 -------
To Kaena: We did this test on the new A4 build. It traped. On A3, there is no 
such symptom. It is really odd. 

To Olof, did you do the test on the new A4 build? I will try to setup a 
enviroment for you to debug if you like. I also navigate the source of 
scsi_dump_status. It seems stop between the following lines:

                        for (i = 0; i < MAX_BLKDEV; i++) {
                                struct list_head * queue_head;

                                queue_head = &blk_dev
[i].request_queue.queue_head;
                                if (!list_empty(queue_head)) {
                                        struct request *req;
                                        struct list_head * entry;

                                        printk(KERN_INFO "%d: ", i);
                                        entry = queue_head->next;
                                        do {
                                                req = blkdev_entry_to_request
(entry);
                                                printk("(%s %d %ld %ld %ld) ",
                                                   kdevname(req->rq_dev),
                                                       req->cmd,
                                                       req->sector,
                                                       req->nr_sectors,
                                                req->current_nr_sectors);
                                        } while ((entry = entry->next) != 
queue_head);
                                        printk("\n");
                                }
                        }

For the first loop, the "printk(KERN_INFO "%d: ", i);" did output a message of 
0, then it stop before or at the next printk.


Comment 3 Kaena Freitas 2003-06-26 00:59:31 UTC
------- Additional Comments From olof.com(prefers email via 
olof.com)  2003-25-06 18:44 -------
The list access is completely unprotected in the proc code, which could cause
crashes just as we've seen.

Wu, can you download the tarfile from below and try out the kernel and initrd?

http://olof.austin.ibm.com/kernels/3224.tar

Let me know if this seems to work better. The problem is more likely to happen
under heavy SCSI load, since the queues will be changed more often then.

Thanks.

Comment 4 Kaena Freitas 2003-06-26 12:43:02 UTC
------- Additional Comments From zhouwu.com  2003-26-06 06:05 -------
We try out the initrd and vmlinux you gave us. It don't work. The same error 
occur on A4 build. And we can saw a more message saying that :
"Kernel panic: kernel access of bad area pc d000000000021fb0 lr 
d000000000021fa8 address 18 tsk bash/12333"
From a document that I can't find now, I have a feeling that the address space 
PPC64 Linux kernel could access is limited to 0xC000000000000000 to 
0xC00001FFFFFFFFFF. Is it right? Maybe the trap is related to this? 

the xmon output is as follow.
2:mon> e 
cpu 2: Vector: 300 (Data Access) at  [c0000001d24df6d0]
    pc: d000000000021fb0
    lr: d000000000021fa8
    sp: c0000001d24df950
   msr: 9000000000009032
   dar: 18
 dsisr: 40000000
  current = 0xc0000001d24dc000
  paca    = 0xc000000000473000
  current = c0000001d24dc000, pid = 12333, comm = bash
2:mon> Success in copy number 19 a of 10000K_file in dir2.
Success in copy number 19 b of 1K_file in dir2.
Success in copy number 19 b of 10K_file in dir3.

Unrecognized command: \x3 (type ? for help)
2:mon> 
2:mon> x
Kernel panic: kernel access of bad area pc d000000000021fb0 lr d000000000021fa8 
address 18 tsk bash/12333


Comment 5 Kaena Freitas 2003-06-26 21:12:48 UTC
------- Additional Comments From olof.com(prefers email via 
olof.com)  2003-26-06 15:55 -------
Is that really the console output? It seems to indicate that the other CPUs
in the system are still running even though one entered XMON (The "Success in
copy ..." messages).

Please include a stacktrace (t command) whenever you provide XMON data, without
very useful information.

0xd addresses are valid kernel addresses, that's not a concern at this time.

Comment 6 Kaena Freitas 2003-06-29 17:24:19 UTC
------- Additional Comments From zhouwu.com  2003-29-06 04:39 -------
yes, that is indeed the console output. while I reproduce the bug, there are 
some IO testcase running, so come the "Success in copy ..." messages. And the 
trap only stoped cpu2, others still running. 

I try the above process another time without any heavy load. It traps again. 
attached is the xmon output, including the backtrace. 

Comment 7 Kaena Freitas 2003-06-29 20:47:49 UTC
Created attachment 92684 [details]
trace

Comment 8 Kaena Freitas 2003-07-08 16:39:03 UTC
------- Additional Comments From olof.com(prefers email via 
olof.com)  2003-08-07 12:10 -------
What's RedHat's input on this? I doubt that it's a pSeries-specific problem. 
Thanks.

Comment 9 Rik van Riel 2003-07-08 17:29:36 UTC
it's a known and very old bug that doesn't seem to have bothered too many people

it is something we will want to fix, but not the very highest priority bug at
the moment

Comment 10 Kaena Freitas 2003-07-22 16:47:53 UTC
olof.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|OPEN                        |ASSIGNED
        Owning Team|pSeries                     |Red Hat


Comment 11 Olof Johansson 2003-07-28 13:18:11 UTC
------- Additional Comments From zhouwu.com  2003-28-07 04:44 -------
This problem still exist in B1 build.

Comment 12 Arjan van de Ven 2003-08-05 10:11:13 UTC
degrading to SHOULDFIX

Comment 14 Olof Johansson 2003-09-08 19:43:18 UTC
salina.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Owner|khoa.com             |salina.com

Comment 15 Olof Johansson 2003-09-08 19:43:47 UTC
salina.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|OPEN                        |ASSIGNED

Comment 16 Olof Johansson 2003-09-08 20:00:45 UTC
------- Additional Comments From salina.com  2003-08-09 15:44 -------
I'll take a look at this in next few days.  p-series on loan to another group - 
need get it back from them first.

Comment 17 Olof Johansson 2003-09-08 20:35:49 UTC
------- Additional Comments From salina.com  2003-08-09 16:30 -------
Found this in the kernel archives
http://www.ussg.iu.edu/hypermail/linux/kernel/0206.1/0430.html

Comment 18 IBM Bug Proxy 2003-10-16 00:44:49 UTC
------ Additional Comments From salina.com  2003-15-10 16:31 -------
Chinmay,

Info on KDB for the crash ... after applying your patch. Same problem still 
there.

My system is SLES 8 PPC64, latest RC3 kernel kernel-ppc64-2.4.21-90

[3]kdb> dump
dump-all
dump-basic
kdb_cmd[0] : excp
cpu 3: Vector: 300 (Data Access) at  [c000000015b9f6d0]
    pc: c000000000266c10 (scsi_dump_status+0x2a4)
    lr: c000000000266bd4 (scsi_dump_status+0x268)
    sp: c000000015b9f950
   msr: 9000000000009032
   dar: 0
 dsisr: 40000000
  current = 0xc000000015b9c000
  paca    = 0xc000000000658000
  current = c000000015b9c000, pid = 2001, comm = bash
kdb_cmd[1] : bt
0xc000000015b9c000 00002001 00001999  0  003  run   0xc000000015b9c5e0*bash
          SP(esp)            PC(eip)      Function(args)
0xc000000015b9f950  0xc000000000266c10  .scsi_dump_status +0x2a4
0xc000000015b9fa70  0xc000000000265060  .proc_scsi_gen_write +0x208
0xc000000015b9fb40  0xc0000000000fa908  .proc_file_write +0x58
0xc000000015b9fbc0  0xc0000000000ad730  .sys_write +0xe4
0xc000000015b9fc60  0xc000000000010814  .ret_from_syscall_1
  [exception: c00:(System Call) regs 0xc000000015b9fcd0] nip:[0xfe5e160] gpr[1]:
[0xffffe250]
    kdb: Not a kernel-space address 0xffffe260
<Stack drops into userspace here 00000000ffffe250>
kdb_cmd[2] : rd
gpr0  = 0xc000000000266bd4 gpr1  = 0xc000000015b9f950
gpr2  = 0xc000000000728000 gpr3  = 0x0000000000000006
gpr4  = 0x0000000000000001 gpr5  = 0x0000000000000001
gpr6  = 0xc00000000044d3d8 gpr7  = 0xc00000000044d390
gpr8  = 0x0000000000000006 gpr9  = 0xc0000000007c0e00
gpr10 = 0xc00000000044d3d0 gpr11 = 0x0000000000003455
gpr12 = 0xc0000000007c0a04 gpr13 = 0xc000000000658000
gpr14 = 0x0000000000000000 gpr15 = 0x0000000000000000
gpr16 = 0x0000000000000000 gpr17 = 0x0000000000000000
gpr18 = 0x0000000000000000 gpr19 = 0x0000000000000000
gpr20 = 0x0000000000000001 gpr21 = 0xc000000000658ec0
gpr22 = 0xb000000000009032 gpr23 = 0x0000000000004000
gpr24 = 0xc0000001fe033180 gpr25 = 0xc0000001fe006e00
gpr26 = 0x0000000000000000 gpr27 = 0x0000000000000000
gpr28 = 0xc0000000008601b0 gpr29 = 0xc0000000008601b0
gpr30 = 0xc000000000501500 gpr31 = 0x0000000000000000
nip   = 0xc000000000266c10 msr   = 0x9000000000009032
esp   = 0xc000000015b9f950 orig_gpr3 = 0xc000000015b9f860
ctr   = 0xc0000000003769f0 link  = 0xc000000000266bd4
xer   = 0x0000000000000000 ccr   = 0x0000000028022822
mq    = 0x0000000000000000 trap  = 0x0000000000000300
dar   = 0x0000000000000000 dsisr = 0x0000000040000000
result = 0x0000000000000000 &regs = 0xc000000015b9f6d0
.......

I disassembled the code 


[2]kdb> id 0xc000000000266c10
0xc000000000266c10 .scsi_dump_status+0x2a4     ld       r31,0(r31)
0xc000000000266c14 .scsi_dump_status+0x2a8     cmpd     r31,r29
0xc000000000266c18 .scsi_dump_status+0x2ac     bne      
0xc000000000266bdc .scsi_dump_status+0x270
0xc000000000266c1c .scsi_dump_status+0x2b0     ld       r3,-32560(r30)
0xc000000000266c20 .scsi_dump_status+0x2b4     bl       
0xc0000000000691ec .printk      
0xc000000000266c24 .scsi_dump_status+0x2b8     nop
0xc000000000266c28 .scsi_dump_status+0x2bc     b        
0xc000000000266b6c .scsi_dump_status+0x200
0xc000000000266c2c .scsi_dump_status+0x2c0     .long 0x0
0xc000000000266c30 .scsi_dump_status+0x2c4     .long 0x2041
0xc000000000266c34 .scsi_dump_status+0x2c8     lwz      r0,256(r8)
0xc000000000266c38 .scsi_dump_status+0x2cc     .long 0x0
0xc000000000266c3c .scsi_dump_status+0x2d0     .long 0x2c0
0xc000000000266c40 .scsi_dump_status+0x2d4     .long 0x107363
0xc000000000266c44 .scsi_dump_status+0x2d8     andi.    r9,r27,24420
0xc000000000266c48 .scsi_dump_status+0x2dc     andis.   r13,r11,28767
0xc000000000266c4c .scsi_dump_status+0x2e0     andi.    r20,r27,24948
[2]kdb> ^[[A^[[A


will also mail you the tar file of scsi.c, scsi.s, scsi.i ( I saved the temp 
files during compile ). 

Comment 19 IBM Bug Proxy 2003-10-16 00:46:14 UTC
------ Additional Comments From salina.com  2003-15-10 18:23 -------
It looks like we are crashing on my machine here

    } while ((entry = entry->next) != queue_head);

entry is 0

need to find out why do they expect entry->next = queue_head to be end of list 
and not entry=0 is end of list.
When is structure initialized ??

Chinmay, end of the day for me, you can continue. 

Comment 20 IBM Bug Proxy 2003-10-29 14:59:02 UTC
------ Additional Comments From achinmay.com(prefers email via albal.com)  2003-29-10 09:50 -------
Hi,

The trap is happening when the scsi_dump_status function is trying to print the
pending block device requests. The dump of the scsi host parameters and the scsi
command parameters goes on fine.

For some reason the structures in this block of code are not being populated.
Trying to decipher the reason for this.

Regards
-Chinmay 

Comment 21 IBM Bug Proxy 2003-11-05 16:55:53 UTC
------ Additional Comments From achinmay.com(prefers email via albal.com)  2003-05-11 05:32 -------
Hi,

Given below is a patch which prevents this trap from occuring. It checks for an
empty "entry" structure and breaks from the loop if it is empty. The root cause
for this trap is that the queue_head structure is being null initialized and so
any access to this triggers a trap. This patch is a workaround and does not
solve the problem of the queue_head being NULL.

We haven't been able to simulate a scenario wherein there a lot of pending block
requests to the scsi subsystem. This could also be one of the reasons why the
request_queue is empty.

Could you please test this in a scenario wherein there are pending block device
requests and let us know.

Regards
-Chinmay 

Comment 22 IBM Bug Proxy 2003-11-05 16:56:54 UTC
Created attachment 95732 [details]
scsi_dump_status.patch

Comment 23 IBM Bug Proxy 2003-11-05 16:57:04 UTC
------ Additional Comments From achinmay.com(prefers email via albal.com)  2003-05-11 05:34 -------
 
patch to check if entry structure is populated or not 

Comment 24 IBM Bug Proxy 2003-11-10 16:56:12 UTC
------ Additional Comments From achinmay.com(prefers email via albal.com)  2003-10-11 00:24 -------
Hi,

I wrote to Patrick Mansfield (scsi guru at IBM) and he was of the opinion that
there have not been many users of this feature and hence it is not present in
the 2.6 kernel. He also mentioned that a dump of block requests should be done
from the block layer.  The mailing lists were also of the opinion that this
feature could be discontinued. 

Khoa, 
There are two approaches to working around this problem.One by using the patch
to check if the entry structure is NULL or by commenting out the code
altogether(this was also suggested in the mailing lists, url attached in the bug
report). We would like to have your comments of how to proceed with this. 

Regards
- Chinmay 

Comment 25 IBM Bug Proxy 2003-11-12 13:13:52 UTC
------ Additional Comments From achinmay.com(prefers email via albal.com)  2003-12-11 03:30 -------
Hi,
Given below is a patch to comment out the scsi_dump_status feature. 

Comment 26 IBM Bug Proxy 2003-11-12 13:15:12 UTC
Created attachment 95922 [details]
scsi_dump_status.patch

Comment 27 IBM Bug Proxy 2003-11-12 13:15:24 UTC
------ Additional Comments From achinmay.com(prefers email via albal.com)  2003-12-11 03:31 -------
 
patch to comment out the scsi_dump_status feature in 2.4 kernel 

Comment 28 IBM Bug Proxy 2004-01-13 02:58:13 UTC
----- Additional Comments From zhouwu.com  2004-01-12 20:02 -------
              This defect still exist in latest RHEL3 update1 build(20040108). 
Investigating the kernel source shows that the above patch wasn't incorporated 
into this build. 

Comment 29 IBM Bug Proxy 2004-03-26 03:35:49 UTC
----- Additional Comments From davidyao.com  2004-03-25 22:36 -------
This bug is not fixed in 0316 RHEL3 U2 Beta.
===========================
[root@plinuxt14 root]# echo "scsi dump 2" > /proc/scsi/scsi

Message from syslogd@plinuxt14 at Fri Mar 26 11:33:42 2004 ...
plinuxt14 kernel: Kernel panic: kernel access of bad area pc d000000000022e88 
lr d000000000022e80 address 18 tsk bash/1817
========================= 

Comment 31 Doug Ledford 2004-05-20 21:38:10 UTC
A proposed fix for this has been created.  Currently, the fix can be
found in my working code repository (a bk repository hosted on
bkbits.net) as: bk://linux-scsi.bkbits.net/rhel3-scsi-test and the
relevant fix is changeset number 1.14.  I'm also going to attach a
copy of the patch, but it may not apply cleanly to a kernel that isn't
up to date with my source tree (or it may, I haven't checked).

Comment 32 Doug Ledford 2004-05-20 21:39:40 UTC
Created attachment 100392 [details]
Test patch for locking problem

Comment 33 Ernie Petrides 2004-06-05 04:36:26 UTC
A fix for this problem has been committed to the RHEL3 U3
patch pool this evening (in kernel version 2.4.21-15.6.EL).


Comment 36 IBM Bug Proxy 2004-07-19 09:37:50 UTC
----- Additional Comments From zhouwu.com  2004-07-19 04:59 -------
Just finish a verification of this defect on 07/09 released U3Beta ISO. It is 
fixed. Closing it. Thanks. 

Comment 37 Jay Turner 2004-08-17 11:27:38 UTC
Closing out issue on verification of resolution from original reporter.

Comment 38 John Flanagan 2004-09-02 04:30:37 UTC
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-433.html