Bug 97866
Summary: | [scsi] LTC3224 - trap while running 'echo "scsi dump 2" > /proc/scsi/scsi ' | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Kaena Freitas <kaena> | ||||||||||
Component: | kernel | Assignee: | Doug Ledford <dledford> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brian Brock <bbrock> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | medium | ||||||||||||
Version: | 2.1 | CC: | olof, riel, tao | ||||||||||
Target Milestone: | --- | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2004-08-17 11:27:38 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 116727 | ||||||||||||
Attachments: |
|
Description
Kaena Freitas
2003-06-23 14:07:34 UTC
------- Additional Comments From olof.com(prefers email via olof.com) 2003-23-06 15:46 ------- I was not able to reproduce this on my own system yet. However, the way the request queue is traversed in the loop in scsi_dump_status worries me, since there are no locks taken. I'll look at this some more and check what locks are serializing access to the list in question. ------- Additional Comments From zhouwu.com 2003-24-06 20:47 ------- To Kaena: We did this test on the new A4 build. It traped. On A3, there is no such symptom. It is really odd. To Olof, did you do the test on the new A4 build? I will try to setup a enviroment for you to debug if you like. I also navigate the source of scsi_dump_status. It seems stop between the following lines: for (i = 0; i < MAX_BLKDEV; i++) { struct list_head * queue_head; queue_head = &blk_dev [i].request_queue.queue_head; if (!list_empty(queue_head)) { struct request *req; struct list_head * entry; printk(KERN_INFO "%d: ", i); entry = queue_head->next; do { req = blkdev_entry_to_request (entry); printk("(%s %d %ld %ld %ld) ", kdevname(req->rq_dev), req->cmd, req->sector, req->nr_sectors, req->current_nr_sectors); } while ((entry = entry->next) != queue_head); printk("\n"); } } For the first loop, the "printk(KERN_INFO "%d: ", i);" did output a message of 0, then it stop before or at the next printk. ------- Additional Comments From olof.com(prefers email via olof.com) 2003-25-06 18:44 ------- The list access is completely unprotected in the proc code, which could cause crashes just as we've seen. Wu, can you download the tarfile from below and try out the kernel and initrd? http://olof.austin.ibm.com/kernels/3224.tar Let me know if this seems to work better. The problem is more likely to happen under heavy SCSI load, since the queues will be changed more often then. Thanks. ------- Additional Comments From zhouwu.com 2003-26-06 06:05 ------- We try out the initrd and vmlinux you gave us. It don't work. The same error occur on A4 build. And we can saw a more message saying that : "Kernel panic: kernel access of bad area pc d000000000021fb0 lr d000000000021fa8 address 18 tsk bash/12333" From a document that I can't find now, I have a feeling that the address space PPC64 Linux kernel could access is limited to 0xC000000000000000 to 0xC00001FFFFFFFFFF. Is it right? Maybe the trap is related to this? the xmon output is as follow. 2:mon> e cpu 2: Vector: 300 (Data Access) at [c0000001d24df6d0] pc: d000000000021fb0 lr: d000000000021fa8 sp: c0000001d24df950 msr: 9000000000009032 dar: 18 dsisr: 40000000 current = 0xc0000001d24dc000 paca = 0xc000000000473000 current = c0000001d24dc000, pid = 12333, comm = bash 2:mon> Success in copy number 19 a of 10000K_file in dir2. Success in copy number 19 b of 1K_file in dir2. Success in copy number 19 b of 10K_file in dir3. Unrecognized command: \x3 (type ? for help) 2:mon> 2:mon> x Kernel panic: kernel access of bad area pc d000000000021fb0 lr d000000000021fa8 address 18 tsk bash/12333 ------- Additional Comments From olof.com(prefers email via olof.com) 2003-26-06 15:55 ------- Is that really the console output? It seems to indicate that the other CPUs in the system are still running even though one entered XMON (The "Success in copy ..." messages). Please include a stacktrace (t command) whenever you provide XMON data, without very useful information. 0xd addresses are valid kernel addresses, that's not a concern at this time. ------- Additional Comments From zhouwu.com 2003-29-06 04:39 ------- yes, that is indeed the console output. while I reproduce the bug, there are some IO testcase running, so come the "Success in copy ..." messages. And the trap only stoped cpu2, others still running. I try the above process another time without any heavy load. It traps again. attached is the xmon output, including the backtrace. Created attachment 92684 [details]
trace
------- Additional Comments From olof.com(prefers email via olof.com) 2003-08-07 12:10 ------- What's RedHat's input on this? I doubt that it's a pSeries-specific problem. Thanks. it's a known and very old bug that doesn't seem to have bothered too many people it is something we will want to fix, but not the very highest priority bug at the moment olof.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|OPEN |ASSIGNED Owning Team|pSeries |Red Hat ------- Additional Comments From zhouwu.com 2003-28-07 04:44 ------- This problem still exist in B1 build. degrading to SHOULDFIX salina.com changed: What |Removed |Added ---------------------------------------------------------------------------- Owner|khoa.com |salina.com salina.com changed: What |Removed |Added ---------------------------------------------------------------------------- Status|OPEN |ASSIGNED ------- Additional Comments From salina.com 2003-08-09 15:44 ------- I'll take a look at this in next few days. p-series on loan to another group - need get it back from them first. ------- Additional Comments From salina.com 2003-08-09 16:30 ------- Found this in the kernel archives http://www.ussg.iu.edu/hypermail/linux/kernel/0206.1/0430.html ------ Additional Comments From salina.com 2003-15-10 16:31 ------- Chinmay, Info on KDB for the crash ... after applying your patch. Same problem still there. My system is SLES 8 PPC64, latest RC3 kernel kernel-ppc64-2.4.21-90 [3]kdb> dump dump-all dump-basic kdb_cmd[0] : excp cpu 3: Vector: 300 (Data Access) at [c000000015b9f6d0] pc: c000000000266c10 (scsi_dump_status+0x2a4) lr: c000000000266bd4 (scsi_dump_status+0x268) sp: c000000015b9f950 msr: 9000000000009032 dar: 0 dsisr: 40000000 current = 0xc000000015b9c000 paca = 0xc000000000658000 current = c000000015b9c000, pid = 2001, comm = bash kdb_cmd[1] : bt 0xc000000015b9c000 00002001 00001999 0 003 run 0xc000000015b9c5e0*bash SP(esp) PC(eip) Function(args) 0xc000000015b9f950 0xc000000000266c10 .scsi_dump_status +0x2a4 0xc000000015b9fa70 0xc000000000265060 .proc_scsi_gen_write +0x208 0xc000000015b9fb40 0xc0000000000fa908 .proc_file_write +0x58 0xc000000015b9fbc0 0xc0000000000ad730 .sys_write +0xe4 0xc000000015b9fc60 0xc000000000010814 .ret_from_syscall_1 [exception: c00:(System Call) regs 0xc000000015b9fcd0] nip:[0xfe5e160] gpr[1]: [0xffffe250] kdb: Not a kernel-space address 0xffffe260 <Stack drops into userspace here 00000000ffffe250> kdb_cmd[2] : rd gpr0 = 0xc000000000266bd4 gpr1 = 0xc000000015b9f950 gpr2 = 0xc000000000728000 gpr3 = 0x0000000000000006 gpr4 = 0x0000000000000001 gpr5 = 0x0000000000000001 gpr6 = 0xc00000000044d3d8 gpr7 = 0xc00000000044d390 gpr8 = 0x0000000000000006 gpr9 = 0xc0000000007c0e00 gpr10 = 0xc00000000044d3d0 gpr11 = 0x0000000000003455 gpr12 = 0xc0000000007c0a04 gpr13 = 0xc000000000658000 gpr14 = 0x0000000000000000 gpr15 = 0x0000000000000000 gpr16 = 0x0000000000000000 gpr17 = 0x0000000000000000 gpr18 = 0x0000000000000000 gpr19 = 0x0000000000000000 gpr20 = 0x0000000000000001 gpr21 = 0xc000000000658ec0 gpr22 = 0xb000000000009032 gpr23 = 0x0000000000004000 gpr24 = 0xc0000001fe033180 gpr25 = 0xc0000001fe006e00 gpr26 = 0x0000000000000000 gpr27 = 0x0000000000000000 gpr28 = 0xc0000000008601b0 gpr29 = 0xc0000000008601b0 gpr30 = 0xc000000000501500 gpr31 = 0x0000000000000000 nip = 0xc000000000266c10 msr = 0x9000000000009032 esp = 0xc000000015b9f950 orig_gpr3 = 0xc000000015b9f860 ctr = 0xc0000000003769f0 link = 0xc000000000266bd4 xer = 0x0000000000000000 ccr = 0x0000000028022822 mq = 0x0000000000000000 trap = 0x0000000000000300 dar = 0x0000000000000000 dsisr = 0x0000000040000000 result = 0x0000000000000000 ®s = 0xc000000015b9f6d0 ....... I disassembled the code [2]kdb> id 0xc000000000266c10 0xc000000000266c10 .scsi_dump_status+0x2a4 ld r31,0(r31) 0xc000000000266c14 .scsi_dump_status+0x2a8 cmpd r31,r29 0xc000000000266c18 .scsi_dump_status+0x2ac bne 0xc000000000266bdc .scsi_dump_status+0x270 0xc000000000266c1c .scsi_dump_status+0x2b0 ld r3,-32560(r30) 0xc000000000266c20 .scsi_dump_status+0x2b4 bl 0xc0000000000691ec .printk 0xc000000000266c24 .scsi_dump_status+0x2b8 nop 0xc000000000266c28 .scsi_dump_status+0x2bc b 0xc000000000266b6c .scsi_dump_status+0x200 0xc000000000266c2c .scsi_dump_status+0x2c0 .long 0x0 0xc000000000266c30 .scsi_dump_status+0x2c4 .long 0x2041 0xc000000000266c34 .scsi_dump_status+0x2c8 lwz r0,256(r8) 0xc000000000266c38 .scsi_dump_status+0x2cc .long 0x0 0xc000000000266c3c .scsi_dump_status+0x2d0 .long 0x2c0 0xc000000000266c40 .scsi_dump_status+0x2d4 .long 0x107363 0xc000000000266c44 .scsi_dump_status+0x2d8 andi. r9,r27,24420 0xc000000000266c48 .scsi_dump_status+0x2dc andis. r13,r11,28767 0xc000000000266c4c .scsi_dump_status+0x2e0 andi. r20,r27,24948 [2]kdb> ^[[A^[[A will also mail you the tar file of scsi.c, scsi.s, scsi.i ( I saved the temp files during compile ). ------ Additional Comments From salina.com 2003-15-10 18:23 ------- It looks like we are crashing on my machine here } while ((entry = entry->next) != queue_head); entry is 0 need to find out why do they expect entry->next = queue_head to be end of list and not entry=0 is end of list. When is structure initialized ?? Chinmay, end of the day for me, you can continue. ------ Additional Comments From achinmay.com(prefers email via albal.com) 2003-29-10 09:50 ------- Hi, The trap is happening when the scsi_dump_status function is trying to print the pending block device requests. The dump of the scsi host parameters and the scsi command parameters goes on fine. For some reason the structures in this block of code are not being populated. Trying to decipher the reason for this. Regards -Chinmay ------ Additional Comments From achinmay.com(prefers email via albal.com) 2003-05-11 05:32 ------- Hi, Given below is a patch which prevents this trap from occuring. It checks for an empty "entry" structure and breaks from the loop if it is empty. The root cause for this trap is that the queue_head structure is being null initialized and so any access to this triggers a trap. This patch is a workaround and does not solve the problem of the queue_head being NULL. We haven't been able to simulate a scenario wherein there a lot of pending block requests to the scsi subsystem. This could also be one of the reasons why the request_queue is empty. Could you please test this in a scenario wherein there are pending block device requests and let us know. Regards -Chinmay Created attachment 95732 [details]
scsi_dump_status.patch
------ Additional Comments From achinmay.com(prefers email via albal.com) 2003-05-11 05:34 ------- patch to check if entry structure is populated or not ------ Additional Comments From achinmay.com(prefers email via albal.com) 2003-10-11 00:24 ------- Hi, I wrote to Patrick Mansfield (scsi guru at IBM) and he was of the opinion that there have not been many users of this feature and hence it is not present in the 2.6 kernel. He also mentioned that a dump of block requests should be done from the block layer. The mailing lists were also of the opinion that this feature could be discontinued. Khoa, There are two approaches to working around this problem.One by using the patch to check if the entry structure is NULL or by commenting out the code altogether(this was also suggested in the mailing lists, url attached in the bug report). We would like to have your comments of how to proceed with this. Regards - Chinmay ------ Additional Comments From achinmay.com(prefers email via albal.com) 2003-12-11 03:30 ------- Hi, Given below is a patch to comment out the scsi_dump_status feature. Created attachment 95922 [details]
scsi_dump_status.patch
------ Additional Comments From achinmay.com(prefers email via albal.com) 2003-12-11 03:31 ------- patch to comment out the scsi_dump_status feature in 2.4 kernel ----- Additional Comments From zhouwu.com 2004-01-12 20:02 ------- This defect still exist in latest RHEL3 update1 build(20040108). Investigating the kernel source shows that the above patch wasn't incorporated into this build. ----- Additional Comments From davidyao.com 2004-03-25 22:36 ------- This bug is not fixed in 0316 RHEL3 U2 Beta. =========================== [root@plinuxt14 root]# echo "scsi dump 2" > /proc/scsi/scsi Message from syslogd@plinuxt14 at Fri Mar 26 11:33:42 2004 ... plinuxt14 kernel: Kernel panic: kernel access of bad area pc d000000000022e88 lr d000000000022e80 address 18 tsk bash/1817 ========================= A proposed fix for this has been created. Currently, the fix can be found in my working code repository (a bk repository hosted on bkbits.net) as: bk://linux-scsi.bkbits.net/rhel3-scsi-test and the relevant fix is changeset number 1.14. I'm also going to attach a copy of the patch, but it may not apply cleanly to a kernel that isn't up to date with my source tree (or it may, I haven't checked). Created attachment 100392 [details]
Test patch for locking problem
A fix for this problem has been committed to the RHEL3 U3 patch pool this evening (in kernel version 2.4.21-15.6.EL). ----- Additional Comments From zhouwu.com 2004-07-19 04:59 ------- Just finish a verification of this defect on 07/09 released U3Beta ISO. It is fixed. Closing it. Thanks. Closing out issue on verification of resolution from original reporter. An errata has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-433.html |