Description of problem: I am noticing the following panic on Redhat 5. Our driver is a Multipath failover module and we are using "scsi_execute_async" API for routing IO's. In earlier kernels we used "scsi_do_req" API. Version-Release number of selected component (if applicable): How reproducible: Very easily reproducible. Steps to Reproduce: 1.load MPP failover driver 2.discover devices 3.make filesystems 4.start io application to issue filesystem io's. 5.panic noticed with in 5 min Actual results: Panic Message -------------- general protection fault: 0000 [1] SMP last sysfs file: /block/sdbl/sdbl2/dev CPU 2 Modules linked in: mppVhba(U) qla2xxx lpfc ata_piix libata mptspi scsi_transport_spi mptfc mptscsih scsi_transport_fc mptbase aacrai d mppUpper (U) sg sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd Pid: 525, comm: mpp_dcr Not tainted 2.6.18-8.el5 #1 RIP: 0010:[<ffffffff800074ff>] [<ffffffff800074ff>] kmem_cache_free+0x5f/0x1cb RSP: 0018:ffff81007e2c9da0 EFLAGS: 00010286 RAX: c3c031c3c031c3c0 RBX: ffff81007df91900 RCX: 00000000000fe000 RDX: c3c033804031c3c0 RSI: 000001bc80000000 RDI: 00000000000007f0 RBP: ffff81007f553fc0 R08: 0000000000000800 R09: 000000000000ffff R10: 0000000000000800 R11: ffffffff80042baf R12: ffff810037f27100 R13: 0000000000000000 R14: 0000000000000800 R15: ffff81007eddac88 FS: 0000000000000000(0000) GS:ffff81007ff1de40(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000005f03a5 CR3: 000000007f799000 CR4: 00000000000006e0 Process mpp_dcr (pid: 525, threadinfo ffff81007e2c8000, task ffff81007e09a100) Stack: ffff81007dd58b08 0000000000000800 ffff81007df91900 ffff81007f553fc0 0000000000000000 ffff81007df91900 0000000000000800 ffffffff80040046 ffff81007df91900 0000000000000800 ffff81007dd58b08 ffffffff88079310 Call Trace: [<ffffffff80040046>] bio_free+0x33/0x43 [<ffffffff88079310>] :scsi_mod:scsi_execute_async+0x18a/0x3ac [<ffffffff88295343>] :mppVhba:mppLnx_do_queuecommand+0x76e/0x7ab [<ffffffff88293e49>] :mppVhba:mppLnx_scsi_done+0x0/0xa4 [<ffffffff8828d6ce>] :mppVhba:mppLnx_dpc_handler+0xf5/0x218 [<ffffffff8005bfe5>] child_rip+0xa/0x11 [<ffffffff8828d5d9>] :mppVhba:mppLnx_dpc_handler+0x0/0x218 [<ffffffff8005bfdb>] child_rip+0x0/0x11 Code: 8b 02 f6 c4 40 74 04 48 8b 52 10 8b 02 84 c0 78 0a 0f 0b 68 RIP [<ffffffff800074ff>] kmem_cache_free+0x5f/0x1cb RSP <ffff81007e2c9da0> <0>Kernel panic - not syncing: Fatal exception Expected results: No panic should be noticed. Additional info: I found the following link with similar issue http://www.mail- archive.com/linux-scsi.org/msg04233.html
LSI did you test the patch in the link: http://marc.theaimsgroup.com/?l=linux-scsi&m=116464965414878&w=2 or are you just posting on that thread? From their debug output on the thread it looks like the patch will solve the problem. LSI, you guys can also just use the block layer functions directly.
Mike, I will try the "Patch 1" from the link and update you accordingly. for patching I am using SLES 10 machine. Hence I posted it on the different thread. I see the panics on both Redhat 5 and SLES 10. regards, Sudhir
Mike, I will try the "Patch 1" from the link and update you accordingly. for patching I am using SLES 10 machine. Hence I posted it on the different thread. I see the panics on both Redhat 5 and SLES 10. Since we started developement during redha5 beta we cleaned up our code to use "scsi_do_req" and scsi_request and went ahead with "scsi_execute_async" functionality. Its too late in the program already to make changes to use block layer functions. If we can workaround using the patch we will go a head with the current changes and start heading in the direction you suggested using block layer functions. regards, Sudhir
Sudhir, How did you guys get that scatterlist with the offset at the end? Are you guys using something like sg iovecs or did you use multiple kmallocd buffers? You may just want to use alloc_page or alloc_pages like sg and st are now.
Mike, We register to scsi_host_template's queue command interface. I am passing the request_buffer and request_bufferlen that we get through the scsi_cmnd from queuecommad directly to the scsi_execute_async. We are not using sgiovecs, we are not allocating buffers for the scsi_cmnd. Sudhir
I applied the patch to SLES 10 and SLES 10 SP1 since the source code was readily available and I compiled only "scsi_mod". The problem was reproduced in those distributions as well. With the patch the IO's are running successfuly for more than 12 hours on 1 system running SLES 10 and another running SLES 10 SP1. Is there a documentation of how to get the RHEL source code and compile it. I guess the patch is very trivial, since KABI promised the availabliity of "scsi_execute_async" we went ahead and used it. What is the mechanism to apply the patch to go into Redhat 5 ? since the release is very close ? Sudhir
(In reply to comment #9) > I guess the patch is very trivial, since KABI promised the availabliity > of "scsi_execute_async" we went ahead and used it. I am not sure if we made any promise we would support it like how you are using it though. It was only meant to support what is in the upstream kernel. If your code is not upstream and you never try to get it there, it is very difficult for people to guess what you want to do.
LSI, have you guys found a workaround? Given your usage comments upstream, the only thing I could think of is to clone bios and bvecs from incoming requests, stick the clones on a new request, then send new request with blk_rq_execute_no_wait. If the request did not have a bio then you would have to set the data field to the data field of the incoming request (this would not work for scatterlists but only data buffers).
Mike, Currently we are working around the problem by not calling the scsi_execute_async and scsi_execute_req. Soon we will try to follow your suggestion and clone the bios and bvecs. This problem can be closed since in future you are going to remove the scsi_execute_async totally. Thanks for all your help. regards, Sudhir
Closing per Comment #15.