230782 – Panic during the usage of API scsi_execute_async

Bug 230782 - Panic during the usage of API scsi_execute_async

Summary: Panic during the usage of API scsi_execute_async

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Mike Christie
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	217103
TreeView+	depends on / blocked

Reported:	2007-03-02 20:48 UTC by Sudhir Dachepalli
Modified:	2008-02-29 19:23 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-05-16 18:00:15 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Sudhir Dachepalli 2007-03-02 20:48:53 UTC

Description of problem:
I am noticing the following panic on Redhat 5.
Our driver is a Multipath failover module and we are 
using "scsi_execute_async" API for routing IO's.
In earlier kernels we used "scsi_do_req" API.


Version-Release number of selected component (if applicable):


How reproducible:
Very easily reproducible.

Steps to Reproduce:
1.load MPP failover driver 
2.discover devices
3.make filesystems
4.start io application to issue filesystem io's.
5.panic noticed with in 5 min
  
Actual results:

Panic Message
--------------
general protection fault: 0000 [1] SMP
last sysfs file: /block/sdbl/sdbl2/dev
CPU 2
Modules linked in: mppVhba(U) qla2xxx lpfc ata_piix libata mptspi 
scsi_transport_spi mptfc mptscsih scsi_transport_fc mptbase aacrai d mppUpper
(U) sg sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
Pid: 525, comm: mpp_dcr Not tainted 2.6.18-8.el5 #1
RIP: 0010:[<ffffffff800074ff>]  [<ffffffff800074ff>] kmem_cache_free+0x5f/0x1cb
RSP: 0018:ffff81007e2c9da0  EFLAGS: 00010286
RAX: c3c031c3c031c3c0 RBX: ffff81007df91900 RCX: 00000000000fe000
RDX: c3c033804031c3c0 RSI: 000001bc80000000 RDI: 00000000000007f0
RBP: ffff81007f553fc0 R08: 0000000000000800 R09: 000000000000ffff
R10: 0000000000000800 R11: ffffffff80042baf R12: ffff810037f27100
R13: 0000000000000000 R14: 0000000000000800 R15: ffff81007eddac88
FS:  0000000000000000(0000) GS:ffff81007ff1de40(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000005f03a5 CR3: 000000007f799000 CR4: 00000000000006e0 Process 
mpp_dcr (pid: 525, threadinfo ffff81007e2c8000, task ffff81007e09a100)
Stack:  ffff81007dd58b08 0000000000000800 ffff81007df91900 ffff81007f553fc0  
0000000000000000 ffff81007df91900 0000000000000800 ffffffff80040046  
ffff81007df91900 0000000000000800 ffff81007dd58b08 ffffffff88079310 Call Trace:
 [<ffffffff80040046>] bio_free+0x33/0x43  
[<ffffffff88079310>] :scsi_mod:scsi_execute_async+0x18a/0x3ac
 [<ffffffff88295343>] :mppVhba:mppLnx_do_queuecommand+0x76e/0x7ab
 [<ffffffff88293e49>] :mppVhba:mppLnx_scsi_done+0x0/0xa4
 [<ffffffff8828d6ce>] :mppVhba:mppLnx_dpc_handler+0xf5/0x218
 [<ffffffff8005bfe5>] child_rip+0xa/0x11  
[<ffffffff8828d5d9>] :mppVhba:mppLnx_dpc_handler+0x0/0x218
 [<ffffffff8005bfdb>] child_rip+0x0/0x11
 

Code: 8b 02 f6 c4 40 74 04 48 8b 52 10 8b 02 84 c0 78 0a 0f 0b 68 RIP  
[<ffffffff800074ff>] kmem_cache_free+0x5f/0x1cb  RSP <ffff81007e2c9da0>  
<0>Kernel panic - not syncing: Fatal exception

Expected results:
No panic should be noticed.

Additional info:
I found the following link with similar issue http://www.mail-
archive.com/linux-scsi.org/msg04233.html

Comment 3 Mike Christie 2007-03-02 21:37:21 UTC

LSI did you test the patch in the link:
http://marc.theaimsgroup.com/?l=linux-scsi&m=116464965414878&w=2

or are you just posting on that thread? From their debug output on the thread it
looks like the patch will solve the problem.

LSI, you guys can also just use the block layer functions directly.

Comment 4 Sudhir Dachepalli 2007-03-02 21:44:35 UTC

Mike,

  I will try the "Patch 1" from the link and update you accordingly. for 
patching I am using SLES 10 machine. Hence I posted it on the different thread.
I see the panics on both Redhat 5 and SLES 10.

regards,
Sudhir

Comment 5 Sudhir Dachepalli 2007-03-02 21:47:51 UTC

Mike,

  I will try the "Patch 1" from the link and update you accordingly. for 
patching I am using SLES 10 machine. Hence I posted it on the different thread.
I see the panics on both Redhat 5 and SLES 10.

Since we started developement during redha5 beta we cleaned up our code to 
use "scsi_do_req" and scsi_request and went ahead with "scsi_execute_async" 
functionality. Its too late in the program already to make changes to use 
block layer functions.

If we can workaround using the patch we will go a head with the current 
changes and start heading in the direction you suggested using block layer 
functions. 

regards,
Sudhir

Comment 6 Mike Christie 2007-03-02 22:04:00 UTC

Sudhir, How did you guys get that scatterlist with the offset at the end? Are
you guys using something like sg iovecs or did you use multiple kmallocd
buffers? You may just want to use alloc_page or alloc_pages like sg and st are now.

Comment 7 Sudhir Dachepalli 2007-03-02 23:18:06 UTC

Mike,

 We register to scsi_host_template's queue command interface.
I am passing the request_buffer and request_bufferlen that we get through the 
scsi_cmnd from queuecommad directly to the scsi_execute_async.

We are not using sgiovecs, we are not allocating buffers for the scsi_cmnd.

Sudhir

Comment 9 Sudhir Dachepalli 2007-03-03 15:31:48 UTC

I applied the patch to SLES 10 and SLES 10 SP1 since the source code was 
readily available and I compiled only "scsi_mod". The problem was reproduced 
in those distributions as well. 
With the patch the IO's are running successfuly for more than 12 hours on 1 
system running SLES 10 and another running SLES 10 SP1.

Is there a documentation of how to get the RHEL source code and compile it.
I guess the patch is very trivial, since KABI promised the availabliity 
of "scsi_execute_async" we went ahead and used it.

What is the mechanism to apply the patch to go into Redhat 5 ? since the 
release is very close ? 

Sudhir

Comment 10 Mike Christie 2007-03-04 00:19:47 UTC

(In reply to comment #9)
> I guess the patch is very trivial, since KABI promised the availabliity 
> of "scsi_execute_async" we went ahead and used it.

I am not sure if we made any promise we would support it like how you are using
it though. It was only meant to support what is in the upstream kernel. If your
code is not upstream and you never try to get it there, it is very difficult for
people to guess what you want to do.

Comment 14 Mike Christie 2007-03-12 16:35:17 UTC

LSI, have you guys found a workaround? Given your usage comments upstream, the
only thing I could think of is to clone bios and bvecs from incoming requests,
stick the clones on a new request, then send new request with
blk_rq_execute_no_wait. If the request did not have a bio then you would have to
set the data field to the data field of the incoming request (this would not
work for scatterlists but only data buffers).

Comment 15 Sudhir Dachepalli 2007-03-12 19:30:08 UTC

Mike,

 Currently we are working around the problem by not calling the 
scsi_execute_async and scsi_execute_req. Soon we will try to follow your 
suggestion and clone the bios and bvecs.

This problem can be closed since in future you are going to remove the 
scsi_execute_async totally.

Thanks for all your help.

regards,
Sudhir

Comment 16 Andrius Benokraitis 2007-05-16 18:00:15 UTC

Closing per Comment #15.

Note You need to log in before you can comment on or make changes to this bug.