Bug 138517 (IT_53295)

Summary: Memory Leak in AS 2.1 and 3.0 kernels
Product: Red Hat Enterprise Linux 2.1 Reporter: Peter Martuccelli <peterm>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED ERRATA QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1CC: benl, coughlan, jparadis, riel, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHSA-2005-529 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-08-25 13:29:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 143573    
Attachments:
Description Flags
Proposed patch none

Description Peter Martuccelli 2004-11-09 18:56:09 UTC
There is a memory leak in the scsi add-single-device path.
The leak drops queue_depth command blocks for each upper level driver
when a new device is added. This means around 16k lost every time a
new device is added.

In the hardcoded scan routine scsi_build_commandblocks is called once
for each upper level driver after the first that sets attached. This
is the main problem. A secondary problem is that
scsi_build_commandblocks unconditionally overwrites the
SDpnt->device_queue pointer so already allocated commands on that
list are simply lost.


Here is a patch to address the first issue:


scsi_scan_cmnd_leak.rhel.diff
-----------------------------------------
---
/home/osbuild/RHEL3.1/linux-2.4.21-9.ELsmp/drivers/scsi/scsi_scan.c	Tue
May  4 07:09:41 2004
+++ linux-2.4.21-9.ELsmp/drivers/scsi/scsi_scan.c	Fri Oct 29 12:42:23 2004
@@ -440,7 +440,7 @@
 			for (sdtpnt = scsi_devicelist; sdtpnt; sdtpnt = sdtpnt->next) {
 				if (sdtpnt->attach) {
 					(*sdtpnt->attach) (oldSDpnt);
-					if (oldSDpnt->attached) {
+					if (oldSDpnt->attached && !oldSDpnt->has_cmdblocks) {
 						scsi_build_commandblocks(oldSDpnt);
 						if (0 == oldSDpnt->has_cmdblocks) {
 							printk("scan_scsis: DANGER, no command blocks\n");


---------------------------------------

This is the minimal change. This fixes the current problem.



It may also be worth fixing scsi_buildcommand_blocks.
Below are two possible patches that do this in different ways.


scsi_cmnd_leak.rhel1.diff
---------------------------------------
--- /home/osbuild/RHEL3.1/linux-2.4.21-9.ELsmp/drivers/scsi/scsi.c	Tue May
4 07:09:41 2004
+++ linux-2.4.21-9.ELsmp/drivers/scsi/scsi.c	Fri Oct 29 12:40:25 2004
@@ -1475,6 +1475,9 @@
 	Scsi_Cmnd *SCpnt;
 	request_queue_t *q = &SDpnt->request_queue;

+        if (SDpnt->has_cmdblocks)
+                return;
+
 	/*
 	 * Init the spin lock that protects alloc/free of command blocks
 	 */

----------------------------------------



scsi_cmnd_leak.rhel2.diff
----------------------------------------
diff -ruN -X ../patches/dontdiff
/home/osbuild/RHEL3.1/linux-2.4.21-9.ELsmp/drivers/scsi/scsi.c
linux-2.4.21-9.ELsmp/drivers/scsi/scsi.c
--- /home/osbuild/RHEL3.1/linux-2.4.21-9.ELsmp/drivers/scsi/scsi.c	Tue May
4 07:09:41 2004
+++ linux-2.4.21-9.ELsmp/drivers/scsi/scsi.c	Wed Oct 27 09:04:03 2004
@@ -1472,10 +1472,17 @@
 	unsigned long flags;
 	struct Scsi_Host *host = SDpnt->host;
 	int j;
-	Scsi_Cmnd *SCpnt;
+	Scsi_Cmnd *SCpnt , *SCnext;
 	request_queue_t *q = &SDpnt->request_queue;
-
-	/*
+
+        if (SDpnt->has_cmdblocks){
+                for (SCpnt = SDpnt->device_queue; SCpnt; SCpnt =
SCnext) {
+                        SDpnt->device_queue = SCnext = SCpnt->next;
+                        list_del(&SCpnt->sc_list);
+                        kfree((char *) SCpnt);
+                }
+        }
+        /*
 	 * Init the spin lock that protects alloc/free of command blocks
 	 */
 	spin_lock_init(&SDpnt->device_request_lock);


-----------------------------------------



These both have the same result (not leaking memory). The second can
take into account that the current number of commands is not correct
and build a correct number. The first assumes the commands are okay.

----------
Action by: smorin
Issue Registered
----------
Action by: mgahagan
Wendy, RHEL 3 U4?

What about 2.1?

They are reporting this with both 2.1 and 3 (both patches in the top
of the ticket)




----------
Action by: mgahagan
We already have a fix for this for RHEL 3 scheduled for U4, I have
included the RHEL 3 U4 patch below for reference.

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
#   2004/08/31 13:47:52-04:00 dledford compaq xsintricity com
#   drivers/scsi/scsi.c:
#    Don't re-init existing command blocks
#
# drivers/scsi/scsi.c
#   2004/08/31 13:47:51-04:00 dledford compaq xsintricity com +6 -0
#   Don't re-init existing command blocks
#
diff -Nru a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c
--- a/drivers/scsi/scsi.c 2004-08-31 13:48:48 -04:00
+++ b/drivers/scsi/scsi.c 2004-08-31 13:48:48 -04:00
@@ -1477,6 +1477,12 @@
request_queue_t *q = &SDpnt->request_queue;

/*
+ * Only init things once.
+ */
+ if (SDpnt->has_cmdblocks)
+ return;
+
+ /*
* Init the spin lock that protects alloc/free of command blocks
*/
spin_lock_init(&SDpnt->device_request_lock);



----------
Action by: peterm
Escalated to Bugzilla

Comment 1 Peter Martuccelli 2004-12-08 21:56:22 UTC

*** This bug has been marked as a duplicate of 131521 ***

Comment 3 Peter Martuccelli 2005-02-08 20:36:10 UTC
A fix for this problem has just been committed to the RHEL3 U4
patch pool this evening (in kernel version 2.4.21-20.4.EL).



Comment 8 Doug Ledford 2005-02-17 23:55:06 UTC
*** Bug 138941 has been marked as a duplicate of this bug. ***

Comment 9 Doug Ledford 2005-02-17 23:59:16 UTC
Created attachment 111186 [details]
Proposed patch

The attached patch should solve the regression that was found in testing of the
original AS2.1 patch.  In case the people have forgotten, the original
regression was that multi-lun devices were no longer properly probed with the
original memleak fix patch.  Patch being submitted for internal review,
although testing will have to be done by someone else since I don't have any
multi-lun devices (I believe the original regression was found when testing on
a MegaRAID controller with multiple logical volumes).

Comment 10 Doug Ledford 2005-02-18 07:13:23 UTC
Patch partially tested (I don't have a multilun setup, but I tested
that it works in non-multilun configurations) and submitted for review
and further testing.

Comment 11 Tom Coughlan 2005-02-28 14:46:41 UTC
I applied the old patch and re-created the regression on a megaraid controller
(as reported in BZ 140454). Then I applied the new patch and verified that the
problem does not occur. This patch looks good to go. 

Comment 12 Jim Paradis 2005-03-03 03:54:07 UTC
A fix for this problem has just been committed to the RHEL2.1 U7
patch pool this evening (in kernel version 2.4.9-61).


Comment 20 Jim Paradis 2005-04-26 21:46:05 UTC
Due to operator error, this patch in fact did *not* make it into the U7 kernel
build, so it is now slated for U8.


Comment 22 Jim Paradis 2005-05-23 21:28:15 UTC
I just verified that this patch *is* in the kernel source pool that will become U8.


Comment 24 Red Hat Bugzilla 2005-08-25 13:29:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-529.html