There is a memory leak in the scsi add-single-device path. The leak drops queue_depth command blocks for each upper level driver when a new device is added. This means around 16k lost every time a new device is added. In the hardcoded scan routine scsi_build_commandblocks is called once for each upper level driver after the first that sets attached. This is the main problem. A secondary problem is that scsi_build_commandblocks unconditionally overwrites the SDpnt->device_queue pointer so already allocated commands on that list are simply lost. Here is a patch to address the first issue: scsi_scan_cmnd_leak.rhel.diff ----------------------------------------- --- /home/osbuild/RHEL3.1/linux-2.4.21-9.ELsmp/drivers/scsi/scsi_scan.c Tue May 4 07:09:41 2004 +++ linux-2.4.21-9.ELsmp/drivers/scsi/scsi_scan.c Fri Oct 29 12:42:23 2004 @@ -440,7 +440,7 @@ for (sdtpnt = scsi_devicelist; sdtpnt; sdtpnt = sdtpnt->next) { if (sdtpnt->attach) { (*sdtpnt->attach) (oldSDpnt); - if (oldSDpnt->attached) { + if (oldSDpnt->attached && !oldSDpnt->has_cmdblocks) { scsi_build_commandblocks(oldSDpnt); if (0 == oldSDpnt->has_cmdblocks) { printk("scan_scsis: DANGER, no command blocks\n"); --------------------------------------- This is the minimal change. This fixes the current problem. It may also be worth fixing scsi_buildcommand_blocks. Below are two possible patches that do this in different ways. scsi_cmnd_leak.rhel1.diff --------------------------------------- --- /home/osbuild/RHEL3.1/linux-2.4.21-9.ELsmp/drivers/scsi/scsi.c Tue May 4 07:09:41 2004 +++ linux-2.4.21-9.ELsmp/drivers/scsi/scsi.c Fri Oct 29 12:40:25 2004 @@ -1475,6 +1475,9 @@ Scsi_Cmnd *SCpnt; request_queue_t *q = &SDpnt->request_queue; + if (SDpnt->has_cmdblocks) + return; + /* * Init the spin lock that protects alloc/free of command blocks */ ---------------------------------------- scsi_cmnd_leak.rhel2.diff ---------------------------------------- diff -ruN -X ../patches/dontdiff /home/osbuild/RHEL3.1/linux-2.4.21-9.ELsmp/drivers/scsi/scsi.c linux-2.4.21-9.ELsmp/drivers/scsi/scsi.c --- /home/osbuild/RHEL3.1/linux-2.4.21-9.ELsmp/drivers/scsi/scsi.c Tue May 4 07:09:41 2004 +++ linux-2.4.21-9.ELsmp/drivers/scsi/scsi.c Wed Oct 27 09:04:03 2004 @@ -1472,10 +1472,17 @@ unsigned long flags; struct Scsi_Host *host = SDpnt->host; int j; - Scsi_Cmnd *SCpnt; + Scsi_Cmnd *SCpnt , *SCnext; request_queue_t *q = &SDpnt->request_queue; - - /* + + if (SDpnt->has_cmdblocks){ + for (SCpnt = SDpnt->device_queue; SCpnt; SCpnt = SCnext) { + SDpnt->device_queue = SCnext = SCpnt->next; + list_del(&SCpnt->sc_list); + kfree((char *) SCpnt); + } + } + /* * Init the spin lock that protects alloc/free of command blocks */ spin_lock_init(&SDpnt->device_request_lock); ----------------------------------------- These both have the same result (not leaking memory). The second can take into account that the current number of commands is not correct and build a correct number. The first assumes the commands are okay. ---------- Action by: smorin Issue Registered ---------- Action by: mgahagan Wendy, RHEL 3 U4? What about 2.1? They are reporting this with both 2.1 and 3 (both patches in the top of the ticket) ---------- Action by: mgahagan We already have a fix for this for RHEL 3 scheduled for U4, I have included the RHEL 3 U4 patch below for reference. # This is a BitKeeper generated diff -Nru style patch. # # ChangeSet # 2004/08/31 13:47:52-04:00 dledford compaq xsintricity com # drivers/scsi/scsi.c: # Don't re-init existing command blocks # # drivers/scsi/scsi.c # 2004/08/31 13:47:51-04:00 dledford compaq xsintricity com +6 -0 # Don't re-init existing command blocks # diff -Nru a/drivers/scsi/scsi.c b/drivers/scsi/scsi.c --- a/drivers/scsi/scsi.c 2004-08-31 13:48:48 -04:00 +++ b/drivers/scsi/scsi.c 2004-08-31 13:48:48 -04:00 @@ -1477,6 +1477,12 @@ request_queue_t *q = &SDpnt->request_queue; /* + * Only init things once. + */ + if (SDpnt->has_cmdblocks) + return; + + /* * Init the spin lock that protects alloc/free of command blocks */ spin_lock_init(&SDpnt->device_request_lock); ---------- Action by: peterm Escalated to Bugzilla
*** This bug has been marked as a duplicate of 131521 ***
A fix for this problem has just been committed to the RHEL3 U4 patch pool this evening (in kernel version 2.4.21-20.4.EL).
*** Bug 138941 has been marked as a duplicate of this bug. ***
Created attachment 111186 [details] Proposed patch The attached patch should solve the regression that was found in testing of the original AS2.1 patch. In case the people have forgotten, the original regression was that multi-lun devices were no longer properly probed with the original memleak fix patch. Patch being submitted for internal review, although testing will have to be done by someone else since I don't have any multi-lun devices (I believe the original regression was found when testing on a MegaRAID controller with multiple logical volumes).
Patch partially tested (I don't have a multilun setup, but I tested that it works in non-multilun configurations) and submitted for review and further testing.
I applied the old patch and re-created the regression on a megaraid controller (as reported in BZ 140454). Then I applied the new patch and verified that the problem does not occur. This patch looks good to go.
A fix for this problem has just been committed to the RHEL2.1 U7 patch pool this evening (in kernel version 2.4.9-61).
Due to operator error, this patch in fact did *not* make it into the U7 kernel build, so it is now slated for U8.
I just verified that this patch *is* in the kernel source pool that will become U8.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-529.html