Bug 55420 - Kernel panic with > 128 drives
Kernel panic with > 128 drives
Status: CLOSED ERRATA
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.1
i386 Linux
high Severity high
: ---
: ---
Assigned To: Pete Zaitcev
Brock Organ
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2001-10-30 17:46 EST by Rob Landry
Modified: 2005-10-31 17:00 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2002-01-16 13:34:26 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Add variable number of HBAs to scsi_debug.c (6.32 KB, patch)
2001-11-27 15:18 EST, Pete Zaitcev
no flags Details | Diff
Oops in account_io_start(). (661 bytes, patch)
2001-11-29 22:07 EST, Pete Zaitcev
no flags Details | Diff
.data corruption with kfree (2.96 KB, patch)
2001-11-29 22:17 EST, Pete Zaitcev
no flags Details | Diff
One-line version of the same, for EMC testing (343 bytes, patch)
2001-11-30 17:30 EST, Pete Zaitcev
no flags Details | Diff
All relevant fixes, but not the final form (10.78 KB, patch)
2001-12-05 16:42 EST, Pete Zaitcev
no flags Details | Diff

  None (edit)
Description Rob Landry 2001-10-30 17:46:46 EST
From Bugzilla Helper:
User-Agent: Mozilla/4.78 [en] (X11; U; Linux 2.4.7-10 i686)

Description of problem:
With the 2.4 kernel; it will panic when it attempts to asign more than 128
scsi drives.

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.Attache more that 128 scsi drives to a machine
2.boot a 2.4 kernel
3.
	

Actual Results:  kernel panic

Expected Results:  The 2.2 kernel would continue, but you could only access
upto 128 drives.  This would be acceptable, but a panic is not.

Additional info:
Comment 1 Bill Nottingham 2001-10-30 22:50:44 EST
Rob, do you have a local machine with 128 drives? :)
Comment 2 Rob Landry 2001-10-31 10:26:08 EST
No such luck.  I'm not sure I have access to a machine with more than 7.
Comment 3 Pete Zaitcev 2001-11-08 17:16:03 EST
Can we ask EMC people if a reproduction with a serial console
is feasible? I am dying to see dmesg (which obviously cannot be
obtained by normal means in this case).
Comment 4 Pete Zaitcev 2001-11-10 18:02:27 EST
Emulator (scsi_debug) does not reproduce the problem.
I need info from EMC about this, like I said before.
Here's what I've got:

[root@pentabug /]# grep Direct-Access /proc/scsi/scsi| wc 
    130     780    8580
[root@pentabug /]# dmesg | tail
 sdam: sdam1 sdam2 sdam3
SCSI device sdan: 8418061 512-byte hdwr sectors (4310 MB)
sdan: Write Protect is off
 sdan: sdan1 sdan2 sdan3
SCSI device sdao: 8418061 512-byte hdwr sectors (4310 MB)
sdao: Write Protect is off
 sdao: sdao1 sdao2 sdao3
SCSI device sdap: 8418061 512-byte hdwr sectors (4310 MB)
sdap: Write Protect is off
 sdap: sdap1 sdap2 sdap3
[root@pentabug /]# 
Comment 5 Pete Zaitcev 2001-11-27 15:18:13 EST
Created attachment 38844 [details]
Add variable number of HBAs to scsi_debug.c
Comment 6 Pete Zaitcev 2001-11-29 22:07:27 EST
Created attachment 39124 [details]
Oops in account_io_start().
Comment 7 Pete Zaitcev 2001-11-29 22:17:36 EST
Created attachment 39125 [details]
.data corruption with kfree
Comment 8 Pete Zaitcev 2001-11-29 23:01:07 EST
Guys, I MUST know the exact kernel version that EMC were running,
even if the oops traceback is long gone. I am sorry to say it,
but "2.boot a 2.4 kernel" does not make a sufficient bug report,
because two fixes that I attached are in 2.4.16 (some are before that).
Comment 9 Mark Tsombakos 2001-11-30 09:36:21 EST
We were using 2.4.9-7.  I have not gotten the boot info via a serial console because I
don't have the time to reasearch how to do that.  If you'd like to email me the procedure, 
I'd appreciate it.

This is the patch we've used to get around the problem.  It effectively limits sd_mod from
seeing more than 16 drives - fine for our application, but a hack nonetheless.

*** drivers/scsi/sd.c.orig      Tue Nov  6 12:29:46 2001
--- drivers/scsi/sd.c   Tue Nov  6 12:30:31 2001
***************
*** 1077,1082 ****
--- 1077,1084 ----
        if (sd_template.dev_max > N_SD_MAJORS * SCSI_DISKS_PER_MAJOR)
                sd_template.dev_max = N_SD_MAJORS * SCSI_DISKS_PER_MAJOR;

+       sd_template.dev_max = 16;
+
        if (!sd_registered) {
                for (i = 0; i < N_USED_SD_MAJORS; i++) {
                        if (devfs_register_blkdev(SD_MAJOR(i), "sd", &sd_fops)) {
***************
*** 1271,1277 ****
                        break;

        if (i >= sd_template.dev_max)
!               panic("scsi_devices corrupt (sd)");

        rscsi_disks[i].device = SDp;
        rscsi_disks[i].has_part_table = 0;
--- 1273,1279 ----
                        break;

        if (i >= sd_template.dev_max)
!               return (1);

        rscsi_disks[i].device = SDp;
        rscsi_disks[i].has_part_table = 0;
Comment 10 Pete Zaitcev 2001-11-30 17:30:04 EST
Created attachment 39244 [details]
One-line version of the same, for EMC testing
Comment 11 Pete Zaitcev 2001-12-05 16:42:40 EST
Created attachment 39767 [details]
All relevant fixes, but not the final form
Comment 12 Pete Zaitcev 2001-12-05 16:55:30 EST
I think I plugged all holes here, so the hacking part is over.
Three issues remain.
1. A misdesign or a bitrot around ->attached and failing sd_init
   require a change that is visible to drivers, so it's a
   customer visible impact. To be reviewed with dledford & sct.
2. How do we package this so that EMC gets the change
   (what release or errata -- depends on #1).
   Also, do we put this into HEAD or filter through Marcelo
   and proper review process.
3. Allocation of huge arrays in sd.c must be changed.

The panic that the EMC patch plugged is an internal error,
possibly caused by a reuse of freed memory (rscsi_disks).
With my fixes they must never hit it. I changed it to a
printout anyways, for ease of reporting and debugging.

Comment 13 Pete Zaitcev 2001-12-12 15:23:25 EST
pensacola 2.4.9-18 (2.4.9-17.5)
Comment 14 Pete Zaitcev 2001-12-13 13:26:59 EST
Rawhide 2.4.16-1 (2.4.16-0.12)
Comment 15 Pete Zaitcev 2001-12-13 13:32:13 EST
The bug is lined up to be closed as RAWHIDE, with the important
off-shot of hd_struct array allocation change. This must be
made a standalone RFE or folded into the Doug's work.

Any customer satisfaction or last minute notes?
Comment 16 Mark Tsombakos 2001-12-13 13:47:51 EST
I see patches for 2.4.7 and one for 2.4.16.  If we're using the latest RH kernel
(not rawhide), can I apply the 2.4.7 patch?  Is there a "final form" version of
the patch available?  Thanks, "zaitcev".
Comment 17 Pete Zaitcev 2001-12-18 14:54:16 EST
Working with the customer by e-mail.
Mark is testing 2.4.16-0.13.
Comment 18 Pete Zaitcev 2002-01-16 13:20:15 EST
I pushed the fix to Marcelo for 2.4.18, so unless someone
finds a hole, it will be there.

Also, I worked on the split allocation of hd_struct's, but
that caused very suspicious and tricky oopses, so I have
to suspend it for a while. I may open a new RFE to track that.
Mark did not tell me if he is ever bitten by sd_mod refusing
to load, so perhaps the split allocation is not that urgent.
Comment 19 Pete Zaitcev 2002-01-16 13:34:21 EST
RFE #58442
Comment 20 Pete Zaitcev 2002-01-24 12:41:25 EST
2.4.9-21 is out.

I do not know yet if the installation images were
updated with it.
Comment 21 Pete Zaitcev 2002-02-20 15:23:13 EST
See also: Bug 59370. It seems we have more issues to resolve,
but the panic around corrupt rscsi_disks is now behind us.

Note You need to log in before you can comment on or make changes to this bug.