Bug 55420
Summary: | Kernel panic with > 128 drives | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Rob Landry <rlandry> | ||||||||||||
Component: | kernel | Assignee: | Pete Zaitcev <zaitcev> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Brock Organ <borgan> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 7.1 | CC: | lesliek, tsombakos_mark | ||||||||||||
Target Milestone: | --- | ||||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | i386 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2002-01-16 18:34:26 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Attachments: |
|
Description
Rob Landry
2001-10-30 22:46:46 UTC
Rob, do you have a local machine with 128 drives? :) No such luck. I'm not sure I have access to a machine with more than 7. Can we ask EMC people if a reproduction with a serial console is feasible? I am dying to see dmesg (which obviously cannot be obtained by normal means in this case). Emulator (scsi_debug) does not reproduce the problem. I need info from EMC about this, like I said before. Here's what I've got: [root@pentabug /]# grep Direct-Access /proc/scsi/scsi| wc 130 780 8580 [root@pentabug /]# dmesg | tail sdam: sdam1 sdam2 sdam3 SCSI device sdan: 8418061 512-byte hdwr sectors (4310 MB) sdan: Write Protect is off sdan: sdan1 sdan2 sdan3 SCSI device sdao: 8418061 512-byte hdwr sectors (4310 MB) sdao: Write Protect is off sdao: sdao1 sdao2 sdao3 SCSI device sdap: 8418061 512-byte hdwr sectors (4310 MB) sdap: Write Protect is off sdap: sdap1 sdap2 sdap3 [root@pentabug /]# Created attachment 38844 [details]
Add variable number of HBAs to scsi_debug.c
Created attachment 39124 [details]
Oops in account_io_start().
Created attachment 39125 [details]
.data corruption with kfree
Guys, I MUST know the exact kernel version that EMC were running, even if the oops traceback is long gone. I am sorry to say it, but "2.boot a 2.4 kernel" does not make a sufficient bug report, because two fixes that I attached are in 2.4.16 (some are before that). We were using 2.4.9-7. I have not gotten the boot info via a serial console because I don't have the time to reasearch how to do that. If you'd like to email me the procedure, I'd appreciate it. This is the patch we've used to get around the problem. It effectively limits sd_mod from seeing more than 16 drives - fine for our application, but a hack nonetheless. *** drivers/scsi/sd.c.orig Tue Nov 6 12:29:46 2001 --- drivers/scsi/sd.c Tue Nov 6 12:30:31 2001 *************** *** 1077,1082 **** --- 1077,1084 ---- if (sd_template.dev_max > N_SD_MAJORS * SCSI_DISKS_PER_MAJOR) sd_template.dev_max = N_SD_MAJORS * SCSI_DISKS_PER_MAJOR; + sd_template.dev_max = 16; + if (!sd_registered) { for (i = 0; i < N_USED_SD_MAJORS; i++) { if (devfs_register_blkdev(SD_MAJOR(i), "sd", &sd_fops)) { *************** *** 1271,1277 **** break; if (i >= sd_template.dev_max) ! panic("scsi_devices corrupt (sd)"); rscsi_disks[i].device = SDp; rscsi_disks[i].has_part_table = 0; --- 1273,1279 ---- break; if (i >= sd_template.dev_max) ! return (1); rscsi_disks[i].device = SDp; rscsi_disks[i].has_part_table = 0; Created attachment 39244 [details]
One-line version of the same, for EMC testing
Created attachment 39767 [details]
All relevant fixes, but not the final form
I think I plugged all holes here, so the hacking part is over. Three issues remain. 1. A misdesign or a bitrot around ->attached and failing sd_init require a change that is visible to drivers, so it's a customer visible impact. To be reviewed with dledford & sct. 2. How do we package this so that EMC gets the change (what release or errata -- depends on #1). Also, do we put this into HEAD or filter through Marcelo and proper review process. 3. Allocation of huge arrays in sd.c must be changed. The panic that the EMC patch plugged is an internal error, possibly caused by a reuse of freed memory (rscsi_disks). With my fixes they must never hit it. I changed it to a printout anyways, for ease of reporting and debugging. pensacola 2.4.9-18 (2.4.9-17.5) Rawhide 2.4.16-1 (2.4.16-0.12) The bug is lined up to be closed as RAWHIDE, with the important off-shot of hd_struct array allocation change. This must be made a standalone RFE or folded into the Doug's work. Any customer satisfaction or last minute notes? I see patches for 2.4.7 and one for 2.4.16. If we're using the latest RH kernel (not rawhide), can I apply the 2.4.7 patch? Is there a "final form" version of the patch available? Thanks, "zaitcev". Working with the customer by e-mail. Mark is testing 2.4.16-0.13. I pushed the fix to Marcelo for 2.4.18, so unless someone finds a hole, it will be there. Also, I worked on the split allocation of hd_struct's, but that caused very suspicious and tricky oopses, so I have to suspend it for a while. I may open a new RFE to track that. Mark did not tell me if he is ever bitten by sd_mod refusing to load, so perhaps the split allocation is not that urgent. RFE #58442 2.4.9-21 is out. I do not know yet if the installation images were updated with it. See also: Bug 59370. It seems we have more issues to resolve, but the panic around corrupt rscsi_disks is now behind us. |