Bug 30976
Summary: | [ABIT KT7-RAID]Crashed install just before copying install image to disk | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Edward Kuns <eddie.kuns> |
Component: | kernel | Assignee: | Arjan van de Ven <arjanv> |
Status: | CLOSED RAWHIDE | QA Contact: | Brock Organ <borgan> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 7.1 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2001-03-19 09:28:07 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Edward Kuns
2001-03-07 19:11:12 UTC
This defect is considered MUST-FIX (show-stopper) for Florence GOLD Wow...that is strange. When you say, 'earlier betas', does that include Wolverine? If this is a cd that you burned yourself, can you check the md5sum and make sure they are the same as the ones on the ftp site? I'm pretty sure that beta-3 was the last beta I tested by installing. With Wolverine, I just upgraded a few select packages (not the kernel or XFree86 or anything like that) to test those packages. So, no, I didn't test this with Wolverine. Would you like me to? Oh, sorry, I also checked the md5sums of the CDs just before I burned them. I compared against the contents of the MD5SUM files in the ftp directories containing the florence RC-2 iso images. Being too lazy to check each one individually, I ran "md5sum --check MD5SUM" and got an OK on each file. (Except the Japanese version of disk 1 :) Let me know if I can do anything to provide additional information. Your original problem sounds like a kernel bug to me, but that's just a guess. There were a lot of bugs fixed between Fisher and Wolverine, so yeah, if you have the time, I think Wolverine will treat you better than Fisher. :) Today, I reinstalled Florence RC-2 to test something else, and now even in text mode the installer freezes. The only hardware change made between last night and today is pulling one SCSI disk. (Which I doubt is the problem.) Also, an X install hung at the usual spot. It actually copied the install image to hard disk, *then* hung in the usual way. No change there. That's where text installs are hanging today. By "kernel bug" do you mean a new one maybe introduced in RC-2? The installer is copying to my SCSI disk (not the one pulled). My controller uses the advansys driver. Where there many changes to "advansys" or to other SCSI stuff between Wolverine and RC-2? I'll try Wolverine ASAP ... hopefully tonight ... and will post the results here. By kernel bug, I mean hard locking the machine. Technically speaking, I don't think the installer should ever be able to hard lock the machine. If there was a kernel bug in beta 3, which you did the original install on, I was hoping that the bug had been fixed with Wolverine or RC-2. Apparently this is not the case. I'm reassigning the bug to the kernel team and changing the component to kernel. Are you prepared to test an experimental kernel to see if a propsed patch fixes this ? (you cannot install with this kernel, but you can install it on an existing install and try to load the disks as much as possible) Are you prepared to test an experimental kernel to see if a propsed patch fixes this ? (you cannot install with this kernel, but you can install it on an existing install and try to load the disks as much as possible) Sure. I'll install that experimental kernel and then try to compile a kernel. That should load the disks, eh? At the same time, I could run some of the daily cron jobs that keep the disks busy. To really test this, I should do the same with the RC-2 kernel and see that it crashes. Point me at the experimental kernel and I'll do both. Good call on this being a kernel issue. Wolverine loads perfectly, so that isolates this as being a 2.4.2 issue. On Wolverine, I installed the RC-2 kernel package and set up lilo so I could boot either. The root file system is on hde on the ATA100 controller, so that driver must be at fault in 2.4.2. To reproduce the problem, I had to boot with a restricted amount of memory. (This machine has 256 Meg) Booting the 2.4.2 kernel with "mem=32M" and then running in three separate windows: find / -name asdfadsf caused this crash the first time. Not immediately, but after some amount of "find" activity. Doing the same (with "mem=32M") on the 2.4.1 kernel worked perfectly. Compiling the kernel wasn't a good test, as I found out, because it generates less disk activity than CPU activity. I should have known that! I am downloading the test kernel right now. I'll try it out ASAP. dbench (ftp://ftp.samba.org/pub/tridge/dbench) is a nice tool for generating diskload. (use it with a parameter of no more than 48 unless you have a boatload of ram) Thanks for testing! Trying to install the experimental kernel, it complains about needing modutils newer than that in RC-2. Should I force it? I'll force it and see how things go. Unfortunately, this test kernel does NOT solve the problem. Running several simultaneous "find" commands after booting with "mem=32M" still locks the system up solid. This simple test actually seems to generate a more solid disk load than dbench, but I've tried both. (Actually, since the "find" test crashed the test kernel, I didn't try dbench on it. I tried dbench on the Wolverine kernel, however, without any crash.) ok... back to the drawingboard then. Thanks for testing. If this only occurs when memory is artifically made low (with mem=32M), then it really doesn't sound like a hardware bug. In that case, would you be willing to try the new test kernel I put up on the same location? It's not that it ONLY occurs when memory is artificially made low. It's just that it's far easier to generate a disk load when memory is small enough that the disk cache can't get in the way by preventing disk access. My machine has 256M of memory. The easiest way I could think of to figure this bug out was to restrict memory. I'll try out the new test kernel ASAP. Booting without restricting memory, the new kernel crashes with dbench. (running "dbench 48") FYI: QA0309 fails the same way. It must be the HPT370 drivers. Looking at the source to the driver (I assume hpt366.c is the appropriate source), I notice something curious. My drive model is listed in: const char *bad_ata66_4[] = { "IBM-DTLA-307075", "IBM-DTLA-307060", "IBM-DTLA-307045", "IBM-DTLA-307030", "IBM-DTLA-307020", "IBM-DTLA-307015", "IBM-DTLA-305040", "IBM-DTLA-305030", "IBM-DTLA-305020", "WDC AC310200R", NULL }; My drive model is the 307020. A later comment in the code clarifies the meaning of "bad": /* * This allows the configuration of ide_pci chipset registers * for cards that learn about the drive's UDMA, DMA, PIO capabilities * after the drive is reported by the OS. Initally for designed for * HPT366 UDMA chipset by HighPoint|Triones Technologies, Inc. * * check_in_drive_lists(drive, bad_ata66_4) * check_in_drive_lists(drive, bad_ata66_3) * check_in_drive_lists(drive, bad_ata33) * */ static int config_chipset_for_dma (ide_drive_t *drive) { The version of source I am looking at is 0.18, dated June 9, 2000. I haven't looked at the source in Wolverine or RC-2 or QA0309. This may or may not be helpful information. I haven't had IDE problems with Wolverine ... just with the later releases. Could you try booting with "ide=nodma" on the lilo commandprompt? That should turn of DMA. If the test still fails, it's NOT a hardware/DMA issue, if it succeeds, can you give me the output of cat /proc/ide/hdX/model and cat /proc/ide/hdX/settings (where hdX is most likely hde/hdf/hdg/hdh, the name of the drive(s) in question) Booting with "ide=nodma" works. It installs. Here is output from console 2 from during the install ... after the install is complete I'll get the "settings" files without the nodma option. For "model": IBM-DTLA-307020 For "settings": name value min max mode ---- ----- --- --- ---- bios_cyl 39870 0 65535 rw bios_head 16 0 255 rw bios_sect 63 0 63 rw breada_readahead 4 0 127 rw bswap 0 0 1 r current_speed 12 0 69 rw file_readahead 0 0 2097151 rw ide_scsi 0 0 1 rw init_speed 12 0 69 rw io_32bit 0 0 3 rw keepsettings 0 0 1 rw lun 0 0 7 rw max_kb_per_request 64 1 127 rw multcount 0 0 8 rw nice1 1 0 1 rw nowerr 0 0 1 rw number 0 0 3 rw pio_mode write-only 0 255 w slow 0 0 1 rw unmaskirq 0 0 1 rw using_dma 0 0 1 rw And I verified that the driver running is indeed hpt366. Could you also get "hdparm -i /dev/hdX" output for the no-parameter case ? thanks. OK. QA0309 installs perfectly (if slowly!) when I boot with "ide=nodma". Here is the information from "proc" with IDE not disabled: /proc/ide/hde/settings: name value min max mode ---- ----- --- --- ---- bios_cyl 39870 0 65535 rw bios_head 16 0 255 rw bios_sect 63 0 63 rw breada_readahead 4 0 127 rw bswap 0 0 1 r current_speed 69 0 69 rw file_readahead 0 0 2097151 rw ide_scsi 0 0 1 rw init_speed 69 0 69 rw io_32bit 0 0 3 rw keepsettings 0 0 1 rw lun 0 0 7 rw max_kb_per_request 64 1 127 rw multcount 8 0 8 rw nice1 1 0 1 rw nowerr 0 0 1 rw number 0 0 3 rw pio_mode write-only 0 255 w slow 0 0 1 rw unmaskirq 0 0 1 rw using_dma 1 0 1 rw /proc/ide/htp366: HPT370 Chipset. --------------- Primary Channel ---------------- Secondary Channel ------------- enabled enabled --------------- drive0 --------- drive1 -------- drive0 ---------- drive1 ------ DMA enabled: yes no yes no UDMA DMA PIO /proc/ide/ide2/config: pci bus 00 device 98 vid 1103 did 0004 channel 0 03 11 04 00 05 00 30 02 03 00 80 01 08 78 00 00 01 b0 00 00 01 b4 00 00 01 b8 00 00 01 bc 00 00 01 c0 00 00 00 00 00 00 00 00 00 00 03 11 01 00 00 00 00 00 60 00 00 00 00 00 00 00 0a 01 08 08 31 4e 45 16 a7 4e 81 06 31 4e 45 16 a7 4e 81 06 05 00 00 00 05 00 00 00 1b 00 00 22 24 00 26 00 01 00 22 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 90 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Here is the hdparm information: /dev/hde: Model=IBM-DTLA-307020, FwRev=TX3OA60A, SerialNo=YHEYHF88717 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs } RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16 CurCHS=16383/16/63, CurSects=-66060037, LBA=yes, LBAsects=40188960 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120} PIO modes: pio0 pio1 pio2 pio3 pio4 DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 *udma5 <--- this is the important bit It seems the kernel enables UDMA100 (udma5 in kernelspeak) even though your drive is blacklisted for UDMA66 (udma4 in kernelspeak). If a certain drivemodel clashes for UDMA66, I think it would be safe to assume it also clashes for UDMA100... I will add a patch for this to the kernel. I'll try out the test kernel. Could a *drive* really cause a full system hang? Yowza. What a pain! It just makes sense that this is the culprit, however. With the test kernel, the drive is correctly placed in udma3 mode, not udma4 or 5. To see if this fixed the problem, I booted with restricted memory (32M) so I could beat on the hard disk without a huge disk cache preventing actual disk access. Well, it hung while going to runlevel 5. (Solid lockup with IDE LED lit, no response to CTRL-ALT-DEL) Hmm. OK, well maybe reducing memory to 32M interfered with normal operation? Although I'd think 32M should be enough, with the 64M swap installed by default. Do you think it's worth tracking this down? This is Wolverine, so if there is a problem with the swapper, maybe we should ignore this. (Of course, I can't *install* the later releases to test this! All I can do is install Wolverine and then install the test kernel.) Anyway, I booted into runlevel 1 with 32M and ran many, many, simultaneous "find" commands. With previous drivers, if I ran three simultaneous "find" commands with restricted memory, it would always crash before the commands finished. Always. 100%. Good news. I was able to run TEN simultaneous "find" commands without a crash. And NOT restricting memory, I have no problem booting to runlevel 5 and it all seems to work. It would probably be worth reporting during bootup that the htp366 driver has noticed you're using a problem drive and that it is using a slower mode. That way, if someone has a problem as you ask for "dmesg" output, you'll know right away if it might be a controller/drive problem. I bought a new UDMA/100 drive -- non IBM :) and not on the problem list. I'll try that out, but I expect no problems. In the odd occasion that it gives me problems, I'll report them here. Silence means it works. :) Should we bother tracking down the hang going to run level 5 with restricted memory? The IDE light was lit and the machine was locked solid ... but if there are other bugs fixed in wolverine that could cause this ... The VM is currently known to not be able to cope very well with low-memory. That is a separate, and very important, issue and is being worked on. Thanks for testing! I will look into the printk'ing of the "you have a known problem drive" message, if it is not too intrusive it will go in. I will close this bug as fixed now (as it seems to be); if either you do get hangs or your new drive has issues, please reopen this bug. Indeed, throwing a different ATA100 drive on there -- one NOT in the "troubled drive" list -- and RC-2 installs flawlessly. Thought I'd let you know the good news. :) Thanks for all your work. |