Bug 30976

Summary: [ABIT KT7-RAID]Crashed install just before copying install image to disk
Product: [Retired] Red Hat Linux Reporter: Edward Kuns <eddie.kuns>
Component: kernelAssignee: Arjan van de Ven <arjanv>
Status: CLOSED RAWHIDE QA Contact: Brock Organ <borgan>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.1   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2001-03-19 09:28:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Edward Kuns 2001-03-07 19:11:12 UTC
On ABIT KT7-RAID and Creative Labs Annihilator II, install with default
mode and with "nofb" crashes shortly before the "copying install image
to disk" step.  But does get to the screen showing the Red Hat Install
progress.  (No progress shown, of course, because it never started 
actually installing packages.)

Installing with "text" worked.

This machine, same hardware configuration, worked on earlier betas, not
with the frame buffer, but with "nofb," so "nofb" no longer works here.

The machine hung with IDE light solid, no response to CTRL-ALT-DEL, dead
mouse.  I had to hit the reset button.  Several consecutive attempts
produced the same behavior.  Switching away from X to a text console didn't
change anything.  I saw / get formatted, swap get installed, then *HANG*
and I was just as dead at a text console.

As I said, a text install works.

Comment 1 Glen Foster 2001-03-07 22:06:27 UTC
This defect is considered MUST-FIX (show-stopper) for Florence GOLD

Comment 2 Brent Fox 2001-03-07 23:49:09 UTC
Wow...that is strange.  When you say, 'earlier betas', does that include
Wolverine?  If this is a cd that you burned yourself, can you check the md5sum
and make sure they are the same as the ones on the ftp site?

Comment 3 Edward Kuns 2001-03-08 09:58:41 UTC
I'm pretty sure that beta-3 was the last beta I tested by installing.
With Wolverine, I just upgraded a few select packages (not the kernel
or XFree86 or anything like that) to test those packages.  So, no, I
didn't test this with Wolverine.  Would you like me to?


Comment 4 Edward Kuns 2001-03-08 10:04:10 UTC
Oh, sorry, I also checked the md5sums of the CDs just before I
burned them.  I compared against the contents of the MD5SUM files
in the ftp directories containing the florence RC-2 iso images.
Being too lazy to check each one individually, I ran
"md5sum --check MD5SUM" and got an OK on each file.  (Except the
Japanese version of disk 1 :)

Let me know if I can do anything to provide additional information.


Comment 5 Brent Fox 2001-03-08 15:52:50 UTC
Your original problem sounds like a kernel bug to me, but that's just a guess.
There were a lot of bugs fixed between Fisher and Wolverine, so yeah, if you
have the time, I think Wolverine will treat you better than Fisher.  :)

Comment 6 Edward Kuns 2001-03-08 19:16:41 UTC
Today, I reinstalled Florence RC-2 to test something else, and now
even in text mode the installer freezes.  The only hardware change
made between last night and today is pulling one SCSI disk.  (Which
I doubt is the problem.)

Also, an X install hung at the usual spot.  It actually copied the
install image to hard disk, *then* hung in the usual way.  No change
there.  That's where text installs are hanging today.

By "kernel bug" do you mean a new one maybe introduced in RC-2?  The
installer is copying to my SCSI disk (not the one pulled).  My
controller uses the advansys driver.  Where there many changes to
"advansys" or to other SCSI stuff between Wolverine and RC-2?

I'll try Wolverine ASAP ... hopefully tonight ... and will post the
results here.


Comment 7 Brent Fox 2001-03-08 19:36:52 UTC
By kernel bug, I mean hard locking the machine.  Technically speaking, I don't
think the installer should ever be able to hard lock the machine.  If there was
a kernel bug in beta 3, which you did the original install on, I was hoping that
the bug had been fixed with Wolverine or RC-2.  Apparently this is not the case.
 I'm reassigning the bug to the kernel team and changing the component to kernel.

Comment 8 Arjan van de Ven 2001-03-09 14:46:22 UTC
Are you prepared to test an experimental kernel to see if a propsed patch fixes
this ?

(you cannot install with this kernel, but you can install it on an existing 
 install and try to load the disks as much as possible)

Comment 9 Arjan van de Ven 2001-03-09 14:51:07 UTC
Are you prepared to test an experimental kernel to see if a propsed patch fixes
this ?

(you cannot install with this kernel, but you can install it on an existing 
 install and try to load the disks as much as possible)

Comment 10 Edward Kuns 2001-03-09 16:09:01 UTC
Sure.  I'll install that experimental kernel and then try to compile
a kernel.  That should load the disks, eh?  At the same time, I could
run some of the daily cron jobs that keep the disks busy.

To really test this, I should do the same with the RC-2 kernel and
see that it crashes.  Point me at the experimental kernel and I'll
do both.


Comment 11 Edward Kuns 2001-03-10 10:08:58 UTC
Good call on this being a kernel issue.  Wolverine loads perfectly,
so that isolates this as being a 2.4.2 issue.  On Wolverine, I
installed the RC-2 kernel package and set up lilo so I could boot
either.  The root file system is on hde on the ATA100 controller,
so that driver must be at fault in 2.4.2.

To reproduce the problem, I had to boot with a restricted amount
of memory.  (This machine has 256 Meg)  Booting the 2.4.2 kernel
with "mem=32M" and then running in three separate windows:

    find / -name asdfadsf

caused this crash the first time.  Not immediately, but after some
amount of "find" activity.  Doing the same (with "mem=32M") on the
2.4.1 kernel worked perfectly.  Compiling the kernel wasn't a good
test, as I found out, because it generates less disk activity than
CPU activity.  I should have known that!

I am downloading the test kernel right now.  I'll try it out ASAP.


Comment 12 Arjan van de Ven 2001-03-10 10:47:18 UTC
dbench (ftp://ftp.samba.org/pub/tridge/dbench) is a nice tool for generating
diskload.
(use it with a parameter of no more than 48 unless you have a boatload of ram)
Thanks for testing!

Comment 13 Edward Kuns 2001-03-10 10:52:01 UTC
Trying to install the experimental kernel, it complains about needing
modutils newer than that in RC-2.  Should I force it?  I'll force it
and see how things go.


Comment 14 Edward Kuns 2001-03-10 11:17:25 UTC
Unfortunately, this test kernel does NOT solve the problem.  Running
several simultaneous "find" commands after booting with "mem=32M"
still locks the system up solid.  This simple test actually seems to
generate a more solid disk load than dbench, but I've tried both.
(Actually, since the "find" test crashed the test kernel, I didn't
try dbench on it.  I tried dbench on the Wolverine kernel, however,
without any crash.)


Comment 15 Arjan van de Ven 2001-03-10 12:38:09 UTC
ok... back to the drawingboard then.
Thanks for testing.

Comment 16 Arjan van de Ven 2001-03-12 13:36:21 UTC
If this only occurs when memory is artifically made low (with mem=32M), then
it really doesn't sound like a hardware bug. In that case, would you be willing
to try the new test kernel I put up on the same location?

Comment 17 Edward Kuns 2001-03-12 14:12:31 UTC
It's not that it ONLY occurs when memory is artificially made low.
It's just that it's far easier to generate a disk load when memory
is small enough that the disk cache can't get in the way by
preventing disk access.  My machine has 256M of memory.  The easiest
way I could think of to figure this bug out was to restrict memory.

I'll try out the new test kernel ASAP.


Comment 18 Edward Kuns 2001-03-12 15:41:21 UTC
Booting without restricting memory, the new kernel crashes
with dbench.  (running "dbench 48")


Comment 19 Edward Kuns 2001-03-18 10:33:09 UTC
FYI:  QA0309 fails the same way.  It must be the
HPT370 drivers.


Comment 20 Edward Kuns 2001-03-18 10:45:24 UTC
Looking at the source to the driver (I assume hpt366.c is the
appropriate source), I notice something curious.  My drive model
is listed in:

const char *bad_ata66_4[] = {
	"IBM-DTLA-307075",
	"IBM-DTLA-307060",
	"IBM-DTLA-307045",
	"IBM-DTLA-307030",
	"IBM-DTLA-307020",
	"IBM-DTLA-307015",
	"IBM-DTLA-305040",
	"IBM-DTLA-305030",
	"IBM-DTLA-305020",
	"WDC AC310200R",
	NULL
};

My drive model is the 307020.  A later comment in the code clarifies
the meaning of "bad":

/*
 * This allows the configuration of ide_pci chipset registers
 * for cards that learn about the drive's UDMA, DMA, PIO capabilities
 * after the drive is reported by the OS.  Initally for designed for
 * HPT366 UDMA chipset by HighPoint|Triones Technologies, Inc.
 *
 * check_in_drive_lists(drive, bad_ata66_4)
 * check_in_drive_lists(drive, bad_ata66_3)
 * check_in_drive_lists(drive, bad_ata33)
 *
 */
static int config_chipset_for_dma (ide_drive_t *drive)
{


The version of source I am looking at is 0.18, dated June 9, 2000.
I haven't looked at the source in Wolverine or RC-2 or QA0309.
This may or may not be helpful information.  I haven't had IDE
problems with Wolverine ... just with the later releases.


Comment 21 Arjan van de Ven 2001-03-18 10:57:14 UTC
Could you try booting with "ide=nodma" on the lilo commandprompt?
That should turn of DMA. If the test still fails, it's NOT a hardware/DMA issue,
if it succeeds, can you give me the output of 
cat /proc/ide/hdX/model
and
cat /proc/ide/hdX/settings
(where hdX is most likely hde/hdf/hdg/hdh, the name of the drive(s) in question)


Comment 22 Edward Kuns 2001-03-18 12:24:28 UTC
Booting with "ide=nodma" works.  It installs.  Here is output
from console 2 from during the install ... after the install is
complete I'll get the "settings" files without the nodma option.

For "model":

IBM-DTLA-307020

For "settings":

name			value		min		max		mode
----			-----		---		---		----
bios_cyl                39870           0               65535           rw
bios_head               16              0               255             rw
bios_sect               63              0               63              rw
breada_readahead        4               0               127             rw
bswap                   0               0               1               r
current_speed           12              0               69              rw
file_readahead          0               0               2097151         rw
ide_scsi                0               0               1               rw
init_speed              12              0               69              rw
io_32bit                0               0               3               rw
keepsettings            0               0               1               rw
lun                     0               0               7               rw
max_kb_per_request      64              1               127             rw
multcount               0               0               8               rw
nice1                   1               0               1               rw
nowerr                  0               0               1               rw
number                  0               0               3               rw
pio_mode                write-only      0               255             w
slow                    0               0               1               rw
unmaskirq               0               0               1               rw
using_dma               0               0               1               rw

And I verified that the driver running is indeed hpt366.


Comment 23 Arjan van de Ven 2001-03-18 12:46:42 UTC
Could you also get "hdparm -i /dev/hdX" output for the no-parameter case ?
thanks.

Comment 24 Edward Kuns 2001-03-18 12:56:34 UTC
OK.  QA0309 installs perfectly (if slowly!) when I boot with "ide=nodma".
Here is the information from "proc" with IDE not disabled:

          /proc/ide/hde/settings:

name			value		min		max		mode
----			-----		---		---		----
bios_cyl                39870           0               65535           rw
bios_head               16              0               255             rw
bios_sect               63              0               63              rw
breada_readahead        4               0               127             rw
bswap                   0               0               1               r
current_speed           69              0               69              rw
file_readahead          0               0               2097151         rw
ide_scsi                0               0               1               rw
init_speed              69              0               69              rw
io_32bit                0               0               3               rw
keepsettings            0               0               1               rw
lun                     0               0               7               rw
max_kb_per_request      64              1               127             rw
multcount               8               0               8               rw
nice1                   1               0               1               rw
nowerr                  0               0               1               rw
number                  0               0               3               rw
pio_mode                write-only      0               255             w
slow                    0               0               1               rw
unmaskirq               0               0               1               rw
using_dma               1               0               1               rw


                      /proc/ide/htp366:

                                HPT370 Chipset.
--------------- Primary Channel ---------------- Secondary Channel -------------
                 enabled                          enabled
--------------- drive0 --------- drive1 -------- drive0 ---------- drive1 ------
DMA enabled:    yes              no              yes               no 
UDMA
DMA
PIO

                   /proc/ide/ide2/config:

pci bus 00 device 98 vid 1103 did 0004 channel 0
03 11 04 00 05 00 30 02 03 00 80 01 08 78 00 00
01 b0 00 00 01 b4 00 00 01 b8 00 00 01 bc 00 00
01 c0 00 00 00 00 00 00 00 00 00 00 03 11 01 00
00 00 00 00 60 00 00 00 00 00 00 00 0a 01 08 08
31 4e 45 16 a7 4e 81 06 31 4e 45 16 a7 4e 81 06
05 00 00 00 05 00 00 00 1b 00 00 22 24 00 26 00
01 00 22 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 90 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00



Comment 25 Edward Kuns 2001-03-18 13:04:01 UTC
Here is the hdparm information:


/dev/hde:

 Model=IBM-DTLA-307020, FwRev=TX3OA60A, SerialNo=YHEYHF88717
 Config={ HardSect NotMFM HdSw>15uSec Fixed DTR>10Mbs }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=40
 BuffType=DualPortCache, BuffSize=1916kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=-66060037, LBA=yes, LBAsects=40188960
 IORDY=on/off, tPIO={min:240,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes: pio0 pio1 pio2 pio3 pio4 
 DMA modes: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5 



Comment 26 Arjan van de Ven 2001-03-18 13:11:10 UTC
*udma5 <--- this is the important bit
It seems the kernel enables UDMA100 (udma5 in kernelspeak) even though your
drive is blacklisted for UDMA66 (udma4 in kernelspeak). If a certain drivemodel
clashes for UDMA66, I think it would be safe to assume it also clashes for
UDMA100... I will add a patch for this to the kernel.

Comment 27 Edward Kuns 2001-03-18 21:44:34 UTC
I'll try out the test kernel.  Could a *drive* really cause a full
system hang?  Yowza.  What a pain!  It just makes sense that this
is the culprit, however.


Comment 28 Edward Kuns 2001-03-19 09:27:56 UTC
With the test kernel, the drive is correctly placed in udma3 mode,
not udma4 or 5.  To see if this fixed the problem, I booted with
restricted memory (32M) so I could beat on the hard disk without
a huge disk cache preventing actual disk access.  Well, it hung
while going to runlevel 5.  (Solid lockup with IDE LED lit, no
response to CTRL-ALT-DEL)

Hmm.  OK, well maybe reducing memory to 32M interfered with normal
operation?  Although I'd think 32M should be enough, with the 64M
swap installed by default.  Do you think it's worth tracking this
down?  This is Wolverine, so if there is a problem with the swapper,
maybe we should ignore this.  (Of course, I can't *install* the
later releases to test this!  All I can do is install Wolverine and
then install the test kernel.)

Anyway, I booted into runlevel 1 with 32M and ran many, many, 
simultaneous "find" commands.  With previous drivers, if I ran three
simultaneous "find" commands with restricted memory, it would always
crash before the commands finished.  Always.  100%.

Good news.  I was able to run TEN simultaneous "find" commands without
a crash.  And NOT restricting memory, I have no problem booting to
runlevel 5 and it all seems to work.

It would probably be worth reporting during bootup that the htp366
driver has noticed you're using a problem drive and that it is using
a slower mode.  That way, if someone has a problem as you ask for
"dmesg" output, you'll know right away if it might be a controller/drive
problem.

I bought a new UDMA/100 drive -- non IBM :) and not on the problem
list.  I'll try that out, but I expect no problems.  In the odd occasion
that it gives me problems, I'll report them here.  Silence means it
works.  :)

Should we bother tracking down the hang going to run level 5 with
restricted memory?  The IDE light was lit and the machine was locked
solid ... but if there are other bugs fixed in wolverine that could
cause this ...


Comment 29 Arjan van de Ven 2001-03-19 09:33:01 UTC
The VM is currently known to not be able to cope very well with low-memory.
That is a separate, and very important, issue and is being worked on.

Thanks for testing! I will look into the printk'ing of the "you have a known
problem drive" message, if it is not too intrusive it will go in.

I will close this bug as fixed now (as it seems to be); if either you do get
hangs or your new drive has issues, please reopen this bug.

Comment 30 Edward Kuns 2001-03-19 16:21:24 UTC
Indeed, throwing a different ATA100 drive on there -- one NOT in the
"troubled drive" list -- and RC-2 installs flawlessly.  Thought I'd
let you know the good news.  :)  Thanks for all your work.