54520 – AIC78xx module on 2940U2W w/U3 drives has SEVERE filesystem corruption, various random failures on multipe installs with the same options picked

Bug 54520 - AIC78xx module on 2940U2W w/U3 drives has SEVERE filesystem corruption, various random failures on multipe installs with the same options picked

Summary: AIC78xx module on 2940U2W w/U3 drives has SEVERE filesystem corruption, vario...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.1
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-10-10 22:00 UTC by David Hahn
Modified:	2007-04-18 16:37 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2001-10-10 22:50:08 UTC
Embargoed:

Attachments	(Terms of Use)

Description David Hahn 2001-10-10 22:00:01 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)

Description of problem:
Various errors occur after installation to my SCSI drive, that do not 
happen when installing to IDE drive. Errors are completely random, i.e., I 
may be kicked out to a login prompt immediately after logging in, or some 
programs may be broken, or I may not notice anything until I try to run X, 
or various programs in X. It appears VERY unstable, and poorly configured, 
and not usable, which is not my experience with Red Hat distros.  

SCSI drive works fine with Windows (not sure that means anything...) and 
I'm pretty well-versed in the ways of SCSI setup; tried disabling write 
cache, tried another SCSI drive, only way it will run right is to install 
to an IDE drive. 

Version-Release number of selected component (if applicable):

Default SMP kernel suppled with RH 7.1

How reproducible:
Always

Steps to Reproduce:
1. Install RH 7.1 with SMP kernel(no issues apparent or present during 
install) or, I believe, the standard kernel will reproduce (not sure). 
2. Try to boot it up. It may. It may not. 
3. Various errors occur; file system reports not unmounted cleanly after 
standard shutdown now -r, may not be able to log in after first boot, etc. 
See below.  
	

Actual Results:  I am a retail owner of both RedHat 5.1 and 6.2; 6.2 would 
not run for me, because of similar issues, so I downloaded and burned 7.1, 
thinking the SCSI module would be updated & maybe it's fixed. These CD's 
are OK, I don't get any read errors from them or anything, and I burned 
them at 2x just to be on the safe side. 

I will be as thorough as possible in describing this problem. This problem 
is also reproducible on RH 6.2, although to a somewhat lesser degree. 
Description below is STRICTLY based on experience with RH 7.1.

**************
Symptoms: RANDOM
Running the install, created /boot partition on HDA, / , swap, and /usr 
and /home as well, to the Seagate Barracuda drive, which is SDD. (I tried 
both WITH and WITHOUT write-cache enabled, also tried installing to IBM U3 
drive with same partitioning scheme) 

I seem to be getting a LOT of filesystem errors. I have run the install 
about 25 times, and frequently, if I can get the machine to boot at all, 
and after issuing the shutdown now -r command, the sytem reboots and 
claims partitions were NOT UNMOUNTED CLEANLY. FSCK finds huge amounts of 
errors. 

I've tried one of the IBM Ultra3 drives as the install target with the 
same results. 

Various things happen or are broken after numerous installs, I don't think 
specifics are relevant, but it is all indicating to me that files are 
being written corrupt to the disk. 

Couple of examples: enter "root" and my password at login prompt, and I am 
immediately kicked out to a login prompt. 

Reinstalled, and *could log in* after selecting NOTHING different in the 
installer. Tried to run X, which I set up during install, and the system 
said it failed to connect to the server. 

Reinstalled *again* and was able to get into X this time. Again, EXACT 
SAME OPTIONS in the installer. Tried to download updates, as I realized 
something was whacked, and I install them (using Red Hat Network) and 
install completed. Nothing in X runs seems to run anymore that relies on 
GTK. So, I went to the dir that had the packages in it, and ussued rpm -
Uvh * and I get "core dump" in response. ONE package at a time would work, 
though. So, I issued shutdown now -r again, restarted, file system 
reported as corrupt once more, kernel panic, could not mount root 
filesystem. Ran FSCK from recovery, rebooted, everything's hosed. 

Reinstalled again, and again, and again, with similar results, but it's 
never EXACTLY the same, but I ALWAYS have the "not unmounted cleanly" 
after a few reboots if the system is usable at all.

Something is really borked here. 

Could this be a bug in the kernel module? Note that the patch provided at 
Justin Gibbs site apparently won't do me any good, because I'd have to 
have a system up and running before I applied the patch, wouldn't I? He's 
done some updates to the code that are more recent than RH 7.1. 

I swear I read somewhere that there were issues with 2940 U2W and either 
U2 drives, or U3 drives on U2 controllers, but I can't remember where I 
saw that. I am going bonkers trying to get this running. Perhaps there's a 
way to install his patch before setup is actually up and running? PLEASE 
HELP! I'll try it! I really want to use my SCSI drive for Linux! 

Summary: CANNOT INSTALL USABLE ON SCSI SUBSYSTEM

Expected Results:  RH 7.1 should install and run as per normal on this 
system. 2940U2W is fully supported, and I have used this exact same card 
(with only UW drives attached) to run RH 6.2 for a long time. 

Additional info:

System description:

Asus CUVX-D, 512 MB CAS-2 PC133 SDRAM, 2 PIII 1 GHz Coppermines 
(sequential serial numbers!),  w/ 3COM 3c905b-TX, Sound Blaster Live, 
Geforce DDR, Adaptec 2940U2W, running vanilla RH 7.1 SMP kernel (must 
disable MPS 1.4 in BIOS to avoid UNKNOWN IO-APIC error at boot, AND IT 
WORKS WITH 6.2! But that's another story!)

Drives connected to SCSI card: 
2 IBM-PSG Tornado 9 WLS, ID 0 and 1(Ultra3/SCSI160)
1 Fujitsu MAE3182LP U2 LVD drive ID 2
1 Seagate Cheetah ST39173LW U2 LVD drive ID 3 

The above drives are on the U2 bus, using an U2 cable, with integrated 
active terminator. 

On The UW bus of the same card are: 
1 IBM DGHS18U UW SCSI drive ID 8

1 Fujitsu MAA3182SP UW SCSI drive ID 9

These drives are in an external enclosure, with an active terminator on 
the enclosure. 

SCSI card is at ID7, termination is *enabled* on the U2 bus (not 
automatic), termination is set to automatic on the UW bus (in case the 
drive box isn't connected.) Nothing configured oddly in SCSI BIOS, pretty 
much set to the defaults. All LVD/U3 drives report running in LVD mode, 
all drives on UW bus (of course) report running in SE mode.  

Connected to integrated IDE:

1 Maxtor 91024D4 (primary master)
2 Maxtor 92048U8 drive (primary slave and secondary master)
1 TEAC CD-w54E CD-RW drive (secondary slave)

NOTE: I installed to the END of HDA, the Maxtor IDE drive. Used same 
options that FAIL with SCSI drive, and it runs beautifully! BUT I WANT TO 
USE MY SCSI DRIVE! 

Further note: This SCSI subsystem runs both Windows 2000, and Windows XP 
(which WAS installed on the IBM that I tested RH 7.1 on a couple of 
times), with ABSOLUTELY no problems (using these drives as the OS drive). 
I think, perhaps, this should indicate that this is not a 
cabling/termination issue. All SCSI ID's are non-conflicting, both buses 
are terminated at both ends (the SCSI card should be terminating both 
buses, and the ends of both chains are terminated.) 

Even further note: I am choosing the following options in setup, as far as 
the lines for Lilo and such... it's installing Lilo into the root 
superblock of HDA2 (/boot) using the default ide-scsi parameter line that 
the installer fills in for me. I also am leaving "use linear mode" 
checked, because I tried disabling that and it wouldn't boot at all. I'm 
pretty much just using what the installer selects for me, thinking that it 
probably knows best. 

Oh, and using Boot Magic 7 as the boot loader to hook into LILO at the 
root superblock of HDA2. 

What the heck am I doing wrong, or is this a bug in the driver module? If 
it's a bug, how do I install an updated module at boot-time?

Comment 1 Doug Ledford 2001-10-10 22:24:23 UTC

First off, unless the SMP kernel won't boot with a 1.4 MPS table, go ahead and
change it back to that.  The error message you are talking about is mostly so
that we get notices of new apic IDs, but it shouldn't keep it from working and
the benefits of a 1.4 table outweigh the harmless message.

Now, as to the easy part of my answer.  If you want to try Justin Gibbs' driver,
then boot the installer using the command:

expert noprobe dd

and when it gets to the point when it asks for a driver disk you can insert the
disk you make by downloading the latest driver image from Justin's ftp site. 
Then you can select that driver from the list of available SCSI drivers (make
sure you scan the entire list of SCSI driver because whichever one comes off of
the driver update disk should get appended to the end of the list, so don't
immediately grab the first Adaptec driver you find).  That should allow you to
try Justin's driver.  If you don't want to download Justin's latest driver, or
if he doesn't have a recent driver disk for a 7.1 install, then drop the dd part
off of the boot line, and select the New Adaptec SCSI driver from the list of
available drivers and it should load Justin's driver instead of mine (I think
6.1.13 is in 7.1, but I could be wrong on that, it may be older).

Now, if that doesn't work, then I'm guessing that your actual problem is related
to PCI or main bus corruption resulting from possible PCI caching options set in
your BIOS.  I could go through the BIOS disabling all of the PCI speedup options
and see if things start working OK.  If that helps, then add things back one at
a time until you find the culprit of how your system is running.

That gives several things to try, I'm marking this bug as NEEDINFO until you can
get the results back to us.  I suspect that the actual case with your machine is
that there is motherboard related corruption that Windows works around with some
sort of motherboard blacklist entry that we don't know about and hoping that
changing things in the BIOS will correct the problem.

Comment 2 David Hahn 2001-10-10 22:50:03 UTC

Thanks so much for the info, Doug! 

Enabling the MPS 1.4 setting actually causes a hard lock (keyboard lights for 
caps lock and such don't even respond...) but that works just dandy with the 
kernel from RH 6.2. What do you make of that? Should I file a separate bug 
report?

Anyway, about your advice...

I do have "PCI 2.1 delayed transaction" enabled in the BIOS, and a couple of 
other options enabled that are NOT enabled by default (tweaking...)

I will investigate these settings, honestly I don't have a clue what some of 
them do (AWARD BIOS for non-Intel chipsets always have some oddball settings, 
this is my first non-Intel chip set mainboard)

None of these settings seemed to adversely affect performance in Win2K, but 
haven't seemed to HELP anything, either. 

I hadn't even considered that it could be a PCI bus issue... 

Will test with ALL that stuff disabled (default BIOS config), and the ORIGINAL 
driver, and get back to you. This install is only 2 days old, so I don't mind 
nuking it if I can get it up and running on SCSI! 

If that does not work, I will try Justin's driver. His driver page is kind of 
foreboding about using the driver images vs. the patch... I took that to mean 
there was something really complicated about using it. It sounds fairly easy; 
I'll just make the driver disk in my working Linux install before I blow it 
away (have to, / is where /boot will go when running from SCSI.)

Is there a good way to get DMESG output into a text file so I can send that 
along if it dumps still? 

Appreciate the info! I love Red Hat 7.1... when it's working!

Comment 3 David Hahn 2001-10-11 12:58:55 UTC

OK... I've changed this to "NOTABUG"- I'm not sure if that's right, but here's 
what I did. I don't know how to use Justin's driver, because all he has on his 
page is a source .gz file, or the patch .gz file. Maybe I'm missing the 
obvious, but I don't see it described as a driver disk image, and I don't know 
how to make one from the patch, if you can.(?)  

BUT... I disabled PCI 2.1 Delayed Transaction in the BIOS (it is disabled by 
default, I had enabled it- seems like I'd want that...) and I booted up from 
the install CD and use the expert noprobe line. 

I installed the provided driver- YOUR driver, NOT the NEW EXPERIMENTAL driver, 
and yes I realize doing noprobe was relatively pointless in that respect... but 
anyway, I did a very basic install. It booted OK, and as a test, I copied the 
entire /usr directory over to the /home partition, then IMMEDIATELY rebooted 
with a shutdown now -r. 

On rebooting, home reported clean. I issued a umount /home, ran FSCK manually, 
and it also reported clean (not sure how much difference that makes vs the 
bootup check). I did this several times. No problems. X runs. X programs all 
run. No issues. 

I have now reinstalled using my typical selection of packages, which is about a 
1.4 GB install, and added some updates and stuff, and it appears to be working. 
Working very well, I might add. 2 installs in a row that work perfectly fine is 
FAR better than anything I've been able to do prior to switching that setting. 
 
I think that's a fair test, and I think your analysis of it being a PCI 
bus/main bus corruption problem was dead-on. I am on the one hand very 
impressed at how easily you solved that, and disappointed that I didn't think 
of that... :)

This seems to be one of those situations where X (Linux) seems to be at fault, 
but in fact, X is merely revealing a flaw or bug in Y (my BIOS settings, BIOS, 
or mainboard) that Z (Windows 2000) did not reveal. 

It seems that Windows 2000 compensates for the havoc that the PCI 2.1 delayed 
transaction feature in my BIOS causes on the PCI bus; I assumed, what with my 
mainboard stating it is PCI 2.1 compliant, that I would WANT this feature 
enabled, yet its caused me DAYS of frustration. Ahem, perhas that's why it's 
disabled by default. :) 

You are great, Doug. Thanks so much, I don't know how I can thank you enough; 
really. Now I can get down to the task of learning more about the OS.

Note You need to log in before you can comment on or make changes to this bug.