Bug 126391

Summary:	both 'update' kernels for fedora 2 fail to boot with SCSI
Product:	[Fedora] Fedora	Reporter:	William W. Austin <waustin>
Component:	kernel	Assignee:	Arjan van de Ven <arjanv>
Status:	CLOSED ERRATA	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	2	CC:	rob, zaitcev
Target Milestone:	---
Target Release:	---
Hardware:	athlon
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-08-05 12:28:58 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description William W. Austin 2004-06-21 05:10:43 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040510

Description of problem:
This problem did not occur with "original" kernel-2.6.5-1.358.i86.rpm
version. (Apologies if this is a dup, but I searched and didn't see it.)

After installing (rpm -ivh) the kernel-2.6.6-1.435.i686.rpm, on
attempting to boot, the system hangs immediately after remounting the
root file system in R/W mode when attempting to access a swap
partition on a scsi drive connected to an adaptec 29160 controller. 
Eventually things time out and the boot attempts to proceed, however
it fails because it can find NONE of the scsi partitions in /etc/fstab
to mount.  Nor can fdisk -l /dev/sda (sdb, sdc) find any, although the
drives are clearly present in /proc/scsi/scsi.

Going back to 2.6.5-1.358 kernel, system boots and runs correctly, so
I tried (a) reloading 2.6.6-1.435 kernel (twice - same results) and
then the 2.6.6-1.427 kernel (ditto).

The aic7xxx module loaded at least once (sorry - I didn't always
remember to do an lsmod to check), and removing and reloading it
caused the system to freeze altogether).


Version-Release number of selected component (if applicable): BOTH
kernel-2.6.6-1.427.i686.rpm AND kernel-2.6.6-1.435.i686.rpm

How reproducible:
Always

Steps to Reproduce:
1.Install either 2.6.6-1.427 or .435 kernel (i686)
2.Reboot system

    

Actual Results:  After remounting root file system (on /dev/hda)
system freezes, and after timeout and recovery cannot find any scsi
partitions.  Ultimately boot fails unless ALL scsi entries are removed
from /etc/fstab (not an acceptable workaround).

Expected Results:  System should boot normally.

Additional info:

Hardware: athlon xp2500+, gigabyte ga-7nnxp MB, 1gb memory,
hdc=160gb ide133, sda=80gb scsi3, sdb=40gb scsi2, sdc=40gb scsi3.
Same system works find under original kernel-2.6.5-1.358.i686 kernel
(with or without 8k stacks and with or without commercial NVidia
driver installed), also works fine under previous fedora 1
installation and even older rh9 installation (happened to have old
disks not yet reused and tried it just as a test).

Comment 1 Niels Weber 2004-06-21 09:54:18 UTC

I got a problem that I think is related:

On a fresh installed and updated FC2 system (without scsi card)
everything works fine and I can mount an usb harddisk.

When plugging in a scsi controller (tested two different adaptec both
needing the aic7xxx module) with a hard disk attached, kudzu
configures it at the next boot but I cannot mount the scsi disk and
the aic7xxx module isn't loaded.
I'm not able to load the module unless I remove the ohci_hcd module
first. Then the controller and disk appear fine but obviously the usb
harddisk can't be mounted anymore.

That's quite a problem, not being able to use scsi and usb at the same
time and something that worked fine on FC1 and before.

Comment 2 William W. Austin 2004-07-05 04:37:11 UTC

Tonight I tried loading the latest kernel
(kernel-2.6.6-1.435.2.3.i686.rpm + source + docs, btw) and the problem
persists.  (FWIW I have 4 machines with root on /dev/hda1 and
additional scsi drives and all 4 exhibit the same problem).

The boot cycle appears normal - the scsi cards (2 in the machine) are
detected (29160, 2940), and the devices attached to them are detected,
the disk partitions are listed, etc.

However after loading the ehci_hcd and ohci_hcd modules, when the
system attempts to enable the first scsi swap partition, it hangs.
Again, eventually the timeout occurs and the system fails to find ANY
scsi partitions, giving root a chance to log in and fix  the problem.
But of course at this point there is no fix.

At this point (single user, attempting to fix problem), if I do a
rmmod sd_mod, the system locks completely, requiring a hard reboot.

It is also effectively not possible to do an rmmod aic7xxx at this
point, since if this is done and then an attempt is made to reload it,
the system hangs, again requiring a hard reboot.

IMHO this is probably a serious bug since it effectively disables
systems which have both ide and scsi drives attached.  One one machine
I tried disabling ALL usb on the MB and removing the two lines
    alias usb-controller ehci-hcd
    alias usb-controller1 ohci-hcd
from /etc/modprobe.conf, but I must have missed something because they
were still loaded and the problem didn't change.

Comment 3 Rob Hughes 2004-07-05 12:41:54 UTC

I'm running a system with all partitions on SCSI drives attached to 
an Adaptec 2940 U2W controller, and this affects me as well. I'm 
unable to boot with any kernel above 2.6.5-1.358, and disabling USB 
isn't an option for me.

Comment 4 William W. Austin 2004-07-07 03:29:25 UTC

I repeated my experiment of disabling all USB in the motherboard bios
and this time was able to boot the new kernels successfully - however,
for me disabling usb is not an option either.  (I monitor the network
UPS's via USB, not to mention other USB devices.)

FWIW, all 4 machines which exhibit this behaviour (all the machines I
have access to in terms of installing a new kernel, anyway) have the
same MB, a gigabyte 7nnxp, all with the same bios (nVidia, f17).  All
have adaptec 29160 controllers, and 2 have a second scsi controller,
one is an adaptec 2940U and the other is a 2940UW.

With the bios disabled, the modules which seem to trigger the problem
(ohci_hcd and ehci_hcd) did load (but did nothing of course).  The
directory /proc/bus/usb was created, but was empty (as expected).

All 4 boxes have 1Gb memory with athlon xp 2500+ cpu's.

If you need further info to help track this one down, please contact
me.  Thanks.

Comment 5 William W. Austin 2004-07-07 04:05:21 UTC

I forgot to mention in comment #4 above that someone had written me,
suggesting that if I used the open source nv driver instead of the
proprietary nvidia driver for my video board, the problem would go
away.  It doesn't.  Also removing the nvidia video board (5200 ultra,
128mb geforce4) altogether didn't help, but as the only replacement
board I had was an older board with a non-accelerated nvidia chip,
this may not have been a fair test.

Comment 6 Rob Hughes 2004-07-14 02:41:13 UTC

Interestingly, I got the system to start booting. What I've now found
is that it seems to be kudzu that's actually hanging my system. Once I
did a chkconfig --level 3,5 kudzu off, the system started booting.
Running kudzu from the prompt also hangs the system unless I specify -s.

Comment 7 William W. Austin 2004-07-26 21:50:18 UTC

More experimentation: kudzu is not a factor.

If I remove the two lines
  alias usb-controller ehci-hcd
  alias usb-controller1 ohci-hcd
from /etc/modprobe.conf and append them to the end of
/etc/modules.conf, the system *appears* to boot normally.

HOWEVER after booting, the only thing that you can do with the scsi
drives is a df <-options> on them (didn't try an unmount).  Any
attempt to do, for instance,
  ls -laF <mount point of any scsi drive>
causes the scsi bus to be reset.

Several such attmpts cause a kernel panic.

Repeating: this is repeatable on Adaptech 29160 and 2940UW boards with
gigabyte ga-7nnxp (multiple systems, almost identical), and swapping
memory, other cards [even removing all other cards except video] does
not make a difference.

If I disable all USB on the MB, then I can safely and successfully
boot any of the non-smp kernels (2.6.6-1.427, 435); however, this is
not an option.

Also if I boot the 435 kernel and wait for the timeout so that I
eventually get the chance to log in (can't find scsi drives ... fsck),
if I do an rmmod of the ohci_hcd and ehci_hcd modules, it doesn't
help.  If after removing these 2, I do an rmmod aic7xxx, the system
may hang.  If it doesn't and I do a modprobe aic7xxx, it invariably hangs.

I have been unable to track this one down in the code - clearly
something in the usb modules is interfering with the scsi module, but
I can't find it.  HELP!!!?!?!?! please.

Comment 8 William W. Austin 2004-07-29 12:39:09 UTC

Nothing new to add (lots of other approaches tried, but all failed);
however, I have had email from two other people having problems on
athlon systems using the aic7xxx driver.

This bug (126391) *may* be related to 125887, but it's hard to tell
from this end.

Comment 9 Pete Zaitcev 2004-07-30 19:27:30 UTC

William, a serial console or a netconsole dump might be useful.
Do NOT drop it into the comments box, please.

Comment 10 Rob Hughes 2004-07-31 17:25:36 UTC

<stern.edu>
[PATCH] USB: Fix endianness bug in UHCI driver

This patch fixes a byte-swapping error in the UHCI driver. It has been
present since 2.6.6 and only got tracked down just now! Thanks a lot to
Michel Roelofs for all his help and testing.

This should be pushed through to Linus in time to appear in 2.6.8, if
possible.


Guess we're mud :p

Comment 11 William W. Austin 2004-08-04 13:21:12 UTC

Last night I downloaded the 2.6.7-1.494.2.2 kernel from the updates
directory, and it fixes the boot-hang problem.  The system now boots
without the boot-hang and without the failure to find the scsi
parititions.

A new (or previously hidden?) problem remains that loading the kpilot
daemon or trying to access my palm pilot via jpilot (the palm pilot is
connected via usb) slows the system to a crawl - playing with it for
several hours, it *acts* almost like an interrupt conflict, but I
haven't been able to track it down further yet.

None the less, this current bug can probably be closed since the
system now boots successfully -- if I can isolate the usb/slowdown
situation further, I will create a new bugzilla report on it.

Thanks for the hard work - I appreciate it (and so do all my fedora 2
boxen). :-)