Bug 178229 - LTC23422-Installer hung during "Loading DAC960 driver..."
Summary: LTC23422-Installer hung during "Loading DAC960 driver..."
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 5
Hardware: ppc64
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Brian Brock
URL:
Whiteboard:
: 207140 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-01-18 18:24 UTC by Joy Latten
Modified: 2008-08-02 23:40 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-11-12 06:16:20 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
lspci-n.txt (3.68 KB, text/plain)
2006-06-20 17:06 UTC, IBM Bug Proxy
no flags Details
dac960_id_table_fixup.patch (937 bytes, text/plain)
2006-07-14 21:11 UTC, IBM Bug Proxy
no flags Details

Description Joy Latten 2006-01-18 18:24:50 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)

Description of problem:
Installing a 64-way pSeries lpar machine with FCTest2.
I successfully did a netboot from a nimserver to boot up
the lpar using FC5Test2 images. The ibm scsi driver 
(cannot recall driver's name) loaded just fine. Then the 
message "Loading DAC960 driver" was displayed
and machine stays here forever unless I reboot. 

Version-Release number of selected component (if applicable):
anaconda

How reproducible:
Always

Steps to Reproduce:
1.netboot using ppc/images/netboot/ppc64.img
2.
3.
  

Actual Results:  Machine hangs forvever with message
"Loading DAC960 driver..."

Expected Results:  I was expecting the next screen which has me "to choose a Language" to come up and thus continue installation process.

Additional info:

I am using vnc. At my open firmware prompt, I do "setenv boot-file vnc",so I can use vnc for installing lpar.

Comment 1 Paul Nasrat 2006-01-18 18:36:50 UTC
I think we should be just using ipr.  If you netboot yaboot and use linux
nostorage then select the ipr driver does it work?  What does lspci look like on
the system?

Comment 2 Joy Latten 2006-01-18 18:43:57 UTC
I am not loading these drivers myself, the installer is. 
The disk I am installing is clean in that it never had an operating
system on it so I can't do an lspci. 

Comment 3 Paul Nasrat 2006-01-18 19:11:10 UTC
If you pass nostorage as a boot argument you will be able to select and should
be able to complete an install.

Comment 4 Joy Latten 2006-01-20 21:22:03 UTC
We did as suggested and passed nostorage as a boot argument so we could select 
devices. Just selecting ipr did not help. However when we tried sym53c8xx it 
worked and our install completed.

An lspci on the machine shows 
d0:01.0 SCSI storage controller: LSI Logic/Symbios Logic 53c1010 66MHz Ultra3 
SCSI Adapter (rev 01)
d8:01.0 SCSI storage controller: Mylex Corporation AccelRAID 
600/500/400/Sapphire support Device (rev 04) 

I recall that when installing, the sym53c8xx always installed first and 
successfully and then when loading DAC960 next, it would hang.
By not installing DAC960, everything went ok. Perhaps a hw probe saw the Mylex 
driver and so attempted to load DAC960, thus the hang.



Comment 5 Paul Nasrat 2006-01-20 23:19:42 UTC
If you increase the kernel log level and try loading DAC960 on the running
system do you get a hang.  You might want to enable Sysrq so you can drop into
xmon to get more details.

Comment 6 IBM Bug Proxy 2006-05-01 16:18:34 UTC
Connecting to IBM Ltc bug 23422... Thanks.

Comment 7 Paul Nasrat 2006-05-01 22:03:26 UTC
Does rawhide also hang?

Comment 8 Mark Smith 2006-05-01 23:02:05 UTC
ahh, I would like to find out, but I need help.  Please point me to where the
'ppc64.img'  netboot image is for rawhide.
I find the FC5 copy from mid March here:
http://download.fedora.redhat.com/pub/fedora/linux/core/5/ppc/os/images/netboot/
but
http://download.fedora.redhat.com/pub/fedora/linux/core/development/ppc64/images/
appears empty to me.  I assume 'rawhide' is synonymous with these daily builds
in the "development" branch.  true?
http://download.fedora.redhat.com/pub/fedora/linux/core/development/tree-ppc/ppc64/
has a yaboot.conf.  But the only way I know to start this on ppc64 is to copy
into an already bootable system.  Is this my best option? or is there a
ppc64.img in the rawhide tree?

Comment 9 Mark Smith 2006-05-01 23:11:28 UTC
I may have found it.  I think my mistake was going into development/ppc64.
Instead I find a netboot image here:
http://download.fedora.redhat.com/pub/fedora/linux/core/development/ppc/images/netboot/
Is this the "rawhide" image?  (sorry for the newbie questions)

Comment 10 Paul Nasrat 2006-05-02 00:59:31 UTC
Yes the ppc tree is what you want - it's a biarch tree  (32 and 64 bit) and is
what we base the trees off, the ppc64 tree is just a side affect of how we build.

You might want to check out:

http://fedoraproject.org/wiki/Testing

I'd tend to use the kernel/initrd with yaboot:

http://download.fedora.redhat.com/pub/fedora/linux/core/development/ppc/ppc/ppc64/

Perhaps it'd be a good idea to work on a ppc specific testing wiki page.

Comment 11 Mark Smith 2006-05-03 17:48:39 UTC
> Does rawhide also hang?
appears fixed in the May 1 snapshot of rawhide.
on 5 victims, both hmc-attached lpars and standalone, they all recreated on FC5
from mid-March, but did not recreate on rawhide.

Comment 12 Paul Nasrat 2006-05-03 18:01:06 UTC
Thanks for testing.  I'm going to close this out as there is a known work
around, and it's fixed moving forward.

Comment 13 Mark Smith 2006-05-11 01:12:40 UTC
This was not recreating on the May1 snapshot of rawhide.  On the May10 snapshot
I captured this morning, the problem again recreates.

Comment 14 IBM Bug Proxy 2006-05-11 01:13:01 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|CLOSED                      |REOPENED
         Resolution|FIX_BY_DISTRO               |




------- Additional Comments From marksmit.com  2006-05-10 21:18 EDT -------
This was not recreating on the May1 snapshot of rawhide.  On the May10 snapshot
I captured this morning, the problem again recreates. 

Comment 15 Paul Nasrat 2006-05-11 01:48:47 UTC
Do you happen to know the exact kernel version it was working on with the May 1
snapshot. Was it on exactly the same hardware? Had anything been run on the
hardware in between?

Comment 16 IBM Bug Proxy 2006-05-11 03:12:53 UTC
----- Additional Comments From marksmit.com  2006-05-10 23:16 EDT -------
hostname: hermeslp1  OpenPower710 was previously installed/running my "May
1"snapshot of rawhide:
Linux version 2.6.16-1.2181_FC6 (bhcompile.redhat.com) (gcc
version 4.1.0 20060425 (Red Hat 4.1.0-11)) #1 SMP Sun Apr 30 23:03:19 EDT 2006

I then shutdown and net booted ppc64.img from today. so same exact hw config.
I then rebooted after recreate, net booted the May1 version of ppc64.img and
re-installed the May1 images. 

Comment 18 IBM Bug Proxy 2006-05-19 20:17:44 UTC
----- Additional Comments From marksmit.com  2006-05-19 16:21 EDT -------
This problem continued to recreate as I sampled ppc64.img netboots from rawhide.
Until the May18 snapshot.
It recreates on kernel-2.6.16-1.2202_FC6.src.rpm  (May 15th snapshot)
and no longer on kernel-2.6.16-1.2206_FC6.src.rpm (May 18th snapshot) 

Comment 19 Paul Nasrat 2006-05-23 14:45:06 UTC
Manoj can we try and get somme more details on what may be causing this - you
might want to check the interaction between dac960 and the other devices on the
system (pci id reuse, etc).  Some sort of trace would be useful - you may have
to add some debugging to loader to make this work and create an new initrd.img
with your debugging loader.

Comment 20 IBM Bug Proxy 2006-05-27 20:30:40 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Version|FC5                         |Other




------- Additional Comments From marksmit.com  2006-05-26 22:31 EDT -------
I am changing the version on the LTC Bugzilla from FC5 to "other" to reflect
that this is found on Fedora-devel (rawhide).
The status remains the same. 
It did not recreate on May1 version of fedora-devel.
It recreated on May10,May15 versions of fedora-devel.
It does not recreate on May 18, May25 & today's May26 versions of fedora-devel. 

Comment 21 IBM Bug Proxy 2006-06-06 03:27:16 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
            Summary|RH178229-FC5 hangs loading  |RH178229-Fedora-devel hangs
                   |DAC960 prior to install     |loading DAC960 prior to
                   |                            |install
            Version|Other                       |devel




------- Additional Comments From marksmit.com  2006-06-05 23:30 EDT -------
I am changing the version to 'devel' to properly reflect that this recreates on
rawhide.

The status of this bug has changed. It now recreates on the June5 version of
rawhide. 
kernel-2.6.16-1.2245_FC6.ppc64.rpm 

Comment 22 IBM Bug Proxy 2006-06-08 05:47:24 UTC
----- Additional Comments From marksmit.com  2006-06-08 01:50 EDT -------
And now it is gone again with June6 snapshot:
kernel-2.6.16-1.2252_FC6.ppc64.rpm 

Comment 23 IBM Bug Proxy 2006-06-08 06:22:31 UTC
----- Additional Comments From marksmit.com  2006-06-08 02:28 EDT -------
oops. the kernel levels are correct, but calling it a June6 snapshot is wrong. 
It is the June7 snapshot with that kernel that no longer recreates. 

Comment 24 IBM Bug Proxy 2006-06-08 23:47:19 UTC
----- Additional Comments From marksmit.com  2006-06-08 19:51 EDT -------
And now it is back again with the June8 snapshot of rawhide:
kernel-2.6.16-1.2255_FC6.ppc64.rpm 

Comment 25 Jeremy Katz 2006-06-19 21:41:27 UTC
This just sounds like the dac960 is exporting that it supports a PCI id which it
really doesn't.  That, or your virtual scsi is saying that it's a dac960 when
it's really not.

Can you provide the output of lspci and lspci -n?

Comment 26 IBM Bug Proxy 2006-06-20 17:06:44 UTC
Created attachment 131208 [details]
lspci-n.txt

Comment 27 IBM Bug Proxy 2006-06-20 17:07:23 UTC
----- Additional Comments From marksmit.com  2006-06-20 13:11 EDT -------
 
lspci and lspci -n output for two different ppc64 recreates. 

Comment 28 IBM Bug Proxy 2006-06-20 17:16:37 UTC
----- Additional Comments From marksmit.com  2006-06-20 13:19 EDT -------
Customers are also encountering this bug:
http://www-128.ibm.com/developerworks/forums/dw_thread.jsp?message=13821904&cat=5&thread=119552&treeDisplayType=threadmode1&forum=375#13821904

I posted the workaround. 

Comment 29 IBM Bug Proxy 2006-06-24 20:36:52 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |gcwilson.com




------- Additional Comments From gcwilson.com  2006-06-24 16:41 EDT -------
I am encountering this bug installing FC5 on the same 64-way Squadrons H that
Joy Latten reported seeing it on back in January.  I can't speak for rawhide
because I need FC5 + updates + selected rawhide packages for LSPP testing.  Same
work around still applies.  However, on boot after the install completes,
start_udev hangs forever.  I booted it with init=/bin/bash and ran start_udev
manually in the background.  Turns out start_udev is attempting to modprobe
DAC960.  modprobe hangs using 100% CPU and start_udev never completes.  It
generated a stack dump, which unfortunately got wiped from my terminal history.
 Workaround is to remove DAC960.ko from the modules tree.  It appears not to
recreate then I move the module back into place and run start_udev now--maybe it
has something to do with the preexisting device nodes? 

Comment 30 IBM Bug Proxy 2006-07-05 16:51:33 UTC
----- Additional Comments From marksmit.com  2006-07-05 12:55 EDT -------
This recreated using the FC6_test1 network boot image. 

Comment 31 IBM Bug Proxy 2006-07-13 17:22:12 UTC
----- Additional Comments From marksmit.com  2006-07-13 13:28 EDT -------
This may be related to the IBM Power RAID IPR driver.
July 10 rawhide did not recreate.
July 11 rawhide does.
So I took a SF4 (p5-550) hmc-attached system and started removing devices from 
the lpar profile (power-off, then activate to try new).
The non-raid scsi adapter that does _not_ use IPR boots past the point of hang.
Putting the integrated IPR (T14) device or either of the slot 2 or 4 raid scsi 
adapters (IBM f/c 5703) will cause the dac970 hang problem to occur on netboot.

I will get the hmc info on those adapters and append shortly. 

Comment 32 IBM Bug Proxy 2006-07-13 17:57:08 UTC
----- Additional Comments From marksmit.com  2006-07-13 14:00 EDT -------
device that recreates:
feature code 5703
vendor id 1069
device id B166
subsyst vend id 1014
subsyst dev id  0278
class code 0104
revision id 04

This is on slot C2 of 9123-720 (OpenPower 720) ninagal
integrated T14 IPR and raid adapter in slot C4 (all recreate alike) have 
identical f/c and properties to these.
hmc access: sqh14lte.upt.austin.ibm.com   hscroot abc123

I am also getting info from the p5-550.  paytonlp1 (on same hmc for ease).

f/c and properties of older non-raid adapter in slot C08 of external reliance 
io drawer: this does not recreate, but loads the sym53C8xx device instead of 
ipr
PCI 160MB Ultra3 SCSI LVD
f/c 6203
vendor id 1000
device id 0021
subsyst vend id 1000
subsyst dev id  1010
class code 0100
revision id 01 

Comment 33 IBM Bug Proxy 2006-07-13 18:37:57 UTC
----- Additional Comments From marksmit.com  2006-07-13 14:42 EDT -------
I have more resources in paytonlp1.
both f/c 5702 and 5703 adapters recreate.
These adapters also cause the hang on 'DAC960'  
 (my mistake calling out 'DAC970' - 2ndprior comment - in this bug - typo)
Storage Controller
f/c 5702
vendor id 1069
device id B166
subsyst vend id 1014
subsyst dev id  0266
class code 0100
revision id 04

These lpfc Fibre Channel Serial Bus  adapters do not recreate
f/c <none>
vendor id 10DF
device id FA00
subsyst vend id 10DF
subsyst dev id  FA00
class code 0C04
revision id 01 

Comment 34 IBM Bug Proxy 2006-07-13 19:06:46 UTC
----- Additional Comments From bjking1.com(prefers email via brking.com)  2006-07-13 15:12 EDT -------
This is a DAC960 bug. The pci id table in DAC960.c indicates it supports all
devices with PCI vendor id 1069 and PCI device id B166, which it does not. There
are several ipr adapters that use this same chip. The DAC960 driver needs to be
fixed to specify which PCI subsystem ids it supports so it does not try to
initialize ipr adapters. Can we simply not build DAC960 for ppc64pseries and
ppc64iseries as a short term solution? 

Comment 35 IBM Bug Proxy 2006-07-13 20:05:54 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
              Owner|csiddali.com         |bjking1.com




------- Additional Comments From bjking1.com(prefers email via brking.com)  2006-07-13 16:09 EDT -------
I sent a couple notes out to try to track down a contact for the DAC960
driver/hardware... 

Comment 36 IBM Bug Proxy 2006-07-14 21:11:12 UTC
Created attachment 132464 [details]
dac960_id_table_fixup.patch

Comment 37 IBM Bug Proxy 2006-07-14 21:12:43 UTC
----- Additional Comments From bjking1.com(prefers email via brking.com)  2006-07-14 17:16 EDT -------
 
Proposed fix to the DAC960 driver

The proposed fix should prevent the DAC960 driver from talking to ipr adapters. 

Comment 38 IBM Bug Proxy 2006-07-14 21:13:15 UTC
----- Additional Comments From bjking1.com(prefers email via brking.com)  2006-07-14 17:17 EDT -------
Can someone who has this hardware try out the patch and verify it fixes the problem? 

Comment 39 Paul Nasrat 2006-07-19 10:59:17 UTC
Manoj have you tried this patch out?

Comment 40 Manoj Iyer 2006-07-20 21:51:26 UTC
I was able to reproduce this problem on a squadron using the RHEL5 7/11
boot.iso, I will build a kernel with brians patches and see if that fixes the
issue. 

Comment 41 IBM Bug Proxy 2006-07-26 22:20:51 UTC
----- Additional Comments From bjking1.com(prefers email via brking.com)  2006-07-26 18:27 EDT -------
Has anyone been able to try out this patch? 

Comment 42 Manoj Iyer 2006-07-28 20:53:43 UTC
Tested the patch with FC5 Gold kernel, the DAC960 does not hang, and makes
progress. I need to do more testing and do a full install with FC development /
RHEL tree. 

Comment 43 Manoj Iyer 2006-08-03 15:35:29 UTC
I was able to boot the install disk with Rawhide kernel plus this patch and did
not hang on DAC960.

Comment 44 IBM Bug Proxy 2006-08-07 13:36:31 UTC
----- Additional Comments From bjking1.com (prefers email at brking.com)  2006-08-07 09:41 EDT -------
Code has been submitted upstream:

http://marc.theaimsgroup.com/?l=linux-scsi&m=115463141705264&w=2 

Comment 45 Paul Nasrat 2006-09-20 07:21:46 UTC
Brian did you recieve any feedback on this patch, is it in any upstream trees?

Could someone from the kernel team review this patch possibly?

Comment 46 Paul Nasrat 2006-09-20 07:24:25 UTC
*** Bug 207140 has been marked as a duplicate of this bug. ***

Comment 47 IBM Bug Proxy 2006-09-20 13:45:51 UTC
----- Additional Comments From bjking1.com (prefers email at brking.com)  2006-09-20 09:40 EDT -------
Its currently in James Bottomley's scsi-misc tree, so it should get pushed with
his scsi update for 2.6.19.

http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=fddafd3d21953d5ea740f7b2f27149f7dd493194 

Comment 49 IBM Bug Proxy 2006-09-21 14:41:14 UTC
----- Additional Comments From rschelle.com  2006-09-21 10:35 EDT -------
Will Red Hat be backporting the DAC960 driver patch to the RHEL5 kernel?  

If so, when should we expect a RHEL5 beta build that integrates this patch? 

Comment 50 Konrad Rzeszutek 2006-09-21 17:45:31 UTC
rschelle:

I looked at the Beta2 kernels (http://people.redhat.com/dzickus/el5/)
and the linux-2.6-ppc-dac960-ipr-clash.patch has the patch.

If you can, pls verify the kernel.


Comment 51 IBM Bug Proxy 2006-10-03 15:20:55 UTC
----- Additional Comments From rschelle.com  2006-10-03 11:15 EDT -------
The 20060927 code drop for rhel5 beta1 (2.6.18-1.2702.el5) no longer runs into
the DAC960 driver hang on install.  The ipr driver is correctly probed and
loaded without user assistance. 

Comment 52 IBM Bug Proxy 2006-10-05 15:51:43 UTC
----- Additional Comments From bjking1.com (prefers email at brking.com)  2006-10-05 11:48 EDT -------
Mark,

Can you verify this is fixed in fc6-test3? 

Comment 53 IBM Bug Proxy 2006-10-05 16:16:09 UTC
----- Additional Comments From marksmit.com  2006-10-05 12:11 EDT -------
Hi Brian,
I can certainly confirm that this bug does not recreate on FC6-test3.  I've 
been using those iso's for many ppc64 installs now.
But here's the tricky part. When I tested the daily builds of rawhide (June, 
July), this recreate came & went frequently.  
I could never figure out if it is something random in the build process, but 
do this day cannot tell in advance whether a build (without your fix) will 
recreate the hang or not. 

Comment 54 IBM Bug Proxy 2006-10-05 17:25:52 UTC
----- Additional Comments From bjking1.com (prefers email at brking.com)  2006-10-05 13:21 EDT -------
According to the kernel rpm changelog in fc6-test3, the dac960/ipr collision
patch was included on 08/03. 

Comment 55 IBM Bug Proxy 2006-10-05 17:26:23 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ACCEPTED                    |CLOSED




------- Additional Comments From bjking1.com (prefers email at brking.com)  2006-10-05 13:22 EDT -------
Closing, as the problem is fixed in FC6-test3. 

Comment 57 Dave Jones 2006-10-16 18:06:26 UTC
A new kernel update has been released (Version: 2.6.18-1.2200.fc5)
based upon a new upstream kernel release.

Please retest against this new kernel, as a large number of patches
go into each upstream release, possibly including changes that
may address this problem.

This bug has been placed in NEEDINFO state.
Due to the large volume of inactive bugs in bugzilla, if this bug is
still in this state in two weeks time, it will be closed.

Should this bug still be relevant after this period, the reporter
can reopen the bug at any time. Any other users on the Cc: list
of this bug can request that the bug be reopened by adding a
comment to the bug.

In the last few updates, some users upgrading from FC4->FC5
have reported that installing a kernel update has left their
systems unbootable. If you have been affected by this problem
please check you only have one version of device-mapper & lvm2
installed.  See bug 207474 for further details.

If this bug is a problem preventing you from installing the
release this version is filed against, please see bug 169613.

If this bug has been fixed, but you are now experiencing a different
problem, please file a separate bug for the new problem.

Thank you.

Comment 58 Mark Smith 2006-10-18 22:39:03 UTC
Although it is not possible to test the recreate on Version: 2.6.18-1.2200.fc5 
without either a netboot image or an install kernel, I am able to observe 
using lsmod, that mylex DAC960 module is loaded by default on this ppc64 
machine (with only ipr devices, not dac960 devices)  when booting from the 
stock FC5 kernel (2.6.15) after a fresh install.  And yes, you do have to 
apply the nostorage workaround to do the fresh install, else this recreates.
By comparison, after doing yum update and rebooting with the Version: 2.6.18-
1.2200.fc5 kernel, DAC960 module is no longer loaded by default.
Based on this, I would guess that if a netboot image were made available of 
this Version: 2.6.18-1.2200.fc5 kernel, then this would not recreate.

I have opened a new bug: 211383 for a regression seen while testing.


Note You need to log in before you can comment on or make changes to this bug.