Bug 127385 - (IT_54259) Machines don't boot on LSI22320-R adapters
Machines don't boot on LSI22320-R adapters
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel (Show other bugs)
2.1
ia64 Linux
medium Severity high
: ---
: ---
Assigned To: Tom Coughlan
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2004-07-07 10:53 EDT by Pierre Fumery
Modified: 2007-11-30 17:06 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-11-12 13:36:42 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
patch to Chaparral entry in the SCSI whitelist (1.39 KB, patch)
2004-07-19 17:30 EDT, Tom Coughlan
no flags Details | Diff
scsi_mod.o for 2.4.21-15.18.EL, with modified entry for Chaparral (57.36 KB, application/octet-stream)
2004-07-19 17:35 EDT, Tom Coughlan
no flags Details
Patch for scsi_scan, to limit LUN scan on Chaparral (second try) (705 bytes, patch)
2004-07-19 17:41 EDT, Tom Coughlan
no flags Details | Diff
64bit version of this tentative patch (95.68 KB, patch)
2004-07-26 12:24 EDT, Pierre Fumery
no flags Details | Diff
ISO image driver disk for AS2.1 U5 (120.92 KB, application/octet-stream)
2004-08-24 12:51 EDT, Doug Ledford
no flags Details
Driver disk for RHEL3 U3 (129.19 KB, application/octet-stream)
2004-08-24 12:52 EDT, Doug Ledford
no flags Details
Yet another disk. (130.14 KB, application/octet-stream)
2004-09-07 15:32 EDT, Doug Ledford
no flags Details
New AS2.1 disk image (243.44 KB, application/octet-stream)
2004-09-07 15:34 EDT, Doug Ledford
no flags Details
Picture of Panic (255.63 KB, image/jpeg)
2004-09-07 16:53 EDT, Bill Peck
no flags Details
serial dump of the panic. (33.96 KB, text/plain)
2004-09-07 17:11 EDT, Bill Peck
no flags Details
Yet another RHEL3 driver disk iso (227.52 KB, application/octet-stream)
2004-09-07 17:54 EDT, Doug Ledford
no flags Details
another serial console capture... (33.82 KB, text/plain)
2004-09-08 13:39 EDT, Bill Peck
no flags Details

  None (edit)
Description Pierre Fumery 2004-07-07 10:53:31 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7)
Gecko/20040514

Description of problem:
We did several tests (same machine and adapter). Please find below
several combination and results:

Booting on LSI22320-R:
   - RHAS2.1-U4 : KO
   - RHAS3-U3 (early release) : KO

   - RHAS2.1-U3 : OK
   - RHAS3-U1 : OK

Booting on Adaptec SCSI CARD 39160:
   - OK on all four RHAS versions listed above.

For RHAS2.1, Issue Track #43391 tracks this problem.

Version-Release number of selected component (if applicable):
kernel-2.4.21-15.5.EL

How reproducible:
Always

Steps to Reproduce:
1. To use LSI22320-R as boot device.
2. To try to boot ...
3. Same boot with Adaptec SCSI works fine.
    

Actual Results:  Unable to boot.

Expected Results:  We should boot ...

Additional info:

Issue traker #43391 identified this problem to be fixed on RHEL2.1-U4 too.

A workaround has been found by using previous driver version 2.05.05
instead of current version 2.05.11.

This regression found on RHEL2.1-U4 (vs. U3) and RHEL3-U3 (vs. U1)
should be fixed in the current RHEL3-betaU3.
Comment 1 Pierre Fumery 2004-07-07 13:13:37 EDT
A version has been posted for RH2.1. Could we expect to get same fix
for RHEL3 soon ? Thanks in advance.
Comment 2 Pierre Fumery 2004-07-13 08:48:56 EDT
We're waiting the fix to be integrated in RHEL3-U3 as soon as
possible. We just got the "RHEL 2.1 Update 5 Beta Preview ISOs" but it
seems the fix was not integrated either.
Comment 3 Tom Coughlan 2004-07-13 11:15:35 EDT
We are trying to reproduce this. We have several adapters that use the
same driver, but we do not seem to have an LSI22320-R adapter. I will
continue to investigate.

Does the LSI22320-R device work correctly when the system boots off
something else? That is, when you boot one of the failing systems
(like RHAS2.1-U4 or RHAS3-U3) from the 39160 adapter, does the
LSI22320-R adapter work correctly after the system is up, when it is
used to access secondary storage?
Comment 4 Pierre Fumery 2004-07-13 12:24:41 EDT
On both of our machines (Bull HW and Intel HW) we experienced problems
with LSI22320-R adapter when going through it to access disk boot device.

I've been told that another Bull team hit this problem too when there
were booting on adaptec card but a LSI22320-R adapter was plugged to
another disk (not the boot disk). I didn't check myself their
configuration and which LSI22320-R adapter/FWversion was used; so I
cannot be 100% sure either.

If you could get such LSI22320-R adapter to set up an in-house
configuration, it will be easier for you to investigate this problem.
Also, it has been well identified that it's linked to LSI22320-R
driver level (see IT #43391: 2.05.05 version is OK, 2.05.11 version is
NOT).

Thanks in advance for your investigation.
Comment 5 Tom Coughlan 2004-07-13 12:52:59 EDT
I just booted RHEL 3 U3 on an Intel Tiger.  The boot device is an:

Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07)

I know the LSI22320-R is also a dual Ultra320 SCSI, but I am not sure
if this is an exact match or not.

The driver version is: 

Fusion MPT SCSI Host driver 2.05.16
scsi0 : ioc0: LSI53C1030, FwRev=01030600h

I would like to compare my system to yours in detail.  Please post a
sysreport for your Intel system with the LSI22320-R installed.
Ideally, this would be from an o.s. version that fails to boot
LSI22320-R but works with the 39160 adapter.  If this is not readily
available, then a sysreport from any o.s. you can boot on the Intel
box will provide a starting point.
Comment 6 Pierre Fumery 2004-07-13 12:58:31 EDT
It has already been posted on IT #43391. Please chack comment "Event
posted 07-06-2004 07:00am by Pierre.Fumery".

Explanation and sysreport are already available there.

Thanks for your investigation.
Comment 7 Tom Coughlan 2004-07-13 16:09:49 EDT
Thanks for the pointer to the sysreport in IT #43391, I had not seen it.

The sysreport (cbrunet.975.tar.bz2) shows a succcesful boot of a disk
attached to an LSI Logic adapter:

- boot disk is sdb at scsi2, channel 0, id 2, lun 0 
- scsi2 : ioc2: LSI53C1030, FwRev=01030a00h
- disk is Vendor: MAXTOR    Model: ATLAS10K4_73SCA   Rev: DFV0
- RHEL 3 pre-U3 (-15.18.EL).  
- Fusion MPT SCSI Host driver 2.05.16
- lspci shows two LSI Logic adapters:
  Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08)
  Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 07)
  (not sure if the rev difference matters)

So, in fact, it is possible to boot some storage from an LSI Logic
adapter with a recent driver version.  The problem, as you have said,
occurs when you try to boot the Chaparral SR0812 storage on the LSI
Logic Adapter. 

Can you confirm that the Chaparral is connected to a different LSI
Logic Adapter adapter than the MAXTOR ATLAS?  Is it possible to try it
on the same adapter as the MAXTOR ATLAS? 
 

Comment 8 Tom Coughlan 2004-07-14 18:05:29 EDT
So, if you would, please try booting from the MAXTOR ATLAS, with the
Chaparral attached to the other port on the same adapter as the MAXTOR
ATLAS.  Then try it again with the Chaparral on one of the ports on
the other mpt fusion adapter.  Please post any error messages that
occur when the system tries to configure the Chaparral. Thanks.
Comment 9 Pierre Fumery 2004-07-15 05:18:17 EDT
Thanks for your investigation.
I apologize to not answering your questions sooner but July 14th is
"Bastille"-day here and nobody was on site to have a look at your request.

Unfortunately, Claude who did all tests investigation since one month
now, left for vacation last Tuesday and he won't be able to answer
himself on all tests he performed to identify/reproduce/isolate this
problem.

However, I've checked with him before he left and he confirmed his
machine did boot on an Adaptec card only and it didn't boot on a
LSI22320-R one.
From your investigation (Additional Comment #7 From Tom Coughlan), it
seems you discovered this machine successfully booted from scsi2 :
ioc2: LSI53C1030 (Fusion MPT SCSI Host driver 2.05.16) and I would
suspect it's from its motherboard.

Anyway, I'll need to further investigate and to ask our test team to
get another people assigned to make other tests if/when needed. It'll
be harder as Claude already spent one month and we'll have to start
investigating/testing again.
Comment 10 Pierre Fumery 2004-07-15 05:57:36 EDT
Tom,
I just double-checked Claude's logbooks and it seems "stlinux9" was
his victim and I found the following:
stlinux9 motherboard
ioc2: LSI53C1030, FwRev=01030a00h, Ports=1, MaxQ=255
ioc3: LSI53C1030, FwRev=01030a00h, Ports=1, MaxQ=255

It would confirm what you saw in his traces, I mean that booting from
motherboard drivers did work (LSI internal chip) but booting from
LSI22320-R (add-on card) did fail.

Do you know if LSI53C1030 (internal chip) uses same driver than
LSI22320-R (LSI adapter card) ? Because I also discovered that your
traces said : driver 2.05.16

We know that this LSI22320-R card was working well (driver 2.05.05) on
RHAS2.1-U3 and RHAS3-U1.
We know that this LSI22320-R card did NOT work (driver 2.05.11) on
RHAS2.1-U4 and RHAS3-betaU3.
We know that this LSI22320-R card did NOT work (driver ??.??.??) on
RHAS2.1-betaU5. Do you know which LSI driver version is integrated in
RHAS2.1-betaU5 ? I would expect 2.05.11 as for RHAS3-betaU3 that would
explain why it does not work.

Could we upgrade LSI driver version on both RHAS2.1-betaU5 and
RHAS3-betaU3 to 2.05.16 ? And I would expect it could work. Could you
agree on that ? Could you check on your side if you use 2.05.16 as
well when it works ?
Comment 11 Tom Coughlan 2004-07-15 09:31:37 EDT
Yes, I was aware of your holiday.  No problem at all and, Happy
Bastille day!

Both the internal chip and the add-on card use the same driver (mpt
fusion).

RHEL 3 U3 and AS 2.1 U5 both contain mptfusion driver 2.05.16.
(Bastien, please update the Issue Tracker, I do not have write access
to it.)

So at this point we have:
2.05.05 is OK, 
2.05.11 fails with add-on card connected to Chaparral storage
2.05.16 fails, as above.

I am currently investigating a possible issue that is specific to the
Chaparral.  I'll be looking at driver differences as well.

If you are able at some point to test the Chaparral connected to the
internal chip, that may be helpful.  Also, if there are any additional
error messages when the failure occurs, that will help. (I understand
about delays due to vacations. No problem.) 

Tom


Comment 12 Pierre Fumery 2004-07-15 09:52:19 EDT
I'll try to figure out how (people+machine) to test what you asked for.

In the mean time, I grabbed a little bit further in Claude's logbooks
and I found the following (I just translated his sentences):
During another test, if you try to install a RHL AS2.1 U4 on a SR0812
connected through a LSI22320-R, installation hangs when the mptscsih
driver is being loaded.

I'm not sure but if it can bring another little clue ...
Comment 13 Pierre Fumery 2004-07-15 10:05:49 EDT
Also some other tests done with SJ0812 gave different results but it's
not obvious to understand what did work and what didn't.

As far as I understood and as far as I know, when using RAID feature
(SJ0812) it could have worked one time through LSI22320-R (with RHEL3
Update pre-beta3 (=kernel 2.4.21-15.18)). Are there different
configuration paths ? I'm not sure at all about what I tried to
understand and I wouldn't like to add confusion. So, please don't put
too much importance on this current note. But if it could help ...
Comment 14 Pierre Fumery 2004-07-15 12:11:56 EDT
I investigate a little more around here and I think I could answer
your initial question resulting from your sysreport analysis.

Can you confirm that the Chaparral is connected to a different LSI
Logic Adapter adapter than the MAXTOR ATLAS?
===> Yes, the MAXTOR ATLAS is internal boot disk which is
linked/reached through LSI Logic adapter (on-board LSI53C1030 chip).
===> "The LSI22320-R adapter is not used to access to the boot disk;
it's only used to access data in a SCSI disk subsystem (SR0812 from
Chaparral)." (see IT #43391).

Is it possible to try it on the same adapter as the MAXTOR ATLAS?
===> I'm not sure we can connect the Chaparral to the on-board
LSI53C1030 chip. I need to ask people who work on these machines to
know (1) if it can be done and then (2) to try it. It'll be tomorrow
as it's already late here.
Comment 15 Pierre Fumery 2004-07-15 12:22:57 EDT
I apologize for my bad statement this morning (on 2004-07-15 05:18)
which could have mistaken you, when I wrote:
"he confirmed his machine did boot on an Adaptec card only and it
didn't boot on a LSI22320-R one."

In fact, tests were done by using (accessing data and not booting)
SR0812 from Chaparral either through an Adaptec adapter (result OK),
either through a LSI22320-R adapter (result KO).
Boot was done on internal MAXTOR ATLAS disk reached through on-board
LSI53C1030 chip.
Comment 16 Tom Coughlan 2004-07-19 17:28:01 EDT
I have a theory about the cause of this problem.  

The most interesting difference between the 2.05.05 driver that works,
and the next version, 2.05.11.03, that does not work,  is:

-#define MPT_LAST_LUN                   31
+#define MPT_LAST_LUN                   255

I have seen some external SCSI RAID subsystems that to not handle
being probed for LUNs > 31 very well at all.  

One way to test this theory is to change the SCSI "whitelist" so that
the system will only probe the Chaparral box for sequential LUNs, up
until it finds a gap, where it will stop probing. This change is shown
in the attached patch. 

I have also attached a re-built scsi_mod.o with this patch applied. 
It is built for -15.18.EL. If you need something different let me
know. To test this, 

1. mv /lib/modules/2.4.21-15.18.EL/kernel/drivers/scsi/scsi_mod.o
to a safe place.

2. put the attached scsi_mod.o in 
/lib/modules/2.4.21-15.18.EL/kernel/drivers/scsi/scsi_mod.o

3. make a new initrd (mkinitrd), and add a new entry to elilo.conf.

4. boot the new initrd, with the Chaparal attached to the LSI22320-R.

Thanks.

Tom
Comment 17 Tom Coughlan 2004-07-19 17:30:08 EDT
Created attachment 102053 [details]
patch to Chaparral entry in the SCSI whitelist
Comment 18 Tom Coughlan 2004-07-19 17:35:06 EDT
Created attachment 102054 [details]
scsi_mod.o for 2.4.21-15.18.EL, with modified entry for Chaparral

Use bunzip2 to restore the file.
Comment 19 Tom Coughlan 2004-07-19 17:41:56 EDT
Created attachment 102055 [details]
Patch for scsi_scan, to limit LUN scan on Chaparral (second try)

Oops.  That first patch had an extra change in it that I did not intend.  Only
the last hunk was intended. The scsi_mod.o file is okay as-is.
Comment 20 Pierre Fumery 2004-07-20 11:23:53 EDT
Hi Tom,

We did try without success to use your scsi_mod.o binary. "mkinitrd"
failed before I discovered that your binary was an ia32/x86 binary.

We're using IA64 boxes and we need ia64/Itanium binaries.

Could you please provide us such ia64 scsi_mod.o binary to let us try
your patch ? People testing it are test people and they don't have the
whole right environment to compile and to get right binary format.
Thanks in advance.
Comment 21 Pierre Fumery 2004-07-21 03:19:00 EDT
I took these following comments from IT #43391 that was for RHEL2.1.
This defect is for RHEL3 and this patch has been built against
2.4.21-15.18.EL, so it should be for RHEL3.

My guess is that the Chaparral is having a hard time when it is probed
for LUNs > 31. This patch will prevent this by making the system stop
probing when it funds the first undefined LUN.

-       {"CNSi", "JSS122", "*", BLIST_SPARSELUN},               //
Chaparral SR0812 SR1422
+       {"CNSi", "JSS122", "*", BLIST_FORCELUN},                //
Chaparral SR0812 SR1422

The attached scsi_mod.o has this patch.  This module is built for
2.4.21-15.18.EL SMP ia64.

Please test it by following these steps:

1. mv /lib/modules/2.4.21-15.18.EL/kernel/drivers/scsi/scsi_mod.o
to a safe place.

2. bunzip2 the attached file and put the resulting scsi_mod.o in
/lib/modules/2.4.21-15.18.EL/kernel/drivers/scsi/scsi_mod.o

3. make a new initrd (mkinitrd), and add a new entry to elilo.conf.

4. boot the new initrd, with the Chaparal attached to the LSI22320-R.

Tom
Comment 22 Pierre Fumery 2004-07-21 03:30:34 EDT
Tom,

Could you please (re)post your ia64 binary in this defect as we don't
succeed to extract it from IT #43391 ?
Thanks.
Comment 24 Pierre Fumery 2004-07-26 12:24:45 EDT
Created attachment 102205 [details]
64bit version of this tentative patch

Gotthis patch by E-mail.
Comment 25 Claude BRUNET 2004-07-27 05:10:20 EDT
Good news: 
   With the new scsi_mod release (patch delivered by email to Pierre
Fumery), the error message "MID not found" does no longer appear and
the server correctly boots with a SR0812 disk subsystem linked to a
LSI22320-R adapter.

The tests have been done using an internal disk as system disk (not a
disk in the SR0812). In the same configuration, the server couldn't
boot correctly with the scsi_mod release delivered in RHEL3 Update 3.

Questions:
1- With this patch, it is mandatory that the LUNs in the SR0812 are
numbered consecutively from 0 (with no hole). Is my understanding correct?

2- What will be the status of the "official" delivery?

Regards,
   Claude.
Comment 27 Tom Coughlan 2004-07-27 17:30:34 EDT
1. Yes, with this patch it is mandatory that the LUNs be numbered
consecutively, starting at zero.

2. In the latest RHEL 3 U3 respin we have restored the 2.05.05 driver,
in addition to the 2.05.11.03 and 2.05.16 drivers.  This was done so
that customers can switch to the older driver, in case we are not able
to ship a better solution in time.  AS 2.1 x86 and IPF also have all
three driver versions.

3. Now that we know what the problem is, we need to pick the best
solution for U3 and U5, assuming that we are able to make any changes
at this late stage. I am still investigating our options here. I hope
to have an answer on this tomorrow.

Tom
Comment 28 Tom Coughlan 2004-07-29 17:50:19 EDT
Here is an update. I don't have a final resolution yet. 

Recent versions of the mpt fusion and aic79xx drivers increase the
max_lun parameter from 64 to 256.  When SPARSELUN is set for a device,
the SCSI midlayer unconditionally probes LUN values up to max_lun, so
high-numbered LUNs are being probed on these devices.

The problem occurs because on a parallel SCSI bus, the driver must use
the packetized protocol to address LUNs > 63. These drivers are
apparently not doing this, and instead, they are using the
non-packetized protocol for LUN 64 and above.  This can cause a system
hang, or non-existent devices to be configured, depending on the
details of the device and the driver.

The right solution to this is to fix the drivers. I have started a
discussion with the driver maintainers on this.  It is not likely that
we will be able to make a significant changes to these drivers at this
late stage in U3/U5. Instead we will look for a workaround, like the
patch that removes the SPARSELUN flag from some devices. The problem
is knowing which devices. I will work with the driver maintainers to
determine what our options are, and pick the best one for U3/U5.     
Comment 29 Tom Coughlan 2004-08-05 17:47:04 EDT
I have confirmed that the problem is in the mpt fusion and aic79xx
drivers. They are probing LUNs > 63 on devices that do not support the
packetized protocol.  I expect to have a fix for this in U4/U6, but it
is too late to make a change like this in U3/U5.  

It would also be inappropriate to remove the SPARSELUN setting for the
JSS122 Charparral storage device in U3/U5, because there is nothing
wrong with the way the device is behaving, and because if we change
it, then some customers who are running it with different drivers will
likely find that their LUNs are no longer configured.

The best solution available for U3/U5 is to manually switch to the
older mpt fusion v2.05.05. This can be selected during install with
the "expert noprobe" option.  If you are not installing to the mpt
fusion device, then edit /etc/modules.conf and re-make the initrd
after installation.   

There is no simple workaround for the aic79xx driver, but we have not
currently received any bug reports, and neither has the Adaptec
maintainer.  Given the fact that U3/U5 have essentially shiped at this
point, it is best to fix the drivers in U4/U6.

 
Comment 30 Susan Denham 2004-08-19 17:35:36 EDT
 It looks like the needed module is not present in the BOOT kernel
because of the limit size of floppy.

In order to workaround this issue, we will have to create a driver
disk with the old driver inside.

A driver disc with the lsi 2.05.05 driver  will be created in a couple
of days (our goal is to have it done by the end of the current week if
possible).

As some machine do not have a floppy drive the actual medium will be
cdrom based.
Comment 31 Doug Ledford 2004-08-24 12:51:14 EDT
Created attachment 103031 [details]
ISO image driver disk for AS2.1 U5

This driver disk should work for AS2.1 U5.  Boot the system using the option
noprobe (aka, type linux noprobe at the elilo prompt), when asked to select
drivers, tell it you have a driver disk on CD, put in this CD image, load the
mptscsih_20505 driver, should be able to proceed after that.
Comment 32 Doug Ledford 2004-08-24 12:52:36 EDT
Created attachment 103032 [details]
Driver disk for RHEL3 U3

Same thing, RHEL3 U3.  If either disk fails to work, please report the exact
problem back here in this bugzilla and I'll get it taken care of.
Comment 39 Claude BRUNET 2004-08-27 05:23:37 EDT
The driver disk you deliver us, is declared as "bad" when we tried an
installation in noprobe mode (elilo linux noprobe, or elilo linux
expert noprobe).

We got the following error messages:

- screen from "Ctrl Alt F4":
<4>FAT: bogus logical sector size 0
<4>VFS: Can't find a valid FAT filesystem on dev 03:00
<4>VFS: Can't find a valid ext2 filesystem on dev ide0(3:0)

- screen from "Ctrl Alt F3":
trying to mount hda
not a new format driver, checking for old
can't find either disk identifier, bad driver disk.

Remark:
   We encountered another kind of pb with the driver CD for AS2.1 U5
(see IT#43391)

Regards,
   Claude.
Comment 40 Claude BRUNET 2004-08-30 07:35:58 EDT
Could you tell us what might be the final solution for this pb that is
major for Bull?

What do you think about the proposal done by Didier Marcon (see its
email to   Susan S. Denham)?  Could Red Hat deliver to Bull a specific
CD1 using the old driver (2.05.05) during the boot phase and, after,
as the default driver, and insure full support for this delivery?

Regards,
   Claude.
Comment 45 Susan Denham 2004-09-07 06:29:22 EDT
To the Bull folks:  Since late last week, we've been  running into a
few problems making this driver disk.  (Unfortunately, yesterday was
the Labor Day holiday in the U.S.)  We expect to have an update on the
problem today.  The goal is obviously to provide you with a driver
disk as quickly as possible so that you can use it to run the
certification tests on the NS5160 and 6160.
Comment 49 Bill Peck 2004-09-07 11:51:58 EDT
Trying this now.
Comment 50 Bill Peck 2004-09-07 12:37:46 EDT
This is still a no go..

elilo linux dd

Prompted for the Driver Disk, inserted and received this error:

No Devices of the appropriate type were found on this driver disk. 
Would you like to manually select the driver, continue anyway, or load
another driver disk?

On virtual console 3 I have this output:

modules to insert e100 e1000 mptbase_20505 mptscsih_20505 qla2300
module(s) e100 e1000 mptbase_20505 mptscsih_20505 qla2300 not found
load module set done

I then selected the "Manually choose" option and scrolling to the
bottom of the list is:
mptfusion SCSI driver module (mptscsih_20505)

And again on VC3:

modules to insert mptbase_20505 mptscsih_20505
module(s) mptbase_20505 mptscsih_20505 not found
load module set done
Comment 51 Doug Ledford 2004-09-07 15:32:26 EDT
Created attachment 103552 [details]
Yet another disk.

Found a bug in the creation of the modules.cgz file on the disk image.	Try and
see if this image solves the problem.
Comment 52 Doug Ledford 2004-09-07 15:34:27 EDT
Created attachment 103554 [details]
New AS2.1 disk image

Same bug existed on the AS2.1 disk image, so new image uploaded.
Comment 53 Bill Peck 2004-09-07 16:37:56 EDT
Gets further!

Driver disk recognizes the mptscsi_20505 driver and loads it.  Further
on in the install right after I fill in the info for Network install
(NFS server and directory) I get a page fault.

I'll post a picture of the panic but I'm afraid the interesting info
has scrolled off.  I'll get serial console going and capture it again.
Comment 54 Bill Peck 2004-09-07 16:53:55 EDT
Created attachment 103560 [details]
Picture of Panic
Comment 55 Bill Peck 2004-09-07 17:11:52 EDT
Created attachment 103561 [details]
serial dump of the panic.

I didn't see it in the picture, but in the serial output you can see
mpt_base_replay is referenced.
Comment 56 Doug Ledford 2004-09-07 17:54:08 EDT
Created attachment 103565 [details]
Yet another RHEL3 driver disk iso

I would have expected the base scsi module to have been loaded by the loader
already, I didn't see that in the output, so I put scsi_mod.o into the
modules.cgz file and explicitly called it out in modules.dep.  See if that
solves your problem.
Comment 57 Bill Peck 2004-09-08 12:09:57 EDT
No Change...
 
Pid: 0, comm:              swapper
EIP is at mpt_base_reply [mptbase_20505] 0x290 (2.4.21-20.EL)
psr : 0000101008022038 ifs : 800000000000040b ip  :
[<a0000000002f0a30>]    Not tainted
unat: 0000000000000000 pfs : 000000000000040b rsc : 0000000000000003
rnat: 0000000000000000 bsps: e000000004cafd00 pr  : 80000000af756927
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270033f
b0  : a0000000002f0990 b6  : e0000000044a0b60 b7  : a0000000002f07a0
f6  : 1003e00000000000000de f7  : 0ffdd8000000000000000
f8  : 1003e0000000000005340 f9  : 1003e0000000000000060
r1  : a000000000306938 r2  : 0000000000000000 r3  : 0000000000000008
r8  : e00000000645fd08 r9  : 0000000000005340 r10 : e0000000066128a8
r11 : 00000000000000de r12 : e0000000049f7ca0 r13 : e0000000049f0000
r14 : 0000000000000060 r15 : e00000000498c4e8 r16 : e0000000044a0b60
r17 : 0000000000000060 r18 : 0000000000000000 r19 : a000000000307c78
r20 : 000000000000000f r21 : 0000000000000060 r22 : e0000000064500b8
r23 : e0000000064500c0 r24 : 0000000000000060 r25 : e00000007f41804a
r26 : 0000000000000001 r27 : 0000000000000060 r28 : e00000007f41804e
r29 : 0000000000000001 r30 : 0000000078f34040 r31 : e0000000064e4000
 
Call Trace: [<e0000000044158e0>] sp=0xe0000000049f78a0
bsp=0xe0000000049f1578 show_stack [kernel] 0x80
[<e000000004431ae0>] sp=0xe0000000049f7a70 bsp=0xe0000000049f1550 die
[kernel] 0x200
[<e000000004451330>] sp=0xe0000000049f7a70 bsp=0xe0000000049f14f8
ia64_do_page_fault [kernel] 0x310
[<e00000000440e9a0>] sp=0xe0000000049f7b00 bsp=0xe0000000049f14f8
ia64_leave_kernel [kernel] 0x0
[<a0000000002f0a30>] sp=0xe0000000049f7ca0 bsp=0xe0000000049f14b8
mpt_base_reply [mptbase_20505] 0x290
 <0>Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing
Comment 58 Doug Ledford 2004-09-08 13:11:56 EDT
Did you get a serial console dump from the updated driver disk?  I'm
curious to know if the scsi subsystem initialization message shows up
this time.
Comment 59 Bill Peck 2004-09-08 13:39:54 EDT
Created attachment 103596 [details]
another serial console capture...

this is the serial console output..  I don't think your going to see the scsi
subsystem initialization from the serial console.  When I load the driver disk
with the console going to video I can switch to console 4 (I think its console
4) and see the scsi disks (sda1, sda2, etc...)
Comment 60 Doug Ledford 2004-09-08 15:28:05 EDT
When is the panic happening then?  If you are using the driver disk,
and you switch to console 4 and see the SCSI disks, at what stage does
the kernel oops?
Comment 61 Doug Ledford 2004-09-08 15:50:57 EDT
OK, I see it now.  This is a bug in Anaconda I think.  Bill, can you
try this on your machine and see if it works (it does here):

boot with linux noprobe dd
load the driver disk, select the mptscsih_20505 driver
select the proper network driver(s)
run the install

that should work.  The problem appears to be that if you don't disable
the autoprobing of devices then anaconda tries to load the mptscsih
module (the new one that's on our boot disks, not the version 2.05.05
from the driver disk that we've already loaded) and when you try to
load that driver twice, both copies of the driver end up trying to
access the same hardware and of course things break horribly.  Using
the noprobe option here settles that issue, but does mean all devices
have to be selected from the list, none are found automatically.
Comment 62 Bill Peck 2004-09-08 16:57:26 EDT
Yup, adding noprobe prevents the system panic.  Its installing now.

This will work but I'm going to open an anaconda bug stating that it
should remove internal pci-ids that are provided by driver disks.  

That won't help for this time though.
Comment 63 Tim Burke 2004-09-09 20:59:56 EDT
Pierre,

Can you have the Bull team confirm that by using the boot syntax noted
in comment #61 that the install can be accomplished?  By communicating
this information to your customers you should then have a viable
workaround.
Comment 64 Jérôme ALEXANDRE 2004-09-10 02:45:06 EDT
As we explained some time ago, and by phone with some peoples at
Redhat, as the machine dont have any floppy drive, it is not possible
to insert any floppy in the machine.
for AS 2.1:
There are (from my opinion)  only tree ways to get the correct boot:
 1- a new version of the mpt driver without regression is build in the
boot CD
 2 the 2.05.05 version is build in the boot CD
 3 the loader releases the CD to enable the possibility to get a new
driver CD.

For RHEL 3 AS U3
 we are always waiting for a driver CD

Regards

Comment 66 Doug Ledford 2004-09-10 12:04:22 EDT
For RHEL3, the driver CD on this bugzilla is *the* driver CD.  It's
done.  You have to use the noprobe option or the machine will flake
out, but that can't be avoided without breaking other things, so this
is the best that it's going to get without spinning a new U3 install
ISO that has a fixed version of anaconda.
Comment 67 Claude BRUNET 2004-09-15 12:38:58 EDT
What you deliver us, DOES NOT WORK.

Could you confirm me that your "final" proposal is:
===================================================
- EFI command: elilo linux noprobe dd
- Driver CD from the attachment delivered in "Comment #56" (2004-09-07
17:54)


Content of this CD:
===================
[root@stlinux11 root]# ll /mnt/cdrom
total 231
-r-xr-xr-x    1 root     root           63 Aug 24 18:34 modinfo
-r-xr-xr-x    1 root     root       233278 Sep  7 23:52 modules.cgz
-r-xr-xr-x    1 root     root           39 Sep  7 23:52 modules.dep
-r-xr-xr-x    1 root     root          541 Aug 24 18:45 pcitable
-r-xr-xr-x    1 root     root           51 Aug 24 18:08 rhdd-6.1
[root@stlinux11 root]# 

ERROR Message returned:
=======================

No devices of the appropriate type
were found on this driver disk.
Would you like to manually select 
the driver, continue anyway or
load another driver.

REMARK:
=======
- using "elilo linux noprobe expert", we got the same error
- using "elilo linux noprobe", the CD1 cannot be ejected when the
window "Insert your Driver Disk" is displayed.
- the CD itself has a readable content (no transfer pb). In
modules.cgz, there is a mptscsih_20505.o file but under the
"2.4.21-20.EL" directory. Is it OK?

Could you tell me what to do?
Comment 68 Bill Peck 2004-09-15 12:55:55 EDT
The missing peice here is a proper document describing the steps
needed. I will attempt to fill in the best I can but Docs will have to
produce a proper procedure for this.

When you recieved your "erorr message" you then have to choose to
manually select the driver.  This is because auto-probing has to be
turned off to prevent the broken driver from loading later.

You will be presented with a list of different drivers, Scroll all the
way to the bottom and you will see an entry for mptscih_20505 driver.
Hit return here and the driver will load.

If the install is being done from CDROM you can stop loading drivers
and continue with the install.  If a network install is desired then
the network drivers will have to be loaded in this fasion as well.
Comment 72 Claude BRUNET 2004-09-16 05:38:24 EDT
GOOD NEWS: it is OK!

It was not obvious to find the way to get the mptscsih_20505 driver:
- go on after the "error message" choosing the "Manually choose" option
- look until the end of the long list to find mptscsih_20505 (and not
the line for "MPT Fusion" at the beginning of the list)
...
after that the installation is OK.

I will have to ask you for some other precisions about the final result.
But I don't want to wait more before sending you these good news.

Regards,
    Claude.
Comment 73 Claude BRUNET 2004-09-16 10:03:52 EDT
And now, my questions about the installation result:

1- In /etc/modules.conf, there is no line for "mptbase".
        Usually there is such a line. Even after a standard RHEL3
Update3 installation. Is this line useless? Was this line useless in
the previous releases for RHEL3 and RHEL2.1?

2- The qla2300 and aic7xxx drivers are not automatically loaded by
this installation.
        It is not the standard behaviour.
        Moreover these drivers are defined in /etc/modules.conf, but
not present in /boot/efi/efi/redhat/initrd-2.4.21-20.EL.img (they are
not in linuxrc). It is not homogeneous. A final "mkinitrd" command
seems to have been forgotten at the end of the installation phase.
        Warning! The /etc/modules.conf file has to be modified before
running mkinitrd because the 2 MPT Fusion releases are simultaneously
present: mptscsih and mptscsih_20505. mkinitrd with a not modified
modules.conf produces a kernel panic when the server is rebooted.

3- To have a completly correct installation, I had to:

- suppress the "alias scsi_hostadapter1 mptscsih" line in
/etc/modules.conf
- run mkinitrd
- reboot.

The installation itself in "noprobe dd" mode is not simple at all. 

After that, it is very hard to demand at our customers to modify the
/etc/modules.conf file, run a mkinitrd command and reboot again the
server.

What could you propose us to simplify all of that?
        
Comment 74 Susan Denham 2004-09-28 10:11:58 EDT
Back to the actual patch issue, restated here again to re-ground
everyone in the discussion for what version of the mptfusion driver is
requested in RHEL 3 U4 and RHEL 2.1 U6:

"LSI mptfusion driver 2.05.16 included in RHEL 3 U3 and RHEL 2.1 U5
provided new features but introduced a regression from at least
version 2.05.11 that prevents several of Bull's NovaScale systems from
booting. As a result, several Bull NovaScale systems that contain the
LSI adapter are in manufacturing Stop Ship.  This regression has been
discussed on both RHEL2.1-U4 (IT #43391) and RHEL3-U3 (BZ #127385). 
Bull needs an updated LSI mptfusion driver version into RHEL 2.1 U6
that fixes the regression."

Here is the status of including the fix in a RHEL update: Red Hat
*will* include a patch in RHEL 3 U4 and RHEL 2.l U6 that updates the
mptfusion driver from 2.05.16 to 2.05.16.02.  This
patch addresses BZ 127385, FZ 131392 (AS2.1) and FZ 131393 (RHEL3).

And this from the mptfusion maintainer Eric Dean Moore at LSI on 15
September; note that LSI says that the 2.05.23 driver **is not** ready
for submission upstream: 

"Please apply the 2.05.16.02 driver to your Red Hat kernels.
This is the driver which solved the "max_lun on non-packetized 
SCSI devices" issue reported by Tom Coughlan back in July.

This fix was implemented by Larry Stephens, and perhaps Larry could
submit his changes/patch upstream to Kernel.org.

Regarding the 2.05.23 driver. I spoke to my manager, Terry Gibbons, 
just yesterday on having this submitted upstream. Terry
suggested that we hold off on submitting that driver, as this
driver version hasn't been widely accepted by various customers."

Sue here again:  As a result (and to repeat), RH will not include the
2.05.23 driver in RHEL 3 U4 and RHEL 2.1 U6 and *will* include the
LSI-recommended 2.05.23 driver.


  
Comment 75 Pierre Fumery 2004-09-28 10:44:52 EDT
I think you meant "*will* include the LSI-recommended 2.05.16.02
driver." Thanks.
Comment 76 Jeremy Katz 2004-09-28 10:56:46 EDT
Sue here:  Yes, that's what I meant. Bad cut and paste : (  Thanks for
correcting.
Comment 77 Susan Denham 2004-09-28 18:36:22 EDT
Posted to IT 43391 by Jeremy Katz 9/28:

There isn't a way to automatically run a script on the normal install
path.  But, if you take a different approach, then it can work easily.

Instead of replacing the mptfusion modules, you will want to do the
following.
* Copy mptbase_20505.o and mptscsih_20505.o into modules.cgz
 * Do as before without the rename
* Edit /modules/pcitable.  
 * Replace all instances of mptscsih with mptscsih_20505
* Edit /modules/modinfo.  
 * Replace all instances of mptscsih with mptscsih_20505.  
 * Replace all instances of mptbase with mptbase_20505.
* Edit /modules/modules.dep
 * Replace all instances of mptscsih with mptscsih_20505.  
 * Replace all instances of mptbase with mptbase_20505.
* Recreate boot CD based on boot.img with these changes.

Then, the old module will get loaded during the install.  It will also
be set up for use post-install.
  
Comment 78 Susan Denham 2004-09-28 18:38:00 EDT
Also posted to IT 43391 by Susan Denham 9/28:

We assume that you are taking the following steps in order to ensure
that your RHEL 2.1 U5 Itanium systems have the correct mptfusion
driver (v2.05.16.02) on the boot image.  Please confirm that you are:

1.  Shipping the unmodified RHEL 2.1 U5 CDs.
2.  Creating and including in your Bull RHEL 2.1 U5 package a new boot
CD that contains a Bull-modified initrd.  You will not be calling or
labelling this "Red Hat Enterprise Linux" but will instead call this a
"Bull Boot CD for Intel Itanium2 systems running Red Hat Enterprise
Linux 2.1 U5" or some such name.
3.  You will create this modified initird (for the Bull Boot CD) using
the instructions that Jeremy Katz, RHEL installer maintainer, provided
above in the previous event.

We will officially support Bull's customer using this modified initrd.
 I do, for obvious reasons, hope that it is indeed only one customer!

It will no longer be necessary for Bull to ship this modified initrd
once RH delivers RHEL 2.1 U6 because U6 will contain the correct
version of the mptfusion driver (v2.05.16.02) that solves the failure
to boot problem you're currently seeing on your RHEL 2.1 U5 Itanium
systems.

Comment 79 Susan Denham 2004-10-06 11:54:57 EDT
And just to cover the RHEL 4 angle for the mpt fusion driver in this
bugzilla as well:

RH QA has tested RHEL 4 beta 1 on the NovaScale 6160 in Westford and
has confirmed that the mpt fusion patch that is required on RHEL 2.1
and 3 (included in mpt fusion 2.05.16.02) for the Bull system is _not_
needed for RHEL 4.

Background: the patch ensures that the mpt fusion driver will not scan
LUNs that are larger than the storage device can address.  This
problem is not seen in RHEL 4 because the LUN scanning in RHEL 4 is
completely different from 2.4 kernels. In RHEL 4, the SCSI layer
requests a LUN inventory (Report LUNs) from the storage device, and it
only configures those specific LUNs, rather than scanning the whole
LUN number space. The only time that the bug in the mpt
fusion driver will exhibit itself on RHEL 4 is when all of the
following are true with respect to the storage device:

- it does not support Report LUNs
- it does not support the packetized SCSI protocol, and,
- it is on the SCSI whitelist with the SPARSELUN flag.

The Chaparral storage in the Bull system meets the second two, but not
the first.

LSI Logic is planning to incorportate this fix into an upstream
version.  RH wants to see the fix get upstream, and then inherit the
fix in a RHEL 4 update at the earliest opportunity.
Comment 80 Claude BRUNET 2004-10-11 05:50:49 EDT
Please, could you give me answers about the questions in my comment #73?

- is there a way to simplify the installation, specially when there is
a  QLogic adapter in the machine? (As the driver for this board is not
automatically loaded, it is mandatory to create a new initrd file and
to reboot again!).
- erroneous content of the /etc/modules.conf file.
- is mptbase useful in /etc/modules.conf?

Regards,
   Claude.
Comment 85 Tom Coughlan 2004-11-12 13:36:42 EST
This BZ spawned BZ 131393 and BZ 131392, to represent the specific fix
needed in RHEL 3 and RHEL 2.1 IPF. Those BZ are now in the modified
state.  I think it is time to close the BZ. 
Comment 86 Pierre Fumery 2004-11-15 04:40:18 EST
Tom, thanks a lot for your investigation and your help to address this
issue. We did good progress on it and we only need to check now this
problem has been fixed on x86 binaries as well.
Comment 87 Tom Coughlan 2004-11-15 08:53:13 EST
Pierre, the fix for this problem is in RHEL 2.1 IPF U6 and RHEL 3 U4
(all architectures, of course).

We did not receive any reports of this problem on RHEL 2.1 x86,
presumably because this combination of hardware and driver are not
being used there. As a result, we did not include the fix in  RHEL 2.1
x86 U6.  If you need this fix in RHEL 2.1 x86 U7, I have suggested
that JoAnne open a new Bugzilla.  
Comment 88 Pierre Fumery 2004-11-15 09:07:34 EST
I already opened IT #54259 and JoAnne created BZ #139042 but I have no
access to this BZ#. Could you put me in Cc: to let me track progress
on this issue for x86 ? Thanks in advance.
Comment 89 Tom Coughlan 2004-11-15 09:26:49 EST
BZ #139042 is a "Featurezilla". I guess you should just track status
in the IT. 
Comment 90 John Flanagan 2004-12-13 15:17:05 EST
An errata has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2004-504.html

Note You need to log in before you can comment on or make changes to this bug.