Bug 130783 - LTC10800-SMP boot fails with sym53c8xx
LTC10800-SMP boot fails with sym53c8xx
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
rawhide
powerpc Linux
high Severity high
: ---
: ---
Assigned To: David Howells
Brian Brock
:
Depends On:
Blocks: 134551
  Show dependency treegraph
 
Reported: 2004-08-24 13:40 EDT by Paul Nasrat
Modified: 2007-11-30 17:10 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2004-11-03 13:08:16 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Here is the diff for the sym2 driver in CVS head for the last two months. (10.98 KB, patch)
2004-08-24 18:47 EDT, Tom Coughlan
no flags Details | Diff
console output from sym53c8xx blowing up (9.28 KB, text/plain)
2004-08-25 13:07 EDT, David Howells
no flags Details
p00001_rh_sym2_update.patch (21.46 KB, text/plain)
2004-08-27 15:13 EDT, IBM Bug Proxy
no flags Details
Console log when debugging with DEBUG_FLAGS set to 0xffff (17.49 KB, text/plain)
2004-09-15 14:17 EDT, David Howells
no flags Details

  None (edit)
Description Paul Nasrat 2004-08-24 13:40:29 EDT
Description of problem:

Installing rhel4 onto pseries (p630) works fine.  Booting normally
fails with:

Kernel panic: Attempted to kill init!

Notable failure messages for each scsi device:

sym0:0:0: HOST RESET operation timed-out.
scsi: Device offlined - not ready after error recovery: host 0 channel
0 id 0 lun 0

Version-Release number of selected component (if applicable):

kernel-2.6.8-1.528.2.5.ppc64

How reproducible:

Always

Steps to Reproduce:
1. Install onto pseries
2. Reboot
  
Actual results:

Fails

Expected results:

Boots

Additional info:

Adding maxcpus=1 to boot enables boot to procede normally.

Working dmesg fragment:

sym0:8: FAST-40 WIDE SCSI 80.0 MB/s ST (25.0 ns, offset 31)
  Vendor: IBM       Model: IC35L073UCDY10-0  Rev: S28C
  Type:   Direct-Access                      ANSI SCSI revision: 03
sym0:8:0: tagged command queuing enabled, command queue depth 16.
scsi(0:0:8:0): Beginning Domain Validation
sym0:8: asynchronous.
sym0:8: wide asynchronous.
sym0:8: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 31)
scsi(0:0:8:0): Ending Domain Validation
SCSI device sda: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sda: drive cache: write through
 sda: sda1 sda2 sda3 sda4 < sda5 >
Comment 6 Tom Coughlan 2004-08-24 18:47:56 EDT
Created attachment 103044 [details]
Here is the diff for the sym2 driver in CVS head for the last two months.
Comment 7 David Woodhouse 2004-08-24 18:57:24 EDT
There's nothing useful in that diff -- it's mostly the pointless
s/0/NULL/ which sparse addicts seem to like -- and one local variable
moving from .data to .bss by removing its explicit initialisation to zero.

It shouldn't have any effect on behaviour.
Comment 9 IBM Bug Proxy 2004-08-25 11:41:08 EDT
ubject: [Bug 10800] New:  - RH130783-SMP boot fails with sym53c8xx
Importance: normal
References: <10800.bugzilla@linux.ibm.com>
In-Reply-To: <10800.bugzilla@linux.ibm.com>
X-Bugzilla-Reason: AssignedTo Reporter
X-Bugzilla-Family: Distro Development
Message-Id: <20040825151855.8108D93B69@smtp.linux.ibm.com>
Date: Wed, 25 Aug 2004 11:18:55 -0400 (EDT)

Do not reply to this note.  It was sent by a machine.  Instead append your
comments to the bug at the URL below.

https://bugzilla.linux.ibm.com/show_bug.cgi?id=10800

           Summary: RH130783-SMP boot fails with sym53c8xx
            Vendor: Red Hat Linux
           Version: RHEL4_Alpha4
          Platform: pSeries
      Architecture: PPC-64
Submitting Project: LTC Change Team
 Customer Priority: --
       Owning Team: LTC
    OSC Acceptance: N/S
   Customer Status: N/S
     Required Date: 0000-00-00 00:00:00
       Target Date: 2000-00-00 00:00:00
     Make External: NO
            Status: ASSIGNED
Technical Severity: high
 Engineer Priority: P2
         Component: Kernel
             Owner: gjlynx@us.ibm.com
       SubmittedBy: spwoods@us.ibm.com
         QAContact: thinh@us.ibm.com


Hardware Environment:

Software Environment:


Steps to Reproduce:
1.
2.
3.

Actual Results:

Expected Results:

Additional Information:
From RedHat bug 130783:

 Opened by Paul Nasrat (pnasrat@redhat.com) on 2004-08-24 13:40

Description of problem:

Installing rhel4 onto pseries (p630) works fine.  Booting normally
fails with:

Kernel panic: Attempted to kill init!

Notable failure messages for each scsi device:

sym0:0:0: HOST RESET operation timed-out.
scsi: Device offlined - not ready after error recovery: host 0 channel
0 id 0 lun 0

Version-Release number of selected component (if applicable):

kernel-2.6.8-1.528.2.5.ppc64

How reproducible:

Always

Steps to Reproduce:
1. Install onto pseries
2. Reboot
  
Actual results:

Fails

Expected results:

Boots

Additional info:

Adding maxcpus=1 to boot enables boot to procede normally.

Working dmesg fragment:

sym0:8: FAST-40 WIDE SCSI 80.0 MB/s ST (25.0 ns, offset 31)
  Vendor: IBM       Model: IC35L073UCDY10-0  Rev: S28C
  Type:   Direct-Access                      ANSI SCSI revision: 03
sym0:8:0: tagged command queuing enabled, command queue depth 16.
scsi(0:0:8:0): Beginning Domain Validation
sym0:8: asynchronous.
sym0:8: wide asynchronous.
sym0:8: FAST-80 WIDE SCSI 160.0 MB/s DT (12.5 ns, offset 31)
scsi(0:0:8:0): Ending Domain Validation
SCSI device sda: 143374000 512-byte hdwr sectors (73407 MB)
SCSI device sda: drive cache: write through
 sda: sda1 sda2 sda3 sda4 < sda5 >


------- Additional Comment #6 From Tom Coughlan (coughlan@redhat.com) on
2004-08-24 18:47 -------

 
Here is the diff for the sym2 driver in CVS head for the last two months.


------- Additional Comment #7 From David Woodhouse (dwmw2@redhat.com) on
2004-08-24 18:57 -------

There's nothing useful in that diff -- it's mostly the pointless
s/0/NULL/ which sparse addicts seem to like -- and one local variable
moving from .data to .bss by removing its explicit initialisation to zero.

It shouldn't have any effect on behaviour. 
Comment 10 IBM Bug Proxy 2004-08-25 12:07:49 EDT
----- Additional Comments From mike.anderson@us.ibm.com(prefers email via andmike@us.ibm.com)  2004-08-25 12:04 -------
Can you provide more of the working dmesg as the failure message is for id 0 lun
0? Your working dmesg fragment does not contain an inquiry string for this device. 
Comment 11 David Howells 2004-08-25 13:07:29 EDT
Created attachment 103084 [details]
console output from sym53c8xx blowing up

I've attached an example of console output when the sym53c8xx blows up.
Comment 12 IBM Bug Proxy 2004-08-25 15:08:47 EDT
----- Additional Comments From mike.anderson@us.ibm.com(prefers email via andmike@us.ibm.com)  2004-08-25 15:07 -------
ok, I just grabbed the kernel-2.6.8-1.528.2.5, compiled it, and booted it on my
p630. I am not seeing issues, but my device configs is not exactly the same as
yours.

You could try to adding inq_timeout=30 as a scsi_mod module options to see if
increasing the scan timeout makes a difference. This is a long shot as this
should only be needed for cdrom type devices on the other side of a acard chip.

Do you want me to make a patch of the sym_2 update that James did against the
2.6.8-1.528.2.5 kernel and attach it to the bug? 
Comment 13 Tom Coughlan 2004-08-25 15:25:11 EDT
It sure looks like interrupts are not being delivered.  The driver is
sending Inquiries to probe each SCSI ID and it is never getting a
response from the adapter.

This used to work and now it doesn't, and there are no apparent
changes to the driver, so the problem is likely to be elsewhere. 
James' patch couldn't hurt, but it does not look like a likely fix.
Comment 14 IBM Bug Proxy 2004-08-27 14:43:08 EDT
----- Additional Comments From khoa@us.ibm.com  2004-08-27 14:40 -------
Mike - thanks for looking into this! 
Comment 15 IBM Bug Proxy 2004-08-27 15:08:09 EDT
----- Additional Comments From mike.anderson@us.ibm.com(prefers email via andmike@us.ibm.com)  2004-08-27 15:08 -------
Tan I have attached a patch that contains the updates James Bottomley made for
the sym2 ported to the RedHat kernel.

I will be out next week on vacation. I will send email to eserverio to get
someone to take over. 
Comment 16 IBM Bug Proxy 2004-08-27 15:13:02 EDT
Created attachment 103177 [details]
p00001_rh_sym2_update.patch
Comment 17 IBM Bug Proxy 2004-08-27 15:13:54 EDT
----- Additional Comments From mike.anderson@us.ibm.com(prefers email via andmike@us.ibm.com)  2004-08-27 15:09 -------
 
sym2 updates ported to rh kernel

Here is a patch that applies James's sym2 updates to a rh kernel. 
Comment 18 IBM Bug Proxy 2004-08-31 12:38:57 EDT
----- Additional Comments From bjking1@us.ibm.com(prefers email via brking@us.ibm.com)  2004-08-31 12:37 -------
Can you

#define DEBUG_FLAGS 0xffff 

in sym_hipd.h and boot the kernel? This should give us a better idea what is
going on, if interrupts are occurring at all, etc.

What kernel version did this work on? 
Comment 19 David Howells 2004-09-15 09:32:35 EDT
The attached patch seems to be in the upstream kernels, but does not 
actually let me boot my p630 box without maxcpus=1. Should the patch 
be reversed before application? 
 
David 
Comment 20 IBM Bug Proxy 2004-09-15 12:29:21 EDT
----- Additional Comments From mike.anderson@us.ibm.com(prefers email via andmike@us.ibm.com)  2004-09-15 12:29 EDT -------
Yes, I would not apply the patch anymore. 

Did you run with any of the flags suggested in previous comments? The DEBUG
flags would be nice to have some more info during the failure. 
Comment 21 David Howells 2004-09-15 14:17:49 EDT
Created attachment 103876 [details]
Console log when debugging with DEBUG_FLAGS set to 0xffff

This is the entire kernel console log. After this it just hung as far as I can
tell. I still see processors 1-3 being stuck and it only bringing up one CPU -
I don't know whether this is relevant.

David
Comment 22 IBM Bug Proxy 2004-09-16 14:34:24 EDT
----- Additional Comments From mike.anderson@us.ibm.com(prefers email via andmike@us.ibm.com)  2004-09-16 14:34 EDT -------
I checked into the processor stuck message. It should not happen, but I did not
get a response. I cannot say if this is effecting the problem. I have a system
here that is generating the same processor stuck message and the sym2 driver
appears to be working ok. I am still investigating how to get all processors to
come online. 
Comment 23 IBM Bug Proxy 2004-09-16 14:59:41 EDT
----- Additional Comments From mike.anderson@us.ibm.com(prefers email via andmike@us.ibm.com)  2004-09-16 14:56 EDT -------
ok I guess I should have waited before creating the response. Anton just sent a
patch to linuxppc64-dev@ozlabs.org and arjanv@redhat.com about ppc64 dont use
state == SYSTEM_BOOTING. One this patch is included that processor stuck
messages should go away, but still unknown about the effect on the sym driver. 
Comment 24 David Howells 2004-09-17 09:13:11 EDT
Yay! I built Arjan's kernel, and now I see: 
 
| Processor 1 found. 
| Processor 2 found. 
| Processor 3 found. 
| Brought up 4 CPUs 
| ... 
| Loading sym53c8xx.ko module 
| sym0:8: FAST-40 WIDE SCSI 80.0 MB/s ST (25.0 ns, offset 31) 
|   Vendor: IBM       Model: IC35L073UCDY10-0  Rev: S28C 
|   Type:   Direct-Access                      ANSI SCSI revision: 03 
| ... 
| SCSI device sda: drive cache: write through 
|  sda: sda1 sda2 sda3 sda4 < sda5 > 
| ... 
| Kernel 2.6.8-1.581.dhowells on an ppc64 
| 
| pseries.cambridge.redhat.com login: 
 
I can see four CPUs in /proc/cpuinfo, and the computer booted of 
discs through the sym2 driver (it used to hang before). 
 
David 
Comment 26 IBM Bug Proxy 2004-09-19 11:40:32 EDT
----- Additional Comments From khoa@us.ibm.com  2004-09-19 11:38 EDT -------
Following our new bugzilla process here, I'd like to put this bug into
SUBMITTED state because:
- A patch has been submitted to Red Hat
- Red Hat has built a private kernel RPM containing the patch
- Red Hat (David Howels) has tested the patch OK

So the next step is to wait for confirmation from Red Hat on when this fix
will make an official RHEL4 build - maybe beta2 ?
Thanks. 
Comment 27 IBM Bug Proxy 2004-10-11 01:35:14 EDT
----- Additional Comments From khoa@us.ibm.com  2004-10-11 01:35 EDT -------
David (Red Hat) - please confirm if this will be fixed in beta2.  Thanks. 
Comment 28 IBM Bug Proxy 2004-10-18 02:25:43 EDT
----- Additional Comments From khoa@us.ibm.com  2004-10-18 02:24 EDT -------
Red Hat has marked the bug as MODIFIED on their side, which indicates that
the fix has been accepted into RHEL4 CVS (according to Bob Johnson at Red Hat).
So I'd like to go ahead and mark this bug as ACCEPTED.  We still need to verify
when beta2 comes out before closing the bug report.  Thanks. 
Comment 29 James Laska 2004-11-03 13:08:16 EST
According to dhowells, this fix is included in RHEL4-Beta2-RC-re1027.0.  Closing
this issue as FIXED CURRENTRELEASE.  

khoa: Please re-open this issue if problems surface
Comment 30 IBM Bug Proxy 2004-11-12 18:18:04 EST
----- Additional Comments From AVenkat@us.ibm.com  2004-11-12 18:16 EDT -------
On a p650 with RHEL4.0-beta2 there is no problem in rebooting (though some minor
errors were displayed on the console) and sym53c8xx gets loaded. (see bug#10763).

On a p630 also rebooting after RHEL4.0-beta2 (installed on SCSI drive) seems to
be no problem. (see bug#10540) 

Note You need to log in before you can comment on or make changes to this bug.