Bug 253538

Summary: Can't boot 2.6.9-56-smp with Vmware
Product: Red Hat Enterprise Linux 4 Reporter: Marcus Alves Grando <marcus>
Component: kernelAssignee: Chip Coldwell <coldwell>
Status: CLOSED ERRATA QA Contact: Martin Jenner <mjenner>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: clalance, coldwell, coughlan, cww, divyanshu.verma, emcnabb, eric.moore, eva, jbaron, jon.stanley, larry.stephens, sathya.prakash
Target Milestone: ---Keywords: Regression
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0791 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-15 16:31:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 279571    
Attachments:
Description Flags
Serial console output for failing boot
none
upstream patch (rejected by Christoph Hellwig).
none
screenshot of vmware server console showing stack trace.
none
Problem persist in kernel 2.6.9-59 (jbarton) none

Description Marcus Alves Grando 2007-08-20 14:24:33 UTC
Description of problem:

I install jbarton 2.6.9-56-smp to test and i can't boot in vmware 3.0.2 ESX.
2.6.9-55.0.2-smp works fine.

Version-Release number of selected component (if applicable):

4.5

How reproducible:

Install 2.6.9-56-smp in one vmware

Actual results:

...
Loading ext3.ko module
mkrootdev: label / not found
Mounting root filesystem
mount: error 2 mounting ext3
mount: error 2 mounting none
Switching to new root
switchroot: mount failed: 22
umount /initrd/dev failed: 2
Kernel panic - not wyncing: Attempted to kill init!

Expected results:

Boot normally

Comment 1 Marcus Alves Grando 2007-08-20 14:27:43 UTC
# lspci 
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge
(rev 01)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge
(rev 01)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 08)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:0f.0 VGA compatible controller: VMware Inc [VMware SVGA II] PCI Display Adapter
00:10.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
Fusion-MPT Dual Ultra320 SCSI (rev 01)
00:11.0 Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet32 LANCE]
(rev 10)
00:12.0 Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet32 LANCE]
(rev 10)

# tune2fs -l /dev/sda1 
tune2fs 1.35 (28-Feb-2004)
Filesystem volume name:   /boot
Last mounted on:          <not available>
Filesystem UUID:          d4ad6eb2-4174-44ac-af3e-053c2c75155a
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype
needs_recovery sparse_super
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              16064
Block count:              64228
Reserved block count:     3211
Free blocks:              47701
Free inodes:              16022
First block:              1
Block size:               1024
Fragment size:            1024
Reserved GDT blocks:      250
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         2008
Inode blocks per group:   251
Filesystem created:       Fri Mar  2 11:36:20 2007
Last mount time:          Mon Aug 20 11:16:54 2007
Last write time:          Mon Aug 20 11:16:54 2007
Mount count:              35
Maximum mount count:      -1
Last checked:             Fri Mar  2 11:36:20 2007
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:		  128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      2820ce24-b251-463e-b163-14ce7e03177d
Journal backup:           inode blocks

# tune2fs -l /dev/sda3 
tune2fs 1.35 (28-Feb-2004)
Filesystem volume name:   /
Last mounted on:          <not available>
Filesystem UUID:          b9c41902-fbef-4978-a58e-a94a3cbc363f
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype
needs_recovery sparse_super large_file
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              1048576
Block count:              2094474
Reserved block count:     104723
Free blocks:              1414838
Free inodes:              957609
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      511
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16384
Inode blocks per group:   512
Filesystem created:       Fri Mar  2 11:36:17 2007
Last mount time:          Mon Aug 20 11:16:53 2007
Last write time:          Mon Aug 20 11:16:53 2007
Mount count:              35
Maximum mount count:      -1
Last checked:             Fri Mar  2 11:36:17 2007
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:		  128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      101a0a14-cce3-48b2-aadb-060377c91f4b
Journal backup:           inode blocks

Comment 2 Marcus Alves Grando 2007-08-20 15:02:40 UTC
(In reply to comment #0)
> Description of problem:
> 
> I install jbarton 2.6.9-56-smp to test and i can't boot in vmware 3.0.2 ESX.
> 2.6.9-55.0.2-smp works fine.

s/jbarton/Jason Baron (jbaron)/


Comment 4 RHEL Program Management 2007-09-05 21:43:59 UTC
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.

Comment 5 Tom Coughlan 2007-09-06 18:22:34 UTC
If possible, please provide a crash dump, or at least a stack trace. 

Comment 6 Chris Lalancette 2007-09-06 18:29:46 UTC
I'll attach the serial console output (which is missing the userspace LVM
errors, but shows the kernel messages).  The relevant part here seems to be this:

Fusion MPT base driver 3.02.99.00rh

Copyright (c) 1999-2007 LSI Logic Corporation

Fusion MPT SPI Host driver 3.02.99.00rh

ACPI: PCI Interrupt 0000:00:10.0[A] -> GSI 17 (level, low) -> IRQ 169

mptbase: Initiating ioc0 bringup

ioc0: 53C1030: Capabilities={Initiator}

scsi0 : ioc0: LSI53C1030, FwRev=00000000h, Ports=1, MaxQ=128, IRQ=169

Fusion MPT SAS Host driver 3.02.99.00rh

device-mapper: 4.5.5-ioctl (2006-12-01) initialised: dm-devel

cdrom: open failed.

cdrom: open failed.

Kernel panic - not syncing: Attempted to kill init!

Note the section for scsi0; in the working kernel, this looks like:

scsi0 : ioc0: LSI53C1030, FwRev=00000000h, Ports=1, MaxQ=128, IRQ=169
  Vendor: VMware,   Model: VMware Virtual S  Rev: 1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
SCSI device sda: 16777216 512-byte hdwr sectors (8590 MB)
sda: cache data unavailable
sda: assuming drive cache: write through
SCSI device sda: 16777216 512-byte hdwr sectors (8590 MB)
sda: cache data unavailable
sda: assuming drive cache: write through
 sda: sda1 sda2
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0

So it looks like it is not finishing finding the drive, scanning the partitions,
and attaching it.

Chris Lalancette

Comment 7 Chris Lalancette 2007-09-06 18:30:30 UTC
Created attachment 189101 [details]
Serial console output for failing boot

Comment 8 Chip Coldwell 2007-09-06 19:13:23 UTC
(In reply to comment #6)
> 
> So it looks like it is not finishing finding the drive, scanning the partitions,
> and attaching it.

Do you get the same "attempting to kill init" message if you try to boot the
system with no attached storage (using the same initrd)?

Chip


Comment 9 Chip Coldwell 2007-09-10 18:22:01 UTC
Created attachment 191831 [details]
upstream patch (rejected by Christoph Hellwig).

This patch works around a bug in the VMWare emulated 1030 mptspi adapter.

Comment 15 Tom Coughlan 2007-09-13 13:35:35 UTC
*** Bug 282411 has been marked as a duplicate of this bug. ***

Comment 16 Jon Stanley 2007-09-16 01:58:12 UTC
Created attachment 196621 [details]
screenshot of vmware server console showing stack trace.

Comment 17 Jon Stanley 2007-09-16 02:00:20 UTC
Not sure if this should be considered a separate bug or not - however an install
of 4.6 beta fails due to not finding disks.  After the install fails, the system
panics, stack trace provided in attached screenshot.

If this needs to be a separate bug, let me know - however it seems to be a
similar profile, and since no stack trace has yet been produced on this one,
figured this would be something new to add.

Comment 18 Chip Coldwell 2007-09-18 15:41:38 UTC
(In reply to comment #16)
> Created an attachment (id=196621) [edit]
> screenshot of vmware server console showing stack trace.
> 

Looks like the module was loaded at address 0xe0925000 and the EIP on the panic
was 0xe0929b1f which was this chunk of assembly (addresses are offsets from the
start of mptscsih_synchronize_cache)

    4b0e:       8b 44 96 60             mov    0x60(%esi,%edx,4),%eax
    4b12:       8b 3c a8                mov    (%eax,%ebp,4),%edi
    4b15:       0f 84 92 00 00 00       je     4bad
<mptscsih_synchronize_cache+0x1e5>
    4b1b:       85 ff                   test   %edi,%edi
    4b1d:       74 0c                   je     4b2b
<mptscsih_synchronize_cache+0x163>
    4b1f:       80 7f 0c 00             cmpb   $0x0,0xc(%edi)

I've figured out that the corresponding bit of source code is line 4744 of
mptscsi.c, 

	while (bus < ioc->NumberOfBuses) {
		iocmd.bus = bus;
		iocmd.id = id;
		pMptTarget = ioc->Target_List[bus];
		pTarget = pMptTarget->Target[id];

		if (doConfig) {

			/* Set the negotiation flags */
			if (pTarget && !pTarget->raidVolume) { <===== panic here
				flags = pTarget->negoFlags;
			} else {


It looks like dereferencing pTarget is the problem.  Judging from the registers
in the panic message, that pointer was holding the value 0x00000010 at the time,
so it was both non-NULL and also an invalid address.

Still digging.

Chip


Comment 20 Jason Baron 2007-09-26 16:03:19 UTC
committed in stream U6 build 60. A test kernel with this patch is available from
http://people.redhat.com/~jbaron/rhel4/


Comment 21 Marcus Alves Grando 2007-09-26 20:03:43 UTC
(In reply to comment #20)
> committed in stream U6 build 60. A test kernel with this patch is available from
> http://people.redhat.com/~jbaron/rhel4/
> 

I take latest 2.6.9-59 kernel and not found yet. I think that you apply wrong fix.

Based on your src.rpm your fix is:

--
@@ -2771,6 +2913,20 @@ GetPortFacts(MPT_ADAPTER *ioc, int portnum, int sleepFlag)
       	pfacts->IOCLogInfo = le32_to_cpu(pfacts->IOCLogInfo);
       	pfacts->MaxDevices = le16_to_cpu(pfacts->MaxDevices);
       	pfacts->PortSCSIID = le16_to_cpu(pfacts->PortSCSIID);
+       
+	max_id = (ioc->bus_type == SAS) ? pfacts->PortSCSIID :
+	    pfacts->MaxDevices;
+       ioc->DevicesPerBus = (max_id > 255) ? 256 : max_id;
+	ioc->NumberOfBuses = (ioc->DevicesPerBus < 256) ? 1 : max_id/256;
+	if ( ioc->NumberOfBuses > MPT_MAX_BUSES ) {
+      	        dinitprintk((MYIOC_s_WARN_FMT "NumberOfBuses=%d >
MPT_MAX_BUSES=%d\n",
+                  ioc->name, ioc->NumberOfBuses, MPT_MAX_BUSES));
+               ioc->NumberOfBuses = MPT_MAX_BUSES;
+       }
+
+       dinitprintk((MYIOC_s_WARN_FMT "Buses=%d MaxDevices=%d DevicesPerBus=%d\n",
+                  ioc->name, ioc->NumberOfBuses, max_id, ioc->DevicesPerBus));
+               
       	pfacts->ProtocolFlags = le16_to_cpu(pfacts->ProtocolFlags);
       	pfacts->MaxPostedCmdBuffers = le16_to_cpu(pfacts->MaxPostedCmdBuffers);
       	pfacts->MaxPersistentIDs = le16_to_cpu(pfacts->MaxPersistentIDs);
--

And still does not work. Proposed patch in attachment works fine.

I'll attach boot screen.

Regards

Comment 22 Marcus Alves Grando 2007-09-26 20:12:04 UTC
I'll attach boot screen as soon as bugzilla works. Login page does not work.

Regards

Comment 23 Marcus Alves Grando 2007-09-26 20:32:27 UTC
Created attachment 207501 [details]
Problem persist in kernel 2.6.9-59 (jbarton)

In jbarton kernel 2.6.9-59 the related bug still persist. Maybe wrong patch.

Comment 24 Chip Coldwell 2007-09-27 13:30:54 UTC
(In reply to comment #23)
> Created an attachment (id=207501) [edit]
> Problem persist in kernel 2.6.9-59 (jbarton)
> 
> In jbarton kernel 2.6.9-59 the related bug still persist. Maybe wrong patch.

Comment #20 says this patch is in build 60, so you should not expect to find it
in 2.6.9-59.  Please test 2.6.9-60 when it becomes available.

Thank-you,

Chip

Comment 25 Marcus Alves Grando 2007-09-27 15:13:36 UTC
(In reply to comment #24)
> (In reply to comment #23)
> > Created an attachment (id=207501) [edit] [edit]
> > Problem persist in kernel 2.6.9-59 (jbarton)
> > 
> > In jbarton kernel 2.6.9-59 the related bug still persist. Maybe wrong patch.
> 
> Comment #20 says this patch is in build 60, so you should not expect to find it
> in 2.6.9-59.  Please test 2.6.9-60 when it becomes available.
> 
> Thank-you,
> 
> Chip

I know. I'm not crazy yet. But in ~jbarton does not have 2.6.9-60 and i see in
date/md5_with_old_kernel_src that 2.5.9-59 are modified. Then i think that
update 2.5.9-59 and not bump kernel version. I'm wrong?

Regards

Comment 26 Chip Coldwell 2007-09-27 15:22:38 UTC
(In reply to comment #25)
> 
> I know. I'm not crazy yet. But in ~jbarton does not have 2.6.9-60 and i see in
> date/md5_with_old_kernel_src that 2.5.9-59 are modified. Then i think that
> update 2.5.9-59 and not bump kernel version. I'm wrong?

Yes.  Please test 2.6.9-60 when it becomes available.

Chip



Comment 29 Chip Coldwell 2007-10-01 19:51:29 UTC
(In reply to comment #26)
> (In reply to comment #25)
> > 
> > I know. I'm not crazy yet. But in ~jbarton does not have 2.6.9-60 and i see in
> > date/md5_with_old_kernel_src that 2.5.9-59 are modified. Then i think that
> > update 2.5.9-59 and not bump kernel version. I'm wrong?
> 
> Yes.  Please test 2.6.9-60 when it becomes available.

2.6.9-60 is now available at

http://people.redhat.com/~jbaron/rhel4/RPMS.kernel/

Chip


Comment 30 Marcus Alves Grando 2007-10-02 13:35:40 UTC
(In reply to comment #29)
> (In reply to comment #26)
> > (In reply to comment #25)
> > > 
> > > I know. I'm not crazy yet. But in ~jbarton does not have 2.6.9-60 and i see in
> > > date/md5_with_old_kernel_src that 2.5.9-59 are modified. Then i think that
> > > update 2.5.9-59 and not bump kernel version. I'm wrong?
> > 
> > Yes.  Please test 2.6.9-60 when it becomes available.
> 
> 2.6.9-60 is now available at
> 
> http://people.redhat.com/~jbaron/rhel4/RPMS.kernel/
> 
> Chip
> 

Works fine. Thanks all.


Comment 35 errata-xmlrpc 2007-11-15 16:31:42 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html