Bug 253538 - Can't boot 2.6.9-56-smp with Vmware
Can't boot 2.6.9-56-smp with Vmware
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.5
i386 Linux
high Severity high
: ---
: ---
Assigned To: Chip Coldwell
Martin Jenner
: Regression
: 282411 (view as bug list)
Depends On:
Blocks: 279571
  Show dependency treegraph
 
Reported: 2007-08-20 10:24 EDT by Marcus Alves Grando
Modified: 2007-11-16 20:14 EST (History)
12 users (show)

See Also:
Fixed In Version: RHBA-2007-0791
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-15 11:31:42 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Serial console output for failing boot (8.10 KB, text/plain)
2007-09-06 14:30 EDT, Chris Lalancette
no flags Details
upstream patch (rejected by Christoph Hellwig). (1.52 KB, patch)
2007-09-10 14:22 EDT, Chip Coldwell
no flags Details | Diff
screenshot of vmware server console showing stack trace. (65.18 KB, image/png)
2007-09-15 21:58 EDT, Jon Stanley
no flags Details
Problem persist in kernel 2.6.9-59 (jbarton) (25.12 KB, image/png)
2007-09-26 16:32 EDT, Marcus Alves Grando
no flags Details

  None (edit)
Description Marcus Alves Grando 2007-08-20 10:24:33 EDT
Description of problem:

I install jbarton 2.6.9-56-smp to test and i can't boot in vmware 3.0.2 ESX.
2.6.9-55.0.2-smp works fine.

Version-Release number of selected component (if applicable):

4.5

How reproducible:

Install 2.6.9-56-smp in one vmware

Actual results:

...
Loading ext3.ko module
mkrootdev: label / not found
Mounting root filesystem
mount: error 2 mounting ext3
mount: error 2 mounting none
Switching to new root
switchroot: mount failed: 22
umount /initrd/dev failed: 2
Kernel panic - not wyncing: Attempted to kill init!

Expected results:

Boot normally
Comment 1 Marcus Alves Grando 2007-08-20 10:27:43 EDT
# lspci 
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge
(rev 01)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge
(rev 01)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 08)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:0f.0 VGA compatible controller: VMware Inc [VMware SVGA II] PCI Display Adapter
00:10.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
Fusion-MPT Dual Ultra320 SCSI (rev 01)
00:11.0 Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet32 LANCE]
(rev 10)
00:12.0 Ethernet controller: Advanced Micro Devices [AMD] 79c970 [PCnet32 LANCE]
(rev 10)

# tune2fs -l /dev/sda1 
tune2fs 1.35 (28-Feb-2004)
Filesystem volume name:   /boot
Last mounted on:          <not available>
Filesystem UUID:          d4ad6eb2-4174-44ac-af3e-053c2c75155a
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype
needs_recovery sparse_super
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              16064
Block count:              64228
Reserved block count:     3211
Free blocks:              47701
Free inodes:              16022
First block:              1
Block size:               1024
Fragment size:            1024
Reserved GDT blocks:      250
Blocks per group:         8192
Fragments per group:      8192
Inodes per group:         2008
Inode blocks per group:   251
Filesystem created:       Fri Mar  2 11:36:20 2007
Last mount time:          Mon Aug 20 11:16:54 2007
Last write time:          Mon Aug 20 11:16:54 2007
Mount count:              35
Maximum mount count:      -1
Last checked:             Fri Mar  2 11:36:20 2007
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:		  128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      2820ce24-b251-463e-b163-14ce7e03177d
Journal backup:           inode blocks

# tune2fs -l /dev/sda3 
tune2fs 1.35 (28-Feb-2004)
Filesystem volume name:   /
Last mounted on:          <not available>
Filesystem UUID:          b9c41902-fbef-4978-a58e-a94a3cbc363f
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype
needs_recovery sparse_super large_file
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              1048576
Block count:              2094474
Reserved block count:     104723
Free blocks:              1414838
Free inodes:              957609
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      511
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16384
Inode blocks per group:   512
Filesystem created:       Fri Mar  2 11:36:17 2007
Last mount time:          Mon Aug 20 11:16:53 2007
Last write time:          Mon Aug 20 11:16:53 2007
Mount count:              35
Maximum mount count:      -1
Last checked:             Fri Mar  2 11:36:17 2007
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:		  128
Journal inode:            8
Default directory hash:   tea
Directory Hash Seed:      101a0a14-cce3-48b2-aadb-060377c91f4b
Journal backup:           inode blocks
Comment 2 Marcus Alves Grando 2007-08-20 11:02:40 EDT
(In reply to comment #0)
> Description of problem:
> 
> I install jbarton 2.6.9-56-smp to test and i can't boot in vmware 3.0.2 ESX.
> 2.6.9-55.0.2-smp works fine.

s/jbarton/Jason Baron (jbaron)/
Comment 4 RHEL Product and Program Management 2007-09-05 17:43:59 EDT
This bugzilla has Keywords: Regression.  

Since no regressions are allowed between releases, 
it is also being proposed as a blocker for this release.  

Please resolve ASAP.
Comment 5 Tom Coughlan 2007-09-06 14:22:34 EDT
If possible, please provide a crash dump, or at least a stack trace. 
Comment 6 Chris Lalancette 2007-09-06 14:29:46 EDT
I'll attach the serial console output (which is missing the userspace LVM
errors, but shows the kernel messages).  The relevant part here seems to be this:

Fusion MPT base driver 3.02.99.00rh

Copyright (c) 1999-2007 LSI Logic Corporation

Fusion MPT SPI Host driver 3.02.99.00rh

ACPI: PCI Interrupt 0000:00:10.0[A] -> GSI 17 (level, low) -> IRQ 169

mptbase: Initiating ioc0 bringup

ioc0: 53C1030: Capabilities={Initiator}

scsi0 : ioc0: LSI53C1030, FwRev=00000000h, Ports=1, MaxQ=128, IRQ=169

Fusion MPT SAS Host driver 3.02.99.00rh

device-mapper: 4.5.5-ioctl (2006-12-01) initialised: dm-devel@redhat.com

cdrom: open failed.

cdrom: open failed.

Kernel panic - not syncing: Attempted to kill init!

Note the section for scsi0; in the working kernel, this looks like:

scsi0 : ioc0: LSI53C1030, FwRev=00000000h, Ports=1, MaxQ=128, IRQ=169
  Vendor: VMware,   Model: VMware Virtual S  Rev: 1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
SCSI device sda: 16777216 512-byte hdwr sectors (8590 MB)
sda: cache data unavailable
sda: assuming drive cache: write through
SCSI device sda: 16777216 512-byte hdwr sectors (8590 MB)
sda: cache data unavailable
sda: assuming drive cache: write through
 sda: sda1 sda2
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0

So it looks like it is not finishing finding the drive, scanning the partitions,
and attaching it.

Chris Lalancette
Comment 7 Chris Lalancette 2007-09-06 14:30:30 EDT
Created attachment 189101 [details]
Serial console output for failing boot
Comment 8 Chip Coldwell 2007-09-06 15:13:23 EDT
(In reply to comment #6)
> 
> So it looks like it is not finishing finding the drive, scanning the partitions,
> and attaching it.

Do you get the same "attempting to kill init" message if you try to boot the
system with no attached storage (using the same initrd)?

Chip
Comment 9 Chip Coldwell 2007-09-10 14:22:01 EDT
Created attachment 191831 [details]
upstream patch (rejected by Christoph Hellwig).

This patch works around a bug in the VMWare emulated 1030 mptspi adapter.
Comment 15 Tom Coughlan 2007-09-13 09:35:35 EDT
*** Bug 282411 has been marked as a duplicate of this bug. ***
Comment 16 Jon Stanley 2007-09-15 21:58:12 EDT
Created attachment 196621 [details]
screenshot of vmware server console showing stack trace.
Comment 17 Jon Stanley 2007-09-15 22:00:20 EDT
Not sure if this should be considered a separate bug or not - however an install
of 4.6 beta fails due to not finding disks.  After the install fails, the system
panics, stack trace provided in attached screenshot.

If this needs to be a separate bug, let me know - however it seems to be a
similar profile, and since no stack trace has yet been produced on this one,
figured this would be something new to add.
Comment 18 Chip Coldwell 2007-09-18 11:41:38 EDT
(In reply to comment #16)
> Created an attachment (id=196621) [edit]
> screenshot of vmware server console showing stack trace.
> 

Looks like the module was loaded at address 0xe0925000 and the EIP on the panic
was 0xe0929b1f which was this chunk of assembly (addresses are offsets from the
start of mptscsih_synchronize_cache)

    4b0e:       8b 44 96 60             mov    0x60(%esi,%edx,4),%eax
    4b12:       8b 3c a8                mov    (%eax,%ebp,4),%edi
    4b15:       0f 84 92 00 00 00       je     4bad
<mptscsih_synchronize_cache+0x1e5>
    4b1b:       85 ff                   test   %edi,%edi
    4b1d:       74 0c                   je     4b2b
<mptscsih_synchronize_cache+0x163>
    4b1f:       80 7f 0c 00             cmpb   $0x0,0xc(%edi)

I've figured out that the corresponding bit of source code is line 4744 of
mptscsi.c, 

	while (bus < ioc->NumberOfBuses) {
		iocmd.bus = bus;
		iocmd.id = id;
		pMptTarget = ioc->Target_List[bus];
		pTarget = pMptTarget->Target[id];

		if (doConfig) {

			/* Set the negotiation flags */
			if (pTarget && !pTarget->raidVolume) { <===== panic here
				flags = pTarget->negoFlags;
			} else {


It looks like dereferencing pTarget is the problem.  Judging from the registers
in the panic message, that pointer was holding the value 0x00000010 at the time,
so it was both non-NULL and also an invalid address.

Still digging.

Chip
Comment 20 Jason Baron 2007-09-26 12:03:19 EDT
committed in stream U6 build 60. A test kernel with this patch is available from
http://people.redhat.com/~jbaron/rhel4/
Comment 21 Marcus Alves Grando 2007-09-26 16:03:43 EDT
(In reply to comment #20)
> committed in stream U6 build 60. A test kernel with this patch is available from
> http://people.redhat.com/~jbaron/rhel4/
> 

I take latest 2.6.9-59 kernel and not found yet. I think that you apply wrong fix.

Based on your src.rpm your fix is:

--
@@ -2771,6 +2913,20 @@ GetPortFacts(MPT_ADAPTER *ioc, int portnum, int sleepFlag)
       	pfacts->IOCLogInfo = le32_to_cpu(pfacts->IOCLogInfo);
       	pfacts->MaxDevices = le16_to_cpu(pfacts->MaxDevices);
       	pfacts->PortSCSIID = le16_to_cpu(pfacts->PortSCSIID);
+       
+	max_id = (ioc->bus_type == SAS) ? pfacts->PortSCSIID :
+	    pfacts->MaxDevices;
+       ioc->DevicesPerBus = (max_id > 255) ? 256 : max_id;
+	ioc->NumberOfBuses = (ioc->DevicesPerBus < 256) ? 1 : max_id/256;
+	if ( ioc->NumberOfBuses > MPT_MAX_BUSES ) {
+      	        dinitprintk((MYIOC_s_WARN_FMT "NumberOfBuses=%d >
MPT_MAX_BUSES=%d\n",
+                  ioc->name, ioc->NumberOfBuses, MPT_MAX_BUSES));
+               ioc->NumberOfBuses = MPT_MAX_BUSES;
+       }
+
+       dinitprintk((MYIOC_s_WARN_FMT "Buses=%d MaxDevices=%d DevicesPerBus=%d\n",
+                  ioc->name, ioc->NumberOfBuses, max_id, ioc->DevicesPerBus));
+               
       	pfacts->ProtocolFlags = le16_to_cpu(pfacts->ProtocolFlags);
       	pfacts->MaxPostedCmdBuffers = le16_to_cpu(pfacts->MaxPostedCmdBuffers);
       	pfacts->MaxPersistentIDs = le16_to_cpu(pfacts->MaxPersistentIDs);
--

And still does not work. Proposed patch in attachment works fine.

I'll attach boot screen.

Regards
Comment 22 Marcus Alves Grando 2007-09-26 16:12:04 EDT
I'll attach boot screen as soon as bugzilla works. Login page does not work.

Regards
Comment 23 Marcus Alves Grando 2007-09-26 16:32:27 EDT
Created attachment 207501 [details]
Problem persist in kernel 2.6.9-59 (jbarton)

In jbarton kernel 2.6.9-59 the related bug still persist. Maybe wrong patch.
Comment 24 Chip Coldwell 2007-09-27 09:30:54 EDT
(In reply to comment #23)
> Created an attachment (id=207501) [edit]
> Problem persist in kernel 2.6.9-59 (jbarton)
> 
> In jbarton kernel 2.6.9-59 the related bug still persist. Maybe wrong patch.

Comment #20 says this patch is in build 60, so you should not expect to find it
in 2.6.9-59.  Please test 2.6.9-60 when it becomes available.

Thank-you,

Chip
Comment 25 Marcus Alves Grando 2007-09-27 11:13:36 EDT
(In reply to comment #24)
> (In reply to comment #23)
> > Created an attachment (id=207501) [edit] [edit]
> > Problem persist in kernel 2.6.9-59 (jbarton)
> > 
> > In jbarton kernel 2.6.9-59 the related bug still persist. Maybe wrong patch.
> 
> Comment #20 says this patch is in build 60, so you should not expect to find it
> in 2.6.9-59.  Please test 2.6.9-60 when it becomes available.
> 
> Thank-you,
> 
> Chip

I know. I'm not crazy yet. But in ~jbarton does not have 2.6.9-60 and i see in
date/md5_with_old_kernel_src that 2.5.9-59 are modified. Then i think that
update 2.5.9-59 and not bump kernel version. I'm wrong?

Regards
Comment 26 Chip Coldwell 2007-09-27 11:22:38 EDT
(In reply to comment #25)
> 
> I know. I'm not crazy yet. But in ~jbarton does not have 2.6.9-60 and i see in
> date/md5_with_old_kernel_src that 2.5.9-59 are modified. Then i think that
> update 2.5.9-59 and not bump kernel version. I'm wrong?

Yes.  Please test 2.6.9-60 when it becomes available.

Chip

Comment 29 Chip Coldwell 2007-10-01 15:51:29 EDT
(In reply to comment #26)
> (In reply to comment #25)
> > 
> > I know. I'm not crazy yet. But in ~jbarton does not have 2.6.9-60 and i see in
> > date/md5_with_old_kernel_src that 2.5.9-59 are modified. Then i think that
> > update 2.5.9-59 and not bump kernel version. I'm wrong?
> 
> Yes.  Please test 2.6.9-60 when it becomes available.

2.6.9-60 is now available at

http://people.redhat.com/~jbaron/rhel4/RPMS.kernel/

Chip
Comment 30 Marcus Alves Grando 2007-10-02 09:35:40 EDT
(In reply to comment #29)
> (In reply to comment #26)
> > (In reply to comment #25)
> > > 
> > > I know. I'm not crazy yet. But in ~jbarton does not have 2.6.9-60 and i see in
> > > date/md5_with_old_kernel_src that 2.5.9-59 are modified. Then i think that
> > > update 2.5.9-59 and not bump kernel version. I'm wrong?
> > 
> > Yes.  Please test 2.6.9-60 when it becomes available.
> 
> 2.6.9-60 is now available at
> 
> http://people.redhat.com/~jbaron/rhel4/RPMS.kernel/
> 
> Chip
> 

Works fine. Thanks all.
Comment 35 errata-xmlrpc 2007-11-15 11:31:42 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html

Note You need to log in before you can comment on or make changes to this bug.