Bug 279571 - Can't boot 2.6.18-26.el5 with Vmware ESX
Summary: Can't boot 2.6.18-26.el5 with Vmware ESX
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.1
Hardware: i386
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Chip Coldwell
QA Contact: Martin Jenner
URL:
Whiteboard:
: 280301 (view as bug list)
Depends On: 253538
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-09-05 21:54 UTC by Chris Lalancette
Modified: 2007-11-17 01:14 UTC (History)
10 users (show)

Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-07 20:03:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
console from kernel-2.6.18-45.el5.bz279571 panic (21.28 KB, text/plain)
2007-09-07 14:12 UTC, Chris Williams
no flags Details
console from kernel-2.6.18-45.el5.bz279571 panic take 2 (21.17 KB, text/plain)
2007-09-07 19:14 UTC, Chris Williams
no flags Details
workaround VMWare bug (737 bytes, patch)
2007-09-10 14:15 UTC, Chip Coldwell
no flags Details | Diff
mptspi.c patch (452 bytes, patch)
2007-09-10 17:01 UTC, Chip Coldwell
no flags Details | Diff
patch as submitted upstream by Eric Moore of LSI. (1.50 KB, patch)
2007-09-10 18:23 UTC, Chip Coldwell
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0959 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5 Update 1 2007-11-08 00:47:37 UTC

Description Chris Lalancette 2007-09-05 21:54:40 UTC
+++ This bug was initially created as a clone of Bug #253538 +++

Description of problem:

Installing kernel-2.6.18-25.el5 in a VMware ESX guest instance boots fine.  In
the same guest instance, installing kernel-2.6.18-26.el5 or later results in a
panic on boot:

Loading ext3.ko module
mkrootdev: label / not found
Mounting root filesystem
mount: error 2 mounting ext3
mount: error 2 mounting none
Switching to new root
switchroot: mount failed: 22
umount /initrd/dev failed: 2
Kernel panic - not wyncing: Attempted to kill init!

Between 25 and 26 is this changelog entry:

- [scsi] update MPT Fusion to 3.04.04 (Chip Coldwell ) [225177]

VMware emulates a LSI 1030 hba.

Comment 1 Tom Coughlan 2007-09-06 18:28:56 UTC
If possible, please provide a crash dump, or at least a stack trace. 

Comment 3 Chip Coldwell 2007-09-06 20:43:32 UTC
I'm building a kernel with some MPT debug flags set

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=955347

when the build finishes, could somebody with VMware try it out and post the
kernel messages here.

Chip


Comment 4 Chip Coldwell 2007-09-06 21:04:30 UTC
(In reply to comment #3)
> I'm building a kernel with some MPT debug flags set
> 
> http://brewweb.devel.redhat.com/brew/taskinfo?taskID=955347
> 
> when the build finishes, could somebody with VMware try it out and post the
> kernel messages here.

Well, that didn't build.  Let's try this:

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=955422

> 
> Chip
> 



Comment 5 Chris Williams 2007-09-06 21:36:23 UTC
The moment that task is done I'll test it.

Comment 6 Chris Williams 2007-09-07 14:08:59 UTC
I tested kernel-2.6.18-45.el5.bz279571 but I still see the same panic as
VolGroup00 can not be found.

Comment 7 Chris Williams 2007-09-07 14:12:00 UTC
Created attachment 189931 [details]
console from kernel-2.6.18-45.el5.bz279571 panic

Comment 8 Chip Coldwell 2007-09-07 15:48:56 UTC
Eric -- could you have a look at the debug info in comment #7 above and see if
it sheds any light on why the VMware virtual mptspi adapter would stop working
after the most recent driver update (in both RHEL-4 and RHEL-5)?  The only thing
I noticed that looked like an error/warning was this:

mptbase: ioc0: IOC operational unexpected
mptbase: whoinit 0x2 statefault 0 force 0

Comment 9 Chip Coldwell 2007-09-07 15:50:13 UTC
(In reply to comment #8)
> Eric -- could you have a look at the debug info in comment #7 above and see if
> it sheds any light on why the VMware virtual mptspi adapter would stop working
> after the most recent driver update (in both RHEL-4 and RHEL-5)?

BTW the RHEL-4 bug is bug 253538.

Chip


Comment 10 Chip Coldwell 2007-09-07 15:56:29 UTC
(In reply to comment #6)
> I tested kernel-2.6.18-45.el5.bz279571 but I still see the same panic as
> VolGroup00 can not be found.

OK, great.  Now, could you bring up the same kernel on bare metal and post the
dmesg right after boot?  You may need to increase the size of the kernel ring
buffer in order to hold all the debugging data.

Chip

Comment 11 Chip Coldwell 2007-09-07 16:11:00 UTC
(In reply to comment #8)
> Eric -- could you have a look at the debug info in comment #7 above and see if
> it sheds any light on why the VMware virtual mptspi adapter

That should be "mptsas" not "mptspi".

Chip



Comment 12 Chip Coldwell 2007-09-07 16:42:30 UTC
(In reply to comment #11)
> (In reply to comment #8)
> > Eric -- could you have a look at the debug info in comment #7 above and see if
> > it sheds any light on why the VMware virtual mptspi adapter
> 
> That should be "mptsas" not "mptspi".

I'm building another debug kernel (unfortunately, with the same name) that adds
some additional debugging info for the mptsas driver.  Could someone please boot
this on both bare metal and VMware and post the dmesg boot log.

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=956312

Chip


Comment 13 Chris Williams 2007-09-07 16:47:43 UTC
Sure, I'll load it asap.

Comment 14 Chris Williams 2007-09-07 19:14:01 UTC
Created attachment 190331 [details]
console from kernel-2.6.18-45.el5.bz279571 panic take 2

Comment 16 Chip Coldwell 2007-09-10 14:14:11 UTC

(In reply to comment #15)
> FWIW, there is a Fedora BZ on this:
> https://bugzilla.redhat.com/show_bug.cgi?id=230703
> 
> Eric Moore (eric.moore) 
> I confirm that your intuition is true. max_id is indeed zero. If I change it to
> MPT_MAX_SCSI_DEVICES, devices are probed correctly. Below are the Port Facts
> returned by the VMWare firmware. PortSCSIID is equal to 7, but MaxDevices is 0,
> and that's what ioc->devices_per_bus is computed from (is this computation
> correct with respect to the semantic of the PortFacts values ?).

If I understand Eric Moore correctly, then the kernel here (when it finishes
building)

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=958967

implements a workaround for this VMWare bug.

Chip


Comment 17 Chip Coldwell 2007-09-10 14:15:51 UTC
Created attachment 191681 [details]
workaround VMWare bug

This is the patch in the kernel at

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=958967

Comment 18 Chip Coldwell 2007-09-10 15:53:27 UTC
(In reply to comment #17)
> Created an attachment (id=191681) [edit]
> workaround VMWare bug
> 
> This is the patch in the kernel at
> 
> http://brewweb.devel.redhat.com/brew/taskinfo?taskID=958967

This build has finished now ... Chris can you grab the kernel and let me know
what happens?

Comment 19 Eric Moore 2007-09-10 16:33:22 UTC
Chip, you should be using mptspi, instead of mptsas, for vmware.

I looked at the log in comment  #7, indicates no devices.  Probably the fix in 
#16 would fix it.   Christoph Hellwig rejected the patch.  THere are newer ESX 
servers that have fix the problem, you should talk to Ed Goggin.   I believe 
this same issue was covered previously in bugzilla 230703.

Comment 20 Chip Coldwell 2007-09-10 16:58:42 UTC

(In reply to comment #19)
> Chip, you should be using mptspi, instead of mptsas, for vmware.

The log in comment #7 was loading both, so I got confused.

> I looked at the log in comment  #7, indicates no devices.  Probably the fix in 
> #16 would fix it.   Christoph Hellwig rejected the patch.  THere are newer ESX 
> servers that have fix the problem, you should talk to Ed Goggin.   I believe 
> this same issue was covered previously in bugzilla 230703.

Thanks.

Chip



Comment 21 Chip Coldwell 2007-09-10 17:01:57 UTC
Created attachment 191801 [details]
mptspi.c patch

Comment 22 Chip Coldwell 2007-09-10 17:05:04 UTC
(In reply to comment #18)
> (In reply to comment #17)
> > Created an attachment (id=191681) [edit] [edit]
> > workaround VMWare bug
> > 
> > This is the patch in the kernel at
> > 
> > http://brewweb.devel.redhat.com/brew/taskinfo?taskID=958967
> 
> This build has finished now ... Chris can you grab the kernel and let me know
> what happens?

Ignore that one.  Use this one instead

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=959303


Comment 23 Eric Moore 2007-09-10 17:08:28 UTC
The patch in comment #21 should do.   I don't have access to 
brewweb.devel.redhat.com, I get a "Bad Gateway" error.




Comment 24 Chip Coldwell 2007-09-10 17:38:47 UTC
(In reply to comment #23)
> The patch in comment #21 should do.   I don't have access to 
> brewweb.devel.redhat.com, I get a "Bad Gateway" error.

When the build finishes, I'll copy the kernel out to an external web server
where you can reach it.

Thanks for reviewing the patch,

Chip


Comment 26 Chip Coldwell 2007-09-10 18:06:32 UTC
(In reply to comment #25)
> Chip,
> 
> That latest kernel from
>
http://brewweb.devel.redhat.com/brew/getfile?taskID=958970&name=kernel-2.6.18-45.el5.bz279571.i686.rpm
> boots just fine.

That one implements a fix that is somewhat more similar to the (rejected)
upstream patch here

http://marc.info/?l=linux-scsi&m=117432237404247

> Did you also want me to test
> http://brewweb.devel.redhat.com/brew/taskinfo?taskID=959303

Actually, I think what I want to do is to use the literal patch from the link
above and see if that fixes the problem.  Sorry about all the churn.

Chip



Comment 27 Chip Coldwell 2007-09-10 18:23:57 UTC
Created attachment 191841 [details]
patch as submitted upstream by Eric Moore of LSI.

Comment 28 Eric Moore 2007-09-10 19:09:28 UTC
Chip, yeah, I believe we will need the last patch, because pfacts->PortSCSIID 
would of been zero (due to vmware emulation not initializing this in the 
config page), and when the driver did IocInit, we would of told firmware that 
we don't support any devices.  

Comment 29 Chip Coldwell 2007-09-10 19:30:33 UTC
(In reply to comment #28)
> Chip, yeah, I believe we will need the last patch, because pfacts->PortSCSIID 
> would of been zero (due to vmware emulation not initializing this in the 
> config page), and when the driver did IocInit, we would of told firmware that 
> we don't support any devices.  

You're referring to the patch in comment #27, right?

I'm building two kernels with that patch, one for RHEL-4 (bug 253538) and one
for RHEL-5.  If you will sign-off on that patch (or even test the kernels), that
will grease our internal code review process.

Thanks-a-million,

Chip



Comment 30 Chip Coldwell 2007-09-10 21:04:43 UTC
Build finished.  RPMs are available from 

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=959396

for folks on the internal Red Hat network, and at

http://people.redhat.com/coldwell/kernel/bugs/279571/

for folks outside the Red Hat network.

Thanks for any and all testing.

Chip


Comment 32 Chip Coldwell 2007-09-11 13:59:24 UTC
James:

Can you comment on this (from Doug Ledford):

> +     case SPI:
> +     default:
> +             max_id = MPT_MAX_SCSI_DEVICES;

Aside from this little bit that would appear to set the max devices as
though the card is wide SCSI without actually checking that it is, which
then implies that if there ever was a narrow SCSI MPT controller, and
you ran this driver on it, it better not break when scanned for devices
that are too large to be on a narrow bus, it looks fine to me.  And this
could be fine too, I just don't know enough about the MPT hardware to
know (and for that matter, the chances of someone *still* running a
narrow SCSI controller are somewhat slim, although not non-existent).
If you are sure this is safe, then ACK.


Comment 35 Don Zickus 2007-09-12 18:43:08 UTC
in 2.6.18-47.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 36 Tom Coughlan 2007-09-13 13:26:09 UTC
*** Bug 280301 has been marked as a duplicate of this bug. ***

Comment 37 Tom Coughlan 2007-09-13 13:29:19 UTC
(In reply to comment #32)
> James:
> 
> Can you comment on this (from Doug Ledford):
> 
> > +     case SPI:
> > +     default:
> > +             max_id = MPT_MAX_SCSI_DEVICES;
> 
> Aside from this little bit that would appear to set the max devices as
> though the card is wide SCSI without actually checking that it is, 

Follow-up from Chip:

I think we're OK, or at least safe from regressions.  The MPT update
patch which introduced the problem with VMWare contained this 

@@ -943,14 +1354,13 @@ mptspi_probe(struct pci_dev *pdev, const struct
pci_device_id *id)
         * max_lun = 1 + actual last lun,
         *      see hosts.h :o(
         */
-       sh->max_id = MPT_MAX_SCSI_DEVICES;
+       sh->max_id = ioc->devices_per_bus;
 
IOW, all MPT SPI devices used to set sh->max_id to
MPT_MAX_SCSI_DEVICES, and the patch above restores this behavior to
the post-update driver.


Comment 38 Eric Moore 2007-09-18 15:16:46 UTC
There's only one controller that works with the mptspi driver, which is wide.


Comment 41 errata-xmlrpc 2007-11-07 20:03:29 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html



Note You need to log in before you can comment on or make changes to this bug.