Bug 426236 - Lenovo T61/X61 lock up with xen kernel
Summary: Lenovo T61/X61 lock up with xen kernel
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel-xen
Version: 8
Hardware: All
OS: Linux
high
medium
Target Milestone: ---
Assignee: Xen Maintainance List
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On: 372741
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-12-19 14:35 UTC by Chris Tatman
Modified: 2009-12-14 20:37 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-01-09 05:33:20 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Chris Tatman 2007-12-19 14:35:43 UTC
+++ This bug was initially created as a clone of Bug #372741 +++

Description of problem:
I just installed a Lenovo T61 notebook with RHEL5.1 x86. It locked up tight
during first boot and required a power cycle to clear it. After getting through
first boot, it continued to lock up approximately every 5-6 minutes with the xen
kernel, requiring a power cycle to continue. Once locked up, I could no longer
ping it from an adjacent system. 

Running the non-xen (2.6.18-53.el5) kernel, the system appears to run fine.

This are the RHEL5.1 gold bits. The T61 was borrowd from GIT. It's the system
planned to be deployed inside Red Hat. I haven't tried another unit to rule out
a hardware problem but suspect the xen kernel because the unit appears to work
fine with the standard kernel.

This bug is preventing the system from successfully completing the certification
suite.

Version-Release number of selected component (if applicable):
2.6.18-53.el5xen

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL5.1 gold x86 (32 bit)
2. At first boot, leave the system at the prompt for approximately 5 minutes
3. Locks up tight
  
Actual results:
Locks up tight requiring a power cycle to recover using the xen kernel. Appears
to work fine with the standard kernel.

Expected results:
Proper operation.

Additional info:

Unit is in my office at Centenniel.

sosreport attached (created with non-xen kernel as I was not able to complete
one with the xen kernel -- locked up)

Running the non-xen kernel as root, after several minutes, I got the message
[root@dhcp59-128 ~]# 
Message from syslogd@ at Fri Nov  9 08:18:58 2007 ...
dhcp59-128 kernel: Disabling IRQ #82

-- Additional comment from ltroan on 2007-11-09 09:11 EST --
Created an attachment (id=252761)
/sosreport-ltroan.6151-356020-f41a04.tar.bz2


-- Additional comment from ltroan on 2007-11-09 13:01 EST --
Created an attachment (id=253151)
md5 for above sosreport


-- Additional comment from ltroan on 2007-11-09 13:02 EST --
Captured a sysreport for the xen-kernel, hopefully before the system hung. 
See below...

-- Additional comment from ltroan on 2007-11-09 13:06 EST --
Created an attachment (id=253171)
correct md5 for non-xen kernel sosreport


-- Additional comment from ltroan on 2007-11-09 13:20 EST --
Created an attachment (id=253191)
sosreport for kernel-xen (hopefully completed before the system hung)


-- Additional comment from ltroan on 2007-11-09 13:21 EST --
Created an attachment (id=253201)
md5 for kernel-xen sosreport


-- Additional comment from clalance on 2007-11-09 15:21 EST --
Larry,
     It looks like VT is disabled in the BIOS.  Out of curiosity, can you try
enabling it in the BIOS, powering off the machine, and then booting back up? 
I'm wondering if this has something to do with that.  I'll try to get my hands
on a T61 for some local testing as well.

Thanks,
Chris Lalancette

-- Additional comment from clalance on 2007-11-12 10:16 EST --
Interesting.  After setting up netconsole, I am getting this:

irq 23: nobody cared (try booting with the "irqpoll" option)
 [<c0442206>] __report_bad_irq+0x2b/0x69
 [<c04423f7>] note_interrupt+0x1b3/0x1ec
 [<c05731a5>] usb_hcd_irq+0x23/0x50
 [<c0441a3d>] handle_IRQ_event+0x49/0x51
 [<c0441af8>] __do_IRQ+0xb3/0xe8
 [<c0406d9b>] do_IRQ+0x93/0xae
 [<c0541651>] evtchn_do_upcall+0x64/0x9b
 [<c0405515>] hypervisor_callback+0x3d/0x48
 [<c04084c3>] raw_safe_halt+0x8c/0xaf
 [<c040321a>] xen_idle+0x22/0x2e
 [<c0403339>] cpu_idle+0x91/0xab
 [<c06d99f9>] start_kernel+0x381/0x388
 =======================
handlers:
[<c0573182>] (usb_hcd_irq+0x0/0x50)
Disabling IRQ #23
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:d5:f3:68/00:00:00:00:00/e7 tag 0 cdb 0x0 data 4096 out
         res 40/00:00:e4:e0:03/00:00:00:00:00/e0 Emask 0x4 (timeout)
hda: lost interrupt
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs

What it looks like happened is that some USB driver didn't respond, and then it
disabled that IRQ; however, I am surmising that this IRQ is also the one that
the hard drive is attached to, so everything comes to a screeching halt.

Next is to try to find out why that USB device isn't receiving interrupts.

Chris Lalancette

-- Additional comment from clalance on 2007-11-12 13:41 EST --
Indeed, removing the ehci_hcd module on the subsequent reboot, I don't get these
issues.  There seem to be 2 bugs here:

1)  Whatever is causing the ehci_hcd module to not acknowledge interrupts
2)  When interrupt 23 is disabled, the fact that it disabled all interrupt
sources.  I was wrong earlier; the only thing on IRQ 23 is the ehci_hcd, so
disabling that one interrupt shouldn't cause the ATA driver to go wacky.

Also interesting is that I watched the interrupts on that device.  It was taking
interrupts just fine up until interrupt 100,000, at which point it crapped out,
since nobody handled 99,000 of the previous 100,000 interrupts.  This also
happens on the bare-metal kernel (that's what the "Disabling IRQ 82" is all
about), but in that case it just stops taking interrupts from that device, not
the ATA device.

This is also being discussed on kernel.org Bugzilla:

http://bugzilla.kernel.org/show_bug.cgi?id=8853

Chris Lalancette

-- Additional comment from clalance on 2007-11-14 14:38 EST --
Created an attachment (id=258661)
Patch to mask pirq's when we go to disable them

I just posted this patch upstream; it actually masks out the pirq on the IOAPIC
when we go to disable them, which seems to prevent the crash on the T61.  I've
posted this now for upstream review.

Chris Lalancette

-- Additional comment from atodorov on 2007-11-16 12:09 EST --
(In reply to comment #0)
> Steps to Reproduce:
> 1. Install RHEL5.1 gold x86 (32 bit)

May it be that the bits installed are 32bit and this laptop having a 64bit CPU?
I have a Lenovo T60 running 5.1 xen kernel(x86_64) without any problems.
I am just guessing in the dark as I've never tried running 32bit RHEL on the
64bit CPU.

-- Additional comment from clalance on 2007-11-16 17:31 EST --
(In reply to comment #11)
> (In reply to comment #0)
> > Steps to Reproduce:
> > 1. Install RHEL5.1 gold x86 (32 bit)
> 
> May it be that the bits installed are 32bit and this laptop having a 64bit CPU?
> I have a Lenovo T60 running 5.1 xen kernel(x86_64) without any problems.
> I am just guessing in the dark as I've never tried running 32bit RHEL on the
> 64bit CPU.

No, installing a 32-bit OS on a 64-bit CPU is a fine thing to do.  This was
tracked down to a couple of problems, explained in Comment #9; I sent the
attached patch upstream, and we are still testing it out.

Chris Lalancette


-- Additional comment from ltroan on 2007-11-19 15:54 EST --
I just tried the X61 and it locked up during firstboot as well. It also locked
up during normal operation after a similar error message: "Disabling IRQ #233".

However, with the non-Xen kernel, it seems to run after the IRQ message. As I
recall, I didn't see the IRQ message on the T61 with the non-Xen kernel. 

Changing Summary to reflect T61/X61.

-- Additional comment from clalance on 2007-11-20 16:16 EST --
Created an attachment (id=265331)
Alternative patch from upstream to disable pirq's.

This is an alternative patch suggested by Keir.  It seems to work on the T61
laptop in question, as well.  I'm guessing that upstream will go with this fix,
so attaching here since this is what we will want for 5.2.

Chris Lalancette

-- Additional comment from clalance on 2007-11-21 10:27 EST --
Also note that the above patch was committed to upstream Xen linux-2.6.18-xen.hg:

http://xenbits.xensource.com/staging/linux-2.6.18-xen.hg?rev/51b2b0d0921c

Chris Lalancette

-- Additional comment from riek on 2007-11-29 11:10 EST --
Adjusting priority to high.

-- Additional comment from mjenner on 2007-12-06 20:15 EST --
QE ack for RHEL 5.2

-- Additional comment from dzickus on 2007-12-17 14:37 EST --
in 2.6.18-61.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 1 Chris Tatman 2007-12-19 14:40:37 UTC
I am seeing this exact same behavior on a F8 system running the latest xen
kernel.  Removing the ehci_hcd module on reboot seems to prevent the issue as
well.  Is there anything I can get you from my reproducer system?  A sosreport,
etc...

--Chris

Comment 2 Jan Wildeboer 2007-12-20 16:03:35 UTC
Adding a metoo here. X61 with  BIOS version 1.16.

Will try test-kernel.

Jan

Comment 3 Chris Lalancette 2007-12-20 16:09:30 UTC
Apparently the latest BIOS update from Lenovo fixes the runaway interrupt, so
you won't run into the issue anymore.  This is still a bug in Xen, however, for
your particular situation, the BIOS update might do it.

Chris Lalancette

Comment 4 Jan Wildeboer 2008-02-08 15:21:38 UTC
After updating the needed packages I still get a royal crash when booting xen on
my X61 with RHEL5.1 32bit.

Running with:

kernel-2.6.18-78.el5.i686.rpm
kernel-xen-2.6.18-78.el5.i686.rpm
xorg-x11-drv-i810-1.6.5-9.6.0.3.el5.i386.rpm
xorg-x11-server-Xnest-1.1.1-48.36.el5.i386.rpm
xorg-x11-server-Xorg-1.1.1-48.36.el5.i386.rpm

HTH



Comment 5 Chris Lalancette 2008-02-08 22:02:18 UTC
That's probably a different bug, however.  The original bug was for RHEL-5, it
would lockup the machine after a while (not crash).  This is a clone for that
bug for Fedora-8.  If you have a crash signature, please open another bug
against RHEL-5 for it.

Thanks,
Chris Lalancette

Comment 6 Bug Zapper 2008-11-26 09:04:23 UTC
This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 7 Bug Zapper 2009-01-09 05:33:20 UTC
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.