Red Hat Bugzilla – Bug 426236
Lenovo T61/X61 lock up with xen kernel
Last modified: 2009-12-14 15:37:17 EST
+++ This bug was initially created as a clone of Bug #372741 +++
Description of problem:
I just installed a Lenovo T61 notebook with RHEL5.1 x86. It locked up tight
during first boot and required a power cycle to clear it. After getting through
first boot, it continued to lock up approximately every 5-6 minutes with the xen
kernel, requiring a power cycle to continue. Once locked up, I could no longer
ping it from an adjacent system.
Running the non-xen (2.6.18-53.el5) kernel, the system appears to run fine.
This are the RHEL5.1 gold bits. The T61 was borrowd from GIT. It's the system
planned to be deployed inside Red Hat. I haven't tried another unit to rule out
a hardware problem but suspect the xen kernel because the unit appears to work
fine with the standard kernel.
This bug is preventing the system from successfully completing the certification
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Install RHEL5.1 gold x86 (32 bit)
2. At first boot, leave the system at the prompt for approximately 5 minutes
3. Locks up tight
Locks up tight requiring a power cycle to recover using the xen kernel. Appears
to work fine with the standard kernel.
Unit is in my office at Centenniel.
sosreport attached (created with non-xen kernel as I was not able to complete
one with the xen kernel -- locked up)
Running the non-xen kernel as root, after several minutes, I got the message
Message from syslogd@ at Fri Nov 9 08:18:58 2007 ...
dhcp59-128 kernel: Disabling IRQ #82
-- Additional comment from email@example.com on 2007-11-09 09:11 EST --
Created an attachment (id=252761)
-- Additional comment from firstname.lastname@example.org on 2007-11-09 13:01 EST --
Created an attachment (id=253151)
md5 for above sosreport
-- Additional comment from email@example.com on 2007-11-09 13:02 EST --
Captured a sysreport for the xen-kernel, hopefully before the system hung.
-- Additional comment from firstname.lastname@example.org on 2007-11-09 13:06 EST --
Created an attachment (id=253171)
correct md5 for non-xen kernel sosreport
-- Additional comment from email@example.com on 2007-11-09 13:20 EST --
Created an attachment (id=253191)
sosreport for kernel-xen (hopefully completed before the system hung)
-- Additional comment from firstname.lastname@example.org on 2007-11-09 13:21 EST --
Created an attachment (id=253201)
md5 for kernel-xen sosreport
-- Additional comment from email@example.com on 2007-11-09 15:21 EST --
It looks like VT is disabled in the BIOS. Out of curiosity, can you try
enabling it in the BIOS, powering off the machine, and then booting back up?
I'm wondering if this has something to do with that. I'll try to get my hands
on a T61 for some local testing as well.
-- Additional comment from firstname.lastname@example.org on 2007-11-12 10:16 EST --
Interesting. After setting up netconsole, I am getting this:
irq 23: nobody cared (try booting with the "irqpoll" option)
Disabling IRQ #23
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: cmd ca/00:08:d5:f3:68/00:00:00:00:00/e7 tag 0 cdb 0x0 data 4096 out
res 40/00:00:e4:e0:03/00:00:00:00:00/e0 Emask 0x4 (timeout)
hda: lost interrupt
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
What it looks like happened is that some USB driver didn't respond, and then it
disabled that IRQ; however, I am surmising that this IRQ is also the one that
the hard drive is attached to, so everything comes to a screeching halt.
Next is to try to find out why that USB device isn't receiving interrupts.
-- Additional comment from email@example.com on 2007-11-12 13:41 EST --
Indeed, removing the ehci_hcd module on the subsequent reboot, I don't get these
issues. There seem to be 2 bugs here:
1) Whatever is causing the ehci_hcd module to not acknowledge interrupts
2) When interrupt 23 is disabled, the fact that it disabled all interrupt
sources. I was wrong earlier; the only thing on IRQ 23 is the ehci_hcd, so
disabling that one interrupt shouldn't cause the ATA driver to go wacky.
Also interesting is that I watched the interrupts on that device. It was taking
interrupts just fine up until interrupt 100,000, at which point it crapped out,
since nobody handled 99,000 of the previous 100,000 interrupts. This also
happens on the bare-metal kernel (that's what the "Disabling IRQ 82" is all
about), but in that case it just stops taking interrupts from that device, not
the ATA device.
This is also being discussed on kernel.org Bugzilla:
-- Additional comment from firstname.lastname@example.org on 2007-11-14 14:38 EST --
Created an attachment (id=258661)
Patch to mask pirq's when we go to disable them
I just posted this patch upstream; it actually masks out the pirq on the IOAPIC
when we go to disable them, which seems to prevent the crash on the T61. I've
posted this now for upstream review.
-- Additional comment from email@example.com on 2007-11-16 12:09 EST --
(In reply to comment #0)
> Steps to Reproduce:
> 1. Install RHEL5.1 gold x86 (32 bit)
May it be that the bits installed are 32bit and this laptop having a 64bit CPU?
I have a Lenovo T60 running 5.1 xen kernel(x86_64) without any problems.
I am just guessing in the dark as I've never tried running 32bit RHEL on the
-- Additional comment from firstname.lastname@example.org on 2007-11-16 17:31 EST --
(In reply to comment #11)
> (In reply to comment #0)
> > Steps to Reproduce:
> > 1. Install RHEL5.1 gold x86 (32 bit)
> May it be that the bits installed are 32bit and this laptop having a 64bit CPU?
> I have a Lenovo T60 running 5.1 xen kernel(x86_64) without any problems.
> I am just guessing in the dark as I've never tried running 32bit RHEL on the
> 64bit CPU.
No, installing a 32-bit OS on a 64-bit CPU is a fine thing to do. This was
tracked down to a couple of problems, explained in Comment #9; I sent the
attached patch upstream, and we are still testing it out.
-- Additional comment from email@example.com on 2007-11-19 15:54 EST --
I just tried the X61 and it locked up during firstboot as well. It also locked
up during normal operation after a similar error message: "Disabling IRQ #233".
However, with the non-Xen kernel, it seems to run after the IRQ message. As I
recall, I didn't see the IRQ message on the T61 with the non-Xen kernel.
Changing Summary to reflect T61/X61.
-- Additional comment from firstname.lastname@example.org on 2007-11-20 16:16 EST --
Created an attachment (id=265331)
Alternative patch from upstream to disable pirq's.
This is an alternative patch suggested by Keir. It seems to work on the T61
laptop in question, as well. I'm guessing that upstream will go with this fix,
so attaching here since this is what we will want for 5.2.
-- Additional comment from email@example.com on 2007-11-21 10:27 EST --
Also note that the above patch was committed to upstream Xen linux-2.6.18-xen.hg:
-- Additional comment from firstname.lastname@example.org on 2007-11-29 11:10 EST --
Adjusting priority to high.
-- Additional comment from email@example.com on 2007-12-06 20:15 EST --
QE ack for RHEL 5.2
-- Additional comment from firstname.lastname@example.org on 2007-12-17 14:37 EST --
You can download this test kernel from http://people.redhat.com/dzickus/el5
I am seeing this exact same behavior on a F8 system running the latest xen
kernel. Removing the ehci_hcd module on reboot seems to prevent the issue as
well. Is there anything I can get you from my reproducer system? A sosreport,
Adding a metoo here. X61 with BIOS version 1.16.
Will try test-kernel.
Apparently the latest BIOS update from Lenovo fixes the runaway interrupt, so
you won't run into the issue anymore. This is still a bug in Xen, however, for
your particular situation, the BIOS update might do it.
After updating the needed packages I still get a royal crash when booting xen on
my X61 with RHEL5.1 32bit.
That's probably a different bug, however. The original bug was for RHEL-5, it
would lockup the machine after a while (not crash). This is a clone for that
bug for Fedora-8. If you have a crash signature, please open another bug
against RHEL-5 for it.
This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora
'version' of '8'.
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version prior to Fedora 8's end of life.
Bug Reporter: Thank you for reporting this issue and we are sorry that
we may not be able to fix it before Fedora 8 is end of life. If you
would still like to see this bug fixed and are able to reproduce it
against a later version of Fedora please change the 'version' of this
bug to the applicable version. If you are unable to change the version,
please add a comment here and someone will do it for you.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
The process we are following is described here:
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version.
Thank you for reporting this bug and we are sorry it could not be fixed.