+++ This bug was initially created as a clone of Bug #372741 +++ Description of problem: I just installed a Lenovo T61 notebook with RHEL5.1 x86. It locked up tight during first boot and required a power cycle to clear it. After getting through first boot, it continued to lock up approximately every 5-6 minutes with the xen kernel, requiring a power cycle to continue. Once locked up, I could no longer ping it from an adjacent system. Running the non-xen (2.6.18-53.el5) kernel, the system appears to run fine. This are the RHEL5.1 gold bits. The T61 was borrowd from GIT. It's the system planned to be deployed inside Red Hat. I haven't tried another unit to rule out a hardware problem but suspect the xen kernel because the unit appears to work fine with the standard kernel. This bug is preventing the system from successfully completing the certification suite. Version-Release number of selected component (if applicable): 2.6.18-53.el5xen How reproducible: Always Steps to Reproduce: 1. Install RHEL5.1 gold x86 (32 bit) 2. At first boot, leave the system at the prompt for approximately 5 minutes 3. Locks up tight Actual results: Locks up tight requiring a power cycle to recover using the xen kernel. Appears to work fine with the standard kernel. Expected results: Proper operation. Additional info: Unit is in my office at Centenniel. sosreport attached (created with non-xen kernel as I was not able to complete one with the xen kernel -- locked up) Running the non-xen kernel as root, after several minutes, I got the message [root@dhcp59-128 ~]# Message from syslogd@ at Fri Nov 9 08:18:58 2007 ... dhcp59-128 kernel: Disabling IRQ #82 -- Additional comment from ltroan on 2007-11-09 09:11 EST -- Created an attachment (id=252761) /sosreport-ltroan.6151-356020-f41a04.tar.bz2 -- Additional comment from ltroan on 2007-11-09 13:01 EST -- Created an attachment (id=253151) md5 for above sosreport -- Additional comment from ltroan on 2007-11-09 13:02 EST -- Captured a sysreport for the xen-kernel, hopefully before the system hung. See below... -- Additional comment from ltroan on 2007-11-09 13:06 EST -- Created an attachment (id=253171) correct md5 for non-xen kernel sosreport -- Additional comment from ltroan on 2007-11-09 13:20 EST -- Created an attachment (id=253191) sosreport for kernel-xen (hopefully completed before the system hung) -- Additional comment from ltroan on 2007-11-09 13:21 EST -- Created an attachment (id=253201) md5 for kernel-xen sosreport -- Additional comment from clalance on 2007-11-09 15:21 EST -- Larry, It looks like VT is disabled in the BIOS. Out of curiosity, can you try enabling it in the BIOS, powering off the machine, and then booting back up? I'm wondering if this has something to do with that. I'll try to get my hands on a T61 for some local testing as well. Thanks, Chris Lalancette -- Additional comment from clalance on 2007-11-12 10:16 EST -- Interesting. After setting up netconsole, I am getting this: irq 23: nobody cared (try booting with the "irqpoll" option) [<c0442206>] __report_bad_irq+0x2b/0x69 [<c04423f7>] note_interrupt+0x1b3/0x1ec [<c05731a5>] usb_hcd_irq+0x23/0x50 [<c0441a3d>] handle_IRQ_event+0x49/0x51 [<c0441af8>] __do_IRQ+0xb3/0xe8 [<c0406d9b>] do_IRQ+0x93/0xae [<c0541651>] evtchn_do_upcall+0x64/0x9b [<c0405515>] hypervisor_callback+0x3d/0x48 [<c04084c3>] raw_safe_halt+0x8c/0xaf [<c040321a>] xen_idle+0x22/0x2e [<c0403339>] cpu_idle+0x91/0xab [<c06d99f9>] start_kernel+0x381/0x388 ======================= handlers: [<c0573182>] (usb_hcd_irq+0x0/0x50) Disabling IRQ #23 ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata1.00: cmd ca/00:08:d5:f3:68/00:00:00:00:00/e7 tag 0 cdb 0x0 data 4096 out res 40/00:00:e4:e0:03/00:00:00:00:00/e0 Emask 0x4 (timeout) hda: lost interrupt ata1.00: qc timeout (cmd 0xec) ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata1.00: revalidation failed (errno=-5) ata1: failed to recover some devices, retrying in 5 secs What it looks like happened is that some USB driver didn't respond, and then it disabled that IRQ; however, I am surmising that this IRQ is also the one that the hard drive is attached to, so everything comes to a screeching halt. Next is to try to find out why that USB device isn't receiving interrupts. Chris Lalancette -- Additional comment from clalance on 2007-11-12 13:41 EST -- Indeed, removing the ehci_hcd module on the subsequent reboot, I don't get these issues. There seem to be 2 bugs here: 1) Whatever is causing the ehci_hcd module to not acknowledge interrupts 2) When interrupt 23 is disabled, the fact that it disabled all interrupt sources. I was wrong earlier; the only thing on IRQ 23 is the ehci_hcd, so disabling that one interrupt shouldn't cause the ATA driver to go wacky. Also interesting is that I watched the interrupts on that device. It was taking interrupts just fine up until interrupt 100,000, at which point it crapped out, since nobody handled 99,000 of the previous 100,000 interrupts. This also happens on the bare-metal kernel (that's what the "Disabling IRQ 82" is all about), but in that case it just stops taking interrupts from that device, not the ATA device. This is also being discussed on kernel.org Bugzilla: http://bugzilla.kernel.org/show_bug.cgi?id=8853 Chris Lalancette -- Additional comment from clalance on 2007-11-14 14:38 EST -- Created an attachment (id=258661) Patch to mask pirq's when we go to disable them I just posted this patch upstream; it actually masks out the pirq on the IOAPIC when we go to disable them, which seems to prevent the crash on the T61. I've posted this now for upstream review. Chris Lalancette -- Additional comment from atodorov on 2007-11-16 12:09 EST -- (In reply to comment #0) > Steps to Reproduce: > 1. Install RHEL5.1 gold x86 (32 bit) May it be that the bits installed are 32bit and this laptop having a 64bit CPU? I have a Lenovo T60 running 5.1 xen kernel(x86_64) without any problems. I am just guessing in the dark as I've never tried running 32bit RHEL on the 64bit CPU. -- Additional comment from clalance on 2007-11-16 17:31 EST -- (In reply to comment #11) > (In reply to comment #0) > > Steps to Reproduce: > > 1. Install RHEL5.1 gold x86 (32 bit) > > May it be that the bits installed are 32bit and this laptop having a 64bit CPU? > I have a Lenovo T60 running 5.1 xen kernel(x86_64) without any problems. > I am just guessing in the dark as I've never tried running 32bit RHEL on the > 64bit CPU. No, installing a 32-bit OS on a 64-bit CPU is a fine thing to do. This was tracked down to a couple of problems, explained in Comment #9; I sent the attached patch upstream, and we are still testing it out. Chris Lalancette -- Additional comment from ltroan on 2007-11-19 15:54 EST -- I just tried the X61 and it locked up during firstboot as well. It also locked up during normal operation after a similar error message: "Disabling IRQ #233". However, with the non-Xen kernel, it seems to run after the IRQ message. As I recall, I didn't see the IRQ message on the T61 with the non-Xen kernel. Changing Summary to reflect T61/X61. -- Additional comment from clalance on 2007-11-20 16:16 EST -- Created an attachment (id=265331) Alternative patch from upstream to disable pirq's. This is an alternative patch suggested by Keir. It seems to work on the T61 laptop in question, as well. I'm guessing that upstream will go with this fix, so attaching here since this is what we will want for 5.2. Chris Lalancette -- Additional comment from clalance on 2007-11-21 10:27 EST -- Also note that the above patch was committed to upstream Xen linux-2.6.18-xen.hg: http://xenbits.xensource.com/staging/linux-2.6.18-xen.hg?rev/51b2b0d0921c Chris Lalancette -- Additional comment from riek on 2007-11-29 11:10 EST -- Adjusting priority to high. -- Additional comment from mjenner on 2007-12-06 20:15 EST -- QE ack for RHEL 5.2 -- Additional comment from dzickus on 2007-12-17 14:37 EST -- in 2.6.18-61.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
I am seeing this exact same behavior on a F8 system running the latest xen kernel. Removing the ehci_hcd module on reboot seems to prevent the issue as well. Is there anything I can get you from my reproducer system? A sosreport, etc... --Chris
Adding a metoo here. X61 with BIOS version 1.16. Will try test-kernel. Jan
Apparently the latest BIOS update from Lenovo fixes the runaway interrupt, so you won't run into the issue anymore. This is still a bug in Xen, however, for your particular situation, the BIOS update might do it. Chris Lalancette
After updating the needed packages I still get a royal crash when booting xen on my X61 with RHEL5.1 32bit. Running with: kernel-2.6.18-78.el5.i686.rpm kernel-xen-2.6.18-78.el5.i686.rpm xorg-x11-drv-i810-1.6.5-9.6.0.3.el5.i386.rpm xorg-x11-server-Xnest-1.1.1-48.36.el5.i386.rpm xorg-x11-server-Xorg-1.1.1-48.36.el5.i386.rpm HTH
That's probably a different bug, however. The original bug was for RHEL-5, it would lockup the machine after a while (not crash). This is a clone for that bug for Fedora-8. If you have a crash signature, please open another bug against RHEL-5 for it. Thanks, Chris Lalancette
This message is a reminder that Fedora 8 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 8. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '8'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 8's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 8 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.