Bug 707966

Summary: 2.6.18-238.1.1.el5 or newer won't boot under Xen HVM due to linux-2.6-virt-nmi-don-t-print-nmi-stuck-messages-on-guests.patch
Product: Red Hat Enterprise Linux 5 Reporter: Jan Kundrát <jkt>
Component: kernel-xenAssignee: Laszlo Ersek <lersek>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.5CC: ajb, cww, dhoward, drjones, dzickus, honza801, jzheng, leiwang, mrezanin, pbonzini, pcao, qguan, qwan, sforsber, xen-maint
Target Milestone: rcKeywords: Regression, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-2.6.18-285.el5 Doc Type: Bug Fix
Doc Text:
A previously applied patch to help clean-up a failed nmi_watchdog check by disabling various registers caused single-vcpu Xen HVM guests to become unresponsive during boot when the host CPU was an Intel Xeon Processor E5405 or an Intel Xeon Processor E5420, and the VM configuration did not have the apic = 1 parameter set. With this update, NMI_NONE is the default watchdog on AMD64 HVM guests, thus, fixing this issue.
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-02-21 03:48:55 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514489, 739823    
Attachments:
Description Flags
Xen config of the HVM domain
none
xm-dmesg-2.6.18-221.
none
amend NMI stuck printk patch so that APIC stuff runs only on bare-metal
none
make NMI_NONE the default watchdog in x86_64 hvm guests none

Description Jan Kundrát 2011-05-26 12:48:39 UTC
Created attachment 501071 [details]
Xen config of the HVM domain

Summary: our fully-virtualized HVM Xen guests won't boot with any 5.x kernel newer than (or including) 2.6.18-238.1.1.el5 unless we remove patch#25523 (linux-2.6-virt-nmi-don-t-print-nmi-stuck-messages-on-guests.patch, available at [1]) from the build process.

Details:
We've observed this issue on multiple Scientific Linux 5 machines running various combinations of dom0 kernels, including 2.6.18-194.32.1.el5 and 2.6.18-238.1.1.el5. The common configuration is that the dom0 is running Xen, and that the guests utilize full HW virtualization (Xen HVM -- please see the attached domU configuration file for details). This combination has worked fine for many months, up to the time when Scientific Linux 5.5 imported kernel packages from 5.6, like the 2.6.18-238.1.1.el5. When using these kernels inside the domU, the guest kernel gets stuck at one of these phases of its boot:

Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
[it gets stuck here]
[...more messages...]
PCI: Setting latency timer of device 0000:00:01.1 to 64
     ide0: BM-DMA at 0xc000-0xc007, BIOS settings: hda:pio, hdb:pio
     ide1: BM-DMA at 0xc008-0xc00f, BIOS settings: hdc:pio, hdd:pio 
[this is the second possble place]
Probing IDE interface ide0..
[on one occasion, it made it up to this place]

Commenting out the "serial='pty'" line in my Xen config file for the domU will prevent it getting stuck at the first possible place, but it still always hangs up just after printing the "ide1". We have tried many VBD options, including direct export of a FC block device which we normally use, as well as the tap:aio: access and the old-school file:/ prefix, but there's no observable difference in behavior.

The very same image of the domU machine (including the problematic new kernel) boots fine on a desktop computer with recent Gentoo/KVM/libvirt. A weird thing is that, back on the production infrastructure based on 5.5, when I ask the grub embedded in my domU image to boot the Xen version of the kernel (so that the dom0 runs Xen on barfe metal as usual, and there's one more Xen running inside the domU, and this second Xen then runs the Xenified kernel (..."el5xen" vmlinuz) inside the domU), the guest boots just fine.

I've verified that commenting out application of patch#25523 in the 2.6.18-238.5.1.el5's spec file causes all symptoms to go away; I can use the resulting RPM for booting the guests and everything works just fine.

Please let me know if you require any more details, I've already spent considerable amount of time fighting with bisecting the 5k+ kernel patches in order to find out which one caused this regression, so I'd love to work with you on fixing this issue.

With kind regards,
Jan Kundrát

[1] http://dev.gentoo.org/~jkt/tmp/linux-2.6-virt-nmi-don-t-print-nmi-stuck-messages-on-guests.patch

Comment 1 Andrew Jones 2011-05-26 15:45:37 UTC
Hi Jan,

assuming there wasn't some strange issue the led the bisection to this patch incorrectly, then the only thing it could be is the apic_write. I wonder why this hasn't shown up in our testing or from other deployments yet though.

One quick thing to test is to boot with nmi_watchdog=2 on the guest's kernel command line. That'll set the nmi_watchdog to NMI_LOCAL_APIC and call lapic_watchdog_stop instead (which I believe is a no-op, if not you can try adding nolapic to the guest kernel command line too to make sure it is). 

If booting with this (these) command line options makes all your guests happy over several reboot cycles then we can be be pretty confident that the issue is the apic_write and we'll start looking at writing a patch.

thanks,
Drew

Comment 2 Jan Kundrát 2011-05-27 08:14:38 UTC
Hi Drew, I tried playing with 2.6.18-245.el5. Without that extra parameter, that kernel gets stuck during boot. Adding just the "nmi_watchdog=2" to the guest's command line allows it to boot, and it behaves consistently over several `xm destroy`/`xm create` cycles, so I guess you've indeed found the culprit.

I'm not sure how relevant this is, but the physical machine on which this happens is an HP Proliant DL360 G5, dmidecode reports its BIOS version as "P58" from "08/03/2008". The CPUs are Intel Xeon E5420. However, I'm pretty sure I tried that even on some Dell or Supermicro machines (could check if it helps).

Cheers,
Jan

Comment 3 Andrew Jones 2011-05-27 11:15:55 UTC
Hi Jan,

thanks for the additional testing. I actually believe this is a regression introduced from a hypervisor patch, rather than this kernel patch. I believe this is the culprit

[xen] emulate injection of guest NMI

which first appeared in the -222 build. Since this kernel patch you've pointed to first appeared in the -216 build, we could actually try a -216 build or anything < -222 and see if the problem is gone. Then see if the problem appears with -222, for our final confirmation.

If this is the problem patch, then it looks like it may be possible to work around it by allocating at least two vcpus for the guest, i.e. change the config to be vcpus=2.

Also, for the next round of tests please add 'loglvl=all guest_loglvl=all' to your hypervisor command line (xen.gz in grub) and reboot it. Then, after the failure grab 'xm dmesg' output.

thanks,
Drew

Comment 4 Jan Kundrát 2011-05-27 16:46:16 UTC
(In reply to comment #3)
> [xen] emulate injection of guest NMI
> 
> which first appeared in the -222 build.

Please note that at the time I first hit this issue, the dom0 was running kernel 2.6.18-194.32.1.el5xen, ie. a kernel which is not supposed to be affected. At that time, I also tried going between the -194.32.1.el5xen and -238.1.1.el5xen versions of the dom0 kernels, but did not see any difference -- no matter what version was running in the dom0, as soon as I switched to -238.1.1 inside the domU, the domU wouldn't boot anymore.

> If this is the problem patch, then it looks like it may be possible to work
> around it by allocating at least two vcpus for the guest, i.e. change the
> config to be vcpus=2.

I can confirm that when I try the -245.el5 kernel inside the domU (without any change in the dom0 at all, ie. still at -238.1.1.el5xen inside the dom0), change the vcpu count to 2 in the Xen config for that particular domU and remove the nmi_watchdog bit from its kernel command line, the guest boots properly.

> Also, for the next round of tests please add 'loglvl=all guest_loglvl=all' to
> your hypervisor command line (xen.gz in grub) and reboot it. Then, after the
> failure grab 'xm dmesg' output.

Do you still want me to reboot with that settings, or are the comments above enough?

Comment 5 Andrew Jones 2011-05-27 17:29:36 UTC
Having issues with the -194 dom0 is a strange data point, but I don't believe the apic_write(APIC_LVT0, APIC_DM_NMI... should have done anything at all without the patch I pointed to. So I'm inclined to believe the -194 dom0 problem was a different issue. The vcpu test also seems to confirm it's the hypervisor patch causing the issues (triggered by the kernel patch you pointed to). The extra logging could help us further confirm this (and fix it) as well, so please grab them if you can.

Comment 6 Jan Kundrát 2011-05-27 17:33:56 UTC
OK. Where can I find the RPMs (or SRPMs) of the dom0 kernel and Xen combination I should try? Any specific domU options this time?

Comment 7 Andrew Jones 2011-05-30 14:23:51 UTC
Hi Jan,

I've uploaded some rpms to here

http://people.redhat.com/drjones/707966/

I'd appreciate the following tests

- Install the -221 kernel/xen on your host and then add 'loglvl=all guest_loglvl=all' to the xen.gz command line.
- Make sure you guest config only has one vcpu assigned to it
- After booting up the host on the new kernel, then attempt to boot an HVM guest that has a >= -216 kernel (you can use the -264 that I put in the directly in order to use the very latest).

I believe this will work.

- Then install the -222 kernel/xen on the host and try booting the guest again.

I believe this will fail. Please capture the logs from 'xm dmesg' after it fails (the loglvl=all guest_loglvl=all should still be there).

- Then to be 100% thorough you can try installing the -264 kernel/xen on your host and repeat the boot test.

I believe this will also fail, and the 'xm dmesg' logs should be similar. I'd like these logs as well though.

Thanks for all the testing!!

Drew

Comment 8 Paolo Bonzini 2011-05-30 16:12:53 UTC
It seems to me that the apic_write call is wrong.  The APIC setup of Xen HVM guests will deliver external interrupts directly to the LAPIC, not to the IOAPIC.  For this reason, the hvmloader will do apic_write(APIC_LVT0, APIC_DM_EXTINT) before starting Linux.

Linux detects this, and does not mask the external interrupts.  This is shown in dmesg during boot (loglevel=9 acpi=debug) with something like

enabled ExtINT on CPU#0
ENABLING IO-APIC IRQs
init IO_APIC IRQs
 IOAPIC (apicid-pin) 1-0, 1-16, [etc.]
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1

Testing the NMI watchdog starts by writing APIC_DM_NMI to the LVT0 register.  In the old code, the register kept this value forever.  Here is where something is missing in my theory, because I do not get how the old code can work.

However, now the NMI watchdog code writes APIC_DM_NMI | APIC_LVT_MASKED and this is probably the source of the problem: the new value is effectively masking external interrupts including, guess what, the serial port and IDE controller interrupts.

   Note: with 5.7, it will work again by chance, because patch b41ec96 ([xen] 
   x86/hvm: Enable delivering 8259 interrupts to VCPUs != 0, 2011-03-02) will 
   default to delivering interrupts to VCPU#0.  Before that patch, the 
   interrupts simply will not be delivered.

In any case, the correct fix is one of the following:

1) not muck with LVT0 at all on the grounds that it has always worked (!).  Possibly, complete the above theory to understand _why_ it has always worked.

2) save the old value of LVT0 in enable_NMI_through_LVT0 and add a disable_NMI_through_LVT0 that restores it.


By the way, note that depending on whether the apic2/pin2 is -1 or not in the above message, the routines that disconnect the APIC to prepare for reboot work very differently:

- if it is -1, disconnect_bsp_APIC(0) is called.  It will basically do apic_write(APIC_LVT0, APIC_DM_EXTINT)

- if it is not -1, disable_IO_APIC will do the IOAPIC equivalent of apic_write(APIC_LVT0, APIC_DM_EXTINT).  Then, disconnect_bsp_APIC(1) is called which will do apic_write(APIC_LVT0, APIC_LVT_MASKED).

Now, the APIC_LVT_MASKED bit is set, so the delivery mode is not interesting.  As such, the apic_write in the patch is equivalent to

                        /* Disable LVT0 */
                        apic_write(APIC_LVT0, APIC_LVT_MASKED);

which is the "wrong choice" when running under Xen.


Also by the way, none of this IOAPIC business is done by the kernel when running as dom0---the hypervisor does it instead---which is why running the guest as a "nested" dom0 works.

Comment 9 Paolo Bonzini 2011-05-30 16:15:39 UTC
The text between "By the way" and "Also by the way" above doesn't make much sense and should not have been there. :)

Comment 10 Jan Kundrát 2011-06-01 17:05:25 UTC
(In reply to comment #7)
> - Install the -221 kernel/xen on your host and then add 'loglvl=all
> guest_loglvl=all' to the xen.gz command line.
> - Make sure you guest config only has one vcpu assigned to it
> - After booting up the host on the new kernel, then attempt to boot an HVM
> guest that has a >= -216 kernel (you can use the -264 that I put in the
> directly in order to use the very latest).
> 
> I believe this will work.

Hi Andrew, sorry for delay. I've followed these instructions (ie. 2.6.18-221.el5xen in the dom0, the -264 in domU, boot arguments to Xen in dom0), but the guest is getting stuck immediately after the "Serial..." line. I'll attach the xm dmesg log shortly (bugzie won't allow me now).

I haven't done anything else.

Comment 11 Jan Kundrát 2011-06-01 17:08:24 UTC
Created attachment 502326 [details]
xm-dmesg-2.6.18-221.

Please note that I started the domU once, saw that it gets stuck, `xm destroy`ed it, started again, observed the same behavior and only then grabbed the `xm dmesg` log.

Comment 12 Andrew Jones 2011-06-01 18:21:33 UTC
yeah, these are the interesting logs

(XEN) vlapic.c:687:d2 Local APIC Write to read-only register 0x30
(XEN) vlapic.c:687:d2 Local APIC Write to read-only register 0x20
(XEN) vlapic.c:687:d2 Local APIC Write to read-only register 0x20
(XEN) vlapic.c:687:d2 Local APIC Write to read-only register 0x20

It looks like I stand corrected. The kernel patch alone appears to start the problems. We see that now that we're running without the HV NMI related patch, and still having problems.

I uploaded kernel-2.6.18-215.el5.x86_64.rpm to the same place. This is the kernel build right before the patch you pointed to. To be completely thorough you could try it on your guest running this -221 host and see that it boots without any problem. However, looking at these logs we see the apic_write is certainly causing some havoc.

Comment 13 Jan Kundrát 2011-06-01 19:45:40 UTC
(In reply to comment #12)
> I uploaded kernel-2.6.18-215.el5.x86_64.rpm to the same place. This is the
> kernel build right before the patch you pointed to. To be completely thorough
> you could try it on your guest running this -221 host and see that it boots
> without any problem. However, looking at these logs we see the apic_write is
> certainly causing some havoc.

I can confirm that the -215 guest boots fine on the -221 host, as you suspected.

Please let me know if you could use any further testing.

Comment 14 RHEL Program Management 2011-08-12 16:29:37 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 15 Qin Guan 2011-08-23 13:10:29 UTC
This problem can be reproduced on host CPU model Intel Xeon E5405 with guest conf as attachment 501071 [details] described.

Host: 2.6.18-238.1.1.el5xen
Guest: 2.6.18-238.1.1.el5

While, if add "apic = 1" in the conf, the guest can boot up successfully.

BTW, this problem not found on AMD host (tested on AMD Opteron 1216) even without "apic = 1" in the guest conf.

Comment 16 Laszlo Ersek 2011-08-29 11:34:30 UTC
Created attachment 520358 [details]
amend NMI stuck printk patch so that APIC stuff runs only on bare-metal

fix up commit b1c317b so that this block is only run on bare-metal:

    Also a little bit of extra code is in this patch to help clean-up a
    failed nmi_watchdog check by disabling various registers as noticed
    during testing.

Comment 18 Paolo Bonzini 2011-08-29 11:46:04 UTC
I like Laszlo's patch, it's the simplest that can work.  Perhaps it would be _too_ simple for upstream or even RHEL6, but it's better to be surgical in RHEL5.

Comment 19 Laszlo Ersek 2011-08-29 16:19:23 UTC
I tested the following configurations. I created the guests with virt-manager, and then updated the kernel in the RHEL-5.6.

Guest                                     Host
----------------------  -----------------------------------------
                        2.6.18-238.24.1.el5xen  2.6.18-281.el5xen

2.6.18-238.el5      G1  boot OK                 boot OK
2.6.18-238.24.1.el5 G1  boot OK                 boot OK
2.6.18-274.el5      G1  boot OK                 boot OK

G1
  name = "rhel57-64bit-hvm-bz707966"
  uuid = "a0203b16-91a0-1958-27d6-f97e36cea95f"
  maxmem = 512
  memory = 512
  vcpus = 1
  builder = "hvm"
  kernel = "/usr/lib/xen/boot/hvmloader"
  boot = "c"
  pae = 1
  acpi = 1
  apic = 1
  localtime = 0
  on_poweroff = "destroy"
  on_reboot = "restart"
  on_crash = "restart"
  device_model = "/usr/lib64/xen/bin/qemu-dm"
  sdl = 0
  vnc = 1
  vncunused = 1
  keymap = "en-us"
  disk = [ "file:/var/lib/xen/images/rhel57-64bit-hvm-bz707966.img,hda,w",",hdc:cdrom,r" ]
  vif = [ "mac=00:16:36:0c:d4:4b,bridge=xenbr0,script=vif-bridge" ]
  parallel = "none"
  serial = "pty"

No problems seen. The host CPU is a Xeon W3550.

I'll retry without apic=1.

Comment 20 Laszlo Ersek 2011-08-29 16:29:14 UTC
Commented out the acpi and apic lines in the guest config seen in comment 19, still can't reproduce the problem (host: 2.6.18-238.24.1.el5xen, guest: 2.6.18-238.24.1.el5).

Am I doing something wrong? AFAICT b41ec96 was not backported to 5.6.z.

Jan, can you try upgrading to 2.6.18-238.24.1? Thanks.

Comment 21 Laszlo Ersek 2011-08-29 16:34:39 UTC
(In reply to comment #20)

> Jan, can you try upgrading to 2.6.18-238.24.1?

Or can you please apply attachment 520358 [details] on top of whatever canned kernel fails for you and retest? Thank you!

Comment 22 Laszlo Ersek 2011-08-29 18:45:43 UTC
(In reply to comment #15)
> This problem can be reproduced on host CPU model Intel Xeon E5405 with guest
> conf as attachment 501071 [details] described.
> 
> Host: 2.6.18-238.1.1.el5xen
> Guest: 2.6.18-238.1.1.el5
> 
> While, if add "apic = 1" in the conf, the guest can boot up successfully.
> 
> BTW, this problem not found on AMD host (tested on AMD Opteron 1216) even
> without "apic = 1" in the guest conf.

For completeness, I tried to reproduce the problem as follows:
- host: W3550 CPU, 2.6.18-238.1.1.el5xen
- guest: 2.6.18-238.1.1.el5; vm config: both with and without apic & acpi 

No hang.

Comment 27 Laszlo Ersek 2011-08-30 10:52:29 UTC
I finally managed to reproduce the hang on a Xeon E5405 host.

Host:
* -283 (hypervisor and dom0)

Guests: 
* checked both -238.24.1 (most recent 5.6.z atm) and -283 (most recent 5.8
  working build)

* Guest config (commenting out acpi and apic is critical -- when I left those
  uncommented, the boots worked flawlessly):

  name = "rhel56-64bit-hvm-bz707966"
  uuid = "cc9c19ec-41d5-4f89-894b-d24e5260c1d3"
  maxmem = 512
  memory = 512
  vcpus = 1
  builder = "hvm"
  kernel = "/usr/lib/xen/boot/hvmloader"
  boot = "c"
  pae = 1
  # acpi = 1
  # apic = 1
  localtime = 0
  on_poweroff = "destroy"
  on_reboot = "restart"
  on_crash = "restart"
  device_model = "/usr/lib64/xen/bin/qemu-dm"
  sdl = 0
  vnc = 1
  vncunused = 1
  keymap = "en-us"
  disk = [ "file:/var/lib/xen/images/rhel56-64bit-hvm-bz707966.img,hda,w",
           ",hdc:cdrom,r" ]
  vif = [ "mac=00:16:3e:5e:89:15,bridge=xenbr0,script=vif-bridge" ]
  parallel = "none"
  serial = "pty"

* hang reproduced under both guest kernels, after the following message was
  printed:

  Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled

Furthermore, the xm dmesg entries listed in comment 12 show up again:

(XEN) vlapic.c:689:d6 Local APIC Write to read-only register 0x30
(XEN) vlapic.c:689:d6 Local APIC Write to read-only register 0x20
(XEN) vlapic.c:689:d6 Local APIC Write to read-only register 0x20
(XEN) vlapic.c:689:d6 Local APIC Write to read-only register 0x20

(In reply to comment #8)

>    Note: with 5.7, it will work again by chance, because patch b41ec96 ([xen] 
>    x86/hvm: Enable delivering 8259 interrupts to VCPUs != 0, 2011-03-02) will 
>    default to delivering interrupts to VCPU#0.  Before that patch, the 
>    interrupts simply will not be delivered.

I might be testing under circumstances that are not valid for the quoted paragraph, but it doesn't seem to work like this. The hypervisor is -283 and the hang reproduces. (The first build to contain b41ec96 is -256.)

I'll check my guest patch (= attachment 520358 [details]) on:
- host = -283
- guest = -283
- acpi & apic commented out in the vm config

Comment 28 Laszlo Ersek 2011-08-30 11:22:10 UTC
> I'll check my guest patch (= attachment 520358 [details]) on:
> - host = -283
> - guest = -283
> - acpi & apic commented out in the vm config

Guest boot succeeds.

Comment 33 Jan Kundrát 2011-08-30 14:12:06 UTC
Here's a quick summary of what I've done again today:

Host: 2.6.18-238.19.1.el5xen
Guest: 2.6.18-238.5.1.el5

without "apic=1", with just "vcpus=1": won't boot
without "apic=1", with "vcpus=2": boots
with "apic=1", with just "vcpus=1": boots

I'm on Scientific Linux, and I currently don't see a RPM for the -238.24.1 in there, and hence can't really test it. If you can build/provide one for me, I'll be happy to test it in the guest.

Comment 34 Jan Kundrát 2011-08-30 14:17:41 UTC
I'm also happy to build my own RPM, but didn't fidn a suitable SRPM at ftp://ftp.redhat.com/redhat/linux/enterprise/5Server/en/os/SRPMS, sorry.

Comment 35 Laszlo Ersek 2011-08-30 15:06:01 UTC
Hello Jan,

(In reply to comment #34)
> I'm also happy to build my own RPM, but didn't fidn a suitable SRPM at
> ftp://ftp.redhat.com/redhat/linux/enterprise/5Server/en/os/SRPMS, sorry.

the patch for the guest (attachment 520358 [details]) also applies to 2.6.18-238.19.1.el5 too, and should work the same way.

Comment 37 Laszlo Ersek 2011-09-06 13:43:23 UTC
Created attachment 521671 [details]
make NMI_NONE the default watchdog in x86_64 hvm guests

x86 already defaults to NMI_NONE, there's no need to check if we're running as a guest.

x86 & x86_64: if the user specified "nmi_watchdog=...", the warning is warranted.

Comment 38 Paolo Bonzini 2011-09-06 14:33:50 UTC
Comment on attachment 521671 [details]
make NMI_NONE the default watchdog in x86_64 hvm guests

looks good

Comment 39 Don Zickus 2011-09-06 16:14:38 UTC
I think this patch is the better approach.

Cheers,
Don

Comment 44 Laszlo Ersek 2011-09-07 14:09:41 UTC
(In reply to comment #37)
> Created attachment 521671 [details]
> make NMI_NONE the default watchdog in x86_64 hvm guests

x86_64 hvm guest, apic=0 vm option
* -284 hangs,
* -284+patch works,
* -284+patch + nmi_watchdog=1 warns
  (WARNING: CPU#0: NMI appears to be stuck (0->0)!) and hangs,
* -284+patch + nmi_watchdog=2 doesn't warn and doesn't hang

x86_64 hvm guest, apic=1 vm option
-284 works, /proc/sys/kernel/nmi_watchdog says 0
-284+patch works, /proc/sys/kernel/nmi_watchdog says 0

Comment 45 Laszlo Ersek 2011-09-07 14:32:22 UTC
(In reply to comment #37)
> Created attachment 521671 [details]
> make NMI_NONE the default watchdog in x86_64 hvm guests

Sanity checked on 32-bit hvm guest (pae=1 vm option and -284+patch i686 PAE guest kernel): works with both apic=0 and apic=1 vm opts.

Comment 48 Jarod Wilson 2011-09-16 20:19:48 UTC
Patch(es) available in kernel-2.6.18-285.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.

Comment 52 Martin Prpič 2011-10-27 09:19:32 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
A previously applied patch to help clean-up a failed nmi_watchdog check by disabling various registers caused single-vcpu Xen HVM guests to become unresponsive during boot when the host CPU was an Intel Xeon Processor E5405 or an Intel Xeon Processor E5420, and the VM configuration did not have the apic = 1 parameter set. With this update, NMI_NONE is the default watchdog on AMD64 HVM guests, thus, fixing this issue.

Comment 53 Qin Guan 2011-12-14 13:02:57 UTC
Verify this problem with guest kernel 2.6.18-301.el5. And also reproduced with RHEL5.7 released kernel(274).

Version:
Host CPU model: Intel Xeon E5405
Host kernel: kernel-xen-2.6.18-300.el5
Guest kernel: kernel 2.6.18-301.el5

Verify Steps:
1. Set the guest conf with below two options specified:
vcpus = 1
apic = 0 (or comment this option)

2. create the guest with above conf.

Test result:
Guest start up without stuck and can be logged in successfully.

Comment 54 errata-xmlrpc 2012-02-21 03:48:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0150.html