Bug 717742 - [RHEL5.7][kernel-xen] HVM guests hang during installation on AMD systems
Summary: [RHEL5.7][kernel-xen] HVM guests hang during installation on AMD systems
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.7
Hardware: Unspecified
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Paolo Bonzini
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On: 719894 725928
Blocks: 514489 684637 719066
TreeView+ depends on / blocked
 
Reported: 2011-06-29 18:40 UTC by Jeff Burke
Modified: 2013-01-10 00:02 UTC (History)
19 users (show)

Fixed In Version: kernel-2.6.18-273.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-21 09:56:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
RHEL5.7-Server-20110409.3 x86_64 HVM DomU hang after boot from boot.iso (3.29 KB, text/plain)
2011-06-30 06:12 UTC, Qixiang Wan
no flags Details
rhel5.7 64bit hvm boot failed (26.60 KB, image/png)
2011-06-30 10:02 UTC, Pengzhen Cao
no flags Details
dmesg after "hang" at disk init during install (15.77 KB, text/plain)
2011-06-30 10:20 UTC, Andrew Jones
no flags Details
rhel5.7-20110413.3-x86_64.amd.guest-boot.iso-panic (14.90 KB, image/png)
2011-07-01 03:52 UTC, Pengzhen Cao
no flags Details
x86_64 guest crash over 273 xen on some of AMD cpus (8.68 KB, text/plain)
2011-07-08 08:32 UTC, Qixiang Wan
no flags Details
the hypervisor log (12.38 KB, text/plain)
2011-07-08 09:17 UTC, Qixiang Wan
no flags Details
[PATCH] xen: svm: fix emulator (2.32 KB, patch)
2011-07-08 13:23 UTC, Andrew Jones
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:1065 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.7 kernel security and bug fix update 2011-07-21 09:21:37 UTC

Description Jeff Burke 2011-06-29 18:40:03 UTC
Description of problem:
 While testing the 2.6.18-271.el5 kernel. We hit an issue that the HVM guests can no longer install on AMD hosts. 

Version-Release number of selected component (if applicable):
kernel-xen 2.6.18-271.el5 x86_64

How reproducible:
Always

Steps to Reproduce:
1. Install a AMD system with Snapshot5 (RHEL5.7-Server-20110622.0) x86_64
2. Upgrade the Dom0 kernel to kernel-xen 2.6.18-271.el5 x86_64
3. Use virt-install to install a HVM guest
  
Actual results:
 System hangs just after clicking the "Yes" to the destroy all data. If you go to tty2 and do a top lvm is at the top of the list

Expected results:
System should install

Additional info:
If I use the kernel-xen 2.6.18-270.el5 x86_64 It completes the installation. This seems to be a regression in the 2.6.18-271.el5 kernel.

Comment 1 Don Dutile (Red Hat) 2011-06-29 21:12:58 UTC
Which AMD system -- 
What was guest being installed?

Comment 2 Jeff Burke 2011-06-30 00:43:20 UTC
(In reply to comment #1)
> Which AMD system -- 
Any AMD that supports HVM I was able to duplicate this issue on.
> What was guest being installed?
Not sure what you are asking here? I was installing a HVM guest using RHEL5.7-Server-20110622.0 x86_64

Comment 3 Qixiang Wan 2011-06-30 05:20:18 UTC
We tried with:

Host kernel-xen-2.6.18-271.el5 + RHEL5.7 20110622.0 x86_64 HVM DomU : FAIL
Host kernel-xen-2.6.18-271.el5 + RHEL5.7 20110409.3 x86_64 HVM DomU : FAIL
Host kernel-xen-2.6.18-271.el5 + RHEL5.7 20110622.0 i386   HVM DomU : PASS
Host kernel-xen-2.6.18-271.el5 + RHEL5.6 released   x86_64 HVM DomU : PASS
Host kernel-xen-2.6.18-270.el5 + RHEL5.7 20110622.0 x86_64 HVM DomU : PASS

Comment 4 Qixiang Wan 2011-06-30 06:09:13 UTC
the failures of RHEL5.7 20110409.3 and RHEL5.7 20110622 are a little different:

[1] 20110409.3 x86_64 HVM DomU hang after boot up from the boot.iso, the console log will be attached soon. 

[2] RHEL5.7 20110622.0 x86_64 HVM DomU hang at the point of initialize the storage or before that.

I think the failure of 20110409.3 is not the same issue, will try with 270 host.

Comment 5 Qixiang Wan 2011-06-30 06:12:07 UTC
Created attachment 510578 [details]
RHEL5.7-Server-20110409.3 x86_64 HVM DomU hang after boot from boot.iso

host kernel : 2.6.18-271.el5xen

Comment 6 Qixiang Wan 2011-06-30 06:47:27 UTC
(In reply to comment #4)
> I think the failure of 20110409.3 is not the same issue, will try with 270
> host.

I was wrong, there is no issue with installing 20110409.3 over 270 host, so it's probably the same issue.

Host kernel-xen-2.6.18-270.el5 + RHEL5.7 20110409.3 x86_64 HVM DomU : PASS

Comment 7 Andrew Jones 2011-06-30 08:52:44 UTC
Hmm, the bisection of the HV definitely points at

commit eba8ca99b31737c482e49a612516a17c435c3685
Author: Andrew Jones <drjones>
Date:   Thu May 19 14:13:14 2011 -0400

    [xen] hvm: svm support cleanups

however this once worked, see bug 702657 comment 32, to see that we actually tested it, and it worked. Which means something else changed. I now have access
to amd-dinar-05.lab.bos.redhat.com, so using that box I'll revert everything back to the way it was when the patch worked, and then incrementally bring in the new stuff to see where it breaks.

Comment 8 Jan Stancek 2011-06-30 09:16:30 UTC
That patch also went to 5-6-Z, so I tried manually installing following combinations:

Host kernel-xen-2.6.18-270.el5 + RHEL5.7 20110622.0 x86_64 HVM DomU : PASS
Host kernel-xen-2.6.18-238.15.1.el5xen + RHEL5.7 20110622.0 x86_64 HVM DomU : PASS
Host kernel-xen-2.6.18-238.16.1.el5xen + RHEL5.7 20110622.0 x86_64 HVM DomU : FAIL

Comment 9 Pengzhen Cao 2011-06-30 10:01:36 UTC
You can also boot an exiting RHEL5.7 20110622.0 x86_64 guest to test this issue.
Guest VM will hang during  device-mapper scanning logic volumes, check the attached screen-shot.

Comment 10 Pengzhen Cao 2011-06-30 10:02:30 UTC
Created attachment 510621 [details]
rhel5.7 64bit hvm boot failed

Comment 11 Andrew Jones 2011-06-30 10:18:37 UTC
I reproduced the "hang" at the point of initialize the storage. The quotes are around hang, because it's not a hang for the most part.

anaconda is angry and thus doesn't do anything - which appears like a hang, but the guest is still running as far as xen is concerned, and you can even switch to vterm2 in virt-viewer and use the shell on the guest. I did that and was able to scp dmesg to my notebook (I'll attach it). The last messages are device-mapper related, which is consistent with comment 10.

I also have successfully installed a 5.6 guest on the same host (which is a build that still has the patch in question - comment 7).

So, so far it appears there's an issue caused by combining the xen patch AND something that went into 5.7, which is file system related.

Comment 12 Andrew Jones 2011-06-30 10:20:38 UTC
Created attachment 510629 [details]
dmesg after "hang" at disk init during install

Comment 13 RHEL Program Management 2011-06-30 11:59:54 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 14 Paolo Bonzini 2011-06-30 14:10:02 UTC
It's true we did test it (bug 703715), but unfortunately Pengzhen didn't mention the version of the guest he used for testing.  That would explain the problem if the filesystem issue is in the guest.  But since booting an existing guest also fails, perhaps you can try bisecting the guest kernels instead?  It's painful because you need to reboot the host multiple times, but it's possible.

Also, I suppose all of you are using file images.  Perhaps you can also try using raw partitions to check if the filesystem issue (current working hypothesis) is in the guest or the host.

Finally (and actually the more interesting part): do you see the "Mismatch between expected and actual instruction bytes:" in "xm dmesg", either before or after the breakage?  Unfortunately it has not been attached to the BZ yet.

Comment 16 Andrew Jones 2011-06-30 19:33:08 UTC
I've been focussing on figuring out what the guest kernel is trying to do when
it hangs. Thus far I've been leaving the hypervisor patch alone (even though
it's clearly connected in some way). So here are some experiment results using
a -272 host and guests that I installed while running on -270.

My 5.7 guest obviously doesn't boot (we knew that), and it always hangs at the
same place, i.e. right after printing

    Waiting for driver initialization.
    Scanning and configuring dmraid supported devices
    Scanning logical volumes

This means it hangs right after we start a vgscan. If the vgscan would have
succeeded, we would have seen these messages next

      Reading all physical volumes.  This may take a while...
      Found volume group "VolGroup00" using metadata type lvm2
    Activating logical volumes
      2 logical volume(s) in volume group "VolGroup00" now active
    Trying to resume from /dev/VolGroup00/LogVol01

xenctx shows that one proc is off in the weeds and the other three are in
rip: ffffffff8006be1c default_idle+0x29

weeds proc

rip: 00000000004d2ff7 
flags: 00000206 i nz p
rsp: 00007fff3d78fb90
rax: 0000000080000000 rcx: 0000000000000006 rdx: 000000000050d140
rbx: 0000000068747541 rsi: 0000000002008140 rdi: 0000000000a00000
rbp: 0000000000080000  r8: 00000000007af9b0  r9: 2f2f2f2f2f2f2f2f
r10: 0000000000021000 r11: 0000000000000014 r12: 0000000000010000
r13: 00007fff3d78fed8 r14: 0000000000000001 r15: 00000000004b1810
 cs: 0033  ss: 002b  ds: 0000  es: 0000
 fs: 0000 @ 0000000000000000
 gs: 0000 @ 0000000000000000/0000000000000000

If attempting with UP, then there's only the weeds proc.

My 5.6 guest boots fine (we knew that too).

Here's what we didn't know. My 5.6 guest boots fine using the -272 kernel as
well, and my 5.7 guest still doesn't boot using the -238 (5.6 GA) kernel.

Then I cloned my 5.6 guest, booted it, and yum updated lvm2 from the 5.7 repo.
Which brought in device-mapper dependencies.

(1/4): device-mapper-event-1.02.63-4.el5.x86_64.rpm      |  23 kB     00:00     
(2/4): device-mapper-1.02.63-4.el5.i386.rpm              | 776 kB     00:00     
(3/4): device-mapper-1.02.63-4.el5.x86_64.rpm            | 807 kB     00:00     
(4/4): lvm2-2.02.84-6.el5.x86_64.rpm                     | 3.1 MB     00:03     

After the yum update completed successfully I typed 'vgscan' and the guest
hung.
xenctx showed that it hung the same way. This guest now hangs in a new place a
boot, right after

   md: Autodetecting RAID arrays.
   md: autorun ...
   md: ... autorun DONE.
   device-mapper: multipath: version 1.0.6 loaded
   Setting up Logical Volume Management:

but it has the same xenctx signature.


We should get some device-mapper and lvm folk to take a look at this in order
help with the debug.

Comment 17 Tom Coughlan 2011-06-30 23:01:52 UTC
(In reply to comment #16)

> We should get some device-mapper and lvm folk to take a look at this in order
> help with the debug.

I have done so.

The last changes to device-mapper and lvm were in snapshot 4, I believe. 

Did this test run on, and pass, snapshot 3, 4, 5?

Comment 18 Pengzhen Cao 2011-07-01 01:20:21 UTC
The rhel5.7 guest I used when verify bug 703715 is:
RHEL5.7-Server-20110513.0, 
guest kernel version: kernel-2.6.18-261.el5.x86_64

device-mapper and lvm pkg version:
dmraid-1.0.0.rc13-65.el5
dmraid-events-1.0.0.rc13-65.el5
lvm2-2.02.84-3.el5
device-mapper-multipath-0.4.7-46.el5
device-mapper-1.02.63-2.el5
device-mapper-event-1.02.63-2.el5
device-mapper-1.02.63-2.el5

This guest work fine on host kernel 2.6.18-272 and 2.6.18-238.17.1

(In reply to comment #14)
> It's true we did test it (bug 703715), but unfortunately Pengzhen didn't
> mention the version of the guest he used for testing.  That would explain the
> problem if the filesystem issue is in the guest.  But since booting an existing
> guest also fails, perhaps you can try bisecting the guest kernels instead? 
> It's painful because you need to reboot the host multiple times, but it's
> possible.
> 
> Also, I suppose all of you are using file images.  Perhaps you can also try
> using raw partitions to check if the filesystem issue (current working
> hypothesis) is in the guest or the host.
> 
> Finally (and actually the more interesting part): do you see the "Mismatch
> between expected and actual instruction bytes:" in "xm dmesg", either before or
> after the breakage?  Unfortunately it has not been attached to the BZ yet.

Comment 19 Milan Broz 2011-07-01 03:45:41 UTC
(In reply to comment #16)
> After the yum update completed successfully I typed 'vgscan' and the guest
> hung.

Please can you update from the latest 5.7 repo (lvm2 should be lvm2-2.02.84-6.el5, device-mapper-1.02.63-4.el5).

Then for the hanging vgscan add -vvvv option and attach debug output (IOW run "vgscan -vvvv"). Also task list (echo t>/proc/sysrq-trigger) and output from "dmsetup info -c --noopencount" would be very useful.

Comment 20 Pengzhen Cao 2011-07-01 03:47:27 UTC
I have tried again with two rhel5.7 x86_64 guest, on the same AMD machine with 272 xen kernel. 

1. rhel5.7-20110409.3 x86_64, boot the guest with boot.iso, 
http://download.englab.nay.redhat.com/pub/rhel/rel-eng/RHEL5.7-Server-20110409.3/tree-x86_64/images/boot.iso
Guest hang and guest kernel panic, see the attachment 


2. rhel5.7-20110513.0, x86_64, boot the guest with boot.iso/ or install it with Installation DVD /or boot the installed geust , all work fine without issue.

I think the issue with 20110409.3 guest panic might be different from the latest rhel5.7 guest, although this maybe the same root cause due to host's kernel-xen.

Comment 21 Pengzhen Cao 2011-07-01 03:52:58 UTC
Created attachment 510794 [details]
rhel5.7-20110413.3-x86_64.amd.guest-boot.iso-panic

Comment 22 Andrew Jones 2011-07-01 09:35:43 UTC
I had momentarily forgotten that my experiments with anaconda had proven this wasn't a real hang last night. This morning I tried an ingenious thing (ctrl-C) after starting vgscan on a guest with updated lvm2. It worked. So it's easy to experiment with this as I can run vgscan as many times as I want in my guest. Paolo suggested with run vgscan in gdb, so I did and got the instruction.

0x00000000004d2ff7 <init_cacheinfo+327>:	cpuid

The emulation of this was indeed changed with

eba8ca9 [xen] hvm: svm support cleanups

specifically we now just return, rather than attempt to emulate the instruction, if we don't get the expected instruction length. Paolo has a hunch why we might be failing this condition. He's attempting to write a reproducer so we can take lvm out of the equation.

Comment 23 Paolo Bonzini 2011-07-01 09:49:24 UTC
It's a Xen bug.  Reproducer:

#include <sys/mman.h>
#include <string.h>
#include <stdlib.h>

/* xor %eax, %eax; pusha; cpuid; popa; ret */
static unsigned char cpuid_bytes[] = { 0x33, 0xc0, 0x60, 0x0f, 0xa2, 0x61, 0xc3 };

int main()
{
  void *m = mmap(NULL, 8192, PROT_READ|PROT_WRITE|PROT_EXEC,
                 MAP_PRIVATE|MAP_ANON, -1, 0);
  unsigned long maddr = (unsigned long)m + 4096 - sizeof cpuid_bytes;
  mprotect((void *) (maddr + sizeof cpuid_bytes), 4096, 0);
  void (*cpuid)(void) = (void (*)(void)) maddr;
  memcpy(cpuid, cpuid_bytes, sizeof cpuid_bytes);
  cpuid();
  exit (0);
}

(must be compiled 32-bit, i.e. with -m32).  The bug occurs when CPUID is less than 15 bytes from the end of a page, and the next page is not readable.

Comment 29 Jarod Wilson 2011-07-06 15:24:52 UTC
Patch(es) available in kernel-2.6.18-273.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.
...
Note: this kernel contains patches that are under embargo until 2011.07.07, so
it will not actually be available until the 7th or 8th.

Comment 31 Qixiang Wan 2011-07-08 08:32:27 UTC
Created attachment 511865 [details]
x86_64 guest crash over 273 xen on some of AMD cpus

the fix introduced another regression, RHEL5(.6/7) 64bit HVM guests will crash during booting on some model of AMD processors (e.g. Dual-Core 1220, Athlon(tm) Dual Core 5400B), can't reproduce with AMD Phenom(tm) II X4 B95 Processor. and can't be reproduced with i386 guests.

guest log is attached.

Setting up hotplug.
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at arch/x86_64/kernel/smp.c:77
invalid opcode: 0000 [1] SMP 
last sysfs file: /class/firmware/timeout
CPU 0 
Modules linked in:
Pid: 1, comm: init Not tainted 2.6.18-238.el5 #1
RIP: 0010:[<ffffffff8002b32e>]  [<ffffffff8002b32e>] flush_tlb_page+0x6d/0xda
RSP: 0000:ffff81003ff95cb8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000000000
RDX: ffff81003fa08768 RSI: ffff81003fb2ed98 RDI: ffff81003ff95cd8
RBP: ffff81003fb2eac0 R08: ffff810000012b00 R09: ffff8100016f05a0
R10: 0000000018f7b9c0 R11: ffff81003fa08298 R12: 0000000018f7b9c4
R13: ffff8100016f05a0 R14: ffff81003fb2eac0 R15: ffff81003fb3ebd8
FS:  0000000018f7b930(0063) GS:ffffffff80425000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000018f7b9c4 CR3: 000000003fb2f000 CR4: 00000000000006e0
Process init (pid: 1, threadinfo ffff81003ff94000, task ffff81003ff827a0)
Stack:  0000000000000000 0000000000000000 0000000000000000 0000000000000000
 0000000000000012 0000000000000001 ffff8100016f05c8 ffffffff800111a4
 ffff81003fb2eac0 ffff81003fa0b638 0000000018f7b9c4 ffff81003fa08768
Call Trace:
 [<ffffffff800111a4>] do_wp_page+0x3fd/0x902
 [<ffffffff8000866f>] copy_page_range+0x6a1/0x795
 [<ffffffff800096ce>] __handle_mm_fault+0xf6b/0x1039
 [<ffffffff800a0282>] attach_pid+0x7c/0xa9
 [<ffffffff8006720b>] do_page_fault+0x4cb/0x874
 [<ffffffff80062ff0>] thread_return+0x62/0xfe
 [<ffffffff8005dde9>] error_exit+0x0/0x84


Code: 0f 0b 68 f5 09 2b 80 c2 4d 00 65 48 8b 04 25 48 00 00 00 90 
RIP  [<ffffffff8002b32e>] flush_tlb_page+0x6d/0xda
 RSP <ffff81003ff95cb8>
 <0>Kernel panic - not syncing: Fatal exception

Comment 32 Qixiang Wan 2011-07-08 09:17:31 UTC
Created attachment 511875 [details]
the hypervisor log

Attach the hypervisor log before/after guest crash, didn't see anything may related from me.

Comment 33 Paolo Bonzini 2011-07-08 11:06:44 UTC
The crash is bug 719894.

Comment 34 Andrew Jones 2011-07-08 13:23:12 UTC
Created attachment 511933 [details]
[PATCH] xen: svm: fix emulator

 arch/x86/hvm/svm/svm.c |   26 ++++++++------------------
 1 files changed, 8 insertions(+), 18 deletions(-)

Comment 35 Andrew Jones 2011-07-08 13:39:45 UTC
The crash in comment 31 will be covered under bug 719894. This bug can be set to verified since it fixed the originally reported problem on the originally reported machine.

Comment 36 Qixiang Wan 2011-07-15 06:21:50 UTC
Move to VERIFIED per comment 35.

Comment 37 errata-xmlrpc 2011-07-21 09:56:17 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-1065.html


Note You need to log in before you can comment on or make changes to this bug.