Bug 717742
Summary: | [RHEL5.7][kernel-xen] HVM guests hang during installation on AMD systems | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Jeff Burke <jburke> |
Component: | kernel-xen | Assignee: | Paolo Bonzini <pbonzini> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 5.7 | CC: | agk, arozansk, coughlan, ddutile, drjones, jarod, jstancek, jwest, leiwang, mbroz, mjenner, mshao, pbunyan, pcao, pmatouse, qwan, syeghiay, tburke, xen-maint |
Target Milestone: | rc | Keywords: | Regression, ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-2.6.18-273.el5 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-07-21 09:56:17 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 719894, 725928 | ||
Bug Blocks: | 514489, 684637, 719066 | ||
Attachments: |
Description
Jeff Burke
2011-06-29 18:40:03 UTC
Which AMD system -- What was guest being installed? (In reply to comment #1) > Which AMD system -- Any AMD that supports HVM I was able to duplicate this issue on. > What was guest being installed? Not sure what you are asking here? I was installing a HVM guest using RHEL5.7-Server-20110622.0 x86_64 We tried with: Host kernel-xen-2.6.18-271.el5 + RHEL5.7 20110622.0 x86_64 HVM DomU : FAIL Host kernel-xen-2.6.18-271.el5 + RHEL5.7 20110409.3 x86_64 HVM DomU : FAIL Host kernel-xen-2.6.18-271.el5 + RHEL5.7 20110622.0 i386 HVM DomU : PASS Host kernel-xen-2.6.18-271.el5 + RHEL5.6 released x86_64 HVM DomU : PASS Host kernel-xen-2.6.18-270.el5 + RHEL5.7 20110622.0 x86_64 HVM DomU : PASS the failures of RHEL5.7 20110409.3 and RHEL5.7 20110622 are a little different: [1] 20110409.3 x86_64 HVM DomU hang after boot up from the boot.iso, the console log will be attached soon. [2] RHEL5.7 20110622.0 x86_64 HVM DomU hang at the point of initialize the storage or before that. I think the failure of 20110409.3 is not the same issue, will try with 270 host. Created attachment 510578 [details]
RHEL5.7-Server-20110409.3 x86_64 HVM DomU hang after boot from boot.iso
host kernel : 2.6.18-271.el5xen
(In reply to comment #4) > I think the failure of 20110409.3 is not the same issue, will try with 270 > host. I was wrong, there is no issue with installing 20110409.3 over 270 host, so it's probably the same issue. Host kernel-xen-2.6.18-270.el5 + RHEL5.7 20110409.3 x86_64 HVM DomU : PASS Hmm, the bisection of the HV definitely points at commit eba8ca99b31737c482e49a612516a17c435c3685 Author: Andrew Jones <drjones> Date: Thu May 19 14:13:14 2011 -0400 [xen] hvm: svm support cleanups however this once worked, see bug 702657 comment 32, to see that we actually tested it, and it worked. Which means something else changed. I now have access to amd-dinar-05.lab.bos.redhat.com, so using that box I'll revert everything back to the way it was when the patch worked, and then incrementally bring in the new stuff to see where it breaks. That patch also went to 5-6-Z, so I tried manually installing following combinations: Host kernel-xen-2.6.18-270.el5 + RHEL5.7 20110622.0 x86_64 HVM DomU : PASS Host kernel-xen-2.6.18-238.15.1.el5xen + RHEL5.7 20110622.0 x86_64 HVM DomU : PASS Host kernel-xen-2.6.18-238.16.1.el5xen + RHEL5.7 20110622.0 x86_64 HVM DomU : FAIL You can also boot an exiting RHEL5.7 20110622.0 x86_64 guest to test this issue. Guest VM will hang during device-mapper scanning logic volumes, check the attached screen-shot. Created attachment 510621 [details]
rhel5.7 64bit hvm boot failed
I reproduced the "hang" at the point of initialize the storage. The quotes are around hang, because it's not a hang for the most part. anaconda is angry and thus doesn't do anything - which appears like a hang, but the guest is still running as far as xen is concerned, and you can even switch to vterm2 in virt-viewer and use the shell on the guest. I did that and was able to scp dmesg to my notebook (I'll attach it). The last messages are device-mapper related, which is consistent with comment 10. I also have successfully installed a 5.6 guest on the same host (which is a build that still has the patch in question - comment 7). So, so far it appears there's an issue caused by combining the xen patch AND something that went into 5.7, which is file system related. Created attachment 510629 [details]
dmesg after "hang" at disk init during install
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. It's true we did test it (bug 703715), but unfortunately Pengzhen didn't mention the version of the guest he used for testing. That would explain the problem if the filesystem issue is in the guest. But since booting an existing guest also fails, perhaps you can try bisecting the guest kernels instead? It's painful because you need to reboot the host multiple times, but it's possible. Also, I suppose all of you are using file images. Perhaps you can also try using raw partitions to check if the filesystem issue (current working hypothesis) is in the guest or the host. Finally (and actually the more interesting part): do you see the "Mismatch between expected and actual instruction bytes:" in "xm dmesg", either before or after the breakage? Unfortunately it has not been attached to the BZ yet. I've been focussing on figuring out what the guest kernel is trying to do when it hangs. Thus far I've been leaving the hypervisor patch alone (even though it's clearly connected in some way). So here are some experiment results using a -272 host and guests that I installed while running on -270. My 5.7 guest obviously doesn't boot (we knew that), and it always hangs at the same place, i.e. right after printing Waiting for driver initialization. Scanning and configuring dmraid supported devices Scanning logical volumes This means it hangs right after we start a vgscan. If the vgscan would have succeeded, we would have seen these messages next Reading all physical volumes. This may take a while... Found volume group "VolGroup00" using metadata type lvm2 Activating logical volumes 2 logical volume(s) in volume group "VolGroup00" now active Trying to resume from /dev/VolGroup00/LogVol01 xenctx shows that one proc is off in the weeds and the other three are in rip: ffffffff8006be1c default_idle+0x29 weeds proc rip: 00000000004d2ff7 flags: 00000206 i nz p rsp: 00007fff3d78fb90 rax: 0000000080000000 rcx: 0000000000000006 rdx: 000000000050d140 rbx: 0000000068747541 rsi: 0000000002008140 rdi: 0000000000a00000 rbp: 0000000000080000 r8: 00000000007af9b0 r9: 2f2f2f2f2f2f2f2f r10: 0000000000021000 r11: 0000000000000014 r12: 0000000000010000 r13: 00007fff3d78fed8 r14: 0000000000000001 r15: 00000000004b1810 cs: 0033 ss: 002b ds: 0000 es: 0000 fs: 0000 @ 0000000000000000 gs: 0000 @ 0000000000000000/0000000000000000 If attempting with UP, then there's only the weeds proc. My 5.6 guest boots fine (we knew that too). Here's what we didn't know. My 5.6 guest boots fine using the -272 kernel as well, and my 5.7 guest still doesn't boot using the -238 (5.6 GA) kernel. Then I cloned my 5.6 guest, booted it, and yum updated lvm2 from the 5.7 repo. Which brought in device-mapper dependencies. (1/4): device-mapper-event-1.02.63-4.el5.x86_64.rpm | 23 kB 00:00 (2/4): device-mapper-1.02.63-4.el5.i386.rpm | 776 kB 00:00 (3/4): device-mapper-1.02.63-4.el5.x86_64.rpm | 807 kB 00:00 (4/4): lvm2-2.02.84-6.el5.x86_64.rpm | 3.1 MB 00:03 After the yum update completed successfully I typed 'vgscan' and the guest hung. xenctx showed that it hung the same way. This guest now hangs in a new place a boot, right after md: Autodetecting RAID arrays. md: autorun ... md: ... autorun DONE. device-mapper: multipath: version 1.0.6 loaded Setting up Logical Volume Management: but it has the same xenctx signature. We should get some device-mapper and lvm folk to take a look at this in order help with the debug. (In reply to comment #16) > We should get some device-mapper and lvm folk to take a look at this in order > help with the debug. I have done so. The last changes to device-mapper and lvm were in snapshot 4, I believe. Did this test run on, and pass, snapshot 3, 4, 5? The rhel5.7 guest I used when verify bug 703715 is: RHEL5.7-Server-20110513.0, guest kernel version: kernel-2.6.18-261.el5.x86_64 device-mapper and lvm pkg version: dmraid-1.0.0.rc13-65.el5 dmraid-events-1.0.0.rc13-65.el5 lvm2-2.02.84-3.el5 device-mapper-multipath-0.4.7-46.el5 device-mapper-1.02.63-2.el5 device-mapper-event-1.02.63-2.el5 device-mapper-1.02.63-2.el5 This guest work fine on host kernel 2.6.18-272 and 2.6.18-238.17.1 (In reply to comment #14) > It's true we did test it (bug 703715), but unfortunately Pengzhen didn't > mention the version of the guest he used for testing. That would explain the > problem if the filesystem issue is in the guest. But since booting an existing > guest also fails, perhaps you can try bisecting the guest kernels instead? > It's painful because you need to reboot the host multiple times, but it's > possible. > > Also, I suppose all of you are using file images. Perhaps you can also try > using raw partitions to check if the filesystem issue (current working > hypothesis) is in the guest or the host. > > Finally (and actually the more interesting part): do you see the "Mismatch > between expected and actual instruction bytes:" in "xm dmesg", either before or > after the breakage? Unfortunately it has not been attached to the BZ yet. (In reply to comment #16) > After the yum update completed successfully I typed 'vgscan' and the guest > hung. Please can you update from the latest 5.7 repo (lvm2 should be lvm2-2.02.84-6.el5, device-mapper-1.02.63-4.el5). Then for the hanging vgscan add -vvvv option and attach debug output (IOW run "vgscan -vvvv"). Also task list (echo t>/proc/sysrq-trigger) and output from "dmsetup info -c --noopencount" would be very useful. I have tried again with two rhel5.7 x86_64 guest, on the same AMD machine with 272 xen kernel. 1. rhel5.7-20110409.3 x86_64, boot the guest with boot.iso, http://download.englab.nay.redhat.com/pub/rhel/rel-eng/RHEL5.7-Server-20110409.3/tree-x86_64/images/boot.iso Guest hang and guest kernel panic, see the attachment 2. rhel5.7-20110513.0, x86_64, boot the guest with boot.iso/ or install it with Installation DVD /or boot the installed geust , all work fine without issue. I think the issue with 20110409.3 guest panic might be different from the latest rhel5.7 guest, although this maybe the same root cause due to host's kernel-xen. Created attachment 510794 [details]
rhel5.7-20110413.3-x86_64.amd.guest-boot.iso-panic
I had momentarily forgotten that my experiments with anaconda had proven this wasn't a real hang last night. This morning I tried an ingenious thing (ctrl-C) after starting vgscan on a guest with updated lvm2. It worked. So it's easy to experiment with this as I can run vgscan as many times as I want in my guest. Paolo suggested with run vgscan in gdb, so I did and got the instruction. 0x00000000004d2ff7 <init_cacheinfo+327>: cpuid The emulation of this was indeed changed with eba8ca9 [xen] hvm: svm support cleanups specifically we now just return, rather than attempt to emulate the instruction, if we don't get the expected instruction length. Paolo has a hunch why we might be failing this condition. He's attempting to write a reproducer so we can take lvm out of the equation. It's a Xen bug. Reproducer: #include <sys/mman.h> #include <string.h> #include <stdlib.h> /* xor %eax, %eax; pusha; cpuid; popa; ret */ static unsigned char cpuid_bytes[] = { 0x33, 0xc0, 0x60, 0x0f, 0xa2, 0x61, 0xc3 }; int main() { void *m = mmap(NULL, 8192, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANON, -1, 0); unsigned long maddr = (unsigned long)m + 4096 - sizeof cpuid_bytes; mprotect((void *) (maddr + sizeof cpuid_bytes), 4096, 0); void (*cpuid)(void) = (void (*)(void)) maddr; memcpy(cpuid, cpuid_bytes, sizeof cpuid_bytes); cpuid(); exit (0); } (must be compiled 32-bit, i.e. with -m32). The bug occurs when CPUID is less than 15 bytes from the end of a page, and the next page is not readable. Patch(es) available in kernel-2.6.18-273.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed. ... Note: this kernel contains patches that are under embargo until 2011.07.07, so it will not actually be available until the 7th or 8th. Created attachment 511865 [details]
x86_64 guest crash over 273 xen on some of AMD cpus
the fix introduced another regression, RHEL5(.6/7) 64bit HVM guests will crash during booting on some model of AMD processors (e.g. Dual-Core 1220, Athlon(tm) Dual Core 5400B), can't reproduce with AMD Phenom(tm) II X4 B95 Processor. and can't be reproduced with i386 guests.
guest log is attached.
Setting up hotplug.
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at arch/x86_64/kernel/smp.c:77
invalid opcode: 0000 [1] SMP
last sysfs file: /class/firmware/timeout
CPU 0
Modules linked in:
Pid: 1, comm: init Not tainted 2.6.18-238.el5 #1
RIP: 0010:[<ffffffff8002b32e>] [<ffffffff8002b32e>] flush_tlb_page+0x6d/0xda
RSP: 0000:ffff81003ff95cb8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000000000
RDX: ffff81003fa08768 RSI: ffff81003fb2ed98 RDI: ffff81003ff95cd8
RBP: ffff81003fb2eac0 R08: ffff810000012b00 R09: ffff8100016f05a0
R10: 0000000018f7b9c0 R11: ffff81003fa08298 R12: 0000000018f7b9c4
R13: ffff8100016f05a0 R14: ffff81003fb2eac0 R15: ffff81003fb3ebd8
FS: 0000000018f7b930(0063) GS:ffffffff80425000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000018f7b9c4 CR3: 000000003fb2f000 CR4: 00000000000006e0
Process init (pid: 1, threadinfo ffff81003ff94000, task ffff81003ff827a0)
Stack: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
0000000000000012 0000000000000001 ffff8100016f05c8 ffffffff800111a4
ffff81003fb2eac0 ffff81003fa0b638 0000000018f7b9c4 ffff81003fa08768
Call Trace:
[<ffffffff800111a4>] do_wp_page+0x3fd/0x902
[<ffffffff8000866f>] copy_page_range+0x6a1/0x795
[<ffffffff800096ce>] __handle_mm_fault+0xf6b/0x1039
[<ffffffff800a0282>] attach_pid+0x7c/0xa9
[<ffffffff8006720b>] do_page_fault+0x4cb/0x874
[<ffffffff80062ff0>] thread_return+0x62/0xfe
[<ffffffff8005dde9>] error_exit+0x0/0x84
Code: 0f 0b 68 f5 09 2b 80 c2 4d 00 65 48 8b 04 25 48 00 00 00 90
RIP [<ffffffff8002b32e>] flush_tlb_page+0x6d/0xda
RSP <ffff81003ff95cb8>
<0>Kernel panic - not syncing: Fatal exception
Created attachment 511875 [details]
the hypervisor log
Attach the hypervisor log before/after guest crash, didn't see anything may related from me.
The crash is bug 719894. Created attachment 511933 [details]
[PATCH] xen: svm: fix emulator
arch/x86/hvm/svm/svm.c | 26 ++++++++------------------
1 files changed, 8 insertions(+), 18 deletions(-)
The crash in comment 31 will be covered under bug 719894. This bug can be set to verified since it fixed the originally reported problem on the originally reported machine. Move to VERIFIED per comment 35. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html |